All the solutions I could find online required the file to be read into memory at some point. Obviously, with my 8GB of RAM these were not acceptable solutions. I needed a solution which would allow one line to be in memory at once and then to be discarded.
I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation. I also realised that the more files there were, the greater the randomisation.
Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into. It then assigns each line at random to one of the files until it runs out of lines. These files can then be concatenated together in a post-processing step. I think it's quite a neat solution. I've commented the code for readability, so hopefully it will be reusable. Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.
For my 17GB file it took 22 minutes to run, writing to 1000 files. Needless to say that most of that time was taken up by I/O.
import java.io.PrintWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;
public class RandomiseLines
{
public static void main(String [] args) throws Exception
{
if(args.length != 2)
{
System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");
System.exit(-1);
}
//the number of separate files to place lines into.
final int FILENUM = Integer.parseInt(args[1]);
//initialise the random number generator.
final long SEED = 1;
Random generator = new Random(SEED);
//if seed isn't required, comment above and use:
/*
Random generator = new Random;
*/
//initialise the file writers
PrintWriter [] writers = new PrintWriter[FILENUM];
for (int i = 0; i < FILENUM; i++)
writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));
//read in the file
int key;
String line;
BufferedReader in = new BufferedReader(new FileReader(args[0]));
while((line = in.readLine()) != null)
{
//generate a random number between 0 and FILENUM - 1
key = (int)Math.floor(FILENUM*generator.nextDouble());
//write the line to the chosen file;
writers[key].println(line);
}//while
//close IO
in.close();
for(int i = 0; i < FILENUM; i++)
writers[i].close();
}//main
}//class
The following shell script can be used as a wrapper to the programme.
#!/bin/bash
FileNum=10;
java RandomiseLines $1 $FileNum
echo "" > randomised.txt
for i in `seq 0 $(($FileNum -1))`; do
cat out.$i.txt >> randomised.txt;
rm out.$i.txt
done
exit
No comments:
Post a Comment