Thursday, August 01, 2013

Randomising lines a very large file with java

I came across an interesting problem today.  I have some nice results from counting data and I wanted to see if the same results would appear if I randomised the underlying dataset.  The problem?  The dataset is a 17 Gigabyte file.

All the solutions I could find online required the file to be read into memory at some point.  Obviously, with my 8GB of RAM these were not acceptable solutions.  I needed a solution which would allow one line to be in memory at once and then to be discarded.

I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation.  I also realised that the more files there were, the greater the randomisation.

Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into.  It then  assigns each line at random to one of the files until it runs out of lines.  These files can then be concatenated together in  a post-processing step.  I think it's quite a neat solution.  I've commented the code for readability, so hopefully it will be reusable.  Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.

For my 17GB file it took 22 minutes to run, writing to 1000 files.  Needless to say that most of that time was taken up by I/O.


import java.io.PrintWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class RandomiseLines
{
  public static void main(String [] args) throws Exception
  {
    if(args.length != 2)
    {
      System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");
      System.exit(-1);
    }

    //the number of separate files to place lines into.
    final int FILENUM = Integer.parseInt(args[1]);

    //initialise the random number generator.
    final long SEED = 1;
    Random generator = new Random(SEED);

    //if seed isn't required, comment above and use:
    /*
      Random generator = new Random;
    */

    //initialise the file writers
    PrintWriter [] writers = new PrintWriter[FILENUM];
    for (int i = 0; i < FILENUM; i++)
     writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));

    //read in the file
    int key;
    String line;
    BufferedReader in = new BufferedReader(new FileReader(args[0]));
    while((line = in.readLine()) != null)
    {

      //generate a random number between 0 and FILENUM - 1
      key = (int)Math.floor(FILENUM*generator.nextDouble());

      //write the line to the chosen file;
      writers[key].println(line);
    }//while

    //close IO
    in.close();
    for(int i = 0; i < FILENUM; i++)
     writers[i].close();
   
  }//main
}//class

The following shell script can be used as a wrapper to the programme.
#!/bin/bash

FileNum=10;

java RandomiseLines $1 $FileNum

echo "" > randomised.txt

for i in `seq 0 $(($FileNum -1))`; do
 cat out.$i.txt >> randomised.txt;
 rm out.$i.txt
done

exit

No comments:

Post a Comment