Monday, August 19, 2013

ACL 2013 - Post Blog

It's been a little over a week since I got back from ACL.  I think it takes a certain amount of time to process a conference, and I think I'm still processing it.  It was a massively positive experience overall.  It was very encouraging to meet so many people working in similar disciplines, people who engaged with similar problems.  It was also very encouraging to present my research and to get feedback from more experienced members of the community.  Despite being somewhat terrified about the prospect of presenting, I actually really enjoyed it.  People had really good questions which made me think more about my research and get even more excited for it.

A real highlight of the conference was the workshop on Predicting and Improving Text Readability (PITR).  This was a small workshop, with maybe 10-20 people at any one time.  During the course of the day I gave both a poster and an oral presentation.  The people there were working in very similar areas to mine and I got such valuable feedback on my work, and was able to understand and discuss other people's research with them.

I really enjoyed the conference experience and I will definitely be looking to attend another conference in the forthcoming season (as much as time and funding might allow!).  I have some work on automatic thesaurus generation that I am looking to write up and submit to either LREC or EACL.  Their submission dates are close together (15th and 18th October respectively), so I will likely submit the same paper to both to increase my odds of acceptance.

The next big hurdle in my academic career is my progression interview on 4th August.  According to the supporting documentation:
"The student has been working for 18 months on research. It should be possible at this point to determine whether they are capable of achieving at the research project they are attempting"
Which sounds terrifying.  I'm currently choosing not to stress about it, whilst they technically have the option to throw me out at this point, the chances of them doing so are very low.  I'm required to present a short (1000 word) report and give a 10 minute talk.  I already have the talk roughly planned out in my mind, although I've not put any slides together as of yet.

Thursday, August 15, 2013

Orthography, Phonology and Nomenclature. Making sense of word relations!

In my role as computational linguist, I often find myself verging into areas of linguistics which I find nothing short of fascinating.  One such is the complex relation between English orthography (that is, how a word is written) and phonology (how it sounds).  In English, we have a 'deep orthography' meaning that a word doesn't necessarily sound the way it looks, leading to beautiful confusions such as:
weight vs. height

foot vs. food
or (my favourite):
cough vs. enough vs. plough vs. though. vs. through.
That's right, 5 distinct sounds from the letters 'ough'.

We also get the interesting phenomenon that one set of letters, with different pronunciations can have totally different meanings.  For example.
He decided to desert the army.
Camels live in the desert.

This is an example of heteronyms.  Heteronyms are different to homonyms which have the same pronounciation and spelling, but a different meaning. These are different again to heterographs, homophones and synonyms.  The table below defines the potential relations between words.  It is taken mostly from this venn diagram

Meaning Spelling Pronunciation
No Relation Different Different Different
Homophone Different - Same
Heterograph Different Different Same
Heteronym Different Same Different
Homonym Different Same Same
Different Spelling Same Different Same
Different Pronunciation Same Same Different
Synonym Same Different Different
Same Word Same Same Same

  • No relation: Two word words which are not related in any sense.
  • Homophones:  Words which sound the same, but have different meanings.  Further split into the following two categories:
  • Heterographs: Homophones with different spellings.  "There", "Their" and "They're" is a classic example.
  • Homonyms: Homophones with the same spelling.  E.g. "Right" (direction) vs. "Right" (entitlement).
  • Heteronyms: Words that are spelt the same but have a different sound and meaning. E.g. "desert" (leave) vs. "desert" (sahara) as in the above example.
  • Different Spelling: No technical word here, just words which mean and sound the same but are spelt differently. e.g. "Labor" (US spelling) vs. "Labour" (British Spelling).
  • Different Pronunciation: Again, no technical word, just two words which are written and mean the same, but sound different. E.g. 'the elephant' vs. 'the circus'.  ('the' takes a different sound in each).
  • Synonyms: Two words with the same meaning, but different pronunciations and written forms.  e.g. "friend" and "companion". Useful for lexical simplification as synonyms can be ranked according to their simplicity.
  • Same Word: No difference here whatsoever.
So there you have it.  I hope this is a helpful contribution to the often confusing world of word relation nomenclature.  I am certainly much more clear on the distinction between these terms as a result of writing this blog.

Monday, August 12, 2013

The Lexical Simplification List

Whilst putting together my Literature review, I decided it might be valuable if the references I was collecting were visible to other people who are interested in lexical simplification.  To that end, I  have put together a list of all the references I know of which pertain in some way to lexical simplification.  I have tried to not overload this load, so have only included those papers which seem to be explicitly working in lexical simplification, rather than those who mention it in passing.  The list is probably incomplete in it's current incarnation, so if you see any papers you think are missing, please do drop me an email and I'll be happy to add them.  To find the list you can follow the tab at the top, or click here.

Further to this, I thought it might be nice to collect together some of the resources I have found helpful on one page.  This means that I have split the resources sections into 'my resources' and 'external resources'.  In the external resources section I have put in some links to useful resources which I have used, but have had no hand in creating.

My idea and hope with this is that somebody wishing to start out in lexical simplification will be able to read through these two lists and find a good bed of research, and a good bed of resources to begin.  I also hope that other more established lexical simplification researchers will find the content interesting and their research will benefit from it.

Thursday, August 01, 2013

Randomising lines a very large file with java

I came across an interesting problem today.  I have some nice results from counting data and I wanted to see if the same results would appear if I randomised the underlying dataset.  The problem?  The dataset is a 17 Gigabyte file.

All the solutions I could find online required the file to be read into memory at some point.  Obviously, with my 8GB of RAM these were not acceptable solutions.  I needed a solution which would allow one line to be in memory at once and then to be discarded.

I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation.  I also realised that the more files there were, the greater the randomisation.

Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into.  It then  assigns each line at random to one of the files until it runs out of lines.  These files can then be concatenated together in  a post-processing step.  I think it's quite a neat solution.  I've commented the code for readability, so hopefully it will be reusable.  Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.

For my 17GB file it took 22 minutes to run, writing to 1000 files.  Needless to say that most of that time was taken up by I/O.

import java.util.Random;

public class RandomiseLines
  public static void main(String [] args) throws Exception
    if(args.length != 2)
      System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");

    //the number of separate files to place lines into.
    final int FILENUM = Integer.parseInt(args[1]);

    //initialise the random number generator.
    final long SEED = 1;
    Random generator = new Random(SEED);

    //if seed isn't required, comment above and use:
      Random generator = new Random;

    //initialise the file writers
    PrintWriter [] writers = new PrintWriter[FILENUM];
    for (int i = 0; i < FILENUM; i++)
     writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));

    //read in the file
    int key;
    String line;
    BufferedReader in = new BufferedReader(new FileReader(args[0]));
    while((line = in.readLine()) != null)

      //generate a random number between 0 and FILENUM - 1
      key = (int)Math.floor(FILENUM*generator.nextDouble());

      //write the line to the chosen file;

    //close IO
    for(int i = 0; i < FILENUM; i++)

The following shell script can be used as a wrapper to the programme.


java RandomiseLines $1 $FileNum

echo "" > randomised.txt

for i in `seq 0 $(($FileNum -1))`; do
 cat out.$i.txt >> randomised.txt;
 rm out.$i.txt