Tuesday, January 28, 2014

XKCD and simplification.

I have been an avid reader of the webcomic xkcd since my days as an undergrad.  If you've never heard of it, I would recommend you check it out, some of them are laugh-out-loud funny.  There are several comics that stick out as having a simplification theme.  I'm going to use this post to look at those comics through the lens of automatic simplification.  I'll try to explain what we can do with the current technology and what we just plain can't.

Simple

Particle accelerators are complex beasts.  I can empathise with the character who read so much simple Wikipedia that he can only talk in that way now.  One of the techniques we use in simplification is language modelling.  A mathematical model of sentences is trained which can be used to score new sentences to say how likely they are to have been produced by a language.  So for example:  "I went to the bank" might recieve a higher score than "I to the banking place did go".  As the latter sentence is poorly written.  An interesting factor of language models is that the scores they give rely heavily on the sentences which are used to train them.  So if you train a model using the text of the English Wikipedia you are likely to get very difficult to understand language.  If you train a model using the text of Simple Wikipedia, you are likely to get very simple sounding language, just like the second character in this comic.  A great paper which explains this further (without the xkcd references) is Kauchak (2013) (see the lexical simplification list).

Up Goer Five

This next one is too long to put in this post - but it's worth a read.  The full comic is here: Up Goer Five (Or, right click on the image and open it in a new tab to view).

The comic presents a simplified blueprint of the Saturn V rocket.  The translator has been restricted to only the thousand most common words in the English language.  There is some question as to where the statistics for the 'thousand most common words' came from.  If it were taken from NASA technical rocket manuals then there may have needed to be little change!  We'll assume that it was taken from some comprehensive resource.  The best way of determining this with currently available resources would be to use the top ranked words in the Google Web1T corpus (Google counted a trillion words and recorded how often each one occurred.)

The style of translation in this comic is phenomenally difficult to achieve, even for humans.  You can try it for yourself at The Up-Goer Five text editor.  Most words have been substituted for simpler phrases, or explanations.  Some technical terms rely on outside knowledge - which actually has the effect of making the sentence more difficult to understand.  For example, one label reads: "This is full of that stuff they burned in lights before houses had power".  This is referring to kerosene, which is highly understandable if you know of kerosene but inapproachable if not.

It would be an interesting experiment to determine the lowest number of words required to be able to produce this kind of simplification without having to draw on inferred knowledge (such as the type of fuel lights once burned).  My guess is that you would need 10 - 20,000 before this became a reality.  It would be difficult to automatically produce text at this level of simplicity.  Explaining a concept requires a really deep understanding and background knowledge, which is difficult to emulate with a machine.

Winter


The above comic touches on an excellent point.  If the words we use are understandable, does it matter if they're not the correct words? Previously, I have written about lexical complexity noting that many factors affect how difficult we find a word.  The big factor that is played on here is context.  For example the term 'handcoats' in the second panel is understandable (as gloves) because we know from the first panel that 'the sky is cold'.  Handcoats is a word that you've probably never seen before, and out of context it would be difficult to get the meaning.  This highlights the importance of selecting words which fit the context of a sentence. If the correct context is chosen and a simple word fitting that context is used, then the understandability of a sentence will be dramatically increased.





Wednesday, November 06, 2013

Word Sense Disambiguation

Some words have more than one meaning.  The brain seems to have an innate ability to work out what a sentence means.  Take the following two sentences:

"I tied my boat to the bank"
"I put my money in the bank"

In the first sentence the you probably imagine somebody tying their boat to the side of a river, yet in the second sentence you imagine somebody investing their money with a financial institution.  That string of four characters: 'b a n k' has completely changed meaning.

Word sense disambiguation (WSD) is a well researched task in computational linguistics with an important application to lexical simplification.  The majority of previous research splits roughly into three categories:
  • Supervised: Using labelled data, a system builds a classifier which can recognise the different senses of a word, from a variety of features in the words surrounding it.
  • Unsupervised: With unlabelled data, a system learns the different senses of a word.  Classification of new data makes use of the previously learned senses.
  • Knowledge Based: A large knowledge resource such as WordNet provides information about the words which can be used during disambiguation.

 WSD is vital to the task of lexical simplification.  Consider simplifying a sentence from the previous example. If you look up the word 'bank' in a thesaurus you will have a list of synonyms that looks something like the following:

Bank:
Financial Institution; Treasury; Safe;
Edge; Beach; Riverside;

If a system does not employ WSD, then there is no method of telling which of the synonyms are correct for the context.  We do not wish to say "I tied my boat to the treasury", or "I put my money in the riverside".  These examples are at best farcical and at worst nonsensical.  WSD is paramount to selecting the correct set of synonyms.

I will not venture to a full explanation of WSD as applied to lexical simplification.  Suffice to say that there are four papers which I have so far identified as addressing the matter.  These can be found in the lexical simplification list.

  • Can Spanish be simpler? LexSiS: Lexical simplification for Spanish. Bott et al. 2012
  • Wordnet-based lexical simplification of a document. Thomas and Anderson 2012
  • Putting it simply: a context-aware approach to lexical simplification. Biran et al. 2011
  • Lexical simplification. De Belder et al. 2010

Friday, September 06, 2013

3rd Year

On Wednesday (4/9/2013) I successfully completed my end of second year interview.  This means that I am now officially a third year PhD student.  I am now at the dead halfway point of my PhD, having completed 24 months with 24 more remaining.  It has been a long road getting here and there is still a long way to go.  Below is a brief analysis of the achievements in my PhD so far and the goals yet to come.


Completed So Far:

  • Literature Review:  This was the first thing I did as a PhD student.  Reading took up most of the first six months of my research.  I consumed, refined and categorised as much of the relevant literature as I could find.  I am attempting to publish this as a survey paper currently, since the only available text simplification survey is a technical report from 2008
  • Lexical Simplification Errors: I recently undertook a pilot study looking at the errors thrown up by the lexical simplification pipeline.  I'm looking to publish this in an upcoming conference, so won't say too much about the results here and now.
  • Complex Word Identification: This was the first element of the lexical simplification pipeline that I studied.  I built a corpus of sentences, each with one word marked as complex for the purpose of evaluating current methods of identification.  This work was published in 2 separate workshop papers at ACL 2013.
  • Substitution Generation: Once we have identified a complex word, we must generate a set of substitutions for it.  However, those words which are complex are also those which are least likely to be found in a thesaurus, complicating the task.  To address this I spent considerable efforts learning simplifications from massive corpora with some success.  This work is also currently being written up for publication.


Still to come:

  • Word Sense Disambiguation: The next step in the pipeline is to apply some word sense disambiguation.  This has been done before, so I will be looking at the best ways to apply it and hopefully making a novel contribution here.  I am just starting out on this phase of research and am currently immersed in the WSD literature, trying to get my head round the myriad techniques that already exist there.
  • Synonym Ranking: I have looked into the best way to rank synonyms according to their complexity before at the start of my project.  The small amounts of work that I did back then did not discover anything radical, but did help me to better understand the structure of a lexical simplification system.  When I revisit this area it will be with the hope of making some significant contribution.  I was really interested in the work David Kauchak presented at ACL 2013 and will be interested to explore what more can be done in this area.
  • User Evaluation: Finally, I will spend some time exploring the effects of each of the modules I have developed on individual users.  It is of paramount importance to evaluate text simplification in the context of the users it is aimed at and to this end I will be focussing my research on a specific user group.  Although which group is as yet undecided.
  • Thesis: This will undoubtedly take a significant portion of my final year.  The chapter titles will hopefully be the bullet points you see listed above.

So there you have it.  Although it appears that I have done a lot so far, it still feels like I have a real mountain to climb.  There are significant hurdles and vast amounts of reading, researching and writing ahead.  I look forward to the challenges that the next two years of my PhD will bring.

    Monday, August 19, 2013

    ACL 2013 - Post Blog

    It's been a little over a week since I got back from ACL.  I think it takes a certain amount of time to process a conference, and I think I'm still processing it.  It was a massively positive experience overall.  It was very encouraging to meet so many people working in similar disciplines, people who engaged with similar problems.  It was also very encouraging to present my research and to get feedback from more experienced members of the community.  Despite being somewhat terrified about the prospect of presenting, I actually really enjoyed it.  People had really good questions which made me think more about my research and get even more excited for it.

    A real highlight of the conference was the workshop on Predicting and Improving Text Readability (PITR).  This was a small workshop, with maybe 10-20 people at any one time.  During the course of the day I gave both a poster and an oral presentation.  The people there were working in very similar areas to mine and I got such valuable feedback on my work, and was able to understand and discuss other people's research with them.

    I really enjoyed the conference experience and I will definitely be looking to attend another conference in the forthcoming season (as much as time and funding might allow!).  I have some work on automatic thesaurus generation that I am looking to write up and submit to either LREC or EACL.  Their submission dates are close together (15th and 18th October respectively), so I will likely submit the same paper to both to increase my odds of acceptance.

    The next big hurdle in my academic career is my progression interview on 4th August.  According to the supporting documentation:
    "The student has been working for 18 months on research. It should be possible at this point to determine whether they are capable of achieving at the research project they are attempting"
    Which sounds terrifying.  I'm currently choosing not to stress about it, whilst they technically have the option to throw me out at this point, the chances of them doing so are very low.  I'm required to present a short (1000 word) report and give a 10 minute talk.  I already have the talk roughly planned out in my mind, although I've not put any slides together as of yet.

    Thursday, August 15, 2013

    Orthography, Phonology and Nomenclature. Making sense of word relations!

    In my role as computational linguist, I often find myself verging into areas of linguistics which I find nothing short of fascinating.  One such is the complex relation between English orthography (that is, how a word is written) and phonology (how it sounds).  In English, we have a 'deep orthography' meaning that a word doesn't necessarily sound the way it looks, leading to beautiful confusions such as:
    weight vs. height

    foot vs. food
    or (my favourite):
    cough vs. enough vs. plough vs. though. vs. through.
    That's right, 5 distinct sounds from the letters 'ough'.

    We also get the interesting phenomenon that one set of letters, with different pronunciations can have totally different meanings.  For example.
    He decided to desert the army.
    Camels live in the desert.

    This is an example of heteronyms.  Heteronyms are different to homonyms which have the same pronounciation and spelling, but a different meaning. These are different again to heterographs, homophones and synonyms.  The table below defines the potential relations between words.  It is taken mostly from this venn diagram

    Meaning Spelling Pronunciation
    No Relation Different Different Different
    Homophone Different - Same
    Heterograph Different Different Same
    Heteronym Different Same Different
    Homonym Different Same Same
    Different Spelling Same Different Same
    Different Pronunciation Same Same Different
    Synonym Same Different Different
    Same Word Same Same Same

    • No relation: Two word words which are not related in any sense.
    • Homophones:  Words which sound the same, but have different meanings.  Further split into the following two categories:
    • Heterographs: Homophones with different spellings.  "There", "Their" and "They're" is a classic example.
    • Homonyms: Homophones with the same spelling.  E.g. "Right" (direction) vs. "Right" (entitlement).
    • Heteronyms: Words that are spelt the same but have a different sound and meaning. E.g. "desert" (leave) vs. "desert" (sahara) as in the above example.
    • Different Spelling: No technical word here, just words which mean and sound the same but are spelt differently. e.g. "Labor" (US spelling) vs. "Labour" (British Spelling).
    • Different Pronunciation: Again, no technical word, just two words which are written and mean the same, but sound different. E.g. 'the elephant' vs. 'the circus'.  ('the' takes a different sound in each).
    • Synonyms: Two words with the same meaning, but different pronunciations and written forms.  e.g. "friend" and "companion". Useful for lexical simplification as synonyms can be ranked according to their simplicity.
    • Same Word: No difference here whatsoever.
    So there you have it.  I hope this is a helpful contribution to the often confusing world of word relation nomenclature.  I am certainly much more clear on the distinction between these terms as a result of writing this blog.

    Monday, August 12, 2013

    The Lexical Simplification List

    Whilst putting together my Literature review, I decided it might be valuable if the references I was collecting were visible to other people who are interested in lexical simplification.  To that end, I  have put together a list of all the references I know of which pertain in some way to lexical simplification.  I have tried to not overload this load, so have only included those papers which seem to be explicitly working in lexical simplification, rather than those who mention it in passing.  The list is probably incomplete in it's current incarnation, so if you see any papers you think are missing, please do drop me an email and I'll be happy to add them.  To find the list you can follow the tab at the top, or click here.

    Further to this, I thought it might be nice to collect together some of the resources I have found helpful on one page.  This means that I have split the resources sections into 'my resources' and 'external resources'.  In the external resources section I have put in some links to useful resources which I have used, but have had no hand in creating.

    My idea and hope with this is that somebody wishing to start out in lexical simplification will be able to read through these two lists and find a good bed of research, and a good bed of resources to begin.  I also hope that other more established lexical simplification researchers will find the content interesting and their research will benefit from it.

    Thursday, August 01, 2013

    Randomising lines a very large file with java

    I came across an interesting problem today.  I have some nice results from counting data and I wanted to see if the same results would appear if I randomised the underlying dataset.  The problem?  The dataset is a 17 Gigabyte file.

    All the solutions I could find online required the file to be read into memory at some point.  Obviously, with my 8GB of RAM these were not acceptable solutions.  I needed a solution which would allow one line to be in memory at once and then to be discarded.

    I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation.  I also realised that the more files there were, the greater the randomisation.

    Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into.  It then  assigns each line at random to one of the files until it runs out of lines.  These files can then be concatenated together in  a post-processing step.  I think it's quite a neat solution.  I've commented the code for readability, so hopefully it will be reusable.  Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.

    For my 17GB file it took 22 minutes to run, writing to 1000 files.  Needless to say that most of that time was taken up by I/O.


    import java.io.PrintWriter;
    import java.io.FileWriter;
    import java.io.FileReader;
    import java.io.BufferedReader;
    import java.util.Random;

    public class RandomiseLines
    {
      public static void main(String [] args) throws Exception
      {
        if(args.length != 2)
        {
          System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");
          System.exit(-1);
        }

        //the number of separate files to place lines into.
        final int FILENUM = Integer.parseInt(args[1]);

        //initialise the random number generator.
        final long SEED = 1;
        Random generator = new Random(SEED);

        //if seed isn't required, comment above and use:
        /*
          Random generator = new Random;
        */

        //initialise the file writers
        PrintWriter [] writers = new PrintWriter[FILENUM];
        for (int i = 0; i < FILENUM; i++)
         writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));

        //read in the file
        int key;
        String line;
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        while((line = in.readLine()) != null)
        {

          //generate a random number between 0 and FILENUM - 1
          key = (int)Math.floor(FILENUM*generator.nextDouble());

          //write the line to the chosen file;
          writers[key].println(line);
        }//while

        //close IO
        in.close();
        for(int i = 0; i < FILENUM; i++)
         writers[i].close();
       
      }//main
    }//class

    The following shell script can be used as a wrapper to the programme.
    #!/bin/bash

    FileNum=10;

    java RandomiseLines $1 $FileNum

    echo "" > randomised.txt

    for i in `seq 0 $(($FileNum -1))`; do
     cat out.$i.txt >> randomised.txt;
     rm out.$i.txt
    done

    exit