Lexical Simplification

Friday, September 06, 2013

3rd Year

On Wednesday (4/9/2013) I successfully completed my end of second year interview. This means that I am now officially a third year PhD student. I am now at the dead halfway point of my PhD, having completed 24 months with 24 more remaining. It has been a long road getting here and there is still a long way to go. Below is a brief analysis of the achievements in my PhD so far and the goals yet to come.

Completed So Far:

Literature Review: This was the first thing I did as a PhD student. Reading took up most of the first six months of my research. I consumed, refined and categorised as much of the relevant literature as I could find. I am attempting to publish this as a survey paper currently, since the only available text simplification survey is a technical report from 2008
Lexical Simplification Errors: I recently undertook a pilot study looking at the errors thrown up by the lexical simplification pipeline. I'm looking to publish this in an upcoming conference, so won't say too much about the results here and now.
Complex Word Identification: This was the first element of the lexical simplification pipeline that I studied. I built a corpus of sentences, each with one word marked as complex for the purpose of evaluating current methods of identification. This work was published in 2 separate workshop papers at ACL 2013.
Substitution Generation: Once we have identified a complex word, we must generate a set of substitutions for it. However, those words which are complex are also those which are least likely to be found in a thesaurus, complicating the task. To address this I spent considerable efforts learning simplifications from massive corpora with some success. This work is also currently being written up for publication.

Still to come:

Word Sense Disambiguation: The next step in the pipeline is to apply some word sense disambiguation. This has been done before, so I will be looking at the best ways to apply it and hopefully making a novel contribution here. I am just starting out on this phase of research and am currently immersed in the WSD literature, trying to get my head round the myriad techniques that already exist there.
Synonym Ranking: I have looked into the best way to rank synonyms according to their complexity before at the start of my project. The small amounts of work that I did back then did not discover anything radical, but did help me to better understand the structure of a lexical simplification system. When I revisit this area it will be with the hope of making some significant contribution. I was really interested in the work David Kauchak presented at ACL 2013 and will be interested to explore what more can be done in this area.
User Evaluation: Finally, I will spend some time exploring the effects of each of the modules I have developed on individual users. It is of paramount importance to evaluate text simplification in the context of the users it is aimed at and to this end I will be focussing my research on a specific user group. Although which group is as yet undecided.
Thesis: This will undoubtedly take a significant portion of my final year. The chapter titles will hopefully be the bullet points you see listed above.

So there you have it. Although it appears that I have done a lot so far, it still feels like I have a real mountain to climb. There are significant hurdles and vast amounts of reading, researching and writing ahead. I look forward to the challenges that the next two years of my PhD will bring.

Monday, August 19, 2013

ACL 2013 - Post Blog

It's been a little over a week since I got back from ACL. I think it takes a certain amount of time to process a conference, and I think I'm still processing it. It was a massively positive experience overall. It was very encouraging to meet so many people working in similar disciplines, people who engaged with similar problems. It was also very encouraging to present my research and to get feedback from more experienced members of the community. Despite being somewhat terrified about the prospect of presenting, I actually really enjoyed it. People had really good questions which made me think more about my research and get even more excited for it.

A real highlight of the conference was the workshop on Predicting and Improving Text Readability (PITR). This was a small workshop, with maybe 10-20 people at any one time. During the course of the day I gave both a poster and an oral presentation. The people there were working in very similar areas to mine and I got such valuable feedback on my work, and was able to understand and discuss other people's research with them.

I really enjoyed the conference experience and I will definitely be looking to attend another conference in the forthcoming season (as much as time and funding might allow!). I have some work on automatic thesaurus generation that I am looking to write up and submit to either LREC or EACL. Their submission dates are close together (15th and 18th October respectively), so I will likely submit the same paper to both to increase my odds of acceptance.

The next big hurdle in my academic career is my progression interview on 4th August. According to the supporting documentation:

"The student has been working for 18 months on research. It should be possible at this point to determine whether they are capable of achieving at the research project they are attempting"

Which sounds terrifying. I'm currently choosing not to stress about it, whilst they technically have the option to throw me out at this point, the chances of them doing so are very low. I'm required to present a short (1000 word) report and give a 10 minute talk. I already have the talk roughly planned out in my mind, although I've not put any slides together as of yet.

Thursday, August 15, 2013

Orthography, Phonology and Nomenclature. Making sense of word relations!

In my role as computational linguist, I often find myself verging into areas of linguistics which I find nothing short of fascinating. One such is the complex relation between English orthography (that is, how a word is written) and phonology (how it sounds). In English, we have a 'deep orthography' meaning that a word doesn't necessarily sound the way it looks, leading to beautiful confusions such as:

weight vs. height

foot vs. food

or (my favourite):

cough vs. enough vs. plough vs. though. vs. through.

That's right, 5 distinct sounds from the letters 'ough'.

We also get the interesting phenomenon that one set of letters, with different pronunciations can have totally different meanings. For example.

He decided to desert the army.
Camels live in the desert.

This is an example of heteronyms. Heteronyms are different to homonyms which have the same pronounciation and spelling, but a different meaning. These are different again to heterographs, homophones and synonyms. The table below defines the potential relations between words. It is taken mostly from this venn diagram

	Meaning	Spelling	Pronunciation
No Relation	Different	Different	Different
Homophone	Different	-	Same
Heterograph	Different	Different	Same
Heteronym	Different	Same	Different
Homonym	Different	Same	Same
Different Spelling	Same	Different	Same
Different Pronunciation	Same	Same	Different
Synonym	Same	Different	Different
Same Word	Same	Same	Same

No relation: Two word words which are not related in any sense.
Homophones: Words which sound the same, but have different meanings. Further split into the following two categories:
Heterographs: Homophones with different spellings. "There", "Their" and "They're" is a classic example.
Homonyms: Homophones with the same spelling. E.g. "Right" (direction) vs. "Right" (entitlement).
Heteronyms: Words that are spelt the same but have a different sound and meaning. E.g. "desert" (leave) vs. "desert" (sahara) as in the above example.
Different Spelling: No technical word here, just words which mean and sound the same but are spelt differently. e.g. "Labor" (US spelling) vs. "Labour" (British Spelling).
Different Pronunciation: Again, no technical word, just two words which are written and mean the same, but sound different. E.g. 'the elephant' vs. 'the circus'. ('the' takes a different sound in each).
Synonyms: Two words with the same meaning, but different pronunciations and written forms. e.g. "friend" and "companion". Useful for lexical simplification as synonyms can be ranked according to their simplicity.
Same Word: No difference here whatsoever.

So there you have it. I hope this is a helpful contribution to the often confusing world of word relation nomenclature. I am certainly much more clear on the distinction between these terms as a result of writing this blog.

Monday, August 12, 2013

The Lexical Simplification List

Whilst putting together my Literature review, I decided it might be valuable if the references I was collecting were visible to other people who are interested in lexical simplification. To that end, I have put together a list of all the references I know of which pertain in some way to lexical simplification. I have tried to not overload this load, so have only included those papers which seem to be explicitly working in lexical simplification, rather than those who mention it in passing. The list is probably incomplete in it's current incarnation, so if you see any papers you think are missing, please do drop me an email and I'll be happy to add them. To find the list you can follow the tab at the top, or click here.

Further to this, I thought it might be nice to collect together some of the resources I have found helpful on one page. This means that I have split the resources sections into 'my resources' and 'external resources'. In the external resources section I have put in some links to useful resources which I have used, but have had no hand in creating.

My idea and hope with this is that somebody wishing to start out in lexical simplification will be able to read through these two lists and find a good bed of research, and a good bed of resources to begin. I also hope that other more established lexical simplification researchers will find the content interesting and their research will benefit from it.

Thursday, August 01, 2013

Randomising lines a very large file with java

I came across an interesting problem today. I have some nice results from counting data and I wanted to see if the same results would appear if I randomised the underlying dataset. The problem? The dataset is a 17 Gigabyte file.

All the solutions I could find online required the file to be read into memory at some point. Obviously, with my 8GB of RAM these were not acceptable solutions. I needed a solution which would allow one line to be in memory at once and then to be discarded.

I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation. I also realised that the more files there were, the greater the randomisation.

Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into. It then assigns each line at random to one of the files until it runs out of lines. These files can then be concatenated together in a post-processing step. I think it's quite a neat solution. I've commented the code for readability, so hopefully it will be reusable. Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.

For my 17GB file it took 22 minutes to run, writing to 1000 files. Needless to say that most of that time was taken up by I/O.

import java.io.PrintWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.Random;

public class RandomiseLines
{
public static void main(String [] args) throws Exception
{
    if(args.length != 2)
    {
      System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");
      System.exit(-1);
    }

    //the number of separate files to place lines into.
    final int FILENUM = Integer.parseInt(args[1]);

    //initialise the random number generator.
    final long SEED = 1;
    Random generator = new Random(SEED);

    //if seed isn't required, comment above and use:
    /*
      Random generator = new Random;
    */

    //initialise the file writers
    PrintWriter [] writers = new PrintWriter[FILENUM];
    for (int i = 0; i < FILENUM; i++)
     writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));

    //read in the file
    int key;
    String line;
    BufferedReader in = new BufferedReader(new FileReader(args[0]));
    while((line = in.readLine()) != null)
    {

      //generate a random number between 0 and FILENUM - 1
      key = (int)Math.floor(FILENUM*generator.nextDouble());

      //write the line to the chosen file;
      writers[key].println(line);
    }//while

    //close IO
    in.close();
    for(int i = 0; i < FILENUM; i++)
     writers[i].close();

}//main
}//class

The following shell script can be used as a wrapper to the programme.

#!/bin/bash

FileNum=10;

java RandomiseLines $1 $FileNum

echo "" > randomised.txt

for i in `seq 0 $(($FileNum -1))`; do
cat out.$i.txt >> randomised.txt;
rm out.$i.txt
done

exit

Monday, July 29, 2013

ACL 2013 Pre-Blog.

In under a week I will be sitting on a plane headed to Bulgaria. This year I will be presenting at the Association for Computational Linguistics Conference in Sofia. I have been fortunate enough to have had two papers accepted for presentation.

The first paper is as a part of the Student Research Workshop. This is a part of the main conference, but they only accept papers from PhD students (making it slightly easier to get in to!). The paper I am presenting details some experiments in attempting to establish a baseline for complex word identification. I used the CW Corpus (see below) to test a few standard techniques in complex word identification. It turned out that they were all fairly similar, but that in itself was an unexpected and hence interesting finding! The mode of presentation will be via a poster, I think this will be quite difficult and require a lot of energy to stay engaged and motivated, but I'm up for the challenge.

I'm excited to attend the student research workshop. It will hopefully be an encouraging experience. Whilst I don't expect there to be many (if any!) people who are experts in text simplification there, I'm sure it will be very useful to meet other PhD students and see where their work is taking them.

The second paper is part of a co-located workshop called Predicting and Improving Text Readability for Target Reader Populations (PITR 2013). This is a much smaller workshop with a more specialised focus. It is very relevant to my field of research and so I'm interested to meet plenty of like minded people there. I have followed the work of some of the presenting authors, so it will be very exciting to meet them face to face.

This is more than just academic celebrity spotting of course. The paper I will be presenting is on the CW Corpus, a resource I developed for evaluating the identification of complex words. There are a lot of implementation details, which for the main part I will try not to bore people with. The main thing I want to do with this conference is to get people interested in the concept of complex word identification as it's own separate evaluable sub task. Hopefully people will respond well to this, seeing it as a valid area to be working in. I'm presenting a poster in this workshop and will also be giving a 15 minute talk on my research.

I'll write about the conference again soon, either whilst I'm out in Bulgaria, or when I get back.

Wednesday, June 19, 2013

The importance of being accurate.

Only a short post today. I am currently writing my transfer report, which is soaking up all of my research time. I thought I would take some time out from that to write about an interesting phenomenon that occurs in text simplification.

Accuracy is always to be sought after. Regardless of your domain, the more accurate your algorithm, the better. In many domains, negative results can be tolerated. For example, if you search for the query 'jaguar in the jungle' you are likely to receive lots of results about big cats in their natural habitat, but you may also receive some results about fancy cars in the jungle too. This is acceptable and may even be helpful as the original query contained some ambiguity - maybe you really wanted to know about those fancy cars.

The same thing can occur during text simplification. Inaccurate identifications or replacements may lead to an incorrect result being present in the final text. Some of the critical points of failure are as follows:

A complex word could be mislabeled as simple - meaning it is not considered for simplification.
No replacements may be available for an identified complex word.
A replacement which does not make sense in the context of the original word may be selected.
A complex replacement may be incorrectly selected over a simpler alternative due to the difficulty of estimating lexical complexity

If any of the above pitfalls occur, then either a complex word or an erroneous replacement may creep into the final text. Unlike in web search, errors are of great detriment to the simplification process. This is because the point is to have text which is easier to understand. In the majority of cases, introducing errors into a text will cause it to be more difficult, completely negating any simplification made. This is a real case of one step forwards and two steps back. For example:

A young couple with children will need nearly 12 years to get enough money for a deposit.

was changed by a rudimentary lexical simplification system to:

A young couple with children will need nearly 12 years to get enough money for a sediment.

Not only has a synonym which is more complicated than the original word been chosen here, the synonym does not make any sense in the given context. Through making an error, the understandability of the text is reduced, and it would have been better to make no simplification at all.

To end this post, I will present some practical ways to mitigate this.

Only simplify if you're sure. Thresholds for deciding whether to simplify should be set high to avoid errors.
Use resources which are well suited to your task, preferably built from as large a corpus as possible.
Investigate these errors in resultant text. If they are occurring, is there a specific reason?

In summary, incomprehensible text is much more complex than understandable yet unsimplified text. Whilst the goal of text simplification must be to simplify when and wherever possible, this must not be done at the expense of a system's accuracy. Presenting a reader with error prone text is as bad, if not worse than presenting them with complex text.

Pages