Wednesday, May 08, 2013

Lexical Simplification: Background

My PhD work is concerned with a process called lexical simplification.  I'm interested in how natural language processing can be applied to make documents easier to read for everyday users.  Lexical simplification specifically addresses the barriers to understandability provided by the difficult words in a text.

For example, take the following sentence:

The workers acquiesced to their boss' request.

It is fairly clear that the rarely used verb 'acquiesce' is going to cause understandability issues here.  Do you know what it means?  Maybe in a wider context you could guess the meaning, however here it is fairly difficult to work out.  Lexical simplification deals with sentences such as the above and attempts to process them into more understandable forms.  There are several stages to the lexical simplification pipeline.  I intend to devote an entire post to each one of these as I continue, however for now, it should be sufficient to give an overview of each one.

The first stage in any lexical simplification system is complex word identification.  There are 2 main approaches to this.  Firstly, systems will attempt to simplify every word and those for which simplifications can be found are transformed, those which cannot be transformed are left.  Secondly, some form of thresholding is applied.  There are various measures of lexical complexity - which often reside heavily in word frequency.  Some threshold may be applied to one of these measures to determine between complex and simple words.  One of the major issues in this field is the lack of evaluation resources.  I have a paper on this topic accepted at the ACL student session 2013,  so will write more at that time.

If we assume that we can get lexical complexity values of:

worker: 300
acquiesce:    5
boss: 250
request: 450

If we also assume that our threshold (which is set on some training data) is somewhere between 5 and 250 then we have an indicator that 'acquiesce' is a difficult word.

The next step, once we have identified this complex word is to generate a set of synonyms which could replace it.  This is typically done with a thesaurus such as WordNet.  This could give us some of the following replacements for acquiesce.

Acquiesce: accept, accommodate, adapt, agree, allow, cave in, comply, concur, conform, consent, give in, okay, submit, yield

We must then process these to discover which will be valid replacements in the given context.  This third step is called word sense disambiguation.  This is necessary as a word will have typically have several senses, so some replacements will only be valid in certain contexts.  In the above example a word sense disambiguation step may look something like the following:

Acquiesce: accept, accommodate, adapt, agree, allow, cave in, comply, concur, conform, consent, give in, okay, submit, yield

Where words in green are those that would be valid replacements and words struck-through and in red are non-valid replacements.   This is somewhat subjective and remains an unsolved task in NLP.

The final step is to rank the resulting replacements in order of their simplicity.  The simplest will then replace the original word.  To do this we revisit our measure of lexical complexity from before.  For example if we have the following values for the remaining candidates:

cave in:
give in

 Then we would choose 'accept' as our replacement.  Giving the simplified sentence as:

The workers accepted their boss' request.

Which is a much more understandable sentence.

There are of course some nuances of the original meaning that are lost in this simplification, however this has to be accepted.  The understandability of the sentence is obviously dramatically increased.

My project is currently focusing on each of these stages individually.  The hypothesis is that by examining and optimising each stage in turn, it will be possible to improve the final simplification.  Work has already taken place in the first stages mentioned above and work will continue on the rest.

There is much more to lexical simplification than the basic outline presented above and readers wishing to know more should look to read the following publications:

Siobhan Devlin and John Tait. 1998. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases, pages 161–173.

Or Biran, Samuel Brody, and No ́emie Elhadad. 2011. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 496–501, Stroudsburg, PA, USA. Association for Computational Linguistics.

Stefan Bott, Luz Rello, Biljana Drndarevic, and Horacio Saggion. 2012. Can spanish be simpler? lexsis: Lexical simplification for spanish. In COLING, pages 357–374. 

S. M. Alu ́─▒sio and C. Gasperin. Fostering digital inclusion and accessibility: the PorSimples project for simplification of Portuguese texts. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Ap- proaches to Languages of the Americas, YIWCALA ’10, pages 46–53, Strouds- burg, PA, USA, 2010. Association for Computational Linguistics. 

L. Feng. Text simplification: A survey, 2008.

L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lex- ical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012.

No comments:

Post a Comment