Wednesday, May 29, 2013

Identifying Complex Words

The very first step in lexical simplification is to identify complex words (CWs).   This is the process of scanning a text and picking out the words which may cause a reader difficulty.  Getting this process right is important, as it is at the first stage in the simplification pipeline.  Hence, any errors incurred at this stage will propagate through the pipeline, resulting in user misunderstanding.

How do we define a CW?

In my previous blog post, I gave several factors that come together to form lexical complexity.  Lexical complexity values can be inferred, using the metrics given there.  Typically, word frequency is either used by itself, or combined with word length to give a continuous scale on which complexity may be measured.  We can then use this scale to define and identify our CWs as described below.

How do we identify them?

There are a few different methods in the literature for actually identifying CWs.  I have written a paper discussing and evaluating these which is referenced at the end of this section.  For now, I'll just give a brief overview of each technique - but please do see the paper for a more in depth analysis.
  1. The most common technique unsurprisingly requires the least effort.  It involves attempting to simplify every word and doing so where possible.  The drawback of this technique is that the CWs are never identified.  This means that difficult words which can't be simplified (e.g. beacuse there is no simpler alternative), won't be.  It also means that words which are not causing a barrier to understanding may be modified, potentially resulting in error.
  2. Lexical complexity (as explained above) can be used to determine which words are complex in a given sentence.  To do this, a threshold value must be established, which is used to indicate whether a word is complex.  Selecting a lexical complexity measure which discriminates well is very important here.  
  3. Machine learning may also be used to some effect.  Typically, Support Vector Machines (SVMs, a type of statistical classifier) have been employed for this task.  Lexical and syntactic features may be combined to give an adequate classifer for this task.
I am soon to publish a comparison of the above techniques at the ACL-SRW 2013.  I will put a link up to that paper here when it is available.

 

The CW corpus

To compare different techniques in CW identification, it is necessary to have an annotated corpus.  My solution to this was to extract sentences from Simple Wikipedia edit histories which had been simplified by a revising editor.  I have a separate paper submitted on this and will write more about it in a future post.  The corpus contains 731 sentences, each with one annotated CW.  This can be used for fully automatic evaluation.  The data is available from the resources page.

User-dependent complexity

Complexity is a subjective measure and will vary from user group to user group and even from user to user.  For example, take the case of a class of English language learners.  They will all have different levels of English proficiency and will have differing knowledge of English, based on their experience of it to date.  A language learner who has been on holiday to England several times may have different simplification needs to a language learner who has watched many films in English, subtitled in their own language.  A language learner whose first language is Italian will find many words to be similar to their own language, similarly a learner whose first language is German may also find many words to be similar. However, German and Italian speakers will not find the same English words familiar.  It could even be hypothesised that words which an Italian speaker found simple, would need to simplified for a German speaker and vice versa.

E.g.

  German               English                    Italian
Verwaltung       Administration    Amministrazione
     Apfel                 Apple                       Mela

The above toy example shows how one language learner's simplification needs may differ from another.  The German speaker will find the word 'Apple' familiar, yet struggle with 'Administration', the Italian speaker will experience the reverse.

There is very little work on discerning the individual simplification needs of a user.  This is not just a problem confined to language learning (although it may be seen there very clearly) but it affects all spheres of text simplification.  A technique which could adapt to a user's needs, maybe incorporating feedback from a user where appropriate would go far.

1 comment:

  1. Yes, L1 affects what appears difficult. Still, we can use corpora to establish a common core that, regardless of difficulty, are needed for receptive competence. Productive competence, on the other hand, can be quite different. A speaker of a Romance language, while relatively weak on receptive competence, might well be able to produce perfectly meaningful and accurate (if stilted) sentences concerning complicated material.

    ReplyDelete