Lexical Simplification: May 2013

Friday, May 31, 2013

muthesis.cls

About this time last year The first CDT cohort wrote their long reports. Some people, myself included, chose to use the university's LaTeX thesis class. Which is a great resource for thesis writing, however is not quite designed for End of year reports. To remedy this I modified some of the style information in the muthesis.cls file to make it more appropriate for an end of year report. The changes made are as follows:

On the title page, the text 'A thesis submitted to the UoM for the degree of doctor of philosophy' has been changed to read 'An end of year report submitted to the UoM'
The word count at the end of the contents page was removed (this was a personal style choice)
On the Abstract page, the word 'thesis' is modified to read 'end of year report' similar to the first point above. Also I changed the ordering here from: Title, Author, Specification to Title, Specification, Author.
The declaration was removed as this felt out of tone for an end of year report and is not required by the formal specification.

The file is available from the resources page. It will download as a tar archive, which can be unpacked using the command:

tar -xf eoy.tar.gz

Once this has been done the latex file should compile straight away using the command:

pdflatex EOY.tex

You can then view the resulting pdf in EOY.pdf

Hope this is a helpful resource in writing end of year reports. If you want a hand to modify the style file further then let me know. I'm happy to assist where I can.

Wednesday, May 29, 2013

Identifying Complex Words

The very first step in lexical simplification is to identify complex words (CWs). This is the process of scanning a text and picking out the words which may cause a reader difficulty. Getting this process right is important, as it is at the first stage in the simplification pipeline. Hence, any errors incurred at this stage will propagate through the pipeline, resulting in user misunderstanding.

How do we define a CW?

In my previous blog post, I gave several factors that come together to form lexical complexity. Lexical complexity values can be inferred, using the metrics given there. Typically, word frequency is either used by itself, or combined with word length to give a continuous scale on which complexity may be measured. We can then use this scale to define and identify our CWs as described below.

How do we identify them?

There are a few different methods in the literature for actually identifying CWs. I have written a paper discussing and evaluating these which is referenced at the end of this section. For now, I'll just give a brief overview of each technique - but please do see the paper for a more in depth analysis.

The most common technique unsurprisingly requires the least effort. It involves attempting to simplify every word and doing so where possible. The drawback of this technique is that the CWs are never identified. This means that difficult words which can't be simplified (e.g. beacuse there is no simpler alternative), won't be. It also means that words which are not causing a barrier to understanding may be modified, potentially resulting in error.
Lexical complexity (as explained above) can be used to determine which words are complex in a given sentence. To do this, a threshold value must be established, which is used to indicate whether a word is complex. Selecting a lexical complexity measure which discriminates well is very important here.
Machine learning may also be used to some effect. Typically, Support Vector Machines (SVMs, a type of statistical classifier) have been employed for this task. Lexical and syntactic features may be combined to give an adequate classifer for this task.

I am soon to publish a comparison of the above techniques at the ACL-SRW 2013. I will put a link up to that paper here when it is available.

The CW corpus

To compare different techniques in CW identification, it is necessary to have an annotated corpus. My solution to this was to extract sentences from Simple Wikipedia edit histories which had been simplified by a revising editor. I have a separate paper submitted on this and will write more about it in a future post. The corpus contains 731 sentences, each with one annotated CW. This can be used for fully automatic evaluation. The data is available from the resources page.

User-dependent complexity

Complexity is a subjective measure and will vary from user group to user group and even from user to user. For example, take the case of a class of English language learners. They will all have different levels of English proficiency and will have differing knowledge of English, based on their experience of it to date. A language learner who has been on holiday to England several times may have different simplification needs to a language learner who has watched many films in English, subtitled in their own language. A language learner whose first language is Italian will find many words to be similar to their own language, similarly a learner whose first language is German may also find many words to be similar. However, German and Italian speakers will not find the same English words familiar. It could even be hypothesised that words which an Italian speaker found simple, would need to simplified for a German speaker and vice versa.

E.g.

German               English                    Italian
Verwaltung       Administration    Amministrazione
     Apfel                Apple                       Mela

The above toy example shows how one language learner's simplification needs may differ from another. The German speaker will find the word 'Apple' familiar, yet struggle with 'Administration', the Italian speaker will experience the reverse.

There is very little work on discerning the individual simplification needs of a user. This is not just a problem confined to language learning (although it may be seen there very clearly) but it affects all spheres of text simplification. A technique which could adapt to a user's needs, maybe incorporating feedback from a user where appropriate would go far.

Friday, May 17, 2013

Lexical Complexity

Lexical simplification often requires some method of determining a word's complexity. At first glance, this sounds like an easy task. If I asked you to tell me which word is simpler: 'sit' or 'repose', you would probably tell me that the first was the easiest. However, if I asked you to say why, it may be more difficult to explain.

Many factors influence the complexity of a word. In this post, I will identify six key factors, these are: Length, Morphology, Familiarity, Etymology, Ambiguity and Context. These are not an exhaustive list, and I'm sure other factors contribute too. I have also mentioned how to measure these where appropriate.

1. Length

Word length, measured in either characters or syllables is a good indicator of complexity. Longer words require the reader to do more work, as they must spend longer looking at the word and discerning it's meaning. In the (toy) example above, sit is 3 characters and 1 syllable whereas repose is 6 characters and 2 syllables.

Length may also affect the following two factors.

2. Morphology

Longer words tend to be made up of many parts - something referred to as morphological complexity. In English, many morphemes may be put together to create one word. For example 'reposing' may be parsed as: re + pose + ing. Here, three morphemes come together to give a single word, the semantics of which are influenced by each part. Morphosemantics is outside the scope of this blog post (and probably this blog!) but lets just say that the more the reader understands about each part, the more they will understand the word itself. Hence, the more parts there are, the more complex the word will be.

3. Familiarity

The frequency with which we see a word is thought to be a large factor in determining lexical complexity. We are less certain about the meaning of infrequent words, so greater cognitive load is required to assure ourselves we have correctly understood a word in it's context. In informal speech and writing (such as film dialogue or sending a text message) short words are usually chosen over longer words for efficiency. This means that we are in contact more often with shorter words than we are with longer words and may explain in part the correlation between length and complexity.

Familiarity is typically quantified by looking at a word's frequency of occurrence in some large corpus. This was originally done for lexical simplification using kucera-francis frequency, which is frequency counts from the 1-million word Brown corpus. In more recent times, frequency counts from larger corpora have been employed. In my research I employ SUBTLEX (a word frequency count of subtitles from over 8,000 films), as I have empirically found this to be a useful resource.

4. Etymology

A word's origins and historical formations may contribute to it's complexity as meaning may be inferred from common roots. For example, the latin word 'sanctus' (meaning holy) is at the etymological root of both the English words 'saint' and 'sanctified'. If the meaning of one of these words is known, then the meaning of the other may be inferred on the basis of their 'sounds-like' relationship.

In the above example, 'sit' is of Proto-Germanic Anglo Saxon origins whereas 'repose' is of Latin origin. words of Latin and Greek origins are often associated with higher complexity. This is due to a mixture of factors including the widespread influence of the Romans and the use of Latin as an academic language.

To date, I have seen no lexical complexity measures that take into account a word's etymology.

5. Ambiguity

Certain words have a high degree of ambiguity. For example, the word 'bow' has a different meaning in each of the following sentences:

The actors took a bow.
The bow legged boy stood up.
I hit a bull's eye with my new carbon fibre bow.
The girl wore a bow in her hair.
They stood at the bow of the boat.

A reader must discern the correct interpretation from the context around a word. This can be measured empirically by looking at the number of dictionary definitions given for a word. According to my dictionary, sit has 6 forms as a noun and a further 2 as a verb, whereas repose has 1 form as a noun and 2 forms as a verb. Interestingly, sit is more complex by this measure.

6. Context

There is some evidence to show that context also affects complexity. For example: take the following sentences:

"The rain in Spain falls mainly on the ______"
"Why did the chicken cross the ______"
"To be or not to ___"
"The cat ____ on the mat"

In each of these sentences, you can easily guess the blank word (or failing that use Google's auto complete feature). If we placed an unexpected word in the blank slot, then the sentence would require more effort from the reader. Words in familiar contexts are more simple than words in unfamiliar contexts. This indicates that a word's complexity is not a static notion, but is influenced by the words around it. This can be modelled, using n-gram frequencies to check how likely a word is to co-occur with those words around it.

Summary

So, if we put those factors into a table it looks something like this:

Word	"sat"	"repose"
Length (characters)	3	6
Length (syllables)	1	2
Familiarity (frequency)	3383	29
Morphology (morphemes)	1	2
Etymology (origins)	Proto-Germanic	Latin
Ambiguity (senses)	8	3
Context* (frequency)	6.976	0.112

*source: Google n-grams value for query "the cat ____". Value is percentage occurrence and is multiplied by a factor of 10^7

We see that repose is more difficult in every respect except for the number of senses.

Lexical complexity is a hard concept to work with, it is often subjective and shifts from sense to sense and context to context. Any research into determining lexical complexity values must take into account the factors outlined here. The most recent work into determining lexical complexity is the SemEval 2012 task in lexical simplification. This is referenced below for further reading.

L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lexical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012

Wednesday, May 08, 2013

Lexical Simplification: Background

My PhD work is concerned with a process called lexical simplification. I'm interested in how natural language processing can be applied to make documents easier to read for everyday users. Lexical simplification specifically addresses the barriers to understandability provided by the difficult words in a text.

For example, take the following sentence:

The workers acquiesced to their boss' request.

It is fairly clear that the rarely used verb 'acquiesce' is going to cause understandability issues here. Do you know what it means? Maybe in a wider context you could guess the meaning, however here it is fairly difficult to work out. Lexical simplification deals with sentences such as the above and attempts to process them into more understandable forms. There are several stages to the lexical simplification pipeline. I intend to devote an entire post to each one of these as I continue, however for now, it should be sufficient to give an overview of each one.

The first stage in any lexical simplification system is complex word identification. There are 2 main approaches to this. Firstly, systems will attempt to simplify every word and those for which simplifications can be found are transformed, those which cannot be transformed are left. Secondly, some form of thresholding is applied. There are various measures of lexical complexity - which often reside heavily in word frequency. Some threshold may be applied to one of these measures to determine between complex and simple words. One of the major issues in this field is the lack of evaluation resources. I have a paper on this topic accepted at the ACL student session 2013, so will write more at that time.

If we assume that we can get lexical complexity values of:

worker:	300
acquiesce:	5
boss:	250
request:	450

If we also assume that our threshold (which is set on some training data) is somewhere between 5 and 250 then we have an indicator that 'acquiesce' is a difficult word.

The next step, once we have identified this complex word is to generate a set of synonyms which could replace it. This is typically done with a thesaurus such as WordNet. This could give us some of the following replacements for acquiesce.

Acquiesce: accept, accommodate, adapt, agree, allow, cave in, comply, concur, conform, consent, give in, okay, submit, yield

We must then process these to discover which will be valid replacements in the given context. This third step is called word sense disambiguation. This is necessary as a word will have typically have several senses, so some replacements will only be valid in certain contexts. In the above example a word sense disambiguation step may look something like the following:

Acquiesce: accept, ~~accommodate~~, ~~adapt~~, agree, ~~allow~~, cave in, comply, ~~concur~~, conform, ~~consent~~, give in, ~~okay~~, submit, yield

Where words in green are those that would be valid replacements and words struck-through and in red are non-valid replacements. This is somewhat subjective and remains an unsolved task in NLP.

The final step is to rank the resulting replacements in order of their simplicity. The simplest will then replace the original word. To do this we revisit our measure of lexical complexity from before. For example if we have the following values for the remaining candidates:

accept:		550
agree:		450
cave in:		250
comply		35
conform		50
give in		350
submit		40
yield		20

Then we would choose 'accept' as our replacement. Giving the simplified sentence as:

The workers accepted their boss' request.

Which is a much more understandable sentence.

There are of course some nuances of the original meaning that are lost in this simplification, however this has to be accepted. The understandability of the sentence is obviously dramatically increased.

My project is currently focusing on each of these stages individually. The hypothesis is that by examining and optimising each stage in turn, it will be possible to improve the final simplification. Work has already taken place in the first stages mentioned above and work will continue on the rest.

There is much more to lexical simplification than the basic outline presented above and readers wishing to know more should look to read the following publications:

Siobhan Devlin and John Tait. 1998. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases, pages 161–173.

Or Biran, Samuel Brody, and No ́emie Elhadad. 2011. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 496–501, Stroudsburg, PA, USA. Association for Computational Linguistics.

Stefan Bott, Luz Rello, Biljana Drndarevic, and Horacio Saggion. 2012. Can spanish be simpler? lexsis: Lexical simplification for spanish. In COLING, pages 357–374.

S. M. Alu ́ısio and C. Gasperin. Fostering digital inclusion and accessibility: the PorSimples project for simplification of Portuguese texts. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Ap- proaches to Languages of the Americas, YIWCALA ’10, pages 46–53, Strouds- burg, PA, USA, 2010. Association for Computational Linguistics.

L. Feng. Text simplification: A survey, 2008.

L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lex- ical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012.

Monday, May 06, 2013

Me

I thought I would put up some information about myself. So here goes:

Name: Matthew Shardlow

Location: Manchester, United Kingdom

Occupation: PhD student, Graduate Teaching Assistant

Employer: University of Manchester

Supervisor: John McNaught

Co-Supervisor: Simon Harper

Project Title: Lexical Simplification

PhD Track: 4-year centre for doctoral training

Finish date: September 2015

Funding body: EPSRC grant no. EP/I028099/1

Project Description: Making difficult language easier to read by detecting and translating complex vocabulary into easy words.

Research Interests:

Text simplification
Complex word identification
Substitution generation
Word sense disambiguation
Lexical complexity
Large scale corpus linguistics
The use of Wikipedia as a corpus

About

The simplification of the lexicon is an important task at the boundary between natural language generation and assistive technology. It concerns the automatic replacement of complex wordforms with more easily accessible alternatives. Complexity is of course subjective and can be interpreted differently depending upon the text and the reader. At it's broadest, the definition may be: 'Any word which reduces the reader's overall understanding of the text'.

This blog details the project outcomes of my PhD in Lexical Simplification (LS). It will serve both as an archive for previous work and as a platform for the promotion of ongoing research. I also intend to publish interesting data-sets that I create during the course of my research.

I intend to post here regularly, but not too often. Hopefully like-minded researchers will find the content on here of interest. If you do use any of the data published here or are inspired by the ideas promoted please drop me a line to encourage me!

Matt.

Pages