Many factors influence the complexity of a word. In this post, I will identify six key factors, these are: Length, Morphology, Familiarity, Etymology, Ambiguity and Context. These are not an exhaustive list, and I'm sure other factors contribute too. I have also mentioned how to measure these where appropriate.
1. Length
Word length, measured in either characters or syllables is a good indicator of complexity. Longer words require the reader to do more work, as they must spend longer looking at the word and discerning it's meaning. In the (toy) example above, sit is 3 characters and 1 syllable whereas repose is 6 characters and 2 syllables.Length may also affect the following two factors.
2. Morphology
Longer words tend to be made up of many parts - something referred to as morphological complexity. In English, many morphemes may be put together to create one word. For example 'reposing' may be parsed as: re + pose + ing. Here, three morphemes come together to give a single word, the semantics of which are influenced by each part. Morphosemantics is outside the scope of this blog post (and probably this blog!) but lets just say that the more the reader understands about each part, the more they will understand the word itself. Hence, the more parts there are, the more complex the word will be.3. Familiarity
The frequency with which we see a word is thought to be a large factor in determining lexical complexity. We are less certain about the meaning of infrequent words, so greater cognitive load is required to assure ourselves we have correctly understood a word in it's context. In informal speech and writing (such as film dialogue or sending a text message) short words are usually chosen over longer words for efficiency. This means that we are in contact more often with shorter words than we are with longer words and may explain in part the correlation between length and complexity.Familiarity is typically quantified by looking at a word's frequency of occurrence in some large corpus. This was originally done for lexical simplification using kucera-francis frequency, which is frequency counts from the 1-million word Brown corpus. In more recent times, frequency counts from larger corpora have been employed. In my research I employ SUBTLEX (a word frequency count of subtitles from over 8,000 films), as I have empirically found this to be a useful resource.
4. Etymology
A word's origins and historical formations may contribute to it's complexity as meaning may be inferred from common roots. For example, the latin word 'sanctus' (meaning holy) is at the etymological root of both the English words 'saint' and 'sanctified'. If the meaning of one of these words is known, then the meaning of the other may be inferred on the basis of their 'sounds-like' relationship.In the above example, 'sit' is of Proto-Germanic Anglo Saxon origins whereas 'repose' is of Latin origin. words of Latin and Greek origins are often associated with higher complexity. This is due to a mixture of factors including the widespread influence of the Romans and the use of Latin as an academic language.
To date, I have seen no lexical complexity measures that take into account a word's etymology.
5. Ambiguity
Certain words have a high degree of ambiguity. For example, the word 'bow' has a different meaning in each of the following sentences:The actors took a bow.
The bow legged boy stood up.
I hit a bull's eye with my new carbon fibre bow.
The girl wore a bow in her hair.
They stood at the bow of the boat.
A reader must discern the correct interpretation from the context around a word. This can be measured empirically by looking at the number of dictionary definitions given for a word. According to my dictionary, sit has 6 forms as a noun and a further 2 as a verb, whereas repose has 1 form as a noun and 2 forms as a verb. Interestingly, sit is more complex by this measure.
6. Context
There is some evidence to show that context also affects complexity. For example: take the following sentences:
"The rain in Spain falls mainly on the ______"
"Why did the chicken cross the ______"
"To be or not to ___"
"The cat ____ on the mat"
In each of these sentences, you can easily guess the blank word (or failing that use Google's auto complete feature). If we placed an unexpected word in the blank slot, then the sentence would require more effort from the reader. Words in familiar contexts are more simple than words in unfamiliar contexts. This indicates that a word's complexity is not a static notion, but is influenced by the words around it. This can be modelled, using n-gram frequencies to check how likely a word is to co-occur with those words around it.
Summary
So, if we put those factors into a table it looks something like this:Word | "sat" | "repose" |
---|---|---|
Length (characters) | 3 | 6 |
Length (syllables) | 1 | 2 |
Familiarity (frequency) | 3383 | 29 |
Morphology (morphemes) | 1 | 2 |
Etymology (origins) | Proto-Germanic | Latin |
Ambiguity (senses) | 8 | 3 |
Context* (frequency) | 6.976 | 0.112 |
We see that repose is more difficult in every respect except for the number of senses.
Lexical complexity is a hard concept to work with, it is often subjective and shifts from sense to sense and context to context. Any research into determining lexical complexity values must take into account the factors outlined here. The most recent work into determining lexical complexity is the SemEval 2012 task in lexical simplification. This is referenced below for further reading.
L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lexical simplification. In First Joint Conference on Lexical and Computational
Semantics, 2012
Started working on complex word identification for the first time and found your blog to be very useful for the basic understanding. Great job. Thanks for the blog.
ReplyDeleteThanks for breaking this down! I've been exploring a few different approaches to measuring lexical complexity in the context of audio vs. text (https://phonic.ai/blog).
ReplyDeleteThis comment has been removed by the author.
ReplyDelete