What’s the Difference Between Stemming and Lemmatization? - Ask Dr. Search
Last Updated Nov 2014
A Reader Asks:
What’s the Difference Between Stemming and Lemmatization? I see these terms everywhere, though stemming seems more common.
Editors Note: The two spellings lemmatization and lemmatisation are both in use in the literature. This question is from a FAST Search customer and FAST spells it with a “z”, so we will use that here. Wikipedia uses an “s” extended from lemmatise. In our opinion both are correct.
Dr. Search Responds:
A great question, thanks! First, I’d like to talk generally about what these terms mean:
As you suspect, they are related. Both have to do with getting words to match each other even if they are not in the exact same form. For example, I search for “house”, and a document contains the word “houses”, and most people would consider that a match we’d like the search engine to find it (even though one is singular and the other is plural. This can be extended to match different forms of verbs, adjectives and adverbs, and even for synonyms and wildcard matching. At this extended level, people sometimes refer to it as “fuzzy” matching, though that is not a standard.
This sounds pretty simple, doesn’t it? It’s such a common practice these days that you might not even have noticed it. It turns out that doing this efficiently, for millions and millions of documents, and for all word forms, is a bit tricky.
But like everything in software, the devil is in the details. Here are some of the challenges, depending on the implementation:
- If an engine DOESN’T do it, users will likely be annoyed because they won’t find what they are looking for. (techies would say it impacts “recall”)
- If an engine does too much of it and starts matching everything under the sun, incorrect matches will be returned, and users will also be annoyed. For example matching “sun” to “son” is probably too aggressive for a first search. (techies would say it impacts “precision”)
- There are many exceptions to the rules. The plural of house is houses, but the plural of mouse is mice. And although the suffix “ed” often indicates the past tense of a verb, it’s also a valid word ending on its own. So lifted should match lift, but breed is not the past tense of “bre”.
- If a search engine handles multiple languages, it needs to apply completely different rules to each document, depending on that language. For example, handling the “ing” in English verbs and “ando” for Spanish verbs.
- It’s helpful to be able to turn on and off different types of rules for some queries, or combine them in different ways. For example, I might want to allow for singular and plural nouns to match, but not for synonyms. Or perhaps I’d like to give extra credit to an exact match, but I would also accept a partial match if there are no exact matches.
- Implementation complexity. High end search engines do give developers this fine grain level of control, but as you can imagine it brings with it some added configuration and coding to fully utilize.
Now to your question on the difference between lemmatization and stemming:
Lemmatization implies a broader scope of fuzzy word matching that is still handled by the same subsystems. It implies certain techniques for low level processing within the engine, and may also reflect an engineering preference for terminology.
FAST and Lucene documentation talk about lemmatization, whereas most other engines refer to this as stemming. Taking FAST as an example, their lemmatization engine handles not only basic word variations like singular vs. plural, but also thesaurus operators like having “hot” match “warm”.
This is not to say that other engines don’t handle synonyms, of course they do, but the low level implementation may be in a different subsystem than those that handle base stemming.
A lemmatization system would handle matching “car” to “cars” along with matching “car” to “automobile”. In a more traditional search engine, matching “car” to “cars” would be handled by stemming, but matching “car” to “automobile” would be handled by a separate system.
With all that said, most business customers don’t need to worry about it too much. Your vendor will tend towards one term or the other, but as long as they have functionality of both, you’re probably OK. Be advised that, for some engines, stemming and thesaurus matching is an extra cost option.
A few other questions you might ask the vendor:
Does it work for all parts of speech, or just nouns? Or can this be configured?
Is stemming done via rules or via a dictionary? Or perhaps a hybrid? A dictionary allows you to override certain words that rules would not cover, for example matching “mouse” and “mice”, and not taking the “ed” off the end of “breed”.
Verify that you can you combine stemming and thesaurus operations in the same query.
Verify that they give you a standard thesaurus for your language, and that you can add your own custom terms if needed. And ask how this is done.
Fuzzy Matching Types:
I’d like to spell out some of the more common types of word matching options on the market. Note that some of these might not be handled by lemmatization, even for vendors who use that term.
- Basic stemming for nouns: cat/cats, mouse/mice, etc.
- Stemming for verbs: run/runs/running/ran
- Other parts of speech
- Thesaurus / synonym matching: dog/pooch/mutt
- Wildcard matching: dog* matches dogma.
- Advanced wildcard: *tion* matches “nation”, “definitions”, etc.
- Regular expression matching.
- Soundex / sound alike: Smith matches Smyth
- Typo, OCR and Levenshtein: lemmatization matches lemmetization and lemmitization
- Case sensitivity / casedex: control over whether “mark” matches “Mark”
- Accent normalization: insure that resumé matches resume and résumé
- Abbreviations and their expanded form (a type of thesaurus)
- Variations of words with punctuation, such as hyphens. For example: Meta Data, meta-data and metadata.
- Entity normalization: Dates for example, so that 6/18/09 matches June 18th, 2009
In the above list, advanced operations like full wildcard matching, Levenshtein and entity matching would not be simple lemmatization subsystem activities; the engine would have to call on other subsystems and do a lot more work.
Expansion and Reduction
Although you didn’t ask about this, there are a number of techniques that are used to handle stemming and lemmatization. Just in case a vendor starts throwing these terms around, I did want to mention them briefly. This is rather advanced, but something your local search guru should commit to memory.
There are three general ways to achieve matching on different word variations:
- Expand words put into the index to include all variations; this is called “index time expansion” or “document expansion”.
- Expand the query to include all variations; this is called “query expansion” or “runtime expansion”.
- Reduce all words to their base form when creating the index AND for each word in the query when running a search. This must be done at both index and query time and usually just called “by reduction”.
It’s also possible to combine these methods, and there are some good reasons an engine might want to do so.
We’ll walk through these with some examples. Since reduction was actually more common in the early days of the search industry, we’ll work through that first.
Reduction is applied at both index and search time. All words are consistently reduced to their base form so that they will match each other.
Some houses have many mice. Our house only has one mouse.
Each word will be reduced to its base form before being stored into the fulltext search index, so we would have:
some house has many mouse. our house only has one mouse.
Notice that the nouns “house” and “mouse” are reduced to their base singular forms, and the verbs “have” and “has” were reduced to the present tense base form “has”.
Query submitted by user:
how to get rid of mice
Reduced query submitted to search engine:
how to get rid of mouse
Notice that “mice” was reduced to “mouse”, and will match two of the indexed words for our sample document (the 2 instances of “mouse”).
The biggest advantage of this method is that the index size is not increased. Since disk space was very expensive in the 1980s and 90s, this was almost always how it was done. There were two disadvantages to this method, however.
- You couldn’t search for the other forms of the word, unless you created an additional index. In our above example, you would not be able to search for the literal word “mice”. In some advanced searches that level of control is nice to have as an option.
- If additional forms of the word need to be added, for example “rodent”, then the documents need to be completely reindexed. In our sample, a search for “rodent” would also be reduced to “mouse”, so the search would still work. But original documents containing the word “rodent” would not have that converted to “mouse” until they are reindexed, so would not be immediately findable. This would give inconsistent and confusing results.
Expansion at Index Time:
In this method additional forms of the words are written to the fulltext index when the document is indexed. The original source document itself is not modified, but additional entries in the index have the other word forms recorded as belonging to that document.
In the example below, we will also assume a lemmatizing subsystem that is also expanding for synonyms as well.
Some houses have many mice. Our house only has one mouse.
Each word will be expanded to include all variations before being stored into the fulltext search index, so we would have:
some few …
houses house home homes domicile …
have has had having possess contains …
many multiple lots dozens …
mice mouse rodent …
our my …
house houses home homes domicile …
only sole solely …
has have had having possess contains …
one single solitary unity …
In the example index above, the dots represent other word forms. I’ve also listed the original form of each word first; some engines keep track of which form was originally used, to maintain the option for doing an exact match.
Later, at search time, the query is submitted as is, without modification. Any variations to the search terms have already been accounted for in the index.
An example query of:
how to get rid of rodents
Would be submitted as is, completely unmodified, and it would match the document about the mouse because rodent was also recorded as belonging to that document.
Although this method does expand the fulltext index, it’s not as bad as it might initially look due to the efficient way fulltext indexes are organized. For example, the word “house” appears on line 2 and 7, but the actual string will only be stored once. Most search engines have a very efficient index structure where a word will only appear once, and be followed by a numerical list of pointers indicating all the places it was seen. Techies might call this an “instance vector”, but we won’t go into that here.
The other problem with index time expansion shares with reduction is that if new word variations are added, the documents must be reindexed in order to fully reflect that.
Given the disadvantages of increased index size and reduction’s problem of needing reindexing to reflect changes, why would anybody use this method? Actually, we’ve already covered that. Notice in the above example that I said the query would be submitted unmodified, this is the key advantage. Index time expansion has two main benefits:
- 1: Queries can be submitted directly to the search engine, without modification.
- 2: Performance is on a par with reduction; the larger index can make the search a bit slower, but it’s still very good.
And in theory the index could be appended to if new word variations are added, to just add those new variations to existing documents, though in practice we don’t know of any engine that works in that way.
Expansion at Search Time:
In this method a document’s words are stored in the fulltext index unmodified, and no extra forms are recorded. The index size stays reasonably small and document indexing is relatively fast.
At search time, however, the users’ searches are expanded to include additional word forms.
Going back to our original example, suppose a user submits:
how to get rid of mice
The Expanded query submitted to search the search engine would be something like:
how method means … to get rid of remove eradicate exterminate … mice mouse rodent rodents…
The dots indicate other expanded terms. And I’ve cheated a bit here, the phrase “get rid of” is three separate words, and this theoretical query expander knew that I meant exterminate, eradicate, etc. Simpler query expanders would have expanded each word separately, so “get” might become “get, got, getting”, etc. and rid might have become “rid, remove, riddens”, etc. But that type of situation could have easily been applied to index time expansion and reduction.
You’ll notice that the number of query terms in the search has risen dramatically. Adding additional query terms can slow a search engine down because it has to do quite a bit more work. Each expanded word must be looked up in the search engine’s fulltext index, and each lookup takes computing resources. This extent of the slowdown, by expanding the query, may be larger than the slowdown incurred using index time expansion. There is also some time needed to actually expand the query itself, though this is usually relatively efficient. Keep in mind, however, that index time expansion creates a larger search index, and takes longer to index documents.
The one big advantage to query time expansion, and this can be quite significant, is that changes to word and synonyms or other word variations can be added at any time!
In our previous example, as soon as “rodent” is defined as a synonym for “mouse”, the very next query will be expanded to include the new term. So query time expansion does not require reindexing of documents.
This is becoming a more popular option as search engines get faster. If they can handle the longer queries (from the expanded terms), then using this method means many fewer document reindexes. This is particularly important for search engine installations that make heavy use of thesaurus terms that are updated frequently.
Expansion techniques, at either index or query time, may not be appropriate for languages that have an inordinate number of word variations, and reduction might be preferred.
The Hybrid Approach: Combining Techniques
I mentioned earlier on that these various methods can be combined in some systems. This is done to minimize the performance compromises inherent in each method. In particular, word rules that almost never change can be stored in the fulltext index, whereas items that could change frequently are expanded only at search time.
In English, the word “mice” has been the plural form of “mouse” for quite some time! This is unlikely to change on some random Monday morning. Whereas in a rapidly evolving technical field, terminology and abbreviations can change rapidly, and perhaps the thesaurus needs monthly updates. In this scenario, the variations of mouse would be stored in the fulltext index, whereas abbreviations and technical terms would be expanded at query time.
One final example to show this hybrid approach, and this may show my age. In the technology field back in the 1980s a “ps 2” referred to IBM’s Personal System /2 line of computers, whereas now it generally refers Sony’s gaming platform, The Playstation 2.
Connect a USB mouse to older computers with a PS/2 adapter.
The index reflects additional forms of the words whose rules are not likely to change:
connect, connecting, connected, connects ….
USB, usb, U.S.B. …
mouse, mice …
… (some terms omitted) …
PS/2, ps/2, ps 2, ps2, ps-2
adapter, adapt, adapters, adapting, adaptive …
The index has captured most of the static word variations we might care about.
Now imagine a user query of:
mice for ps2
This would be expanded only to get thesaurus variations in place, so this is what would be sent to the engine:
mice rodent for ps2 IBM Personal System 2 /2 Sony Playstation Play Station 2 II two
This query will now match most variations, including word form, punctuation and synonyms.
The query sent to the engine captures both product names, along with variations on the number 2, and includes the spurious synonym “rodent”, which is likely not what they are looking for. That last point is particularly interesting, this search will likely match some documents that mention the word “rodent”. If there were enough other pertinent words in that document, it might be ranked high enough to show near the top of the results. As we said, overuse of fuzzy matches can annoy users. In a large enough document set, disparate contexts that happen to use same words become much more frequent, so care should be taking when turning on fuzzy matching options.
Both stemming and lemmatization allow queries to match different forms of words. Stemming was commonly implemented with Reduction techniques, though this is not universal. Lemmatization implies a possibly broader scope of functionality, which may include synonyms, though most engines support thesaurus-aided searches in one form or another. Lemmatization also tends to be expansion based (either index or query time), though this is not universal. Word variation rules that rarely change are often implemented with reduction or index time expansion, while rules that periodically change are best served with query time expansion.
The main reasons to care about this are:
- For advanced search applications, make sure your vendor offers different choices on how to handle these issues. Ideally a developer should be able to turn various fuzzy matching features on and off as needed.
- Index time expansion increases index size, and possibly indexing time.
- Query time expansion can affect search performance.
- Only query time expansion allows rule changes to be reflected immediately, without the need to reindex documents, allowing you to add thesaurus terms whenever you want.
- A reduction system may not allow you to find exact forms of the words, although this is not always true, depending on the vendor’s implementation.
- Some vendors charge extra for some advanced features. In fairness to them, the features may take more time test and support. We believe basic stemming should be a minimum requirement.
Hope that’s clear. Keep those questions coming!