Skip to content

JA2. Spelling Correction

Statement

Upon completion of this assignment, you will be able to demonstrate the use of eEdit distance and k-overlap grams to determine spelling corrections in the given text. You will further be able to determine the correct spelling (there is no case-sensitivity) for the words in a given document using an existing dictionary that has vocabulary terms.

Assignment instructions

  • Assume there exists a dictionary on local server with the following four vocabulary terms (note: this is a demo dictionary, while in an actual dictionary there are millions of words/terms):
    • Information
    • Jeopardy
    • Lost
    • Mount Everest
  • A sample document, Doc1, is also available on the server with incorrect spellings:
    • Doc1: At Mount Everest, 12 people were lost. This inforomation about jopardy raised fear.

Using the above data, answer the following questions.

  1. How can the given dictionary be helpful for spelling corrections in Doc1?
  2. Describe the approach used to correct the spellings of ‘information’ and ‘jeopardy’ in the document (Doc1) using Levenshtein distance and k-gram overlap (you may choose the value of k). Create the table, as shown in the Example Levenshtein distance computation to compute the edit distance using Levenshtein distance metric among the pair of strings.
  3. In this situation, the dictionary is sorted. If the dictionary is not sorted, what shall be its impact on the process chosen by you in Q2?

Solution

Spelling correction is an interesting topic, it is a process where a term is looked up in the terms index; and if the term is not found, then the term is term is marked as misspelled and the query processor decides to return an empty result set or correct the term and return the result set of the corrected term.

There are two approaches to spelling correction: the first is isolated term correction where the term is taken separately from the query and the second is contextual term correction where the entire query is taken as a unit and other terms in the query affect the spelling correction process.

The second one is more complex and involves collecting logging and history of query, and sometimes machine learning is involved. This assignment focuses on the first approach, that is, isolated term correction where the edit distance and k-grams proved to be useful.

Edit distance (also known as, Levenshtein distance) is a measure of similarity between two terms, and it is the minimum number of operations required to transform one term into the other. The allowed operations are: insertion, deletion, and substitution of one character at a time. The term with the smallest edit distance is considered to be the most similar to the query term and can be used as a spelling correction.

K-grams are the sequences of k characters in a term, where the term is split into overlapping sequences of k characters, and then run against the terms index; then the term with the highest number of overlapping k-grams with the query term is considered to be the most similar to the query term and can be used as a spelling correction.

In the example above, the term inforomation and jopardy has no appearance in the dictionary; but the question is: do we consider it as a new term and add it to the dictionary or consider it as misspelled and pass it to the correction method?

The answer to the previous question, we must consult a reference dictionary, that is, a dictionary that contains all the terms in a language; if the term does exist in the reference dictionary, it is considered as a new term and added to the dictionary. If the term does not exist in the reference dictionary, it is considered as misspelled and passed to the correction method.

Assuming that we did that, and both inforomation and jopardy are considered as misspelled, we can pass them to the correction method which will use the edit distance and k-grams to correct the spelling of the terms.

To actually correct the spelling of both terms we are going to do a two-step process:

  1. Use k-grams to find a subset of terms in the dictionary that are similar to the misspelled term and call it the candidate set.
  2. Run the edit distance on the candidate set to find the term with the smallest edit distance and consider it as the correct spelling.

The gram K=2 seems suitable for both examples; so we are going to use it.

Correcting inforomation

  • The term inforomation is split into overlapping sequences of 2 characters as the following gram set: in, nf, fo, or, ro, om, ma, at, ti, io, on.
  • Run each gram against the terms index (aka, our dictionary), and we find the following results:
    • ro, om are not found.
    • All other grams are found in information.
  • Thus the candidate set is: information.
  • Well, in this case, so we don’t need to run the edit distance, and we can consider information as the correct spelling.
  • In real life, the candidate set may contain more than one term, and we need to run the edit distance on all of them. Here is an example of the candidate set for the term inforomation:
* i n f o r m a t i o n
* 0 1 2 3 4 5 6 7 8 9 10 11
i 1 0 1 2 3 4 5 6 7 8 9 10
n 2 1 0 1 2 3 4 5 6 7 8 9
f 3 2 1 0 1 2 3 4 5 6 7 8
o 4 3 2 1 0 1 2 3 4 5 6 7
r 5 4 3 2 1 0 1 2 3 4 5 6
o 6 5 4 3 2 1 2 3 4 5 6 7
m 7 6 5 4 3 2 1 2 3 4 5 6
a 8 7 6 5 4 3 2 1 2 3 4 5
t 9 8 7 6 5 4 3 2 1 2 3 4
i 10 9 8 7 6 5 4 3 2 1 2 3
o 11 10 9 8 7 6 5 4 3 2 1 2
n 12 11 10 9 8 7 6 5 4 3 2 1

Each cell [i, j] in the table, represents the number of edits (edits distance), between the row[0,i] and the column[0,j]. For example, the cell [2, 3] represents the number of edits between the term in and the term inf which is 1 edit (insertion of f).

The last cell [11,12] (the very bottom right), contains the distance between the term inforomation and the term information which is 1 edit (deletion of o).

Correcting jopardy

  • The term jopardy is split into overlapping sequences of 2 characters as the following gram set: jo, op, pa, ar, rd, dy.
  • Run each gram against the terms index (aka, our dictionary), and we find the following results:
    • jo is not found.
    • The other grams are found in jeopardy.
  • The candidate set is: jeopardy.
  • Similarly, in this case, so we don’t need to run the edit distance, and we can consider jeopardy as the correct spelling.
* j e o p a r d y
* 0 1 2 3 4 5 6 7 8
j 1 0 1 2 3 4 5 6 7
o 2 1 2 1 2 3 4 5 6
p 3 2 3 2 1 2 3 4 5
a 4 3 4 3 2 1 2 3 4
r 5 4 5 4 3 2 1 2 3
d 6 5 6 5 4 3 2 1 2
y 7 6 7 6 5 4 3 2 1
  • The last cell [8,7] (the very bottom right), contains the distance between the term jopardy and the term jeopardy which is 1 edit (insertion of e).

Impact of unsorted dictionary

When the dictionary is sorted, while running any operation against that dictionary, depending on the data structure used to build the dictionary (has table, binary tree, B tree) the operation will go directly to the place where only terms starts with the same k-letters exist, and then exists early as soon as we passed the last key in that subset.

However, if the index is not sorted, a linear pass through the entire index is required to make sure that we all relevant terms have been processed. In complexity notation, the sorted index is O(log n) while the unsorted index is O(n).

References

  • Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/03dict.pdf. Chapter 3: Dictionaries and tolerant retrieval.
  • Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/04const.pdf. Chapter 4: Index Compression.