Skip to content

2. Dictionaries and Index Construction

Dictionaries and tolerant retrieval 1

3.1 Search structures for dictionaries

  • Dictionary: is a data structure that supports: Hashing and Search trees.
  • Hashing:
    • It is fast but does not support partial match, and not suitable for the web where keys (terms) keep changing.
    • Each term in the query is hashed separately, thus misspelled or accented words will generate different hash values.
  • Search Trees:
    • Usually a balanced binary trees, or B-trees.
    • Binary tree: each node has at most two children. B-tree: each node has at most B children.
    • Re-balancing is expensive, thus we allow the difference between left and right subtree to be more than 1.
    • B-Trees are useful as data can be stored partially on disk and partially in memory.

3.2 Wildcard queries

  • Useful in: Spelling correction, seeking documents that contains different forms of a word, and uncertain queries.
  • Trailing wildcard query: a query that has one wildcard at the end => B-Trees are suitable.
  • Leading wildcard query: a query that has one wildcard at the beginning => Reverse B-trees are suitable, where words are indexed from right to left; thus, *mon => start looking for n, then o, then m; and words like lemon will be found.
  • General wildcard queries:
    • Can be handled using both normal and reverse B-trees.
    • Strategy:
      • Convert the wildcard query (qw) into Boolean query Q.
      • Run the boolean query Q against a special index.
      • Answer to Q is QA, a superset of terms matching qw.
      • QA is then filtered to remove terms that do not match qw, to prepare a final query QF.
      • The QF is then run against the main index to get the final answer.
      • The final answer is returned to the user.
  • 3.2.1 Permuterm indexes:
    • This index uses a special character $ to mark the end of a term.
    • When generating the index vocabulary, a set of rotations of each term is generated, and added to the index.
    • Example: term$, erm$t, rm$te, m$ter, $term.
    • When a wildcard query is received, it will be converted into the corresponding permuterm and then run against the premuterm index.
    • Example: looking for m*n will be converted to n$m*, and then run against the index.
    • Example of multiple wildcards: fi*mo*er will be converted to er$fi*, and then run against the index.
    • After that the resulted terms are filtered, and run again against the main index.
    • Disadvantage: Large index size due to the extra terms added (10X the normal index size)
    • The term el$ postings, contains all the terms that end with el (including camel, parcel).
  • 3.2.2 K-gram indexes:
    • k-gram: a sequence of k characters.
    • Example: the word castle, has the following 3-grams: $ca, cas, ast, stl, tle, le$.
    • This index includes all the k-grams of each term in the vocabulary.
    • Thus, the term stl postings, contains all the terms that have stl in them (including castle).
    • Then the wildcard query is converted into k-grams, and run against the k-gram index.
    • The resulted terms are then filtered, and run again against the main index.
    • Disadvantage: processing is expensive.
    • Querying such an index is usually hidden behind advanced search functionality in search engines, and is not used by default.

3.3 Spelling correction

  • We look at two steps to solving this problem: edit distance and k-gram overlap.
  • 3.3.1 Implementing spelling correction:
    • Two principles:
      • When correcting, choose the nearest term to the misspelled term. Proximity must be defined.
      • When two possible corrections found, choose the one that is more frequent, that is either has more frequency in the vocabulary, or users search for it more.
    • Spelling correction algorithm variants:
      • Retrieve documents containing the original and misspelled terms.
      • Retrieve documents containing the corrected, only if the original has no matches.
      • Retrieve documents containing the corrected, only if the original has few matches than a threshold.
      • Do not retrieve documents containing the corrected, but offer the corrected as a query suggestion.
  • 3.3.2 Forms of spelling correction:
    • Two forms: isolated term correction and context-sensitive correction.
    • Isolated term correction: only one term is corrected at a time, edit distance and k-gram overlap are used in such correction.
  • 3.3.3 Edit distance:
    • Edit distance: the minimum number of edits required to transform one term into another.
    • Allowed edit operations: insert a char, delete a char, replace a char.
    • Edit distance is also known as Levenshtein distance.
  • 3.3.4 K-gram indexes for spelling corrections:
    • The idea is to use the k-gram index to find the terms that are close to the misspelled term.
    • The k-gram index is used to find the terms that have the most overlap with the misspelled term.
    • The k-gram index is used to find the terms that have the smallest edit distance with the misspelled term.
  • 3.3.5 Context sensitive spelling corrections:
    • The idea is to use the context of the misspelled term to correct it.
    • The query is being treated as a unit, and not that each term is being corrected separately.
    • That is, looking for logs of users’ past queries, and find the closest query to the misspelled query.

3.4 Phonetic correction

  • Misspellings that arise because the user types a query that sounds like the target term.
  • Example: weather and wether or cat and kat.
  • The idea is to generate a phonetic hash for each term in the vocabulary, and then use it to find the closest term to the misspelled term.
  • Soundex is a phonetic hashing algorithm that is used to generate the phonetic hash; and here are its steps:
    • Turn every term into 4-character form; and build inverted index for the new form.
    • Hash the query term using the same algorithm.

Index construction 2

4.1 Hardware basics

  • Accessing memory is faster than accessing disk, thus we cache most frequently used data in memory.
  • Disk access is expensive, it involves: seek time (time for the head to move to the right cylinder), rotational delay (time till the exact position of the cylinder is under the reader), transfer time (transfer bytes from disk to memory).
  • Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. a buffer is the part of memory that holds an entire block while it is being read or written to/from disk.
  • Data transfer (I/O) ar done by the bus, thus the CPU is free for other tasks while I/O is being done; thus compressing data by CPU before writing, and decompressing data by CPU after reading is a good idea.

4.2 Blocked sort-based indexing BSBI

  • Each term is given a unique integer ID, and the index is sorted by term ID.
  • Then an index is built that maps each term to its id, sorted by term.
  • While doing the sorting, if the collection is large, we need external sorting algorithms that keep data partially in memory/disk while sorting.
  • The Blocked sort-based indexing algorithm (BSBI) is an external sorting algorithm.
  • It minimizes the number of disk seeks during the sorting process.
  • Steps:
    • Split the collection into blocks of equal size.
    • Save blocks to disk, and sort each block in memory separately, then write the sorted version back to the disk.
    • Merge the sorted blocks into a single sorted list.
  • It builds the terms index in one pass, and the postings lists in another pass.
  • Complexity: O(T log T) where T is the number of terms in the collection.

4.3 Single-pass in-memory indexing SPIMI

  • It is also an external sorting algorithm, but builds up the terms index and positing lists in one pass.
  • Posting lists have dynamic sizes while running SPIMI, thus may be written to disk multiple times, if memory is not enough.
  • It is faster and saves memory over BSBI.
  • We keep posting lists sorted by lexical order of their document IDs.
  • Complexity: O(T)

4.4 Distributed indexing

  • For large collections, one machine can not do the indexing, thus we need to distribute the indexing process over multiple machines.
  • Where clusters of machines split up the work and each machine builds a part of the index.
  • Web search engines use distributed indexing.
  • The index can be split up between machines according to the terms or according to the documents (each machine indexes a subset of the documents).
  • Search engines prefer document-partitioned indexes.
  • MapReduce:
    • A general framework for distributed computing.
    • Any node in the cluster may fail at any point, thus the framework must be fault-tolerant.
    • A master node splits the work into smaller tasks, and then distributes the tasks to worker nodes or reassigns them if a worker node fails.
    • In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs.

4.5 Dynamic indexing

  • Previous methods assume that the collection is static, and no new documents are added.
  • The simplest way to achieve this is to periodically reconstruct the index from scratch.
    • This is a good solution if the number of changes over time is small and a delay in making new documents searchable is acceptable
    • And if enough resources are available to construct a new index while the old one is still available for querying.
  • If there is a requirement that new documents be included quickly, one solution is to maintain two indexes:
    • A large main index and a small auxiliary index AUXILIARY INDEX that stores new documents.
    • Auxiliary index is kept in memory, thus it can be searched quickly.
    • Both indexes are searched in parallel, and the results are merged.
    • Deletions are recorded in the auxiliary index, and the deleted results are removed if returned from the main index.
    • Updates are recorded in the auxiliary index, and the updated results are replaced if returned from the main index.
    • Periodically, the auxiliary index is merged into the main index.
  • Because of this complexity of dynamic indexing, some large search engines adopt a reconstruction-from-scratch strategy.
    • They do not construct indexes dynamically. Instead, a new index is built from scratch periodically.
    • Query processing is then switched from the new index and the old index is deleted

References


  1. Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/03dict.pdf. Chapter 3: Dictionaries and tolerant retrieval. 

  2. Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/04const.pdf. Chapter 4: Index Compression.