DA5. Query Types Comparison¶
Statement¶
- How can you determine which type of query to use?
- What are the differences between Boolean retrieval, wildcard queries, and phrase queries?
- What are some techniques that can improve the efficiency of computing scoring and ranking in search systems?
Solution¶
How can you determine which type of query to use?¶
The choice of a query type is application-dependent, that is, an application may support only boolean queries, while another may support phrase and boolean queries, and so on. It also depends on the user’s needs, query distribution, and the data (documents in the collection) itself.
A general web search engine would hide the complexity from the end-users, thus it only supports a search box for entering text input, and then a query parser would parse the input and generate multiple queries, such as boolean, phrase, wildcard, etc. The query parser would also determine which query type to use based on the input. Maybe only one of those queries would be used, but if the executed query generated less than the desired number of results, then the search engine would try to execute another query type, and so on (Manning et al., 2009, 1).
A highly specialized search engine, such as a search engine for a specific domain, may give the end-users more control over which type of query they need, and in such cases, the interface would be more complex, allowing for choosing the query type, operators, and any parameters that users are allowed to control.
Nonetheless, supporting any type of query would require the IR system to have the necessary components to support that query type, such as the necessary indexes, data structures, algorithms to process the query, and utilities to rank/filter the results.
What are the differences between Boolean retrieval, wildcard queries, and phrase queries?¶
Boolean queries check only the existence of a term within a document; thus it does not rank or score documents; it has only two possible outcomes: true=1 or false=0. It is also known as an exact match. It can be tuned by using logical operators such as AND, OR, NOT, etc. A simple inverted index is sufficient to support boolean queries (Manning et al., 2009, 1).
Wildcard queries are used to search for terms that match a pattern. It is also known as an approximate match. When a wild query is issued, the query parser must accommodate all possible scenarios and may issue multiple queries to the IR system. For example, if a query is with develop*
, then the query parser may issue multiple queries such as develop
, developed
, developer
, development
, etc (Manning et al., 2009, 1). It requires a special index that supports fast pattern matching, such as tree-like data structures that store terms in a way that allows for fast searching and retrieval. Algorithms such as edit distance, Levenshtein distance, and regular expressions may be used.
Phrase queries are used to search for terms that appear in a specific order. Scoring and ranking are necessary in answering such queries; as the window (the number of tokens that contain query terms) is an important factor in ranking documents. It requires a special index that supports fast phrase matching, and fast-computing of scores, such as positional indexes (Manning et al., 2009, 1).
What are some techniques that can improve the efficiency of computing scoring and ranking in search systems?¶
Here are some techniques that mentioned in (Manning et al., 2009, 1):
- Inexact top K document retrieval: Retrieves a set of documents that are likely to be in the top K documents (but not necessarily exactly the top K documents).
- Index elimination: Only consider documents with high scores and contain as many query terms as possible, and ignore the rest.
- Champion lists: For each term, maintain a short list of documents that are considered best matches for that term.
- Static quality scores and ordering: query-independent scores that measure the quality of a document, and can be used to order documents and alter the ranking of all terms in that document.
- Impact ordering: Instead of default ordering, terms are ordered by their tf(t,d) (term frequency in the document) or IDF. the query then would stop after processing the required K documents.
- Cluster pruning: Pre-assign documents into clusters, and then randomly pick a few documents from each cluster, and then process the rest of the documents in each cluster only if the required K documents are not retrieved.