How to Implement a Search Engine Part 1: Create Index

Ok, let’s start! We will implement a search engine that answers queries on Wikipedia articles. There will be two main parts of the project. First creating the index by going through the documents, and second answering the search queries using the index we created. Then we will also add ranking, classification, compression, and duplicate detection mechanisms.

What is our corpus
Our corpus (document collection) is Wikipedia articles. To simplify our work and focus on the core algorithms, we won’t write the crawler to get the articles from Wikipedia. We will use a prebuilt file which contains approximately 50,000 articles from various fields ranging from computer science to arts. The structure of the file is as follows:

<id> pageID (an integer) </id>
<title> title of the page </title>
Contents of the article
Can span multiple lines
May contain non-alphanumeric characters and links
<id> another pageID (every pageID is distinct) </id>
<title> title of another page </title>
Body of the article

As we can see, there is an XML like structure in the collection. Every article lies between <page> and </page> tags, where the pageID, title, and text is separated by the corresponding tags. We will write our own routine to parse this structure by using regular expressions. We will assume that these special tags won’t appear in the body of the articles. However, the articles may contain non-alphanumeric characters and capital letters that we will pay special attention to while building the index. For example we don’t want to index apple, Apple, and APPLE differently. They are basically all apple. Or considering the grammatical structure of the language, we don’t want to index research, researches, and researching separately. They are all about research. When we search for one of these terms we would expect results containing any of those variations. Additionally, we wouldn’t like to index words such as ‘the’, ‘a’, ‘an’ because they appear in almost every document and they don’t give us very much information about the document or the query. These very common words are called stop words.

Inverted Index
Chapters 1 and 2 of the Introduction to Information Retrieval book cover the basics of the inverted index very well. To summarize, an inverted index is a data structure that we build while parsing the documents that we are going to answer the search queries on. Given a query, we use the index to return the list of documents relevant for this query. The inverted index contains mappings from terms (words) to the documents that those terms appear in. Each vocabulary term is a key in the index whose value is its postings list. A term’s postings list is the list of documents that the term appears in. To illustrate with an example, if we have the following documents:
Document 1: Information Retrieval and Web Search
Document 2: Search Engine Ranking
Document 3: Web Search Course

Then the postings list of the term ‘web’ would be the list [1, 3], meaning the term ‘web’ appears in documents with IDs 1 and 3. Similarly the postings list of the term ‘search’ would be [1, 2, 3], and for the term ‘course’ the postings list would be [3]. We may want to keep additional information in the index such as the number of occurrences of the term in the whole collection (its term frequency), or the number of different documents that the term appears in (its document frequency), the positions of the term’s occurrences within a document etc. The amount of information we keep in our index will grow as we add more functionality to our search engine.

Query Types
So, what types of queries our search engine will answer? We will answer the types of queries that we use while doing search every day. Namely:
1) One Word Queries: Queries that consist of a single word. Such as movie, or hotel.
2) Free Text Queries: Queries that contain sequence of words separated by space. Such as godfather movie, or hotels in San Francisco.
3) Phrase Queries: These are more advance queries that again consist of sequence of words separated by space, but they are typed inside double quotes and we want the matching documents to contain the terms in the query exactly in the specified order. Such as “godfather movie”.

Parsing the Collection
While parsing the document collection we will decide which words will be the terms in the index. As mentioned above, we don’t want every word to be a term. So, while parsing the Wikipedia articles we will perform the following operations on each page in this order:
1) Concatenate the title and the text of the page.
2) Lowercase all words.
3) Get all tokens, where a token is a string of alphanumeric characters terminated by a non-alphanumeric character. The alphanumeric characters are defined to be [a-z0-9]. So, the tokens for the word ‘apple+orange’ would be ‘apple’ and ‘orange’.
4) Filter out all the tokens that are in the stop words list, such as ‘a’, ‘an’, ‘the’.
5) Stem each token using Porter Stemmer to finally obtain the stream of terms. Porter Stemmer removes common endings from words. For example the stemmed version of the words fish, fishes, fishing, fisher, fished are all fish.

Building the Inverted Index
The inverted index is the main data structure of our search engine. We will use a Hashtable (python’s dictionary) to store the inverted index in memory. The reason is we will perform lots of lookups (one for every term in the document), and we will also add lots of keys (every term is a key), so we want these operations to be very efficient. Since Hashtables have average O(1) lookup time and amortized O(1) insertion time, they are very suitable for our needs.

As we discussed earlier, every term will be a key in our dictionary, whose value is its postings list (the list of documents that the term appears in). However, we would like to keep one additional information in the postings list, the positions of term occurrences within the document. The reason is to answer the phrase queries we need positional information, because we want to check whether the terms in the query appear in the specified order. Without knowing the positions of the terms in the document, we can only check whether the query terms simply appear in a document. To verify the order, we need to know their positions. So for every occurrence of a term in a document, we will keep the occurrence position. For example, if we have the documents:
Document 1: web retrieval web search information
Document 2: search engine web ranking
Document 3: web search course information search

Now the postings list for the term ‘web’ is [ [1, [0, 2]], [2, [2]], [3, [1]] ]. Meaning, the term ‘web’ appears in document 1 in positions 0 and 2 (we start counting positions from 0), document 2 position 2, and document 3 position 1. The postings list of a term is a list of lists, where each list corresponds to a specific document. So, there is a list for every document that the term appears in. Each of these lists contain the document ID as the first element, and the list of occurrences of the term in that document as the second element. As another example, the postings list of the term ‘search’ is [ [1, [3]], [2, [0]], [3, [1, 4]] ]. Because ‘search’ appears in document 1 position 3, document 2 position 0, and document 3 positions 1 and 4. If a term doesn’t appear in a document, it postings list simply doesn’t have an entry for that document. So, the postings list of the term ‘information’ is [ [1, [4]], [3, [3]] ].

We build the index as follows. First, we extract a page from the collection with our parsing routine. The page content is between <page> and </page> tags. Then we perform the operations listed in the section Parsing the Collection. As a result we have the list of terms in the document. Then we build the inverted index for the page in the format described above. But notice that since we are building an index for just the current page, the postings list of a term won’t be a list of lists. It will simply be a list where the first element is the document ID, and the second element is the list of positions the term appears in that document. Then we merge the index of the current page with the main inverted index, which is the index for the whole corpus. The merging is simple. For every term in the current page, we append its postings list to the postings list of that term in the main index (which is a list of lists as described above).

So, if we use the above example. First, we extract Document 1:
web retrieval web search information
Then we build the index for this document (note that the terms are not the stemmed versions for demonstration. In the actual program a term would be stemmed by Porter Stemmer before being added to the index. So the word ‘retrieval’ would be added to the index as the term ‘retriev’ after being stemmed):
{ ‘web’: [1, [0, 2]], ‘retrieval’: [1, [1]], ‘search’: [1, [3]], ‘information’: [1, [4]] }
Then we merge this dictionary with our main dictionary (which is currently empty because this is the first document in the collection). Our main dictionary becomes:
{ ‘web’: [ [1, [0, 2]] ], ‘retrieval’: [ [1, [1]] ], ‘search’: [ [1, [3]] ], ‘information’: [ [1, [4]] ] }
Note that the postings of a term in the current page index is added inside a list in the main index, because the main index postings lists are list of lists, a list for every document the term appears in.

Then we extract the second document:
search engine web ranking
and build the index for that document:
{ ‘search’: [2, [0]], ‘engine’: [2, [1]], ‘web’: [2, [2]], ‘ranking’: [2, [3]] }
And we merge the current page index with the main index. Which is simply appending the current postings of a term to the postings list of the corresponding term in the main index. After merging, our main index becomes:
{ ‘web’: [ [1, [0, 2]], [2, [2]] ], ‘retrieval’: [ [1, [1]] ], ‘search’: [ [1, [3]], [2, [0]] ], ‘information’: [ [1, [4]] ], ‘engine’: [ [2, [1]] ], ‘ranking’: [ [2, [3]] ] }

We continue like this and build our main inverted index for the whole collection. Our query answering program will use this inverted index. So, we should save this index to a file because our query index program will be separate. First the create index program will run and write the index to a file. Then the query index program will execute by reading the index from the file and answering the search queries using that index. So, we should decide a format for saving the index to the file.

The index is stored as text in the following format:
Every line of the file contains a separate term. The line starts with a term, and then the character ‘|’ is used to separate the term from its postings list. The postings list of a term has the following form. First document ID containing the term, followed by a colon, followed by the positions of the term in that document with commas in between, semicolon, second document ID containing the term, colon, comma delimited positions, and it goes on like this. Using the above example, the term – postings list pair ‘web’: [ [1, [0, 2]], [2, [2]] ] would be saved as:

Here is another example of index creation. All the parsing operations are performed (including stemming) on the words to obtain the terms.

We have two collections. A small test collection of size 40MB which contains around 5,000 documents, and a larger full collection of size 300MB containing around 40,ooo documents. These collections are of course not representative of the web, but they are good for experimentation purposes. Let’s look at some statistics:

Test CollectionFull Collection
Time without Stemming (min:sec)0:233:14
Time with Stemming (min:sec)1:5615:16
Number of Documents5,15541,141
Size of Collection (MB)38301
Index Size (MB)25201
Number of Terms145,005616,723

Stemming is a very useful and important operation, but it increases the execution time by a factor of 5. The reason is that stemming is performed on every term and it contains lots of checks. All statistics except the first row is reported with stemming performed.

The distribution of terms in a collection generally follows Zipf’s Law, which states that the frequency of a word is inversely proportional to its rank. So, the most frequent word will appear 2 times more than the second most frequent word, 5 times more than the fifth most frequent word etc. Let’s verify this by plotting term frequencies in the full collection:

The graph is a log-log plot. The x-axis is the rank of the term (first one is the most frequent term, second one is the second most frequent etc.), and the y-axis is the collection frequency of that term (number of occurrences in the full collection). The term distributions approximately follows Zipf’s Law. Even though the model is not a perfect fit, it’s good enough to be considered as a useful model.

Source Code
I wrote the code in python because it’s both easily understood and widely used. Here is the source code. If you want to test it yourself, here is the workspace (size: 10MB). It contains the compressed test collection, stopwords file, Porter Stemmer, and the source code.

Next Steps..
So, this was our create index program. We will next write the query index program that answers search queries using the index that we just built. Then we will add ranking to our search engine. We will implement tf-idf (term frequency – inverse document frequency) ranking scheme and PageRank. Then we will add classifiers to our search engine. Namely Naive BayesSuppor Vector Machine (SVM), and Rocchio classifiers. We will also compress the index stored on disk using delta compression with variable length encoding.

VN:F [1.9.11_1134]
Rating: 10.0/10 (3 votes cast)
How to Implement a Search Engine Part 1: Create Index, 10.0 out of 10 based on 3 ratings

This entry was posted in Information Retrieval, Search Engines, Web Search. Bookmark the permalink.

3 Responses to How to Implement a Search Engine Part 1: Create Index

  1. Nice explanation, it’s very useful for me :)
    i’m under graduate student from information system study at Universitas Indonesia but now i must doing Information retrieval project >.< (never get the course)

    maybe u can help me discussing this topic :)
    by the way.. i must implement TF-IDF with Vector Space Model and K-Mean clustering for searching similiar document. which is better? or I must implement all of these method?

    maybe u can contact me via email :D
    vans (dot) emperor (at) gmail (dot) com

  2. Sudeep Reddy says:

    Even I am looking for doing my MS.I’m lookking forward for specialization in IR and Machine Learning.But I’m worried that the both fields are almost saturated.Is there anything going in these fields right now..Please clarify mmy doubt..?

    • Arden says:

      Absolutely. IR and ML are the hottest topic of CS these days. Both research institutions and companies are in high demand for specialists of IR and ML. I’m more into the industry and I see that job postings of almost every company emphasize the knowledge of these fields.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">