Building Search Engine

Motivation

Data is generated at a rapid pace, and as it grows, it gets harder to find relevant data. Search engine indexes such data as a result of which, it can quickly provide a reference to a relevant data source. It’s not necessary that the underlying data should be web pages, it can be anything from files, database, web service, etc. all we need is a way to refer to data source, we will call data source as a document in rest of post.

Let’s build

Overview

Code is written in Python 3 and is available at GitHub link PrajitP/webSearchEngine

We will first collect the data that we want to build our search engine on, data is going to be 120 Wikipedia pages/documents, then we will build the index for the data, then we will build a ranking algorithm to select top n document based on search query.

NOTE: For simplicity, we are going to deal with a small dataset, as a result of this all our operations will be in-memory operations.

Let’s get the data

We are going to use the crawler build here

from crawler import Crawler

def collect_data(outDir):
    crawler = Crawler(outDir = outDir)
    crawler.start(
        domain = 'https://en.wikipedia.org',                    # Explore pages with-in this domain
        seedLink = 'https://en.wikipedia.org/wiki/Database',    # Start exploring from this page
        dumpInterval = 10,                                      # Dump explored pages after every '10' exploration
        maxExplore = 120)

collect_data('crawlerOutput')

Let’s analyze the content of output file generated

    • Content of file crawlerOutput/1
{'https://en.wikipedia.org/wiki/Computer_software': '<!DOCTYPE html>\n...'
 ...
 'https://en.wikipedia.org/wiki/Microsoft_SQL_Server': '<!DOCTYPE html>\n...'}

Index the data

Index will be mapping of word(also known as term) to list of documents that word occurs in(also know as posting list). Usually, posting list are large so they are stored in files or database and only references to them are stored in memory, but for our use-case with just 120 documents, posting list generated will is small enough to fit in memory.

Build index for single document

Tokenize

The first thing we need to do is break the document content into a list of tokens, the document we have is a HTML page, here are list of steps we will follow to tokenize:

  1. Parse the HTML document and extract visible text, ignore all non-ASCII characters.
  2. Tokenize the above string, we can simply split it by space or use standard library utils like nltk to tokenize.
from bs4 import BeautifulSoup
import nltk

# Create BeautifulSoup object from html text, and ignore/remove the non-ASCII characters
soup = BeautifulSoup(docData.encode("ascii", errors='ignore'), 'html.parser')
# Remove non-visible tags [Reference: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text]
[tag.extract() for tag in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
# Tokenize
tokens = nltk.word_tokenize(visible_text)

Normalize

We can now normalize the tokens to remove unnecessary information, like making them case insensitive, ignoring tokens that don’t have alphabets.

newTokens = []
for token in tokens:
    # Convert all character in token to lower case
    token = token.lower()
    # Ignore token which does not have alphabet
    if not re.search(r'.*[a-z].*', token):
        continue
    newTokens.append(token)
tokens = newTokens

Stemming

Stemming is specific type of normalization, this is done to make sure that we reduce all tokens to their root words, for example ‘fishing’, ‘fished’ and ‘fisher’ will all reduce to ‘fish’.

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
newTokens = []
for token in tokens:
    newTokens.append(stemmer.stem(token))
tokens = newTokens

Stop words

There are few words that occur frequently like ‘the’, ‘a’, ‘is’, etc. they are called as stop word and can be of very very little or no use because they will occur in every document which makes them less valuable when it comes to distinguishing document using them.

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

Summarize all the above steps

import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def index_string(text):
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # To capture occurrence of 'token/term' in a 'document'
    tokenFreq = {}
    for token in tokens:
        # Convert all character to lower case
        token = token.lower()
        # Ignore token which does not have alphabet
        if not re.search(r'.*[a-z].*', token):
            continue
        # Apply stemming
        token = stemmer.stem(token)
        # Ignore stopword
        if token in stop_words:
            continue
        # Count token frequency in document
        if token in tokenFreq:
            tokenFreq[token] += 1
        else:
            tokenFreq[token] = 1
    return (tokenFreq, len(tokens))

def index_document(document):
    # Create BeautifulSoup object from html text, and ignore/remove the non-ASCII 
    soup = BeautifulSoup(document.encode("ascii", errors='ignore'), 'html.parser')
    # Remove non-visible tags [Reference: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text]
    [tag.extract() for tag in soup(['style', 'script', '[document]', 'head', 'title'])]
    # Get visible text from html document
    visible_text = soup.getText()
    return index_string(visible_text)

Build index for all documents

By using the above method index_document(document), let’s build index for entire corpus of documents.
Reference or identifier for each document is a ‘document link’, which is a relatively big string for an identifier and can increase a posting list size by a huge margin, to avoid this we will replace ‘document link’ by a smaller string called ‘document id’.
Each element in ‘posting list’ will be a tuple with 2 values, the 1st value will be the ‘document id’ and the 2nd value will be a ‘score of term/token in document’.
We will use ‘term frequency(tf)’ to calculate the score, Term frequency tf(t,d) of term t in document d is defined as number of times that t occurs in d, for example, a document with 10 occurrences of term is more relevant than document with 1 occurrence, but it’s not 10 times more relevant, this means it not linear mapping between score and frequency, so to avoid this we will take log of raw frequency to get our final score.

import sys
import os
import pickle
import re
import math
from collections import defaultdict

def dumpStructure(fileName, content):
    with open(fileName, 'wb') as fileHandle:
        pickle.dump(content, fileHandle, protocol=pickle.HIGHEST_PROTOCOL)

def index_document(document):
    # ... skipping code here ...

def index_data(inDir, outDir):
    docLink_to_docId_map = {}
    token_to_docId_map   = defaultdict(list)
    # Use 'document id' instead of 'document link' as 'id' is shorter and will take less memory then 'link'
    docId = 1
    # Get list of data files from input directory
    dataFiles = [dataFile for dataFile in os.listdir(inDir) if os.path.isfile(os.path.join(inDir, dataFile))]
    # Read the data file
    for dataFile in dataFiles:
        with open(os.path.join(inDir, dataFile), 'rb') as fileHandle:
            data = pickle.load(fileHandle)
        # Build the 'term document index'
        for docLink,docData in data.items():
            # Map 'document id' to 'document link'
            docLink_to_docId_map[docId] = docLink
            # Index document, i.e. get list of tokens in document
            tokenFreq, totalTokens = index_document(docData)
            # Create token to document mapping
            for token, tokenDocFreq in tokenFreq.items():
                # Calculate 'Term/token Frequency', which states importance of 'Token' in 'Document'
                tokenDocScore = (1 + math.log10(tokenDocFreq)) if tokenDocFreq else 0
                token_to_docId_map[token].append((docId, tokenDocScore))
            # Increment 'document id' for next iteration
            docId += 1
    # Store the output
    if not os.path.exists(outDir):
        os.makedirs(outDir)
    dumpStructure(os.path.join(outDir, 'docLink_to_docId_map'), docLink_to_docId_map)
    dumpStructure(os.path.join(outDir, 'token_to_docId_map'), token_to_docId_map)

index_data('crawlerOutput', 'indexOutput')

Let’s analyze the content of output file generated

    • Content of file token_to_docId_map
{"address": [(24, 1.0)],
 ...
 "hello": [(17, 1.0), (109, 1.4771212547196624)],
 ...
 "zone": [(1, 1.0), 
          (2, 1.3010299956639813),
          (26, 1.6020599913279625),
          (27, 1.7781512503836436),
          (36, 1.0), 
          (39, 1.0), 
          (42, 1.0), 
          (65, 1.0)],}
    • Content of file docLink_to_docId_map
{1: 'https://en.wikipedia.org/wiki/Sybase',
 2: 'https://en.wikipedia.org/wiki/Microsoft_SQL_Server',
 ...
 120: 'https://en.wikipedia.org/wiki/Database_schema'}

Parse search query and retrieve relevant documents

Now let’s take the search query and find most relevant documents, the idea is to parse the query string and extract the relevant tokens, now prepare a list of documents that contains any of relevant query token, now rank the documents by their scores with respect to query and return top 5 documents. We will be using TF-IDF score for ranking.

def loadStructure(fileName):
    with open(fileName, 'rb') as fileHandle:
        data = pickle.load(fileHandle)
    return data

def index_string(text):
    # ... skipping code here ...

def get_relevant_documents(inDir, query):
    token_to_docId_map = loadStructure(os.path.join(inDir, 'token_to_docId_map'))
    docLink_to_docId_map = loadStructure(os.path.join(inDir, 'docLink_to_docId_map'))
    # Index query
    tokenFreq, totalTokens = index_string(query)
    documentRanks = {}
    # Get scores for all relevant documents using relevant tokens from search query
    for token, tokenDocFreq in tokenFreq.items():
        documentsForToken = token_to_docId_map[token]
        for (docId, tokenDocScore) in documentsForToken:
            # Calculate tf-idf score, properties of this score are:
            # - Increases with number of occurrences with in a document
            # - Increases with rarity of term across documents
            tokenDocCount = len(documentsForToken)  # Number of documents in which token exists
            totalDocCount = totalDocuments          # Total number of documents
            tfIdfScore = tokenDocScore * math.log10(totalDocCount/tokenDocCount)
            if docId in documentRanks:
                # Document found for multiple query tokens
                documentRanks[docId] += tfIdfScore
            else:
                documentRanks[docId] = tfIdfScore
    # Sort the documents by score
    sortedDocumentRanks = sorted(documentRanks, key=documentRanks.get, reverse=True)
    # Get top 5 document links
    finalDocumentList = []
    for docId in sortedDocumentRanks[:5]:
        finalDocumentList.append(docLink_to_docId_map[docId])
    # Return top 5 document links
    return finalDocumentList

documents = get_relevant_documents(indexOutdir, 'what is sql')
for document in documents:
    print(document)

Scope for improvements

  • Adding support for phrase queries.
  • Storing index in database or file system instead of in memory.
  • Improvements to token filtering and sanitization logic.
  • Improvements to ranking algorithm.