Building Search Engine

Motivation

Data is generated at a rapid pace, and as it grows, it gets harder to find relevant data. Search engine indexes such data as a result of which, it can quickly provide a reference to a relevant data source. It’s not necessary that the underlying data should be web pages, it can be anything from files, database, web service, etc. all we need is a way to refer to data source, we will call data source as a document in rest of post.

Let’s build

Overview

Code is written in Python 3 and is available at GitHub link PrajitP/webSearchEngine

We will first collect the data that we want to build our search engine on, data is going to be 120 Wikipedia pages/documents, then we will build the index for the data, then we will build a ranking algorithm to select top n document based on search query.

NOTE: For simplicity, we are going to deal with a small dataset, as a result of this all our operations will be in-memory operations.

Let’s get the data

We are going to use the crawler build here

from crawler import Crawler

def collect_data(outDir):
    crawler = Crawler(outDir = outDir)
    crawler.start(
        domain = 'https://en.wikipedia.org',                    # Explore pages with-in this domain
        seedLink = 'https://en.wikipedia.org/wiki/Database',    # Start exploring from this page
        dumpInterval = 10,                                      # Dump explored pages after every '10' exploration
        maxExplore = 120)

collect_data('crawlerOutput')

Let’s analyze the content of output file generated

    • Content of file crawlerOutput/1
{'https://en.wikipedia.org/wiki/Computer_software': '<!DOCTYPE html>\n...'
 ...
 'https://en.wikipedia.org/wiki/Microsoft_SQL_Server': '<!DOCTYPE html>\n...'}

Index the data

Index will be mapping of word(also known as term) to list of documents that word occurs in(also know as posting list). Usually, posting list are large so they are stored in files or database and only references to them are stored in memory, but for our use-case with just 120 documents, posting list generated will is small enough to fit in memory.

Build index for single document

Tokenize

The first thing we need to do is break the document content into a list of tokens, the document we have is a HTML page, here are list of steps we will follow to tokenize:

  1. Parse the HTML document and extract visible text, ignore all non-ASCII characters.
  2. Tokenize the above string, we can simply split it by space or use standard library utils like nltk to tokenize.
from bs4 import BeautifulSoup
import nltk

# Create BeautifulSoup object from html text, and ignore/remove the non-ASCII characters
soup = BeautifulSoup(docData.encode("ascii", errors='ignore'), 'html.parser')
# Remove non-visible tags [Reference: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text]
[tag.extract() for tag in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
# Tokenize
tokens = nltk.word_tokenize(visible_text)

Normalize

We can now normalize the tokens to remove unnecessary information, like making them case insensitive, ignoring tokens that don’t have alphabets.

newTokens = []
for token in tokens:
    # Convert all character in token to lower case
    token = token.lower()
    # Ignore token which does not have alphabet
    if not re.search(r'.*[a-z].*', token):
        continue
    newTokens.append(token)
tokens = newTokens

Stemming

Stemming is specific type of normalization, this is done to make sure that we reduce all tokens to their root words, for example ‘fishing’, ‘fished’ and ‘fisher’ will all reduce to ‘fish’.

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
newTokens = []
for token in tokens:
    newTokens.append(stemmer.stem(token))
tokens = newTokens

Stop words

There are few words that occur frequently like ‘the’, ‘a’, ‘is’, etc. they are called as stop word and can be of very very little or no use because they will occur in every document which makes them less valuable when it comes to distinguishing document using them.

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

Summarize all the above steps

import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def index_string(text):
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # To capture occurrence of 'token/term' in a 'document'
    tokenFreq = {}
    for token in tokens:
        # Convert all character to lower case
        token = token.lower()
        # Ignore token which does not have alphabet
        if not re.search(r'.*[a-z].*', token):
            continue
        # Apply stemming
        token = stemmer.stem(token)
        # Ignore stopword
        if token in stop_words:
            continue
        # Count token frequency in document
        if token in tokenFreq:
            tokenFreq[token] += 1
        else:
            tokenFreq[token] = 1
    return (tokenFreq, len(tokens))

def index_document(document):
    # Create BeautifulSoup object from html text, and ignore/remove the non-ASCII 
    soup = BeautifulSoup(document.encode("ascii", errors='ignore'), 'html.parser')
    # Remove non-visible tags [Reference: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text]
    [tag.extract() for tag in soup(['style', 'script', '[document]', 'head', 'title'])]
    # Get visible text from html document
    visible_text = soup.getText()
    return index_string(visible_text)

Build index for all documents

By using the above method index_document(document), let’s build index for entire corpus of documents.
Reference or identifier for each document is a ‘document link’, which is a relatively big string for an identifier and can increase a posting list size by a huge margin, to avoid this we will replace ‘document link’ by a smaller string called ‘document id’.
Each element in ‘posting list’ will be a tuple with 2 values, the 1st value will be the ‘document id’ and the 2nd value will be a ‘score of term/token in document’.
We will use ‘term frequency(tf)’ to calculate the score, Term frequency tf(t,d) of term t in document d is defined as number of times that t occurs in d, for example, a document with 10 occurrences of term is more relevant than document with 1 occurrence, but it’s not 10 times more relevant, this means it not linear mapping between score and frequency, so to avoid this we will take log of raw frequency to get our final score.

import sys
import os
import pickle
import re
import math
from collections import defaultdict

def dumpStructure(fileName, content):
    with open(fileName, 'wb') as fileHandle:
        pickle.dump(content, fileHandle, protocol=pickle.HIGHEST_PROTOCOL)

def index_document(document):
    # ... skipping code here ...

def index_data(inDir, outDir):
    docLink_to_docId_map = {}
    token_to_docId_map   = defaultdict(list)
    # Use 'document id' instead of 'document link' as 'id' is shorter and will take less memory then 'link'
    docId = 1
    # Get list of data files from input directory
    dataFiles = [dataFile for dataFile in os.listdir(inDir) if os.path.isfile(os.path.join(inDir, dataFile))]
    # Read the data file
    for dataFile in dataFiles:
        with open(os.path.join(inDir, dataFile), 'rb') as fileHandle:
            data = pickle.load(fileHandle)
        # Build the 'term document index'
        for docLink,docData in data.items():
            # Map 'document id' to 'document link'
            docLink_to_docId_map[docId] = docLink
            # Index document, i.e. get list of tokens in document
            tokenFreq, totalTokens = index_document(docData)
            # Create token to document mapping
            for token, tokenDocFreq in tokenFreq.items():
                # Calculate 'Term/token Frequency', which states importance of 'Token' in 'Document'
                tokenDocScore = (1 + math.log10(tokenDocFreq)) if tokenDocFreq else 0
                token_to_docId_map[token].append((docId, tokenDocScore))
            # Increment 'document id' for next iteration
            docId += 1
    # Store the output
    if not os.path.exists(outDir):
        os.makedirs(outDir)
    dumpStructure(os.path.join(outDir, 'docLink_to_docId_map'), docLink_to_docId_map)
    dumpStructure(os.path.join(outDir, 'token_to_docId_map'), token_to_docId_map)

index_data('crawlerOutput', 'indexOutput')

Let’s analyze the content of output file generated

    • Content of file token_to_docId_map
{"address": [(24, 1.0)],
 ...
 "hello": [(17, 1.0), (109, 1.4771212547196624)],
 ...
 "zone": [(1, 1.0), 
          (2, 1.3010299956639813),
          (26, 1.6020599913279625),
          (27, 1.7781512503836436),
          (36, 1.0), 
          (39, 1.0), 
          (42, 1.0), 
          (65, 1.0)],}
    • Content of file docLink_to_docId_map
{1: 'https://en.wikipedia.org/wiki/Sybase',
 2: 'https://en.wikipedia.org/wiki/Microsoft_SQL_Server',
 ...
 120: 'https://en.wikipedia.org/wiki/Database_schema'}

Parse search query and retrieve relevant documents

Now let’s take the search query and find most relevant documents, the idea is to parse the query string and extract the relevant tokens, now prepare a list of documents that contains any of relevant query token, now rank the documents by their scores with respect to query and return top 5 documents. We will be using TF-IDF score for ranking.

def loadStructure(fileName):
    with open(fileName, 'rb') as fileHandle:
        data = pickle.load(fileHandle)
    return data

def index_string(text):
    # ... skipping code here ...

def get_relevant_documents(inDir, query):
    token_to_docId_map = loadStructure(os.path.join(inDir, 'token_to_docId_map'))
    docLink_to_docId_map = loadStructure(os.path.join(inDir, 'docLink_to_docId_map'))
    # Index query
    tokenFreq, totalTokens = index_string(query)
    documentRanks = {}
    # Get scores for all relevant documents using relevant tokens from search query
    for token, tokenDocFreq in tokenFreq.items():
        documentsForToken = token_to_docId_map[token]
        for (docId, tokenDocScore) in documentsForToken:
            # Calculate tf-idf score, properties of this score are:
            # - Increases with number of occurrences with in a document
            # - Increases with rarity of term across documents
            tokenDocCount = len(documentsForToken)  # Number of documents in which token exists
            totalDocCount = totalDocuments          # Total number of documents
            tfIdfScore = tokenDocScore * math.log10(totalDocCount/tokenDocCount)
            if docId in documentRanks:
                # Document found for multiple query tokens
                documentRanks[docId] += tfIdfScore
            else:
                documentRanks[docId] = tfIdfScore
    # Sort the documents by score
    sortedDocumentRanks = sorted(documentRanks, key=documentRanks.get, reverse=True)
    # Get top 5 document links
    finalDocumentList = []
    for docId in sortedDocumentRanks[:5]:
        finalDocumentList.append(docLink_to_docId_map[docId])
    # Return top 5 document links
    return finalDocumentList

documents = get_relevant_documents(indexOutdir, 'what is sql')
for document in documents:
    print(document)

Scope for improvements

  • Adding support for phrase queries.
  • Storing index in database or file system instead of in memory.
  • Improvements to token filtering and sanitization logic.
  • Improvements to ranking algorithm.
Advertisements

Building Web Crawler

Motivation

Today lot of application thrive on data, and there is nothing better than the internet to get the data. Web crawler will be an important component in most of the data driven application like Web_scraping application, Web search engine, etc.

Let’s build

Overview

The code is written in Python 3 and is available at github link PrajitP/webCrawler.

We will start with a seed link, get the content of the link, extract all the links in the content and again repeat the process with each link as a seed link. Note since the process we are following is a recursive process, we should always add a termination condition.

If we look from another dimension whole web is a directed graph, where every page is a node and outgoing link from that pages are outgoing edges and incoming links to a page are incoming edges. The number of nodes(pages) are almost infinite, even the world largest crawler cannot crawl all the pages. We start from a random node called seed and then do Breadth First Traversal(BFT) to explore all the nodes (Note: As the graph is very big we need an explicit termination condition). There might be some independent component(i.e. a subset of the nodes which are disconnected from the main web) in the graph, but as component grows bigger it’s very unlikely to remain independent, this is important because the seed you select ends up in such independent small component, you might not be able to explore all the data.

Let’s request a page and extract links

We are going to use Python, just because it’s simple and has many libraries that we can use.

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
response = urllib.request.urlopen(seedLink)   # Get the content of seed link
soup = BeautifulSoup(response.read(), 'html.parser')  # Parse the received HTML page
for aTag in soup.find_all('a'):        # Find all anchor tags
    link = aTag.get('href')            # Get the values of anchor tag attribute href
    print link

now we know how to extract links, so lets recursively explore them to get new links

Let’s build the simple crawler

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink];
for candidateLink in candidateLinks:
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We have few major issues with above code

  • It will run into infinite loop because we can end up in same seed link in our recursive process, which will restart the cycle (cycle in directed graph data structure).
  • Even if there is no cycle the graph is so big that exploring of new nodes will never end.

so let’s fix the above 2 issues

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink]
exploredLinkList = {}                  # Do not explore already visited node
maxExplore = 50                        # Explore maximum 50 pages
for candidateLink in candidateLinks:
    if candidateLink in exploredLinkList:     # This is to avoid cycle
        continue
    if len(exploredLinkList) >= maxExplore:   # This is to avoid exploring huge graph
        break
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We still have few issues with above code

  • There are various types of links and depending upon type we need to handle them accordingly

Let’s analyze the links we got

Links could be of various types varying from links within a page, to link within a domain to external links, etc. It very important that we handle links properly to make crawler more effective and functional. We can categorize links on two major dimensions:

Links based on origin

Links within the page

#Design_and_modeling

Links within domain (internal links)

/w/index.php?title=Database&action=edit§ion=5
/wiki/Navigational_database
/wiki/File:CodasylB.png
/wiki/Wikipedia:Verifiability#Burden_of_evidence

Links outside the domain (external links)

http://www.oed.com/view/Entry/47411
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Dynamic v.s. Static links

Static links are the one that does not contain any varying component and will always result in static output.

/wiki/Navigational_database
/wiki/File:CodasylB.png
http://www.oed.com/view/Entry/47411

Dynamic links contain varying component(usually called as HTTP GET parameters), depending upon parameter value, the output will differ.

/w/index.php?title=Database&action=edit§ion=5
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Let’s update our code to only access static internal links

# ... skipping code here ...
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        if link:
            # Below regex will match link starting with '/',
            # Example: '/wiki/Navigational_database'
            # TODO: handle '/wiki/Navigational_database'
            reInternalLink = re.compile(r"^/")      # Link that starts with '/'
            match = reInternalLink.search(link)
            if not match:                           # 'Skip' if not a internal link
                continue
            # Below regex will match any link with '?'
            # Example: '/w/index.php?title=Database&action=edit§ion=5'
            reDynamicLink = re.compile(r'\?')       # Link that has '?'
            match = reDynamicLink.search(link)
            if match:                               # 'Skip' if not a static link
                continue
            link = domain + link
            candidateLinks.append(link)links

Now we have a code to crawl the web, let’s store the content/data retrieved from each link we crawl.

# ... skipping code here ...
cacheContent = {}      # Map to cache the data for link
fileIndex = 0          # Index to calculate new dump file name
dumpInterval = 10      # Dump after we collect data for 10 links
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    responseBody = response.read().decode('utf-8');
    soup = BeautifulSoup(responseBody, 'html.parser')
    cacheContent[candidateLink] = responseBody
    if len(cacheContent) >= dumpInterval:    # This will free up memory at regular intervals
        dumpCache(str(fileIndex), cacheContent)
        fileIndex += 1
        cacheContent = {}
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        # Link sanatization and filtering
        # ... skipping code here ...
dumpCache(str(fileIndex), cacheContent)  # Dump the final cache

Scope for improvements

There is a lot of scope for improvements, I will list few of them here:

  • Error handling for urllib.request.urlopen calls, for example, it can happen that link is broken or server is down.
  • Refactoring existing code to object oriented.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Link sanitization and filtering logic’, in most of the use cases one might want to plug in their custom logic for link selection policy.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Data retrieved by links’, in some of the cases one might want to manage information retrieved, differently than the default behavior.

Building Custom Terminal

Motivation

Requirement was to have a command line environment which can provide features like:

  • Maintaining a state, a state can be command history, variables, config, defaults, etc.
  • Way to store and restore state.
  • Variable support
    I wanted a way to assign arbitrary string to a name and use that name in command-line as an substitute for arbitrary string.  For example I could just say:

    >> set project_path /home/prajit/projects/TerminalEmulator

    and the use it in command line

    >> ls $project_path

  • Smart autocompletion, like variable name completion, command completion,
    command argument completion etc. based on the context.
  • Caching.

Let’s code

Overview

We are not building this from scratch, we will use GNU Readline Library which is primarily written in C and has support of various other languages, in our case the choice of language is Perl(Term::ReadLine::Gnu), I will try to keep code as simple as possible, you can find the code link on github PrajitP/customTerminal.

Let’s create template code

use Term::ReadLine;

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    system($cmd);
}

Above code will run in an infinite loop, it will wait for user to type and press enter, once user press enter it will run/execute user input as system command.
So far it’s a dumb terminal, we have to press ‘Ctrl + c’ to kill the terminal, so lets add a support for our first command ‘exit’.

use Term::ReadLine;

my %commands = (
    exit => {
        action => sub {   # This is a subroutine reference in Perl
            last;         # This will break the while loop
        },
    },
);

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    my ($cmd_name, @cmd_args) = split(' ', $cmd);
    if(exists $commands{$cmd_name}) {           # Is first word a known command
        $commands{$cmd_name}->{action}->($cmd); # Run the code for given command
    }
    else {
         system($cmd);
    }
}

Let’s add variable feature

Let’s add a feature where we can set a variable and use it directly in command-line, we will create a new command ‘set‘ which take argument ‘variableName variableValue‘, and we will add support for new syntax ‘$variableName‘ in command-line which will be replace by ‘varaiableValue‘ internally.

use Term::ReadLine;

my %variables = ();
my %commands = (
    # ... skipping code here ...
    set => {
        action => sub {
            my $cmd = shift;
            my ($cmd_name, @cmd_args) = split(' ', $cmd);
            # TODO: Add parameter validation
            $variables{$cmd_args[0]} = $cmd_args[1];
        },
    },
);

sub expand_variables {
    my $cmd = shift;
    # NOTE: 'reverse' & 'sort' make sure we always substitute biggest string first,
    #       this make sure prefix does not get priority,
    #       example: if we have 2 variables 'dir' & 'dir1', 'dir1' will be substituted first.
    for my $key (reverse sort keys %variables) {
        $cmd =~ s/\$\Q$key\E/$variables{$key}/gc;
    }
    return $cmd;
}

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    $cmd = expand_variables($cmd);
    # ... skipping code here ...
}

Let’s add auto complete feature

Let’s add an auto completion feature to complete the commands and variable names, if it’s a first word suggest command completion, if we find a ‘$‘ prefix at front to word suggest variable name completion.

use Term::ReadLine;

my %variables = ();
my %commands = (
    # ... skipping code here ...
);

sub expand_variables {
    # ... skipping code here ...
}

my $term = Term::ReadLine->new('Custom Terminal Name');
# The completion logic is referred from 'http://www.perlmonks.org/?node_id=629849'
$term_attr->{attempted_completion_function} = \&complete;

while(1) {
    my $cmd = $term->readline('>> ');
    $cmd = expand_variables($cmd);
    # ... skipping code here ...
}

sub complete {
    my ($text, $line, $start, $end) = @_;

    # Example, let's say user typed: 'ls $fo'
    #          $line = 'ls $fo'
    #          $text = 'fo'
    #          $start = 4
    #          $end   = 6

    my @words = split(' ', $line);
    my $last_word = $words[-1];         # For above example this will be '$fo'

    if($start == 0){
        # First word should be a command
        return $term->completion_matches($text,\&complete_command);
    }
    elsif($last_word =~ /^\$/){
        # If last word start with '$', it can be a variable
        return $term->completion_matches($text,\&complete_variable);
    }
    # Else this will do default to file name completion
    return;
}

sub complete_variable {
    return complete_keyword(@_, [sort keys %variables]);
}

sub complete_command {
    return complete_keyword(@_, [sort keys %commands]);
}

{
    my $i;
    my @keywords;
    # Below function will be called every time user press ,
    # '$state' value will be '0' for first  press and 'non-zero' for consecutive  press.
    sub complete_keyword {
        my ($text, $state, $keyword_list) = @_;
        return unless $text;
        if($state) {
            $i++;
        }
        else {
            # first call, set starting point
            @keywords = @{$keyword_list};
            $i = 0;
        }
        for (; $i<=$#keywords; $i++) {
            # Return first word starting from index '$i' whose prefix matches '$text'
            return $keywords[$i] if $keywords[$i] =~ /^\Q$text/;
        };
        return undef;
    }
}

Scope for improvements

  • Add redirection(STDOUT & STDERR redirect to file) and pipe support to custom commands like set, get.
  • UI improvements
    • If command line gets bigger it warps on same line instead of new line, fix this.
    • Overwrite command with expanded variable value, so one can know the final command that ran.