Building Web Crawler

Motivation

Today lot of application thrive on data, and there is nothing better than the internet to get the data. Web crawler will be an important component in most of the data driven application like Web_scraping application, Web search engine, etc.

Let’s build

Overview

The code is written in Python 3 and is available at github link PrajitP/webCrawler.

We will start with a seed link, get the content of the link, extract all the links in the content and again repeat the process with each link as a seed link. Note since the process we are following is a recursive process, we should always add a termination condition.

If we look from another dimension whole web is a directed graph, where every page is a node and outgoing link from that pages are outgoing edges and incoming links to a page are incoming edges. The number of nodes(pages) are almost infinite, even the world largest crawler cannot crawl all the pages. We start from a random node called seed and then do Breadth First Traversal(BFT) to explore all the nodes (Note: As the graph is very big we need an explicit termination condition). There might be some independent component(i.e. a subset of the nodes which are disconnected from the main web) in the graph, but as component grows bigger it’s very unlikely to remain independent, this is important because the seed you select ends up in such independent small component, you might not be able to explore all the data.

Let’s request a page and extract links

We are going to use Python, just because it’s simple and has many libraries that we can use.

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
response = urllib.request.urlopen(seedLink)   # Get the content of seed link
soup = BeautifulSoup(response.read(), 'html.parser')  # Parse the received HTML page
for aTag in soup.find_all('a'):        # Find all anchor tags
    link = aTag.get('href')            # Get the values of anchor tag attribute href
    print link

now we know how to extract links, so lets recursively explore them to get new links

Let’s build the simple crawler

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink];
for candidateLink in candidateLinks:
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We have few major issues with above code

  • It will run into infinite loop because we can end up in same seed link in our recursive process, which will restart the cycle (cycle in directed graph data structure).
  • Even if there is no cycle the graph is so big that exploring of new nodes will never end.

so let’s fix the above 2 issues

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink]
exploredLinkList = {}                  # Do not explore already visited node
maxExplore = 50                        # Explore maximum 50 pages
for candidateLink in candidateLinks:
    if candidateLink in exploredLinkList:     # This is to avoid cycle
        continue
    if len(exploredLinkList) >= maxExplore:   # This is to avoid exploring huge graph
        break
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We still have few issues with above code

  • There are various types of links and depending upon type we need to handle them accordingly

Let’s analyze the links we got

Links could be of various types varying from links within a page, to link within a domain to external links, etc. It very important that we handle links properly to make crawler more effective and functional. We can categorize links on two major dimensions:

Links based on origin

Links within the page

#Design_and_modeling

Links within domain (internal links)

/w/index.php?title=Database&action=edit§ion=5
/wiki/Navigational_database
/wiki/File:CodasylB.png
/wiki/Wikipedia:Verifiability#Burden_of_evidence

Links outside the domain (external links)

http://www.oed.com/view/Entry/47411
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Dynamic v.s. Static links

Static links are the one that does not contain any varying component and will always result in static output.

/wiki/Navigational_database
/wiki/File:CodasylB.png
http://www.oed.com/view/Entry/47411

Dynamic links contain varying component(usually called as HTTP GET parameters), depending upon parameter value, the output will differ.

/w/index.php?title=Database&action=edit§ion=5
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Let’s update our code to only access static internal links

# ... skipping code here ...
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        if link:
            # Below regex will match link starting with '/',
            # Example: '/wiki/Navigational_database'
            # TODO: handle '/wiki/Navigational_database'
            reInternalLink = re.compile(r"^/")      # Link that starts with '/'
            match = reInternalLink.search(link)
            if not match:                           # 'Skip' if not a internal link
                continue
            # Below regex will match any link with '?'
            # Example: '/w/index.php?title=Database&action=edit§ion=5'
            reDynamicLink = re.compile(r'\?')       # Link that has '?'
            match = reDynamicLink.search(link)
            if match:                               # 'Skip' if not a static link
                continue
            link = domain + link
            candidateLinks.append(link)links

Now we have a code to crawl the web, let’s store the content/data retrieved from each link we crawl.

# ... skipping code here ...
cacheContent = {}      # Map to cache the data for link
fileIndex = 0          # Index to calculate new dump file name
dumpInterval = 10      # Dump after we collect data for 10 links
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    responseBody = response.read().decode('utf-8');
    soup = BeautifulSoup(responseBody, 'html.parser')
    cacheContent[candidateLink] = responseBody
    if len(cacheContent) >= dumpInterval:    # This will free up memory at regular intervals
        dumpCache(str(fileIndex), cacheContent)
        fileIndex += 1
        cacheContent = {}
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        # Link sanatization and filtering
        # ... skipping code here ...
dumpCache(str(fileIndex), cacheContent)  # Dump the final cache

Scope for improvements

There is a lot of scope for improvements, I will list few of them here:

  • Error handling for urllib.request.urlopen calls, for example, it can happen that link is broken or server is down.
  • Refactoring existing code to object oriented.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Link sanitization and filtering logic’, in most of the use cases one might want to plug in their custom logic for link selection policy.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Data retrieved by links’, in some of the cases one might want to manage information retrieved, differently than the default behavior.
Advertisements

Building Custom Terminal

Motivation

Requirement was to have a command line environment which can provide features like:

  • Maintaining a state, a state can be command history, variables, config, defaults, etc.
  • Way to store and restore state.
  • Variable support
    I wanted a way to assign arbitrary string to a name and use that name in command-line as an substitute for arbitrary string.  For example I could just say:

    >> set project_path /home/prajit/projects/TerminalEmulator

    and the use it in command line

    >> ls $project_path

  • Smart autocompletion, like variable name completion, command completion,
    command argument completion etc. based on the context.
  • Caching.

Let’s code

Overview

We are not building this from scratch, we will use GNU Readline Library which is primarily written in C and has support of various other languages, in our case the choice of language is Perl(Term::ReadLine::Gnu), I will try to keep code as simple as possible, you can find the code link on github PrajitP/customTerminal.

Let’s create template code

use Term::ReadLine;

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    system($cmd);
}

Above code will run in an infinite loop, it will wait for user to type and press enter, once user press enter it will run/execute user input as system command.
So far it’s a dumb terminal, we have to press ‘Ctrl + c’ to kill the terminal, so lets add a support for our first command ‘exit’.

use Term::ReadLine;

my %commands = (
    exit => {
        action => sub {   # This is a subroutine reference in Perl
            last;         # This will break the while loop
        },
    },
);

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    my ($cmd_name, @cmd_args) = split(' ', $cmd);
    if(exists $commands{$cmd_name}) {           # Is first word a known command
        $commands{$cmd_name}->{action}->($cmd); # Run the code for given command
    }
    else {
         system($cmd);
    }
}

Let’s add variable feature

Let’s add a feature where we can set a variable and use it directly in command-line, we will create a new command ‘set‘ which take argument ‘variableName variableValue‘, and we will add support for new syntax ‘$variableName‘ in command-line which will be replace by ‘varaiableValue‘ internally.

use Term::ReadLine;

my %variables = ();
my %commands = (
    # ... skipping code here ...
    set => {
        action => sub {
            my $cmd = shift;
            my ($cmd_name, @cmd_args) = split(' ', $cmd);
            # TODO: Add parameter validation
            $variables{$cmd_args[0]} = $cmd_args[1];
        },
    },
);

sub expand_variables {
    my $cmd = shift;
    # NOTE: 'reverse' & 'sort' make sure we always substitute biggest string first,
    #       this make sure prefix does not get priority,
    #       example: if we have 2 variables 'dir' & 'dir1', 'dir1' will be substituted first.
    for my $key (reverse sort keys %variables) {
        $cmd =~ s/\$\Q$key\E/$variables{$key}/gc;
    }
    return $cmd;
}

my $term = Term::ReadLine->new('Custom Terminal Name');

while(1) {
    my $cmd = $term->readline('>> ');
    $cmd = expand_variables($cmd);
    # ... skipping code here ...
}

Let’s add auto complete feature

Let’s add an auto completion feature to complete the commands and variable names, if it’s a first word suggest command completion, if we find a ‘$‘ prefix at front to word suggest variable name completion.

use Term::ReadLine;

my %variables = ();
my %commands = (
    # ... skipping code here ...
);

sub expand_variables {
    # ... skipping code here ...
}

my $term = Term::ReadLine->new('Custom Terminal Name');
# The completion logic is referred from 'http://www.perlmonks.org/?node_id=629849'
$term_attr->{attempted_completion_function} = \&complete;

while(1) {
    my $cmd = $term->readline('>> ');
    $cmd = expand_variables($cmd);
    # ... skipping code here ...
}

sub complete {
    my ($text, $line, $start, $end) = @_;

    # Example, let's say user typed: 'ls $fo'
    #          $line = 'ls $fo'
    #          $text = 'fo'
    #          $start = 4
    #          $end   = 6

    my @words = split(' ', $line);
    my $last_word = $words[-1];         # For above example this will be '$fo'

    if($start == 0){
        # First word should be a command
        return $term->completion_matches($text,\&complete_command);
    }
    elsif($last_word =~ /^\$/){
        # If last word start with '$', it can be a variable
        return $term->completion_matches($text,\&complete_variable);
    }
    # Else this will do default to file name completion
    return;
}

sub complete_variable {
    return complete_keyword(@_, [sort keys %variables]);
}

sub complete_command {
    return complete_keyword(@_, [sort keys %commands]);
}

{
    my $i;
    my @keywords;
    # Below function will be called every time user press ,
    # '$state' value will be '0' for first  press and 'non-zero' for consecutive  press.
    sub complete_keyword {
        my ($text, $state, $keyword_list) = @_;
        return unless $text;
        if($state) {
            $i++;
        }
        else {
            # first call, set starting point
            @keywords = @{$keyword_list};
            $i = 0;
        }
        for (; $i<=$#keywords; $i++) {
            # Return first word starting from index '$i' whose prefix matches '$text'
            return $keywords[$i] if $keywords[$i] =~ /^\Q$text/;
        };
        return undef;
    }
}

Scope for improvements

  • Add redirection(STDOUT & STDERR redirect to file) and pipe support to custom commands like set, get.
  • UI improvements
    • If command line gets bigger it warps on same line instead of new line, fix this.
    • Overwrite command with expanded variable value, so one can know the final command that ran.