Building Web Crawler

Motivation

Today lot of application thrive on data, and there is nothing better than the internet to get the data. Web crawler will be an important component in most of the data driven application like Web_scraping application, Web search engine, etc.

Let’s build

Overview

The code is written in Python 3 and is available at github link PrajitP/webCrawler.

We will start with a seed link, get the content of the link, extract all the links in the content and again repeat the process with each link as a seed link. Note since the process we are following is a recursive process, we should always add a termination condition.

If we look from another dimension whole web is a directed graph, where every page is a node and outgoing link from that pages are outgoing edges and incoming links to a page are incoming edges. The number of nodes(pages) are almost infinite, even the world largest crawler cannot crawl all the pages. We start from a random node called seed and then do Breadth First Traversal(BFT) to explore all the nodes (Note: As the graph is very big we need an explicit termination condition). There might be some independent component(i.e. a subset of the nodes which are disconnected from the main web) in the graph, but as component grows bigger it’s very unlikely to remain independent, this is important because the seed you select ends up in such independent small component, you might not be able to explore all the data.

Let’s request a page and extract links

We are going to use Python, just because it’s simple and has many libraries that we can use.

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
response = urllib.request.urlopen(seedLink)   # Get the content of seed link
soup = BeautifulSoup(response.read(), 'html.parser')  # Parse the received HTML page
for aTag in soup.find_all('a'):        # Find all anchor tags
    link = aTag.get('href')            # Get the values of anchor tag attribute href
    print link

now we know how to extract links, so lets recursively explore them to get new links

Let’s build the simple crawler

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink];
for candidateLink in candidateLinks:
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We have few major issues with above code

  • It will run into infinite loop because we can end up in same seed link in our recursive process, which will restart the cycle (cycle in directed graph data structure).
  • Even if there is no cycle the graph is so big that exploring of new nodes will never end.

so let’s fix the above 2 issues

import urllib
from bs4 import BeautifulSoup

seedLink = 'https://en.wikipedia.org/wiki/Database'
candidateLinks = [seedLink]
exploredLinkList = {}                  # Do not explore already visited node
maxExplore = 50                        # Explore maximum 50 pages
for candidateLink in candidateLinks:
    if candidateLink in exploredLinkList:     # This is to avoid cycle
        continue
    if len(exploredLinkList) >= maxExplore:   # This is to avoid exploring huge graph
        break
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')        # Get the values of anchor tag attribute href
        candidateLinks.append(link)    # Add the new link to list of candidate links

We still have few issues with above code

  • There are various types of links and depending upon type we need to handle them accordingly

Let’s analyze the links we got

Links could be of various types varying from links within a page, to link within a domain to external links, etc. It very important that we handle links properly to make crawler more effective and functional. We can categorize links on two major dimensions:

Links based on origin

Links within the page

#Design_and_modeling

Links within domain (internal links)

/w/index.php?title=Database&action=edit§ion=5
/wiki/Navigational_database
/wiki/File:CodasylB.png
/wiki/Wikipedia:Verifiability#Burden_of_evidence

Links outside the domain (external links)

http://www.oed.com/view/Entry/47411
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Dynamic v.s. Static links

Static links are the one that does not contain any varying component and will always result in static output.

/wiki/Navigational_database
/wiki/File:CodasylB.png
http://www.oed.com/view/Entry/47411

Dynamic links contain varying component(usually called as HTTP GET parameters), depending upon parameter value, the output will differ.

/w/index.php?title=Database&action=edit§ion=5
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4t_NX-QeWDYZGMwOTRmOTItZTg2Zi00YmJkLTg4MTktN2E4MWU0YmZlMjE3

Let’s update our code to only access static internal links

# ... skipping code here ...
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    soup = BeautifulSoup(response.read(), 'html.parser')
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        if link:
            # Below regex will match link starting with '/',
            # Example: '/wiki/Navigational_database'
            # TODO: handle '/wiki/Navigational_database'
            reInternalLink = re.compile(r"^/")      # Link that starts with '/'
            match = reInternalLink.search(link)
            if not match:                           # 'Skip' if not a internal link
                continue
            # Below regex will match any link with '?'
            # Example: '/w/index.php?title=Database&action=edit§ion=5'
            reDynamicLink = re.compile(r'\?')       # Link that has '?'
            match = reDynamicLink.search(link)
            if match:                               # 'Skip' if not a static link
                continue
            link = domain + link
            candidateLinks.append(link)links

Now we have a code to crawl the web, let’s store the content/data retrieved from each link we crawl.

# ... skipping code here ...
cacheContent = {}      # Map to cache the data for link
fileIndex = 0          # Index to calculate new dump file name
dumpInterval = 10      # Dump after we collect data for 10 links
for candidateLink in candidateLinks:
    # ... skipping code here ...
    response = urllib.request.urlopen(candidateLink)
    responseBody = response.read().decode('utf-8');
    soup = BeautifulSoup(responseBody, 'html.parser')
    cacheContent[candidateLink] = responseBody
    if len(cacheContent) >= dumpInterval:    # This will free up memory at regular intervals
        dumpCache(str(fileIndex), cacheContent)
        fileIndex += 1
        cacheContent = {}
    for aTag in soup.find_all('a'):    # Find all anchor tags
        link = aTag.get('href')
        # Link sanatization and filtering
        # ... skipping code here ...
dumpCache(str(fileIndex), cacheContent)  # Dump the final cache

Scope for improvements

There is a lot of scope for improvements, I will list few of them here:

  • Error handling for urllib.request.urlopen calls, for example, it can happen that link is broken or server is down.
  • Refactoring existing code to object oriented.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Link sanitization and filtering logic’, in most of the use cases one might want to plug in their custom logic for link selection policy.
  • Allowing custom hooks(subroutine reference or object) to manage ‘Data retrieved by links’, in some of the cases one might want to manage information retrieved, differently than the default behavior.