WA7. Web Crawler¶

This text is divided into three parts, starting from my code in the previous assignment:

Statement: Includes the statement of the assignment, formatted in an easy-to-follow format.
Summary: description of the assignment, my observations, and final notes.
Changes: A step-by-step documentation of all my interactions and changes to the code.

Statement¶

One of the topics covered in CS 3304 Analysis of Algorithms is algorithms for traversing graphs. The structure of the world-wide-web is an example of a directed graph with each web page forming a vertex and each URL or web link forming an edge. In Analysis of Algorithms, we learned about different algorithms used to traverse a graph. One traversal approach was referred to as the depth-first search (DPS). The basic idea of DPS was that each path, which is comprised of nodes and edges was traversed using a stack data structure to the required depth. The following code, available in this document, provides an example of such a depth-first approach. The web crawler sets a limit of 500 web pages to be added to the ‘URL frontier’ because without some limit the web crawler could easily exceed the memory of the computer with a very large URL frontier.

Your web crawler can use the example code as a base, however, you must reuse components of the indexer part 2 that you created as part of the unit 4 assignment.

Your indexer must have the following characteristics:

Web crawler must prompt the user to enter a starting website or domain. This must be entered in the form http://mywebpage.com and placed in a queue of URLs to crawl.
The web crawler must extract from each visited web page all of the ‘out links’ or the links to other web pages and websites. Each link must be placed into the queue of URLs to crawl. This queue is called the URL frontier.
The code that places URLs on the URL frontier should keep track of the number of URLs that are in the frontier and stop adding URLs when there are 500 in the queue.
The crawler must extract the text from the web page by using some mechanism to remove all of the HTML tags and formatting. In the example above, the module BeautifulSoup was used to accomplish this. You can use any technique that you want to remove the HTML tags and formatting, however, if you would like to use the BeautifulSoup module, instructions and download links are available along with installation instructions in the unit resources.
Your web crawler must produce statistics similar to those listed below to indicate how long it took to index your selected website and key metrics such as the number of documents (in this case it will be the number of web pages), Number of tokens extracted and processed, and the number of unique terms added to the term dictionary.
Your web crawler must use exactly the same format that has been used in the indexer part 2 assignment so that the search engine developed in unit 5 can be used to search your web index.
Output Produced Against Example Website

>>>
Enter URL to crawl (must be in the form http://www.domain.com): http://www.hometownlife.com
Start Time: 15:44
Indexing Complete, write to disk: 16:53
Documents 473 Terms 16056 Tokens 2751668
End Time: 21:26
>>>

summary¶

This assignment is built on top of the previous assignments where we built an indexer and a search engine. In this assignment, we are building a web crawler that crawls a domain, extracts the text from each page and passes it to the indexer to be indexed.

Each page is treated as a separate document, and the usage of BeautifulSoup has simplified the process of parsing the HTML and extracting the text.

In this text, we crawled the university public website https://www.uopeople.edu/, and the crawling process took around 300 seconds (5 minutes) to complete. The results are shown below:

crawler report

We took an incremental path during development, where we started with the example code provided in the assignment, and we made a single change every time, observations are recorded in detail in the Changes section below:

Change the supplied code to use Python 3: we fixed syntax errors, and deprecated methods, and made sure that all dependencies were installed correctly.
Try the first website: we ran the crawler against the university website and reported issues.
Fix BeautifulSoup errors: most of the errors were in the parsing process, where BeautifulSoup4 was incompatible with the code.
Run the first crawling session: we ran the crawler against the university website with relaxed rules (only 10 URLs) and saw how it performed.
Monitor the crawling process: we monitored the crawling process and found the website is complex with some 403/404 errors; so we introduced a success rate and added it to the report.
Monitor the Indexing process: we monitored the indexing process and made sure that the requirements of the assignment were met; like removing HTML tags, ignoring stopwords, and stemming.

Assessment¶

Below is my answer to the assessment questions (in bold):

Did the posting include the website that was indexed as well as required statics captured when the web crawler was executed against the selected website including:
- Number of documents processed: 444 Documents (as 57 URLs failed to be crawled).
- Total number of terms parsed from all documents: 121298 Tokens.
- Total number of unique terms found and added to the index: 2106 Terms.
- Total number of terms found that matched one of the stop words in your program’s stop words list: 46199 stopwords (93 unique stopwords).
Did the posting include the source code of the completed assignment? The source code is attached in a file named crawler7.py.
Does the web crawler program integrate the Porter Stemmer code? Yes, the Porter Stemmer code is integrated into the indexer, and the PorterStemmer.py file is attached to the assignment files.
Did the web crawler program provide a mechanism to remove the HTML tags and formatting from the web page so that only the text would be indexed? Yes, the BeautifulSoup library is used to parse the HTML and extract the text. Using the get_text() method, we can extract only the text from the HTML (without the tags).

Changes¶

1. Change the supplied code to use Python 3¶

The supplied code is written using Python 2, so I changed it to Python 3.
urllib2 is replaced by urllib.
BeautifulSoup is replaced by bs4.
urlparse is replaced by urllib.parse.
urlopen is replaced by urllib.request.urlopen.

2. Try the first website¶

I choose to start crawling the university website https://www.uopeople.edu/
I received a <HTTPError 403: 'Forbidden'> error.
This is usually caused by the website blocking the crawler, as it detects the request is not coming from a browser.
We need to mimic a browser request by changing the User-Agent header.
The code that does that is below:

from urllib.request import urlopen, Request

req = Request(crawling, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

3. Fix BeautifulSoup errors¶

As we changed the code to use Python 3, the BeatifulSoup code needs to be changed as well.
There were the following issues:
- The text parameter has changed to string.
- The features parameter is required when instantiating the BeatifulSoup instance; I set it to html.parser.
- The response is being passed as an array of bytes, so we need to decode it to a string before any processing.
- The findAll method is deprecated and replaced by find_all.
The following code fixes the issues:

from urllib.request import urlopen, Request

req = Request(crawling, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()
response = response.decode('utf-8')
soup = BeautifulSoup(response, 'html.parser')
tok = "".join([p.get_text() for p in soup.find_all("p", string=re.compile("."))])

4. Run the first crawling session¶

I ran the first crawling session with the following parameters:
- Starting URL: https://www.uopeople.edu/
- Max URLs: 500
The crawling took a very long time, so I changed the maximum number of URLs to 10 while testing.
I introduced the maxUrlFrontierSize to control this feature.
I broke the loop of links early when we reached the maximum number of URLs.
The following code implements these changes:

maxUrlFrontierSize = 10
if links_queue < maxUrlFrontierSize:
    links = re.findall('''href=["'](.[^"']+)["']''', response, re.I)
        for link in (links.pop(0) for _ in range(len(links))):
            if links_queue >= maxUrlFrontierSize:
                break
            # ... rest of the code
            if link not in crawled:
                links_queue += 1
                tocrawl.append(link)

5. Monitor the crawling process¶

I added a print statement to show the current URL being crawled.
I noticed that some pages were failing to be crawled for various reasons, and there was no reporting around that, so:
- I intercepted the exception and printed it.
- Introduced a failed_urls list to store the failed URLs and count them.
The success rate was reported as 20% at first, see the output below:

success rate report

we notice that failed URLs are mostly due to the following reasons:
- Fonts that have the .woff2 extension.
- Images with .webp extension.
- JSON files queried through WordPress /wp-json/ endpoint.
The following code implements these changes:

def extract_file_extension(url):
    # Parse the URL to extract the path
    path = urlparse(url).path
    # Decode URL encoding, if any
    path = unquote(path)
    # Extract the extension
    ext = os.path.splitext(path)[1]
    # If there's a query string after the extension, remove it
    ext = ext.split('?')[0]
    return ext


def is_non_html_link(url):
    path = urlparse(url).path
    if path.startswith('/wp-json/'):
        return True
    return False

# ---- main ----
if links_queue < maxUrlFrontierSize:
    links = re.findall('''href=["'](.[^"']+)["']''', response, re.I)
        for link in (links.pop(0) for _ in range(len(links))):
            if links_queue >= maxUrlFrontierSize:
                break
            # ... rest of the code
            if link not in crawled:
                ext = extract_file_extension(link)
                isNonHtmlLink = is_non_html_link(link)
                if (ext in ['.woff2', '.woff', '.webp'] or isNonHtmlLink):
                    #  ignore non-html links and fonts/images
                    continue
                links_queue += 1
                tocrawl.append(link)

The success rate jumped to 80% after these changes.

6. Monitor the Indexing process¶

The assignment asked us to make sure that the indexing process was similar to the previous assignment.
As we are processing HTML pages, we need to make sure that any HTML tags are removed.
The assignment provides us with the following code to do that:

def stripTags(s):
    intag = False
    s2 = ""
    for c in s:
        if c == '<':
            intag = True
        elif c == '>':
            intag = False
        if intag != True:
            s2 = s2+c
    return (s2)

But since we are using the new version of BeatifulSoup, where we call the tag.get_text() method, we don’t need to use this code as the get_text() method does the same thing.
The previous code was deleted.