Skip to content

DA7. Web Crawlers

Statement

What is involved in creating a web crawler? What are the differences between static and dynamic web content?

Solution

Task 1: What is involved in creating a web crawler?

A web crawler is a program that fetches web pages and copies their content and metadata for later processing; it is -usually- part of a search engine; the later processing includes indexing, ranking, and retrieval of stored content in response to a user query. The crawler is also known as a spider, bot, or web robot (What Is a Web Crawler, 2023).

The following diagram shows the main components and processes of a web crawler (Gautam, 2023):

components and processes of a web crawler

  1. The cycle starts from a set of Seed URLs that are sent to the URL Frontier which is a queue of URLs to be crawled.
  2. The HTML Fetcher pops a URL from the URL Frontier and fetches the HTML content of the page.
  3. Before sending the HTTP request to fetch the page, the HTML Fetcher consults a DNS Resolver to resolve the domain name of the URL to an IP address; these IP addresses are usually cached on the Fetcher level to avoid repeated DNS lookups.
  4. The fetched HTML page is sent to the HTML Parser which extracts textual content, metadata, and links from the page.
  5. The parsed page is sent to Duplicate Detection where the necessary computations and checks are performed to determine if the page is a duplicate or not; if the computations (e.g. hashes, checksums, shingles, etc.) indicate that the page is a duplicate, it is discarded and the process ends, while the Crawler moves on to the next URL in the URL Frontier.
  6. If the page is not marked as duplicate, the parsed data goes to the right Data Storage, and the computed information is Cached to avoid repeated computations while crawling more pages.
  7. The parsed page is sent to the URL Extractor which extracts URLs from the page.
  8. The extracted URLs are sent to the URL Filter which filters out invalid URLs, URLs that are not allowed to be crawled, and irrelevant URLs (e.g. ads, images, files, external domains, etc.).
  9. The filtered URLs are sent to the URL Loader/Detector which checks if the URL is already stored in the URL Frontier, already crawled, or new; all existing URLs are discarded, and only new and un-crawled URLs are kept.
  10. The new URLs are stored in the URL Storage which is a persistent storage in between crawling sessions.
  11. The new URLs are pushed to the URL Frontier to be crawled in the next cycle(s).
  12. A new URL is popped from the URL Frontier and steps 2-12 are repeated until the URL Frontier is empty.

As we saw the crawling process is complex and such systems must be designed to be scalable, efficient, concurrent, distributed, continuous, adaptive to new formats and technologies, respective of robots.txt, avoid spamming and overloading servers, and be immune to malicious page or spider traps (UoPeople, 2023).

Task 2: What are the differences between static and dynamic web content?

Static content is content that does not change over a relative period; that is, content does not change between crawling sessions; this content is usually stored in files on the server and served as-is to the client (Manning et al., 2009).

Dynamic content is the opposite of static content; every time a client requests a page, the server returns a different page; the content relies on the URL path or query parameters, cookies, and/or other data that may be included in the request (Manning et al., 2009).

Dynamic content is not indexed in search engines as the crawler cannot know all possible URLs and their parameters; such content is usually locked behind an authentication mechanism which the crawler cannot bypass. The move from Web 1.0 to Web 2.0 showed a dynamic increase in dynamic content.

References