JA8. Learning Journal 8

Statement¶

Your learning journal entry must be a reflective statement that considers the following questions:

1. Describe what you did¶

This was the last week of this course; It was about crawling architecture, website indexing, and link analysis. I started the week by reading the required reading and taking notes and then watched some additional YouTube videos. I also did the discussion assignment, and review quiz, and prepared for the final exam.

2. Describe your reactions to what you did¶

The topic of this week is an extension of what we learned in the previous week and a recap of the entire page. It was interesting to learn about the architecture of crawlers and how they work; the different data structures used in these systems and the reasons behind choosing them.

3. Describe any feedback you received or any specific interactions you had. Discuss how they were helpful¶

I did not receive any feedback that was worth mentioning.

4. Describe your feelings and attitudes¶

The topic is very complex but interesting; there aren’t many crawlers in the world, and the few ones that exist took years and years of man-hours to build; that’s what I say to myself when I look back at the course complexity knowing that I did not understand everything. I think it is fair to just understand the basics and leave the details to new learning experiences in life; if I ever had one.

5. Describe what you learned¶

The week started with talking about the architecture of crawlers, the must-have and should-have aspects of a crawler; the crawling process is defined as traversing the web graph, and the Mercator crawler was set as the reference architecture in the book; the URL Frontier, Host Splitter, DNS resolution and Distributed crawlers were discussed in detail. Distributed indexing and the way to partition such indexes (either by terms or by hosts) were also discussed. The connectivity server which is a server that answers queries about the in and out links of a page was also discussed (Manning et al., 2009, Chapter 20: Web Crawling and Indexes).

The last chapter in the book started to talk about the process of link analysis, and how it is used in citation analysis and web crawling. The web graph was re-visited, link spam, anchor text terms, PageRank, random walks and teleportation while surfing the web, hubs and authorities, Markov chains and matrix, and selecting a subset of pages in response to a query were also discussed (Manning et al., 2009, Chapter 21: Link Analysis).

6. What surprised me or caused me to wonder?¶

I was surprised by how the book looked at surfing the web as a matrix of probabilities and then following the user journey; I have never thought of it this way; Also the ideas of random walks and teleportation were interesting.

7. What happened that felt particularly challenging? Why was it challenging for me?¶

The Markov chains and matrix were complex topics. Although I got the main idea behind them, the math was not clear to me. Also, the math behind computing the PageRank was also challenging.

8. What skills and knowledge do I recognize that I am gaining?¶

As a web developer, I think I gained a lot of knowledge about crawlers, data structures, and the algorithms that are used in the realm of information retrieval. The most interesting part was the quantification of the user journey through the web graph, which may help me in planning and executing my user experience research later.

9. What am I realizing about myself as a learner?¶

I think I am not good at math, every time I face a topic that involves math, I feel that my mind is shifting away and I read the text while ignoring the math. Part of it is because I’ll never memorize this math and if I ever needed it, I’ll just look it up.

10. In what ways am I able to apply the ideas and concepts gained to my own experience?¶

As our organization grows we need some information retrieval system for our documentation, reports, and other documents; I think what I learned is useful in building some data mining pipelines that can be used in the company. Even if I don’t build these tools from scratch, it would still be useful to understand how they work and how to use them.

References¶

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 20: Web crawling and Indexes. http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Chapter 21: Link Analysis http://nlp.stanford.edu/IR-book/pdf/21link.pdf