JA7. Learning Journal for Unit 7¶
Statement¶
Your learning journal entry must be a reflective statement that considers the following questions:
1. Describe what you did¶
This was the 7th week of this course; It was about the basics of web search and crawlers. I started the week by reading the required reading and taking notes, and then watched the lecture notes along with some additional YouTube videos. I also did the discussion assignment and programming assignment, and I also did the self-quiz.
2. Describe your reactions to what you did¶
I found the topic of this week interesting as I’m interested in web development professionally, and I always wondered how search engines work. I was surprised by the complexity of the topic, and how many things are involved in building a search engine. It was interesting to know the difference between IR systems that power search engines and specialized IR systems that are used in specific domains.
3. Describe any feedback you received or any specific interactions you had. Discuss how they were helpful¶
I did not receive any feedback that was worth mentioning.
4. Describe your feelings and attitudes¶
The topic is very complex but interesting; it is not possible to read and understand everything around the topic in one week, but I think I got a broad picture of how search engines work. The programming assignment took a lot of time, but I was able to build a simple crawler and search engine; I think using BeautifulSoup made it easier to build the crawler, but it abstracted some concepts that we should know about.
5. Describe what you learned¶
The week started with talking about the architecture of crawlers, how they work, their role in information systems, and what are the expectations of a good crawler. Things like pottiness, respecting robots.txt, immunity to spider traps, handling delicate and spam content properly, performance, scalability, and continuous crawling were all characteristics of a good crawler (UoPeople, 2023).
The textbook starts with a brief history of web search where full-text and taxonomy-based search engines were the first generation of search engines, and then the ranked algorithmic engines appeared, and the recent updates included sponsored and personalized searching (Manning et al., 2009).
The textbook then continued with the characteristics of the web; the web is described as decentralized, uncontrolled, open to thousands of languages and dialects which makes the stemming harder, heterogeneous, planetary-scaled, not limited to professional writers, and include all kinds of information from the high-quality peer-reviewed to the absolute lies (Manning et al., 2009).
The web graph was explained later along with the various techniques used by spammers to trick search engines; the advertising business model was introduced later which opened the door for new industries like SEO (search engine optimization) and SEM (search engine marketing) (Manning et al., 2009).
The user search experience was discussed later, where queries and information needs were clearly explained, and user queries were classified into navigational, informational, and transactional queries (Manning et al., 2009).
6. What surprised me or caused me to wonder?¶
I was surprised by how complex the structure of the crawler is, and this is just one part of the search engine; I also think that some components that are mentioned as part of the crawler, but are -in fact- so complex to deserve their own systems, like the fetcher, parser, and indexer.
7. What happened that felt particularly challenging? Why was it challenging for me?¶
The programming assignment was very challenging; I spent an entire day trying to connect components together but I was happy that I was able to put together a simple crawler and search engine in one day.
8. What skills and knowledge do I recognize that I am gaining?¶
As a web developer, I think I gained a lot of knowledge about how search engines work, how users search and interact with these search engines, and what are their expectations of a search engine. The introduction to the history of search engines was also interesting.
I think the most important thing I learned this week was the advertising business model and how it changed the web and the search engines; especially as it is the main source of income for a lot of web applications and startups.
9. What am I realizing about myself as a learner?¶
I think I invested a good time on the programming assignment which is something I’m proud of; although my implementation is a bit different from the theory we learned in the book; I wish that the book contained the code to a reference implementation that we can compare our code to.
10. In what ways am I able to apply the ideas and concepts gained to my own experience?¶
As our organization grows we need some information retrieval system for our documentation, reports, and other documents; I think I can use the knowledge I gained in this area; I will rarely build any IR or search engine from scratch, but the knowledge I gained will make integrating a third-party system easier.
References¶
- UoPeople (2023). Unit 7 Lecture 1: Introduction to Web Crawling | Home. (2023). Uopeople.edu. https://my.uopeople.edu/mod/kalvidres/view.php?id=392894
- Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html? Chapter 19: Web Search Basics