Basics of Web Scraping Part I: Understanding What is Web Scraping

Zohaib Bashir
3 min readJan 22, 2024

--

1. Introduction

We live in a world today where data has become the lifeline of countless applications and businesses. Many fields require publicly available data to do their jobs. Due to the size of the data available, it is very hard to get that data manually. Not only would that take a long time, but it can also lead to errors while data is being gathered (human errors). To remedy that, web scraping is used. Web scraping can be used to extract information from websites automatically, which, depending on the task, can be instant, and the data extracted will have no errors. In this guide, I’ll walk you through the basics of web scraping, assuming you are a complete newbie in the field. Keep in mind, web scraping is used to extract information from publicly available data such as e-commerce websites, news websites, blogs, etc.

2. Web Scraping

Web scraping refers to the automated process of extracting data from websites. It allows us to gather information from various online sources, transforming unstructured data into a structured format that can be analyzed and utilized. The applications of web scraping are diverse, ranging from market research and price comparison to data analysis and content aggregation. Some of the jobs that utilize data from web scraping are as follows:

  • Data Scientist
  • Business Intelligence Analyst
  • Market Research Analyst
  • Data Engineer
  • Machine Learning Engineer
  • Deep Learning Engineer

3. Applications of Web Scraping

  • Market Research: Businesses can use web scraping to monitor competitors, track pricing trends, and gather consumer reviews, enabling them to make informed decisions.
  • Data Analysis: Researchers and analysts can leverage web scraping to collect data for studies, identify patterns, and draw meaningful conclusions.
  • Content Generation: News websites and content platforms often use web scraping to gather information from multiple sources, creating a centralized hub of data that can later be used in Machine/Deep Learning concepts to generate content.
  • Job Market Insights: Job seekers and recruiters can utilize web scraping to gather information about job postings, salary trends, and skill requirements in the job market.

4. Key Concepts

  1. Before you can start web scraping, it is important to know a few things. Those are the backbone of a website (HTML, CSS, JS). You cannot start web scraping without first having some knowledge of HTML. Without knowledge of these topics, it is impossible to extract information from websites. A good resource to build your understanding of HTML can be found here
  2. XPath and CSS selectors are two common methods for navigating the HTML structure. They provide a way to specify the path to a particular element.
  3. Robots.txt, most websites have these. Robots.txt is a standard used by websites to communicate with web crawlers and other automated agents which are used for web scraping, specifying which parts of the site should not be crawled or scraped. These are generally accessed by adding \robots.txt at the end of the URL.These links will show the robots.txt of Google and Amazon respectively

· Google robots.txt

· Amazon robots.txt

Conclusion

Web scraping is a crucial tool in today’s digital age, enabling people and businesses to leverage the abundant information on the web. In this part of the series, we’ve examined various uses of web scraping and introduced fundamental concepts regarding this technology. In Part II, we’ll explore practical examples and advanced techniques, providing you with coding and practical examples. Get ready for an exciting exploration into the realm of web scraping!

--

--