circuit

Best Tool for Web Scraping: BeautifulSoup vs. Regex vs. Advanced Web Scrapers

BeautifulSoup, Regular Expressions, or an advanced web scraper - which is the best tool for web scraping? A deep dive into web scraping using each.


image

Data extraction through web scraping has become a vital requirement in many industries for various purposes, ranging from marketing to strategic development. The core idea of web scraping is to extract data from public websites which do not have API endpoints for obtaining data.

Modern developers use many methods to scrape the web, and the most traditional way is writing your own script and scraping what you need. So, developers use programming languages like Python and JavaScript to write their scripts.

This blog post compares three methods of scraping data: using DOM parsing (via Python and the BeautifulSoup library), Regex (via JavaScript), and an advanced web scraping tool. We will look at each of these approaches and their pros and cons to help you decide the best method for your requirements.

First, we'll see how to scrape a website using DOM parsing. Let's begin.

1. Web Scraping Using DOM Parsing (with Python/BeautifulSoup)

Scraping websites using DOM traversal is relatively easy since there are plenty of third-party libraries for both Python and JavaScript that enable us to do just that. Let's begin with Python.

Step 1: Create a Virtual Environment and install the required libraries.

As we'll be using third-party dependencies (requests for HTTP requests, and BeautifulSoup for our DOM traversal) for scraping, it's best to create a virtual environment.

Execute the following two commands to create the virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate

Or, if you're on Windows,

> python3 -m venv venv
> cd venv/Scripts
> activate

After activation, the virtual environment will be displayed in our terminal.

To start web scraping, we must install the following third-party libraries inside venv using the following commands:

python -m pip install requests
python -m pip install beautifulsoup4

Step 2: Analyzing the blog to scrape.

Here we will use one of the blogs from GeeksforGeeks to demonstrate the implementation. We will select the following blog URL to start with the scraping script: https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/

This blog scraper will use a URL and fetch a list of content published in the blog, and then it will collect attributes such as Article Name.

Step 3: Implementing the Scraper

First, create a Python file named scraper.py and add the two imports as shown below:

from bs4 import BeautifulSoup
import requests
import json

The requests library generates a GET request to get the blog post's data.

blogUrl = "https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/"
response = requests.get(blogUrl)

Then, the returned text response must be parsed into HTML using the HTML Parser of BeautifulSoup.

parsedHtml = BeautifulSoup(response.text, 'html.parser')

Then, the stories can be queried by performing a DOM operation to extract the data.

blogDetails = []
blog = parsedHtml.find('article', class_='content post-414923 post type-post status-publish format-standard hentry category-guestblogs tag-web-scraping')
blogTitle = blog.find('h1').text
blogDetails.append({
 'blogTitle': blogTitle
})

file = open('stories.json', 'w')
file.write(json.dumps(blogDetails))

Here, the blog post is scraped using the parsedHtml.find('div', class_='brd_blog_titles') DOM operation, and then all the data is captured accordingly. Finally, the collected data is appended into an array and written into a JSON file. The final result is as follows:

Now, there are some advantages to DOM parsing for web scraping, as well as some disadvantages, let's take a look.

Pros and Cons of Web Scraping with DOM Parsing

Pros:

  • Powerful libraries and versatility: One of the biggest advantages of this method for web scraping is its versatility. Python, for example, has several libraries for this purpose, such as BeautifulSoup, Scrapy, and Selenium. These have been in development for years, making it easy to extract data from HTML and XML documents and handle cookies, sessions, and other web scraping tasks.
  • Popularity and active community: DOM parsing is the go-to method for web scraping and you're likely to find plenty of tutorials, support forums, and other resources available to help you with your web scraping projects, whether you're using Python or JavaScript.

Cons:

  • Time and resource-draining process: Scraping large websites or multiple websites at once can be a time-consuming and resource-intensive task. If we're scraping at scale, we may find ourselves running into resource limits far before we run into server blocks. Additionally, some websites may have measures in place to prevent web scraping, and DOM parsing may not be effective at bypassing these measures. Needless to say, for both enterprises and solo developers, this isn't ideal. However, this issue is addressed if an advanced web scraper is used (more on this later).
  • Needs frequent developer intervention when scraping websites with changing structure or layout: Our scripts may break if the website we are scraping changes its layout or structure. This can require frequent maintenance of our web scraping scripts to ensure that they continue to work as intended. This can be a painful process for individual developers as well as enterprises. To avoid this, we can use advanced scrapers built to tackle dynamic websites with changing structures or layouts. As mentioned before, we will see this in more detail later in this article.
  • Data privacy - ethical and legal questions: Web scraping can raise ethical and legal considerations, depending on the website you are scraping and how you use the data you collect. It's important to respect the terms of service of the websites you are scraping and avoid violating any copyrights or other intellectual property rights. To avoid this grey area, it's best to use a scraping tool or service that is built in compliance with major laws and regulations including the new EU data protection regulatory framework, GDPR, and the California Consumer Privacy Act of 2018 (CCPA) - respecting requests to exercise privacy rights.

Next, we'll go through the process of web scraping using Regular Expressions (using JavaScript). Without further ado, let's get started.

2. Web Scraping Using Regular Expressions/JavaScript

If we want to do web scraping without any dependencies (thereby keeping our script lightweight), our best bet is regular expressions (aka, "Regexes").

How does this work? Well, HTML is semantic. It is a markup language that uses tags to structure content, and so Regexes can easily match and extract these tags and their contents.

For example, let's say we want to extract the text of all the headings (<h1>, <h2>, etc.) on a web page. We could use a regular expression to match the opening and closing tags of each heading, and extract the text in between. We'll go through this process shortly in the article and you'll see that this can be useful for extracting specific data from a web page, or for creating a structured data representation of the page's content.

Here, we'll scrape the same website as we did with the Python example.

// Define an async function to fetch the page data
async function fetchPageData() {
  // Make an HTTP request to the web page
  const response = await fetch(
    "https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/"
  );
  // Get the HTML content of the page
  const html = await response.text();
  // Use a regular expression to match and extract the headings
  const headingsRegex = /<h[1–6]>(.*?)<\/h[1–6]>/g;
  const headings = html.match(headingsRegex);
  // Return the extracted headings
  return headings;
}

In summary, regular expressions can be useful for parsing document structures such as HTML by allowing us to match and extract specific tags and attributes from the document. This can be useful for extracting data from web pages or creating structured representations of their content.

As we can see, this is a relatively simple method compared to DOM parsing using an external library like BeautifulSoup. However, it is crucial to understand the pros and cons before using regex or regular expressions for web scraping.

Let's see the pros and cons.

Pros and Cons of Web Scraping with Regular Expressions/JavaScript

Pros

  • Extracts precise information: One of the biggest advantages of using regex for web scraping is that it allows us to be very specific and precise in the information that we extract from a web page. With regex, we can define a pattern that will match only the exact text that we are looking for, and ignore everything else on the page. This can be particularly useful when working with web pages that have a lot of irrelevant or noisy information. In the example above, we used this method to extract the headings from the page, ignoring the rest of the data.
  • Another advantage of regex is that it is widely supported and can be used in many different programming languages, including JavaScript. This means that if we are familiar with regex, we can use it in our web scraping projects without having to learn a new language or tool.

However, there are also some potential drawbacks to using regex for web scraping.

Cons

  • Difficult to write and debug: Regex patterns can be complex and hard to read, and it can be easy to make mistakes that cause our pattern to fail to match the text we are looking for. In addition, regex can be slow to run, especially when applied to large amounts of text.
  • Can be inflexible: It can be brittle and inflexible. Because regex patterns are so specific, they can break easily if the structure of the web page changes slightly. This means that we may need to constantly update and maintain our regex patterns as the web page changes over time.

Overall, the pros and cons of using regex for web scraping will depend on our specific situation and needs. If we need to extract very specific information from a web page and are comfortable working with regex, then it can be a useful tool.

However, if we are dealing with large amounts of text or need to be able to adapt to changing web page structures, then other approaches to web scraping may be more suitable. This is where Bright Data's Web Scraper IDE comes in. We'll find out more about this tool in the following section.

3. Web Scraping Using Bright Data's Web Scraper IDE

Instead of using all the coding methods and technologies we discussed above, we can make use of an advanced web scraping tool like Bright Data's Web Scraper IDE. This powerful tool allows us to scrape data and store it in any standard format.

Step 1: Setting up an account to use the Web Scraper IDE

First, go to Bright Data's Web Scraper IDE and click on 'Start Free Trial'. Then we fill out the form with our basic details and sign up for Bright Data.

Here we need to give our organizational ID to register and start using the IDE.

Next, we will create an account using my details and demonstrate how we can scrape data using Bright Data IDE.

Step 2: Accessing the IDE

Once registered, go to the Web Data Platforms page, where we can find many helpful links to check the dataset marketplace, request custom datasets, check Bright Data insights, and access the IDE.

Next, select the 'Get started' button on the Web Scraper IDE card and choose a template to start. Here, we will select a Twitter hashtag search.

Step 3: Understanding the IDE and Scraping data

The IDE is developed on top of JavaScript, and you can write your code according to your requirement. It provides you with functions and commands that are a superset of JavaScript and eases many of the pain points of web scraping by providing a streamlined developer experience.

The IDE itself consists of three main parts, which are:

  1. Interaction Code:

This is where we configure our scraper. Once we have selected a template, we can leave most of these settings unchanged. But if we're writing our own script, the IDE offers us detailed, helpful comments to let us know what each option does, and why we might want to change them. For example,

//Max Time of collection (4.5m + then 30sec to collect result)
let collection_time = new Date(Date.now() + 4.5 * 60000);

Date.now() returns a value in milliseconds, so if we wanted the collector to scrape data for a maximum of 2 minutes instead, we would change this to new Date(Date.now() + 2 * 60000).

Also, we can filter navigation to unwanted pages by configuring an ignore array.

try {
  wait_network_idle({
    ignore: [
      /accounts.google.com/,
      /twitter.com\/sw.js/,
      /twitter.com\/i\/jot/,
    ],
  });
} catch (e) {}
  1. Parser Code:

Extracting HTML is only the first step. In this section, we write code that parses and sanitizes the extracted data into a structure that we want. This will get called on each variable we've collected in our Interaction Code.

For example, if we collected these in our Interaction Code...

// call parse() to get the data
// then call collect() to add a record to your final dataset
let data = parse();
collect({
  url: new URL(location.href),
  title: data.title,
  links: data.links,
});

...we'd have this in our Parser Code to sanitize (get rid of trailing whitespaces) and add structure (add all extracted links to an array instead of plaintext) to those variables.

// Code that will parse page HTML into structured data.
// Use parse() in interaction code to read this data.
return {
  title: $("h1").text().trim(),
  links: $("a")
    .toArray()
    .map((e) => new URL($(e).attr("href"))),
};
  1. User interaction Cell:

Here, we can:

  • Input - Manage our inputs (in this example, this is where we would put the hashtags we're watching for). As expected, our inputs here would be inserted into our actual scrape code. Click Preview to test it.

  • Run Log - We can check running logs to check the job status.

  • Browser Console - This window is the built-in browser console that displays any errors that occur during the runtime.

  • Browser Network - Here, we can monitor all the requests passed through the IDE.

  • Last Error - Here, we can monitor if any errors occur after finishing the scraping task.

  • Output - We can see the scraped result and download the data set.

That's a wrap. Next, we'll take a look at the pros and cons of this process.

Pros and Cons of Web Scraping with Bright Data's Web Scraper IDE

Pros:

  • The IDE is accessible via Bright Data's website, making it easy to start immediately. It does not require any in-house data collection infrastructure.
  • Built on Bright Data's proxy infrastructure, enabling scalability and 99.9% accuracy.
  • Uses an unlocker infrastructure that helps overcome captcha, IP, and device fingerprint-based blocks.
  • Gives us ready-made templates for a variety of data-rich websites, as well as useful functions and commands built on top of JavaScript to help us write our scraper without needing to code everything by hand.
  • Unlike the conventional scraping methods of DOM parsing or using regex, we can extract data from dynamic websites, as the IDE supports browser emulation out of the box, saving us the trouble of configuring Selenium.
  • In a typical web scraping application, we'd need to update our script whenever the website is updated. However, with Bright Data's IDE, we can extract data from any publicly available site without worries as it's a managed service that updates templates for us automatically.
  • Bright Data is committed to compliance with data protection requirements, including GDPR and CCPA.
  • While the other methods do have widespread community support, if we run into an issue, we'll still have to spend time searching for resources online. Bright Data offers 24/7 live support so we don't have to waste time sifting through a sea of information just to find the right solution to our problem.

Cons:

  • There is a free trial and we need to pay for Bright data after it ends.
  • As a common issue for all of the methods, only publicly available data can be extracted.

What is the Best Method for Web Scraping?

As we discussed above, DOM parsing is the go-to web scraping method, but it has its issues. It does not scale, and there might be scenarios where the websites will get new updates, and we might re-write more than 90% of the script in such cases.

Regexes are an unconventional strategy that could serve us well, and they keep our scripts lightweight and deployable anywhere. However, Regexes are notoriously finicky and difficult to write, and again, they'll break if the structure/layout of the site changes.

Instead of these methods, then, consider a much more stable scraping solution like the Web Scraper IDE when scaling up. To get used to tools like Bright Data, it is better to select tools with free trials to try them without wasting your money.

Considering all the above approaches, you can decide what to choose as per your requirements.




Continue Learning