How to Scrape a Website Using Node.js and Cheerio

A step-by-step guide to scraping Amazon with Node.js and Cheerio — using residential proxies to help overcome the challenges of web scraping.

Introduction

Web scraping is a technique employed to extract targeted information or data from websites, and it finds application in various digital businesses that depend on gathering vast amounts of data. Here are some practical applications of web scraping:

  1. Price Comparison: Price comparison websites/extensions use web scraping to gather data from e-commerce websites over time, enabling users to compare product prices.
  2. Social Media Scraping: Web scraping is employed to collect data from social media websites such as Twitter, providing insights into current trends.
  3. Research and Development: Web scraping is instrumental in collecting extensive datasets (statistics, general information, temperature, etc.) from websites. These datasets are then analyzed and utilized for surveys or research and development purposes.
  4. Job Listings: Details regarding job openings and interviews can be scraped from various job portals and then listed in one place so that it is easily accessible to the user.

In this article, we are going to build a basic scraper that extracts customer reviews for a specific product on a popular e-commerce website — Amazon. We will build this using Node.js and Cheerio.

💡 As a bonus, we will also explore the use of proxies in our web scraping script. Proxies are crucial in web scraping as they help bypass certain restrictions that may be placed on your IP address by the target website. By using a proxy, your requests appear to come from different IP addresses, thereby circumventing these limitations and ensuring uninterrupted data collection.

A Primer on Cheerio

Cheerio is a fast and flexible library that parses HTML and XML documents and allows you to use syntax similar to jQuery while working with the parsed data. When combined, Node.js and Cheerio can be used to build efficient web scraping tools.

Here is an example of how you might use it with Node.js:

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://example.com';

axios.get(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);
    const data = $('div').text();
    console.log(data);
  })
  .catch(error => console.error(error));

In this example, Axios is used to send an HTTP request to the website you want to scrape. The response, which is the HTML of the page, is then loaded into Cheerio (denoted as $ in the code). It then selects all the div elements on the page. Axios is used in this case because Cheerio is a parser, and it cannot send or receive HTTP requests.

Get ready as we dive into the exciting world of web scraping with Node.js and Cheerio.

Setting Up the Environment

Setting up Node.js

Before starting to set up the project, please make sure that you have Node.js installed on your local machine.

If not, you can download Node.js from the official website. There are official packages available for all the major platforms.

After downloading, run the installer and follow the steps. After installation, you should have access to the node executable program in the command line.

You can verify that Node.js is installed correctly by opening a terminal or command prompt and typing node -v. This should display the installed version of Node.js.

This should return something like:

v18.11.0

Once you make sure that Node.js is properly installed and configured on your machine, you can move on to the next step.

Step 1: Set up a New Node.js Project

First create the folder that will contain our project with this command:

mkdir amazon-scraper

You should now have an empty amazon-scraper directory. Enter the folder with:

cd amazon-scraper

Next initialize an npm project with:

npm init -y

This will create a package.json file for your project.

Now, create an index.js file in the root folder of your project and write some code to test:

// index.js
console.log("Hello, World!")

We will write our application code in this file.

Open your package.json file and add the following script in the scripts section:

"start": "node index.js"

You can now run the command below in your terminal to launch your Node.js script:

npm run start
Step 2: Cheerio and Axios

In your terminal or command prompt, navigate to your project directory and run:

npm install cheerio axios

This will install Cheerio and Axios locally for your project and add them as a dependency in your package.json file. Axios will be used to fetch the HTML body of the website, and Cheerio will be used to parse the HTML body.

Building a Simple Web Scraper

As mentioned earlier, we are set to develop a basic scraper designed to extract customer reviews for any product on Amazon. The scraper’s primary function is to get the HTML of the product details page and retrieve all associated customer reviews.

How are we going to build this tool?

Step 1: Searching for a Product on Amazon

To illustrate the process, let’s consider searching for the “Apple 2023 MacBook Pro Laptop” on Amazon.

To see the reviews on a product, scroll down to the bottom and you will find a list of customer reviews from the United States. This is interesting data to scrape.

Step 2: Copy the Product URL

After searching, we need to copy the URL of the product page to scrape the reviews.

Step 3: Inspect the HTML page

By inspecting the HTML page, especially scrolling down to the bottom where customer reviews are located, we identify that the reviews are encapsulated within <div> HTML elements.

Step 4: Analyzing HTML Elements

We start by analyzing the HTML elements and note the CSS classes that characterize these HTML elements.

Using them, you will be able to define the CSS selectors required to select those HTML elements from the DOM. In detail, note that the review cards are contained in a div with the following classes .a-section.review.aok-relative.

Then, given a card, you can extract all its relevant data with the following CSS selectors:

  1. .a-profile > div.a-profile-content > span.a-profile-name: To get the name of the user
  2. .review-title.review-title-content > span:last-child: To get the title of the review
  3. .review-date: To get the date of the review
  4. .reviewText.review-text-content > span:last-child`: To get the text of the review
Step 5: Select HTML elements with Cheerio

Cheerio provides several methods to select HTML elements. You can use CSS selectors as you would with jQuery. For instance:

const htmlElementByClass = $(".elementClass");
const htmlElementById = $("#elementId");

You can select HTML elements by passing to $ any valid CSS selector, just as you would do in jQuery. You can also concatenate selection logic with Cheerio methods:

// retrieving the list of reviews
const industryCards = $("div.a-section.review.aok-relative")

You can then iterate on a list of Cheerio nodes with the each() method, as follows:

// iterating over the list of reviews
$("div.a-section.review.aok-relative").each((i, elem) => {
 // scraping logic…
});

Let’s put everything into action using the provided code:

const axios = require("axios");
const cheerio = require("cheerio");

const url =  "https://www.amazon.com/Apple-MacBook-Laptop-12%E2%80%91core-19%E2%80%91core/dp/B0BSHF7LLL";

axios
  .get(url)
  .then((response) => {
    console.log("Fetching data...");

    const html = response.data;
    const $ = cheerio.load(html);

    const reviews = [];

    $("div.a-section.review.aok-relative").each((i, elem) => {
      const name = $(elem)
        .find("a.a-profile > div.a-profile-content > span.a-profile-name")
        .text()
        .trim();
      const reviewTitle = $(elem)
        .find("a.review-title.review-title-content > span:last-child")
        .text()
        .trim();
      const reviewDate = $(elem).find("span.review-date").text().trim();
      const reviewText = $(elem)
        .find("div.reviewText.review-text-content > span:last-child")
        .text()
        .trim();

      reviews.push({
        name,
        reviewTitle,
        reviewDate,
        reviewText,
      });
    });

    console.log("Data ready", reviews);
  })
  .catch((error) => console.error(error));

This code showcases how to use Axios to fetch the HTML of the specified Amazon product page, load it into Cheerio, and extract relevant information about each customer review. The extracted data, including the reviewer’s name, review title, date, and text, is then stored in an array named reviews. The final array is logged to the console, providing the extracted reviews for further use.

Launch your web scraper in Node.js with:

npm run start

You should see the following on your console:

What’s the catch? You may not even see this result. Or even if you do, it wouldn’t be consistent. Here’s why:

  1. Anti-bot Measures: Websites frequently block access if you’re accessing them programmatically like we are doing right now. Worse, if you’re flagged for doing this too often, your IP might be banned altogether.
  2. Rate Limitations: Some websites have a limit on the number of requests that an IP can send within a certain timeframe. If you exceed this limit, your IP might be temporarily or permanently blocked. It’s important to respect these limits and adjust your scraping speed accordingly.
  3. CAPTCHAs: Websites often use CAPTCHAs to prevent automated data collection. If a website serves a CAPTCHA, your scraper will not be able to proceed without solving it.

This is where proxies can help.

Using proxies for web scraping

In addition to dealing with the problems mentioned above, proxies can help you overcome a variety of challenges, like running multiple scrapers in parallel from different IP addresses or extracting geo-restricted content, all while preserving anonymity.

In this section, we’ll be using Bright Data’s residential proxies as a solution to common scraping challenges. While we could have used free residential proxies, they’re not really ideal as you never know how these proxies were sourced and could land you in legal trouble. Moreover, they’re generally unreliable in terms of uptime and performance.

Bright Data, on the other hand, is a leading provider of web data collection tools and services such as the web scraper IDE, the scraping browser, and SERP API, with a fast, large, and diverse proxy network that boasts a 99.9% uptime guarantee that can handle any scraping challenge. Their proxies are all ethically sourced and comply with major data protection regulations like the GDPR and CCPA.

👉 Learn more about Bright Data’s Proxies and Scraping Infra.

Integrating Bright Data’s Proxies into our Code

Now that we’ve built a basic scraper using Node.js and Cheerio, it’s time to enhance our web scraping capabilities by integrating Bright Data’s proxies. Proxies play a crucial role in overcoming common challenges like IP bans, rate limitations, and geographical restrictions. Bright Data’s proxies, with their reliability and extensive network, can significantly improve the efficiency of our scraping process.

Step 1: Obtaining Bright Data Proxy Credentials
  1. Visit the Bright Data website and sign up for an account if you haven’t already.
  2. Sign in to your Bright Data control panel. If you haven’t yet signed up for Bright Data, you can sign up for free, and when adding your payment method, you’ll receive a $5 credit to get you started!
  3. Click on the My Proxies tab from the top menu.
  4. Click on the Add button and select Residential Proxies from the menu
  1. Give your proxy a name, and choose Shared IP type. You can also use a dedicated IP. However, this article will not require a dedicated IP. After selecting these options, click on Add.
  2. Once the proxy is added, you’ll be redirected to the proxy details page. You can access the necessary details, like the host, username, and password from this page.
  1. Note down the Host, Username, and Password. These will be used for integrating the proxy in your codebase.
Step 2: Integrating Bright Data’s IPs into the Scraping Script

Now, let’s modify our existing scraping script to utilize Bright Data’s proxies. The proxy needs to be integrated into Axios. To do so, open your existing code and create a new variable called config. This variable will hold the necessary proxy details. Here’s an example of how this variable should look:

const config = {
  proxy: {
    protocol: "http",
    host: "brd.superproxy.io",
    port: 22225,
    auth: {
      username: "brd-customer-**********",
      password: "2x5tan*****",
    },
  },
};

The proxy property contains details about the proxy server that our application will use.

  1. The protocol property specifies the protocol that the proxy server uses, which is http, in this case.
  2. The host property is the hostname of the proxy server.
  3. The port property is the network port on the proxy server to which the application will connect.
  4. The auth property is a nested object that contains the authentication details required by the proxy server. It has two properties: your Bright Data username and password.

The hostname you received from Bright Data’s dashboard consists of both the port number and host, separated by a colon. However, to use them with Axios, you need to define them in separate host and port keys, like above. Remember to replace the values with the ones you received from the dashboard.

You are now almost ready to send your requests using Bright Data’s residential proxy. The only change required to our previous code is to pass the config object into Axios, as a second parameter. Here’s how it should look:

axios
  .get(url, config)
  .then((response) => {
   // …
}

Everything else can remain unchanged.

Running the script now will send the GET request using Bright Data’s Residential Proxy network.

node index.js

You should see the following printed on your console.

Nothing different in the result itself, of course, but you’ll notice what’s changed when you can run this script multiple times — with different products if you so wish. You’ll immediately notice how the script, now powered by Bright Data’s proxies, retrieves data successfully every time without failing or facing any of the potential challenges mentioned earlier.

By integrating Bright Data’s residential proxies, you not only overcome common scraping challenges but also benefit from a more stable and secure web scraping experience.

👉 Learn more about Bright Data’s residential proxies in the official documentation.

Conclusion

Web scraping with Cheerio and Node.js is a powerful and efficient method for extracting data from websites that don’t use JavaScript to load dynamic content. The process involves sending HTTP requests to the target website, parsing the HTML response, and extracting the desired data. But in most cases, websites have protections against scraping, and Cheerio alone might not work well. This is where using a proxy will help.

Bright Data’s proxies offer several advantages for web scraping including, reliability and security, a large proxy network, geo-targeting, and many others. Here are some of the advantages of using Bright Data’s proxies over free proxies:

  1. Reliability: Bright Data’s proxies are fast, stable, and secure. They offer a 99.9% uptime guarantee and a dedicated customer support team. Free proxies, on the other hand, are often slow, unstable, and risky. They can expose your identity, compromise your data, or stop working at any time.
  2. Large Proxy Network: Bright Data leads the proxy market with over 72 million IPs from almost every country and city in the world. This means you can scrape any website from any location, without being blocked or detected. Free proxies usually have a limited number of IPs, which makes them easy to identify and ban by websites.
  3. Security: Bright Data offers the highest level of security and privacy for your web scraping activities. They use encryption, authentication, and rotation techniques to protect your online identity and data. Free proxies do not offer these security measures, and they can often leak your IP address, personal information, or web activity to third parties.
  4. Geo-targeting: Bright Data’s large proxy pool reduces the chances of IP blockages and enables precise geo-targeted (with ASN/ZIP targeting) web scraping. Free proxies usually do not offer this level of precision.
  5. Ethical Compliance: Bright Data’s proxy network consists of 100% ethically sourced IPs, fully compliant with major data protection regulations like GDPR and CCPA. This means using these proxies you won’t land into any legal trouble. Free proxies can’t guarantee any kind of ethical compliance and data gathered using such proxies might land you into trouble later on. Here’s a checklist to help you choose an ethically compliant proxy provider.

If you’re interested in learning more about Bright Data’s proxies and using them in your scraping activities, you can sign up for a free trial and explore the various proxy solutions they have to offer.

It’s also important to note that while web scraping is a powerful tool, it should be used ethically and responsibly. Always respect the website’s policies, such as the robots.txt file, and ensure that your data extraction activities do not violate any laws or terms of service.

Continue Learning

Discover more articles on similar topics