If you’re tasked with extracting data from an e-commerce website, whether as a data analyst or marketer, there are two approaches you might likely take:
- Manual data gathering: This approach is time-consuming and tedious.
- Web scraping: Writing a script for automated data extraction.
While web scraping is more efficient, it has its challenges. You’ll often need to inspect web pages to locate the relevant data, and frequent updates to website element classes or anti-scraping mechanisms can cause your scripts to break, requiring constant maintenance.
Robotic Process Automation (RPA) is a better alternative, which automates the entire data extraction process. Tools like Bright Data’s Web Scraper API — a prime example of RPA — can log into a designated URL, navigate through pages, extract specific data, and transform it into the desired format seamlessly.
What You’ll Learn in This Article
- How to extract data from any website using the Web Scraper API.
- How to collect product reviews and details from e-commerce platforms like Amazon.
- How to use extracted data for competitor analysis.
What Is RPA and How Does It Work?
RPA involves deploying software bots that mimic human actions, such as clicking, typing, and navigating websites. Unlike traditional web scraping methods that directly query web elements, RPA bots interact with websites in the same way a human would, bypassing many anti-scraping defences. These bots rely on predefined rules and workflows to extract, organize, and store data efficiently.
For example, an RPA bot designed for e-commerce data extraction could:
- Log in to an e-commerce platform.
- Search for a specific product category or keyword.
- Extract product details such as prices, reviews, and availability.
- Export the collected data to a structured format like Excel or a database.
Bright Data has an RPA tool called the Web Scraper API, which you will use in this tutorial. The Web Scraper API simplifies web data extraction with advanced features such as automated IP rotation, CAPTCHA solving, and data parsing into structured formats. What sets it apart is its specialised capabilities, including bulk request handling, data discovery, and automated validation. These features, combined with technologies like Residential Proxies and JavaScript Rendering, make it a powerful tool for seamless and efficient data collection.
What are the benefits of using RPA over traditional web scraping for data extraction?
- Dynamic Interaction: Bots can adapt to changes in website layouts or structures by following user-defined workflows, reducing the risk of failure when element classes are updated.
- Scalability: Whether you’re extracting data from a single webpage or thousands of pages, RPA can handle large-scale operations without compromising accuracy.
- Compliance-Friendly: Unlike aggressive scraping methods that may violate website terms of service, RPA mimics human browsing behaviour, making it a more compliant approach.
- Low Maintenance: With proper configuration, RPA workflows require minimal updates, even if websites implement changes to their design or structure.
Using Bright Data’s RPA tool to collect product prices and reviews from Amazon
The Bright Data Web Scraper API provides a robust, scalable solution for extracting data from websites, including dynamic e-commerce platforms like Amazon. Here’s how you can use it to collect product prices and reviews.
Getting Started with the Web Scraper API
Before diving into the implementation, ensure you have the following:
- Bright Data Account: Sign up at Bright Data.
- API Key: Obtain your unique API key from the Bright Data dashboard to authenticate your requests.
Steps to Extract Data from Amazon Using the Scraper API
1. Define Your Target Data
Decide on the specific data points you want to extract, such as:
- Product names
- Prices
- Customer reviews and ratings
- Product descriptions
2. Find the Appropriate Scraper Template
Bright Data’s scraper marketplace offers prebuilt scrapers for various use cases. For this tutorial, you’ll use two templates:
- “Amazon Reviews — collect by URL”
- “Amazon Products — collect by URL”
If you don’t find a suitable prebuilt scraper, you can create one using the no-code or coding method.
Using the “Amazon Reviews — Collect by URL” Template
- Access the Template:
- Open the Bright Data marketplace and select the “Amazon Reviews — collect by URL” template.
- Choose Scraper API and proceed.
- 2. Set Up Data Collection:
- Add the Amazon product URL in the “Data Collection APIs” section.
- Paste your API token and copy the generated cURL request.
- 3. Fetch Your Snapshot ID:
- Use Postman to create a new POST request. Paste the cURL request and send it.
- Copy the returned snapshot ID for further steps.
- A snapshot ID is a unique identifier used by the Bright Data Web Scraper API to reference a specific dataset or data snapshot. It allows you to retrieve consistent and structured data that has already been processed and stored by the API. So, instead of fetching raw data directly from a website in real time, the snapshot ID ensures access to pre-scraped data, reducing latency and improving efficiency.
4. Retrieve Reviews:
- In the Bright Data dashboard for the Amazon review, paste your snapshot ID and copy the cURL request.
- In Postman, create a new GET request and paste the cURL request with the snapshot ID.
- Send the request and inspect the response to confirm successful data extraction.
- The response will include essential data such as:
- Review ID
- Rating
- Author Name
- Review Header
- Review Text
- Review Date
- Verified Purchase
- Helpful Count
Using the “Amazon Products — Collect by URL” Template
Repeat the same steps as above for the “Amazon Products — collect by URL” template. The response will include essential data such as:
- ASIN
- Title
- Brand
- Price
- Ratings
- Reviews count
- Availability
- Seller Name
- With the Web Scraper API, you don’t need to worry about changes to website structures or anti-scraping mechanisms. Your data remains accessible whenever you need it.
Before proceeding to the next step, ensure you have your snapshot ID for both the review and products.
Building a Product Analysis System using the Web Scraper API
Step 1: Set Up the Environment
- Install Node.js Make sure you have Node.js installed. You can download it from Node.js official site. Verify installation:
node -vnpm -v
2. Create a Project Directory
Create and navigate to a new project directory:
mkdir product-analysis-systemcd product-analysis-system
Open the project with your preferred code editor
3. Initialise the Project
Initialise a new Node.js project:
npm init -y
This will create a package.json file.
4. Install Required Packages
Add the necessary dependencies:
npm install axios winston moment danfojs-node dotenv
5. Set Up Environment Variables
Create a .env file in the project directory to store your Bright Data API key:
BRIGHT_DATA_API_KEY=your_api_key_here
Step 2: Understand the Project Structure
Core Functionalities
- Fetching Product Data: The system interacts with the Bright Data API to retrieve product reviews and details.
- Data Processing: Converts raw data into structured objects for analysis.
- Data Analysis: Performs analysis like rating distribution, purchase verification stats, and review trends.
- Report Generation: Summarizes the analysis into a report.
Step 3: Code Walkthrough
- Environment Configuration
require('dotenv').config();
Loads the .env file for securely managing sensitive information like the API key.
2. Logger Setup
Using the winston library for logging:
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
- Uses the
winstonlibrary to log information, errors, and warnings. - Logs are stored in
error.logandcombined.logfiles.
3. Data Models
ProductReview Class: Defines and structures review data:
class ProductReview {
constructor({ review_id, rating, author_name, review_header, review_text, review_posted_date, is_verified, helpful_count }) {
this.review_id = review_id;
this.rating = rating;
this.author_name = author_name;
this.review_header = review_header;
this.review_text = review_text;
this.review_date = moment(review_posted_date, 'MMMM DD, YYYY').toDate();
this.verified_purchase = is_verified;
this.helpful_count = helpful_count;
}
}
- Models each review as an object.
- Converts
review_posted_dateto a JavaScriptDateobject.
ProductDetails Class: Models product details:
class ProductDetails {
constructor({ asin, title, brand, final_price, rating, reviews_count, availability, seller_name }) {
this.asin = asin;
this.title = title;
this.brand = brand;
this.price = parseFloat(final_price);
this.rating = parseFloat(rating);
this.reviews_count = reviews_count;
this.availability = availability;
this.seller_name = seller_name;
}
}
- Models product details like title, price, and reviews_count.
4. Data Fetching
DataFetcher Class — Handles API requests:
class DataFetcher {
constructor(baseUrl, apiKey) {
this.baseUrl = baseUrl;
this.axiosInstance = axios.create({
baseURL: baseUrl,
timeout: 10000,
headers: {
Authorization: `Bearer ${apiKey}`,
Accept: 'application/json',
},
});
}
async makeRequest(endpoint) {
try {
const response = await this.axiosInstance.get(endpoint);
return response.data;
} catch (error) {
logger.error(`API request failed: ${error.message}`);
throw new Error(`Failed to fetch data: ${error.message}`);
}
}
async getProductReviews(snapshotId) {
return this.makeRequest(`/datasets/v3/snapshot/${snapshotId}?format=json`);
}
async getProductDetails(snapshotId) {
return this.makeRequest(`/datasets/v3/snapshot/${snapshotId}?format=json`);
}
}
- Fetches data from the Bright Data API using
axios. - Requires snapshot IDs (
s_m60gwuw4181n8ikiobands_m5y480dm2l7109x73gin this case) to get reviews and details.
5. Data Processing
DataProcessor Class — Converts raw API data into usable formats:
class DataProcessor {
static processReviews(rawReviews) {
return rawReviews.map((review) => {
try {
return new ProductReview(review);
} catch (error) {
logger.warn(`Error processing review: ${error.message}`);
return null;
}
}).filter((review) => review !== null);
}
static processProductDetails(rawDetails) {
try {
return new ProductDetails(rawDetails[0]);
} catch (error) {
logger.error(`Error processing product details: ${error.message}`);
throw error;
}
}
}
- Converts raw API data into
ProductReviewandProductDetailsobjects.
6. Data Analysis
DataAnalyzer Class — Analyzes review trends and statistics using danfojs-node:
class DataAnalyzer {
constructor(reviews, productDetails) {
this.reviews = reviews;
this.productDetails = productDetails;
this.reviewsDataFrame = new DataFrame(
reviews.map((review) => ({
...review,
review_date: review.review_date.toISOString(),
}))
);
}
getRatingDistribution() { /* Calculate rating counts */ }
getVerifiedPurchaseStats() { /* Calculate verified vs. unverified reviews */ }
getReviewTrends() { /* Trends by month */ }
}
- Performs statistical analysis using
danfojs-node.
7. Report Generation
ReportGenerator Class — Creates a comprehensive analysis report:
class ReportGenerator {
constructor(analyzer) {
this.analyzer = analyzer;
}
generateSummaryReport() {
return {
product_summary: { /* Summary of product details */ },
review_analysis: { /* Summary of review data */ },
};
}
}
- Combines all analyses into a comprehensive report.
8. Main Function
The function brings everything together:
async function main() {
const fetcher = new DataFetcher('https://api.brightdata.com', process.env.BRIGHT_DATA_API_KEY);
const [rawReviews, rawProductDetails] = await Promise.all([
fetcher.getProductReviews('s_m60gwuw4181n8ikiob'),
fetcher.getProductDetails('s_m5y480dm2l7109x73g'),
]);
const reviews = DataProcessor.processReviews(rawReviews);
const productDetails = DataProcessor.processProductDetails(rawProductDetails);
const analyzer = new DataAnalyzer(reviews, productDetails);
const reportGenerator = new ReportGenerator(analyzer);
const report = reportGenerator.generateSummaryReport();
console.log(JSON.stringify(report, null, 2));
}
- Fetches, processes, analyses and generates the report.
Step 4: Run the Program
- Ensure the
.envfile contains your API key. - Run the program:
node index.js
3. Review the output in the console or logs for errors.
You can find the complete code for this tutorial here. Feel free to extend this project, for example, you can design a dashboard for these tutorials or incorporate AI to help with sentiment analysis.
Final Thoughts
By using RPA tools like Bright Data’s Web Scraper API, you can simplify the data extraction process, overcome the challenges of traditional web scraping, and focus on utilising the extracted data for strategic business decisions.
You can sign up to Bright Data to test the Web Scraper API for free now!