Build awareness and adoption for your software startup with Circuit.

How to Bypass CAPTCHA and Anti-Scraping Techniques with an AI-Powered Proxy Solution

Explore the challenges posed to seamless data collection by anti-scraping techniques — and how to overcome them using an AI-powered proxy solution.

Image source: Flickr

From acquiring market insights to monitoring competitive activities and accessing public data for research purposes, web scraping has become an indispensable tool. It empowers both businesses and individuals by providing data-driven insights in the ever-evolving digital landscape.

Nevertheless, the increasing popularity of web data extraction has led websites to adopt more sophisticated measures to safeguard their data against bots and scrapers. This is driven by concerns related to privacy and platform integrity. While these protective measures are understandable, they can present significant challenges for legitimate web scraping projects undertaken by businesses, individuals, or researchers.

One particularly formidable measure is the deployment of CAPTCHAs (Completely Automated Public Turing tests to distinguish Computers and Humans). You may have encountered these puzzles or challenges designed to differentiate between automated bots/scrapers and human users.

CAPTCHAs and other anti-scraping measures pose a formidable challenge to seamless data collection. Integrating manual logic to bypass such challenges can lead to code complexity and ongoing adjustments to accommodate evolving anti-scraping mechanisms on websites. Although third-party proxies are commonly used as solutions, they entail extra infrastructure and still lack 100% reliability.

In this article, we will take a look at these anti-scraping measures, explore the challenges they present, and how to get around them while using Nimble’s Web API, an AI-powered zero-maintenance solution that comes with a built-in premium proxy infrastructure that allows for seamless and continuous data collection.

Web Scraping Challenges: Understanding Anti-Scraping Measures

To grasp the intricacies of web scraping in the modern landscape, it’s essential to understand the mechanisms behind CAPTCHAs and other anti-scraping measures. Let’s take a look at some of the anti-scraping techniques employed by modern websites and how they work.

  • CAPTCHAs: CAPTCHAs security measures employed by websites to differentiate between bots/scrapers and legitimate human users. Since CAPTCHAs are designed to be visually or interactively solved by humans, automated scraping processes can’t easily tackle them. As a result, the web crawler gets stuck, and your data collection effort comes to a standstill. CAPTCHAs are generated by websites generally through a process known as browser fingerprinting. This involves analyzing various attributes of the user’s browser and device, such as screen resolution, installed fonts, and browser plugins, to create a unique fingerprint containing information about the configuration of a user’s browser as well as software/hardware environment. When a scraper’s fingerprint is detected, the website may respond with a CAPTCHA to validate the user.
  • IP Address Blacklisting: Most websites keep a log of incoming requests along with their corresponding IP addresses, and they analyze the timing and frequency of these requests to identify patterns. If an IP address is making too many requests, or too regularly, it’s likely to be a bot — and the website either limits the number of requests from the IP (this is known as ‘rate-limiting’), or, in the worst-case scenario, blocks it. The IP address is added to a blacklist, which prevents further access to the website’s resources.
  • Header and User-Agent Monitoring: When a client (whether a web browser or a bot) sends a request to a website’s server, it includes a User-Agent header.**** Web scrapers often use default or identifiable User-Agent strings, making them easily detectable by websites. When a website detects a non-standard or suspicious User-Agent, it may trigger a CAPTCHA to verify the legitimacy of the request.
  • Dynamic websites: Dynamic websites pose a challenge for web scraping due to their use of client-side technologies like JavaScript. These sites often load content asynchronously, update the DOM dynamically, and make additional requests through AJAX, all of which can be missed by traditional static HTML parsers.
  • Geolocation Blocks: Websites implement geolocation restrictions as a defensive measure to prevent data extraction from specific regions or IP addresses, limiting the data you can access.

Generally, to get around these challenges, you’d need to use third-party libraries and proxies, but they come with their own set of limitations:

  • You’ll need to integrate them into your existing tech stack, which can be time-consuming and may require substantial changes to your codebase.
  • These solutions may not always be entirely reliable, especially in the long term when websites update their CAPTCHA-generation mechanisms. You could find yourself constantly struggling to adapt and maintain scraping efficiency.
  • They require you to invest in additional infrastructure, leading to increased costs and management complexities.

In the next section, we’ll take a look at Nimble’s Web API which not only helps to get around these anti-scraping measures but also overcomes the limitations of traditional approaches.

Introducing Nimble’s AI-Powered Web API: A Comprehensive Solution

Nimble’s Web API is a versatile, powerful, and zero-maintenance data collection solution that can turn any website into a real-time dataset, immediately ready for use. From accessing the data to delivering it to your preferred storage, every step of the data collection process is fully managed. To initiate the request, you only need to specify the target URL and you can start collecting highly accurate web data in real time.

Nimble API - Collect Data from Any Website

Let’s look at how Nimble’s Web API addresses the challenges previously mentioned.

  • Robust Fingerprint Management: One of the major challenges faced while scraping the web is bypassing device/browser fingerprints which flags and blocks your request as coming from a bot or crawler. Nimble’s fingerprint engine automatically manages TLS fingerprints, canvas checking, headers, cookies, and more, ensuring unrestricted and uninterrupted access to public web data. By bypassing browser fingerprinting, it takes care of the issue of CAPTCHA generation at its root.
  • Premium Proxy Network: Nimble’s API comes with a premium zero-downtime IP infrastructure for continuous and effortless data gathering. It is optimized via AI to intelligently select the best premium IPs for every request, and you can even granularly target down to state and city levels for any geolocation in the world, making localized web data gathering a breeze. Making use of Nimble’s proxy rotation, you can choose among residential, data center, or ISP IPs as per your use case. This lets you bypass rate limits and ReCaptchas, ensuring maximum success rates even for particularly sensitive pages.
  • IP Pipelines: To oversee multiple proxy use cases, Nimble offers IP pipelines, allowing you to compartmentalize usage and performance metrics, facilitating a more detailed analysis of budget allocation. With IP pipelines, you can conveniently configure proxy settings — such as session control, target geolocation, and IP rotation — only once, which are then automatically inherited by all requests made through that pipeline.
  • Highly Granular Geo-Targeting: Using Nimble’s residential proxies, you can easily target down the proxy choice to country, state, and city levels for any geolocation worldwide, allowing you to bypass geolocation blocks, while still ensuring maximum success rates even for particularly sensitive pages.
  • Page Interactions: One of Nimble’s most useful features, page interactions enable users to perform operations on a webpage before the web data is collected and returned, such as clicking, typing, and scrolling. These are synchronous and run sequentially one by one, with an overall sequence limit of 60 seconds. They are supported by real-time, asynchronous, and batch requests. This feature is particularly helpful for pages using lazy loading and infinite scrolling, where the data needed is not (at least initially) in the HTML itself, but it really shines for websites that require user action (form inputs, button clicks, hover events) to dynamically display data.
  • Fully Managed, Zero Maintenance, Highly Scalable: From your end, Nimble’s API is a fully managed REST API in the cloud. Make simple API calls, then get back unlimited, structured data with zero engineering or infrastructural complexity involved. This eliminates the need to set up and maintain expensive scraping infrastructure in-house as well as the need to constantly update your code to keep up with the evolution of websites’ anti-scraping mechanisms. This is all taken care of on Nimble’s end.

Nimble’s web API comes with various pricing plans depending on your use case but most importantly, they offer a 22GB free trial of their entire platform, including their premium IP infrastructure.

In the next section, we’ll see Nimble’s Web API in action by scraping a page on Amazon. With popular e-commerce websites like Amazon, there is a high likelihood of encountering CAPTCHAs when scraping large volumes of data. We’ll see just how successful Nimble’s Web API is in bypassing it.

💡 To use Nimble’s API, you’ll have to sign up for a free trial. You can sign up for a free trial from here (click on ‘Start Free Trial’ under the ‘Free Trial’ plan) and get 100 credits to start your web scraping journey. Once done, you can log in with your account details.

Scraping Amazon Using Nimble

Let’s say you wanted to scrape Amazon search results for ‘holiday gift’.

Normally, even if you used something like Puppeteer to automate this, you’d run into a CAPTCHA or JavaScript-based challenge to root out bots. Even if you got lucky, you’d still be left with a bunch of HTML that you’d have to parse manually to extract the data that you want.

Can we do better? You bet.

Let’s use Nimble’s Realtime API to scrape the exact same page, using the render parameter to have it use a real browser behind the scenes to make the request — making sure we don’t miss any JavaScript, as well as bypass any JavaScript-based challenges or CAPTCHAs — and use the parse parameter to have its AI turn the raw extracted data into structured JSON, automatically, with no extra code required on our part.

Nimble allows our code here to be much simpler, and straightforward — we’ll simply use the battle-tested axios library to make a POST request to the API (with the Amazon URL containing our search string), and then write the JSON response to a file. That’s it.

const axios = require(''axios'');
const fs = require(''fs'');

// Searching for christmas gifts on amazon
const requestUrl = ''https://www.amazon.com/s?k=holiday+gift'';

// Step 1: Put your Nimble credentials here
const username = "YOUR_USERNAME";
const password = "YOUR_PASSWORD";

// Step 2: Format credentials into a string, and encode it in base64
const credentials = `${username}:${password}`;
const base64Credentials = Buffer.from(credentials).toString(''base64'');

// Step 3: Configure Nimble''s Realtime API 
const url = ''https://api.webit.live/api/v1/realtime/web'';
const headers = {
  ''Authorization'': `Basic ${base64Credentials}`,
  ''Content-Type'': ''application/json''
};
const data = {
  "url": requestUrl,
  "parse": "true", // parse raw response into JSON
  "render": "true" // use a real browser
};

// Step 4: Make the request
axios.post(url, data, { headers })
  .then(response => {
    const responseObj = {
      status: response.status,
      data: response.data,
    };

    const jsonResponse = JSON.stringify(responseObj);

    // Step 5: Save the JSON data to a file
    fs.writeFileSync(''data.json'', jsonResponse);

  })
  .catch(error => {
    console.error(error);
  });

The only other thing to note here is that Nimble’s API requires your credentials to be sent over as an Authorization header, Base64 encoded.

Your JSON response is going to look like this:

{
  "status": 200,
  "data": {
    "url": "https://www.amazon.com/s?k=christmas+gifts",
    "status": "success",
    "query_time": "2023-11-30T07:51:49.848Z",
    "html_content": "
           ...
       ",
    "status_code": 200,
    "headers": {
      ...
    },
    "parsing": {
      "entities": {
        "SearchResult": [
          {
            "asin": "B0CC2PB9WC",
            "entity_type": "SearchResult",
            "imageUrl": "https://m.media-amazon.com/images/I/91vzETSVSCL._AC_UL320_.jpg",
            "price": "$35.99",
            "prime": false,
            "productName": "Winter Hat Scarf Gloves and Ear Warmer, Warm Knit Beanie Hat Touch Screen Gloves Set Winter Gifts Neck Scarves for Women",
            "productUrl": "/Winter-Gloves-Warmer-Beanie-Scarves/dp/B0CC2PB9WC/ref=sr_1_51?keywords=christmas+gifts&qid=1701330717&sr=8-51",
            "rating": "4.4 out of 5 stars",
            "reviewsCount": 152,
            "sponsored": false
          },
          ...
        ]
      }
    }
  }
  ...
}

Notably, it’ll always contain the raw HTML response (html_content), headers (headers), and the AI-parsed search results — Amazon products — as an array of objects of a classified entity type (here, that’ll be SearchResult), much like you’d expect from a native store API response.

Conclusion

CAPTCHAs can pose considerable challenges to web scraping projects, especially large-scale ones. Nimble’s Web API offers an all-in-one zero-maintenance solution, bypassing CAPTCHAs effortlessly.

From emulating device/browser fingerprints to header and cookie management, Nimble’s Web API ensures uninterrupted scraping at scale, eliminating the need for manual CAPTCHA-bypassing logic, third-party libraries, or additional infrastructure on your part.

Whether you’re an individual or run a business, with Nimble, you can efficiently access web data uninterruptedly and extract crucial insights while respecting data privacy and integrity. Since it comes with a free trial, give it a shot and see the difference it can make in your web scraping experience.




Continue Learning