Thought leadership from the most innovative tech companies, all in one place.

Common Anti-Scraping Measures on Websites and How to Bypass Them

Navigating proxy blacklists, Cloudflare, HUMAN/PerimeterX, and Datadome for your ethical web scraping projects.

Website data extraction is a critical tool for businesses seeking to remain competitive in an ever-changing digital landscape, and as the internet continues to evolve, so do the methods employed by website owners to protect their valuable data and resources from automated bots and scrapers. These digital gatekeepers have become commonplace for websites that want to safeguard sensitive information and maintain the integrity of their online platforms.

However, researchers, developers, or individuals with legitimate purposes still need a way to get the data these websites try to hide.

In this article, we'll delve into the realm of anti-bot (anti-scraping) measures commonly deployed by websites and explore the techniques employed to bypass them...with a focus on how Bright Data's Scraping Browser - a headful, fully GUI browser fully compatible with Puppeteer/Playwright APIs, with block bypassing technology out of the box - can make it easier.

Without further ado, let's dive right in.

1. IP Address Blacklisting

The Problem

Most websites keep a log of incoming requests along with their corresponding IP addresses, and they analyze the timing and frequency of these requests to identify patterns. If an IP address is making too many requests, or too regularly, it's likely to be a bot - and the website either limits the number of requests from the IP (this is known as 'rate-limiting'), or, in the worst-case scenario, blocks it. The IP address is added to a blacklist, which prevents further access to the website's resources.

The Solution

The best way to walk around IP blacklisting is by using proxies and rotating them, but many sites are one step ahead of you and employ proxy blacklists. Using a premium/residential proxy provider is the way to go, but you'll still have to manage rotating them yourself, which is tedious.

Luckily, you don't have to do any of those yourself.

Bright Data's Scraping Browser automatically rotates your IP address through a pool of over 72 million real IP addresses, adding realistic intervals between requests from a single IP so it doesn't trigger anti-bot flagging, making it difficult for websites to identify a consistent pattern. Additionally, you can make use of four different types of proxy services depending on your use case: residential proxies, data center proxies, ISP proxies, and mobile proxies. This enables complex sequences - like bypassing ReCAPTCHA on sensitive pages using the 'waterfall solution', which involves sending requests with data center IPs first, then residential IPs when crawling sensitive pages, and finally, mobile IPs if the previous two fail.

You'll never have to spend time worrying about your IP getting blacklisted, managing the IP rotations yourself, or throttling requests.

2. Header and User-Agent Monitoring

The Problem

When a client (whether a web browser or a bot) sends a request to a website's server, it includes a User-Agent header. This header provides information about the requesting client's software, device, and operating system. Websites then establish a baseline profile of the requesting client by monitoring requests from legitimate users, and analyzing the User-Agent headers of these requests to identify patterns associated with popular web browsers or mobile devices.

A regular client's User-Agent string might look something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

Most bots and scrapers often use custom or uncommon User-Agent strings that deviate from this typical pattern, so websites employ algorithms and rules to detect these anomalies by comparing the received User-Agent headers against their established patterns.

So when a request's User-Agent header looks suspicious, like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/114.0.5735.133 Safari/537.36

That's indicative of an automation client (like Puppeteer, or Playwright) being used, and the website immediately blocks access to their resources (or redirects the request to a JavaScript-based challenge or CAPTCHA).

The Solution

One way to bypass this is by manually modifying the User-Agent header in your Playwright/Puppeteer code to mimic a commonly used web browser or device, making requests appear more like those of legitimate users.

However, this is a band-aid fix at best. It's impossible to know exactly which parts of a User-Agent string a specific website is checking. Also, the site may employ additional techniques to verify the authenticity of the User-Agent, making simple spoofing ineffective.

The Scraping Browser uses Bright Data's built-in unlocker infrastructure to bypass this problem automatically for you, no bespoke code required. The unlocker infrastructure is equipped to handle User-Agent generation - it can analyze target site requirements on the fly and generate an appropriate User-Agent string for each, complete with HTTP2's handling of advanced header field compression and server push. You have the flexibility to tailor the generated User-Agents to meet the specific requirements of your target site, effectively creating User-Agent strings that closely resemble legitimate user traffic.

Learn more about the Scraping Browser's capabilities: Scraping Browser - Automated Browser for Scraping

3. Headless Browser Detection

The Problem

Headless browser detection goes beyond simple User-Agent string monitoring. They can be detected via installed fonts, browser plugins, screen resolution, and webGL capabilities - all of which bots don't have. A headless browser, after all, is a web browser that operates in the background without a graphical user interface (GUI), performing automated tasks with web pages via browser APIs. This makes them faster, using fewer system resources, and easier to multi-instance in parallel...but also much easier to detect.

Websites employ a number of heuristics - predefined mouse movements, rapid and scripted form submissions, or uniform browsing patterns. Headless browsers often exhibit predictable, programmatic behavior in contrast to the more erratic, varied, and dynamic nature of human-user interactions, so these clients are easier to detect and block.

The Solution

Successfully bypassing detection requires accurately emulating a lot of behaviors, which can be complex and resource-intensive. Website administrators continuously update their detection techniques to adapt to this strategy, making it impossible to know the specific ones to emulate.

You could just use a headful browser with full GUI, like Puppeteer/Playwright configured specifically to use a real window, but then you give up any performance benefits. Also, it might not be enough. A website might just be using an advanced machine-learning-based bot detection service like HUMAN/PerimeterX, that trips you up.

Bright Data's Scraping Browser, however, is the best of both worlds. It's a full-featured, GUI-enabled headful browser, with an added twist: it runs on Bright Data's own servers. You merely connect to it via a WebSocket URL, using the Puppeteer/PlayWright API to connect to it remotely, like so:

const auth = "<your username>:<your password>";

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://${auth}@zproxy.lum-superproxy.io:9222`,
});

This gives you all the speed, and parallelization abilities of headless browsers, with the benefits of headful browsers - namely, being harder to detect - without ever having to worry about performance or scaling.

4. Honeypots

The Problem

A honeypot is a technique websites use to block crawlers by setting up traps that appear to be legitimate targets for automated activities. Here, websites include hidden links or forms within their pages that are not visible to human users but are accessible to bots or scrapers. These links and forms lead to honeypot pages or dummy resources that do not contain any valuable information, but triggering them leads to your scrapper being banned from the website.

The Solution

Your scraper might employ methods to detect the presence of honeypots, such as programmatically analyzing each element and searching for the hidden property, inspecting page structures, or looking for specific patterns in URLs or form fields. Not only is this tedious, but it is also a naive solution, and any change to website code would render your logic obsolete overnight. Not a viable solution if scraping is a business-critical need for you.

Bright Data's Scraping Browser, on the other hand, bypasses these traps preemptively, being a full-GUI real browser, and using ideal fingerprint configurations to prevent being detected as a crawler in the first place that would even cause honeypot elements from appearing in the DOM. Moreover, the Scraping Browser is updated constantly, keeping up with industry-standard honeypot measures and avoiding them.

Source: brightdata.com

5. CAPTCHAs

The Problem

"Are you a robot?"

We've all met that annoying CAPTCHA message at least once because it's the most widely-used method to block bots and scrapers. When a user attempts to access a protected resource or perform a specific action, such as submitting a form or logging in, the website presents a CAPTCHA challenge(usually visual puzzles) as part of the verification process.

Once the user submits their response, the website evaluates it to determine its accuracy, comparing it against expected answers or patterns, to verify if it was likely provided by a human. Based on the accuracy of the CAPTCHA response, the website then blocks any invalid or suspicious request that indicates potential bot or scraper activity.

The Solution

To automatically bypass a CAPTCHA test, one needs to emulate advanced device/browser fingerprints. Using a headful browser (Puppeteer/Playwright with GUI rendering enabled, for example) might help, but for some websites, that may not be enough, as each site uses different techniques to ensure even real users have to go through the CAPTCHA tests, randomly, and sometimes more than once!

Some third-party libraries exist that can do this for you, but they still have to be integrated into your existing tech stack, and moreover, they are either not as reliable, or may not stay reliable as websites use new techniques to stop scrapers.

In today's market, Bright Data's Scraping Browser has the smartest device/browser fingerprint emulation that can bypass most CAPTCHA automatically. In case one is unavoidable, it can automatically detect the strategy a target website is using, then solve it for you automatically, no third-party libraries required. The previously-mentioned unlocker infrastructure can solve reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, and GeeTest CAPTCHA, and moreover, it stays updated automatically to consistently deal with websites that develop new ways to detect you.

👉Learn more about how Bright Data's unlocker infrastructure enables browser/device fingerprint emulation and helps you bypass CAPTCHAs

Conclusion

As web technologies evolve, websites are implementing increasingly sophisticated anti-bot solutions to protect their resources. However, with the right knowledge and techniques, it is possible to overcome these barriers in a responsible and ethical manner.

Bright Data's Scraping Browser, for example, offers an advanced automation solution specifically designed to tackle anti-bot measures, and it can even bypass sophisticated and powerful detection systems like PerimeterX, Datadome, and others.

The best part is that the Scraping Browser comes with a free trial. To sign up, go to the Scraping Browser page on Bright Data's site, and click on the 'Start free trial' button and enter your details (you can use your regular mail account too). For a guide to a more hands-on experience with the Scraping Browser, check out this article which describes the process in a step-by-step manner.

With the Scraping Browser, no bespoke code or third-party libraries are ever needed on your part for bypassing any of these blocks. Simply connect to the remote Scraping Browser, and save time, effort, and resources while ensuring you remain within the boundaries of responsible web scraping and automation.




Continue Learning