When aiming to scrape high amounts of data, Scrapy stands out as a clever choice due to its impressive speed. Nevertheless, its implementation is more complex compared to other web scraping libraries such as Beautiful Soup. For instance, in case you want to use a headless browser, you'll need to install additional dependencies and configure settings parameters.
Scrapy is an open-source and collaborative framework. Written in Python, it became one of the fastest solutions for web scraping, and due to its extensibility, it has been incorporating other frameworks such as Playwright. It includes built-in support for handling common web scraping tasks such as handling cookies, user agents, and pagination. And uses CSS and Xpath selectors to parse HTML content.
Playwright is an automation browser framework built by Microsoft, and initially thought for JavaScript applications. Currently, it has been adapted to other languages, such as Python and Java.
1 - Run Playwright on the Scrapy framework
Before using Playwright, we first need to set up Scrapy, starting by installing the library.
pip install scrapy
For those who are not familiar with Scrapy, the setup is slightly different from other web scraping libraries. The framework is organized within a project folder, that can be created with the following command:
scrapy startproject <my_project>
After running the above command, we’ll get a structured project similar to this one:
├── playwright_scrapy_project
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── my_spider.py
|
└── scrapy.cfg
Now we’re ready to install Playwright and make the necessary adjustments to the project.
pip install scrapy-playwright
playwright install
The dependency scrapy-playwright
is a Scrapy plugin which performs requests using Playwright for Python, you can read more about it here.
To activate Playwright functions on the Scrapy project, we need to change the settings.py accordingly.
# Activate scrapy-playwright
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https":"scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Note that the PLAYWRIGHT_BROWSER_TYPE and PLAYWRIGHT_LAUNCH_OPTIONS are not mandatory, but the other parameters need to be set, so Playwright can be recognized.
2 - Make a web scraping script
Scrapy is organized in spiders. Which are web scraping classes, aimed at different goals. In one Scrapy project, you can for instance have spiders for different websites, or for several ways of extracting data from the same website. These spiders need to be configured to recognize Playwright as a headless browser. You can learn more about scraping with Scrapy here.
There are several ways of configuring depending on your needs. In this case, we’ll take a more generalist approach. We can start creating a class and define the imports.
async def parse(self, response):
max_retries = 10
retries = 0
while retries < max_retries:
page = response.meta["playwright_page"]
# scroll to load more items
for i in range(5):
await page.mouse.wheel(0, 150000)
await page.locator("text='View More'").first.click()
# Get the content of the current page
page_content = await page.content()
items = scrapy.Selector(text=page_content).css('div.<name_of_the_class>')
# Iterate over several items
for item in items:
try:
title = item.css('span.<title_class> h4::text').get()
date = item.css("span.<date_class> span::text").get()
description = item.css('div.<description_class>::text').get()
link = item.css('a.<link_class>::attr(href)').get()
await page.goto(link)
# More content can be scraped inside the link page
user = response.css('div.<user_class::text').get()
retries = max_retries
# if doesn't go to link triggers exception
except Exception as error:
print(error)
retries += 1
time.sleep(10)
break
yield {
"title": title,
"date": date,
"description": description,
"user": user}
First of all it is important to understand the difference between synchronous and asynchronous scraping. The latter, which is the one we are using, allows multiple requests to be made concurrently without waiting for each request to complete. This prevents the application from stopping. In synchronous scraping, each request is handled at a time, which at first glance is better for control, but it is not as robust as the asynchronous.
The script starts with a while loop, which serves mainly to ensure that the website does not block the request.
Then the page instance is created with response.meta["playwright_page"]
. Just below a for loop is used to scroll the website with the page.mouse.wheel() function. This scrolling is used to deal with the infinite scroll issue. After each scroll the page loads more content, meaning more data to be scraped. If you want to know more about infinite scroll, check out my other article.
https://python.plainenglish.io/3-ways-to-scrape-infinite-scroll-websites-with-python-66b38ed66016
After the loop, we start using the Scrapy syntax to scrape information from the current page. First, we make sure to tackle the website content with scrapy.Selector(text=page_content).css() then we iterate over the different elements inside the items list, and we extract more features from it. Note that the values, title, description, date, etc are merely descriptive.
Take for instance an e-Commerce website. We can move to the product’s URL, by extracting the link. To move the page to the new URL, we use the page.goto(link) function and we can continue scraping from there. That’s exactly what we have in the script.
Finally, at the end of the script, we yield all the extracted elements. To run the crawler, all you need to do is to use the following command in the terminal:
scrapy crawl myspider
Make sure you use the name you’ve specified just below the spider class. In case you want to save the information in a CSV file, do the following:
scrapy crawl myspyder -o data.csv
Conclusion
The integration of Playwright with the Scrapy framework can be difficult to set up at the beginning, mainly because there are several instructions that can be added to the meta dictionary inside the scrapy.Request() class, and mastering them is the key to building a robust web crawler. In this article, we’ve looked into the "playwright_include_page" key which is a very generalist one that allows us to easily use the playwright’s page instance, without adding complexity to the the code. However, it is worth taking a look at other parameters and see which ones can be used to build the perfect web scraper for the target website.