Use Web Scraping to Download All PDFs With Python

A guide on using web scraping to download all PDFs with Python.

image

Downloading hundreds of PDF files manually was…tiresome.

One fine day, a question popped up in my mind: why am I downloading all these files manually? That’s when I started searching for an automatic tool.

This sounded like a fun automation task and since I was eager to get my hands dirty with web-scraping, I decided to give it a try. The idea was to input a link, scrap its source code for all possible PDF files and then download them. Let’s break down the steps.

Check Validity

Using a simple try-except block, I check if the URL entered is valid or not. If it can be opened using urlopen, it is valid. Otherwise, the link is invalid and the program is terminated.

def check_validity(my_url):
    try:
        urlopen(my_url)
        print("Valid URL")
    except IOError:
        print ("Invalid URL")
        sys.exit()

Read HTML

In Python, HTML of a web page can be read like this:

html = urlopen(my_url).read()

However, when I tried to print it on my console, it wasn’t a pleasant sight. In order to get a properly formatted and humanly readable HTML source code, I tried doing this with BeautifulSoup, which is a Python package for parsing HTML and XML documents:

html_page = bs(html, features=”lxml”)

Now, I had two main websites from which I occasionally downloaded pdf files. Upon evaluating the HTML code of both, I realized that the content of their meta tags was slightly different. For example, one of the websites had this:

<meta content=”Chemistry” property=”og:title”/>

while another website had no og:title and had this instead:

<meta content=”text/html; charset=utf-8" http-equiv=”Content-Type”/>

In order to get usable meta-data, I added this:

og_url = html_page.find(“meta”, property = “og:url”)

and got something like this as a result:

<meta content=”https://cnds.jacobs-university.de/courses/cs-2019/" property=”og:url”/>

Parse Input URL

Next, it was time to parse and evaluate the input URL.

base = urlparse(my_url)

The results looked like this:

ParseResult(scheme=’https’, netloc=’cnds.jacobs-university.de’, path=/courses/os-2019/, params=’’, query=’’, fragment=’’)

Now, I knew the scheme, netloc (main website address), and the path of the web page.

Now that I had the HTML source code, I needed to find the exact links to all the PDF files present on that web page. If you know HTML, you would know that the <a> tag is used for links.

First I obtained the links using the href property. Next, I checked if the link ended with a .pdf extension or not. If the link led to a pdf file, I further checked whether the og_url was present or not.

CNDS Links

If og_urlwas present, it meant that the link is from a cnds web page, and not Grader.

Now the current_links looked like p1.pdf, p2.pdf etc. So to get a full-fledged link for each PDF file, I extracted the main URL using the content tag and appended my current link to it. For example, the org_url[“content”] looked like this:

https://cnds.jacobs-university.de/courses/os-2019/

while the current link was p5.pdf.When appended together, I got the exact link for a PDF file:

https://cnds.jacobs-university.de/courses/os-2019/p5.pdf

for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            if og_url:
                print("currentLink",current_link)
                links.append(og_url["content"] + current_link)
            else:
                links.append(base.scheme + "://" + base.netloc + current_link)

Other Links:

While trying to download PDFs from another website, I realised that the source codes were different. Hence, the links had to be dealt with differently. Since I had already parsed the URL, I knew its scheme and netloc. Upon appending the current link to it, I could easily get the exact link for my PDF file.

and there it was! My very own notes downloading web scraping tool. Why waste hours downloading files manually when you can copy-paste a link and let Python do its magic?

Here’s what my overall code looked like:

def check_validity(my_url):
    try:
        urlopen(my_url)
        print("Valid URL")
    except IOError:
        print ("Invalid URL")
        sys.exit()


def get_pdfs(my_url):
    links = []
    html = urlopen(my_url).read()
    html_page = bs(html, features="lxml")
    og_url = html_page.find("meta",  property = "og:url")
    base = urlparse(my_url)
    print("base",base)
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            if og_url:
                print("currentLink",current_link)
                links.append(og_url["content"] + current_link)
            else:
                links.append(base.scheme + "://" + base.netloc + current_link)

    for link in links:
        try:
            wget.download(link)
        except:
            print(" \n \n Unable to Download A File \n")

Want to try it? Feel free to fork, clone, and star it on my Github. Have ideas to improve it? Create a pull request!

Github Link: https://github.com/nhammad/PDFDownloader

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics