circuit

Search for Text in a PDF with Python

Nobody likes working with PDFs, but we have to


Working with PDF files is definitely not the easiest thing to do. Many people have issues editing, using signing software, slow load times, file sizes being too large, and the list goes on. From a developer standpoint, creating PDFs can be complicated while trying to read them is not an exact science and could produce unexpected results. Lucky for Python programmers, there is a package called PyPDF2 that can help reduce the stress of working with PDFs.

What is PyPDF2?

In short, PyPDF2 is used for reading, retrieving metadata, splitting, merging, cropping, and transforming PDF pages. Needless to say, it can do quite a bit. If you happen to be poking around their Github repo, you may notice that the package hasn’t been updated in quite a while. The reason for this is because the creators decided to try a new business model and have begun working on PyPDF4. Don’t fret, because, at the time of this writing, the creators mentioned that the new package will be free to use. Since PyPDF4 is still relatively new and could potentially be buggy, I will be using PyPDF2.

Installing and Setup

PyPDF2 makes interacting with PDFs a lot easier. To get started using it with Python, we first need to install using pip.

pip3 install PyPDF2

With it now installed, we can start using its methods by declaring a new reader object.

reader = PyPDF2.PdfFileReader(file)

Reading Files

Single Page

Since PDFs treat individual pages more like images, reading the content of a file can be a bit tricky. Fortunately for us, PyPDF2 has a few methods to help make this easier. To read a single page in a file, we will use the getPage method and assign it to a variable.

page = reader.getPage(PAGE_NUMBER)

After that, using the extractText method will get us all the text on the page we just requested.

page_content = page.extractText()

At this point, if you just want to see the text, all you have to do is just print it out.

print(page_content)

But, if you want to know if the page contains a certain string of text, an IF statement can help with that.

if search_term in page_content:  
     print("Found it!")

Whole File

Now if you’re like me and you have a PDF file but no reader software to open it, this next bit will come in handy. Utilizing a lot of what we already have learned, we will use a FOR loop to iterate through the pages of the file.

for page_number in range(0, reader.numPages):

Next, we will use some familiar methods to get the content of the pages.

page = reader.getPage(page_number)  
page_content = page.extractText()

Again, we will use an IF statement to determine if the text we are looking for exists on a certain page. However, outside of our FOR loop well will create a new list called result_list and within the IF block, we will create a new dictionary that contains the page content and the page number. Here is the full code:

	result_list = []
	reader = PyPDF2.PdfFileReader(file)
	for page_number in range(0, reader.numPages):  
	     page = reader.getPage(page_number)  
	     page_content = page.extractText() 
	     if search_term in page_content:  
	          result = {  
	               "page": page_number,  
	               "content": page_content  
	          } result_list.append(result)

The final thing to do is close the reader.

reader.close()

This basic solution will accomplish the goal of searching through a PDF file. However, you will need to tune it a little to increase performance since it does run a bit slow.

Conclusion

PyPDF2 is a pretty powerful package that helps make life working with PDFs a lot easier. In this article, we really only scratched the surface of what this library is capable of. Being able to read PDFs is a good starting point that can be used as spring to the more advanced topics of splitting, merging, and transforming files. Let me know in the comments below your thoughts about working with PDFs and what libraries you use to aid you in your work (feel free to talk about libraries for different languages). Until the next time, cheers!

References:

  1. https://www.zenbusiness.com/blog/pdf/
  2. https://pdfexpert.com/blog/what-people-hate-about-pdf
  3. https://www.nngroup.com/articles/pdf-unfit-for-human-consumption/



Continue Learning