circuit

How to Convert Scanned Files to Searchable PDF Using Python and Pytesseract

An efficient way to OCR scanned images


Oftentimes, you would like to make scanned files or image files searchable in PDF, because it is much quicker and more convenient to search keywords with a searchable PDF than in native image format. More importantly, the underlying text would be useful for Natural Language Processing (NLP). In this article, I’m going to talk about how to turn scanned file(s) into searchable PDF programmatically using Python and Pytesseract.

Required Libraries

  • pdf2image: It is a Python module that wraps pdftoppm and pdftocairo to convert a PDF to an image object. The produced output would be a list of image objects.
  • pytesseract: Python-Tesseract is an optical character recognition (OCR) tool developed for Python. It uses an OCR engine (namely, Google’s Tesseract-OCR Engine) to extract text from the image(s) instead of relying on underlying text and structure from PDF. pytesseract has the advantages of extracting text from PDF (such as preserving whitespaces between words) over other Python packages.
  • PyPDF2: It is a Python PDF toolkit, which is capable of splitting, cropping, merging PDF pages and more.
  • io: It allows us to manage the file-related input and output.

Install Libraries

pip install pdf2image
pip install pytesseract
pip install PyPDF2

Download and Install additional software

We would need additional software to use the libraries.

  • For pdf2image, we will have to download the poppler for windows users. This software functions similarly to pdftoppm and pdftocairo in a Linux system.
  • For pytesseract, we will need to download and install Tesseract-OCR Engine.

Import Libraries

import pytesseract
from pdf2image import convert_from_path
import PyPDF2
import io

Initialize pytesseract and pdf2image

After you download and install the software, you can add their executable paths into Environment Variables on your computer. Alternatively, you can run the following commands to directly include their paths in the Python program.

poppler_path = '...\pdf2image_poppler\Release-22.01.0-0\poppler-22.01.0\Library\bin'
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Convert scanned PDF page(s) to single search PDF

Usually, we have multiple scanned PDF pages in a single file. We can use the following functions to process all the pages with a for-loop.

  • convert_from_path: it converts all the scanned PDF pages into images.

  • image_to_pdf_or_hocr: it converts an image into a searchable pdf page

  • PdfFileWriter: it creates a pdf writer object for the new PDF file.

  • PdfFileReader: it creates a pdf reader object

  • addPage: it adds a page to pdf writer object

    images = convert_from_path('Receipt.pdf', poppler_path=poppler_path)
    pdf_writer = PyPDF2.PdfFileWriter()
    for image in images:
        page = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
        pdf = PyPDF2.PdfFileReader(io.BytesIO(page))
        pdf_writer.addPage(pdf.getPage(0))
    # export the searchable PDF to searchable.pdf
    with open("searchable.pdf", "wb") as f:
        pdf_writer.write(f)
    

Convert an Image to Searchable PDF

If the scanned file is in an image format, such as, tif, png, jpg. The process to convert it into a search PDF file is simpler.

PDF = pytesseract.image_to_pdf_or_hocr('Receipt.PNG', extension='pdf')
# export to searchable.pdf
with open("searchable.pdf", "w+b") as f:
    f.write(bytearray(PDF))

Convert Multiple Images in the same folder to a Single searchable PDF

If you would like to convert a lot of images in the same folder into a single searchable PDF file, you can use os.walk to create a list of paths for all the image files in the same folder, then use the same functions mentioned above to process the images and export into a single searchable PDF file.

all_files = []
for (path,dirs,files) in **os.walk**('images_folder'):
    for file in files:
        file = os.path.join(path, file)
        all_files.append(file)

pdf_writer = PyPDF2.PdfFileWriter()
for file in all_files:
    page = pytesseract.image_to_pdf_or_hocr(file, extension='pdf')
    pdf = PyPDF2.PdfFileReader(io.BytesIO(page))
    pdf_writer.addPage(pdf.getPage(0))

with open("searchable.pdf", "wb") as f:
    pdf_writer.write(f)

If you would like to continue exploring PDF scraping, please check out my other articles:

If you enjoy this article and would like to Buy Me a Coffee, please click here.




Continue Learning