How to Convert Scanned Files to Searchable PDF Using Python and Pytesseract

Oftentimes, you would like to make scanned files or image files searchable in PDF, because it is much quicker and more convenient to search keywords with a searchable PDF than in native image format. More importantly, the underlying text would be useful for Natural Language Processing (NLP). In this article, I’m going to talk about how to turn scanned file(s) into searchable PDF programmatically using Python and Pytesseract.

Required Libraries

pdf2image: It is a Python module that wraps pdftoppm and pdftocairo to convert a PDF to an image object. The produced output would be a list of image objects.
pytesseract: Python-Tesseract is an optical character recognition (OCR) tool developed for Python. It uses an OCR engine (namely, Google’s Tesseract-OCR Engine) to extract text from the image(s) instead of relying on underlying text and structure from PDF. pytesseract has the advantages of extracting text from PDF (such as preserving whitespaces between words) over other Python packages.
PyPDF2: It is a Python PDF toolkit, which is capable of splitting, cropping, merging PDF pages and more.
io: It allows us to manage the file-related input and output.

Install Libraries

pip install pdf2image
pip install pytesseract
pip install PyPDF2

Download and Install additional software

We would need additional software to use the libraries.

For pdf2image, we will have to download the poppler for windows users. This software functions similarly to pdftoppm and pdftocairo in a Linux system.
For pytesseract, we will need to download and install Tesseract-OCR Engine.

Import Libraries

import pytesseract
from pdf2image import convert_from_path
import PyPDF2
import io

Initialize pytesseract and pdf2image

After you download and install the software, you can add their executable paths into Environment Variables on your computer. Alternatively, you can run the following commands to directly include their paths in the Python program.

poppler_path = '...\pdf2image_poppler\Release-22.01.0-0\poppler-22.01.0\Library\bin'
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Convert scanned PDF page(s) to single search PDF

Usually, we have multiple scanned PDF pages in a single file. We can use the following functions to process all the pages with a for-loop.

convert_from_path: it converts all the scanned PDF pages into images.
image_to_pdf_or_hocr: it converts an image into a searchable pdf page
PdfFileWriter: it creates a pdf writer object for the new PDF file.
PdfFileReader: it creates a pdf reader object

addPage: it adds a page to pdf writer object

images = convert_from_path('Receipt.pdf', poppler_path=poppler_path)
pdf_writer = PyPDF2.PdfFileWriter()
for image in images:
    page = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
    pdf = PyPDF2.PdfFileReader(io.BytesIO(page))
    pdf_writer.addPage(pdf.getPage(0))
# export the searchable PDF to searchable.pdf
with open("searchable.pdf", "wb") as f:
    pdf_writer.write(f)

Convert an Image to Searchable PDF

If the scanned file is in an image format, such as, tif, png, jpg. The process to convert it into a search PDF file is simpler.

PDF = pytesseract.image_to_pdf_or_hocr('Receipt.PNG', extension='pdf')
# export to searchable.pdf
with open("searchable.pdf", "w+b") as f:
    f.write(bytearray(PDF))

Convert Multiple Images in the same folder to a Single searchable PDF

If you would like to convert a lot of images in the same folder into a single searchable PDF file, you can use os.walk to create a list of paths for all the image files in the same folder, then use the same functions mentioned above to process the images and export into a single searchable PDF file.

all_files = []
for (path,dirs,files) in **os.walk**('images_folder'):
    for file in files:
        file = os.path.join(path, file)
        all_files.append(file)

pdf_writer = PyPDF2.PdfFileWriter()
for file in all_files:
    page = pytesseract.image_to_pdf_or_hocr(file, extension='pdf')
    pdf = PyPDF2.PdfFileReader(io.BytesIO(page))
    pdf_writer.addPage(pdf.getPage(0))

with open("searchable.pdf", "wb") as f:
    pdf_writer.write(f)

If you would like to continue exploring PDF scraping, please check out my other articles:

If you enjoy this article and would like to Buy Me a Coffee, please click here.

How to Convert Scanned Files to Searchable PDF Using Python and Pytesseract

An efficient way to OCR scanned images

Required Libraries

Install Libraries

Download and Install additional software

Import Libraries

Initialize pytesseract and pdf2image

Convert scanned PDF page(s) to single search PDF

Convert an Image to Searchable PDF

Convert Multiple Images in the same folder to a Single searchable PDF

Continue Learning

How to Replace Python 'for' Loops with NumPy Operations

How to find all possible combinations in a list (and their sum) with Python

Access files from AWS S3 using pre-signed URLs in Python

How to Use the _eq_() Method to Compare Objects in Python

LangChain in Chains #5: Prompt Template

The Best Python IDE For Mac Users 2024— Part 2

Main Menu

Follow Us

How to Convert Scanned Files to Searchable PDF Using Python and Pytesseract

An efficient way to OCR scanned images

Required Libraries

Install Libraries

Download and Install additional software

Import Libraries

Initialize pytesseract and pdf2image

Convert scanned PDF page(s) to single search PDF

Convert an Image to Searchable PDF

Convert Multiple Images in the same folder to a Single searchable PDF

Continue Learning

How to Replace Python 'for' Loops with NumPy Operations

How to find all possible combinations in a list (and their sum) with Python

Access files from AWS S3 using pre-signed URLs in Python

How to Use the _eq_() Method to Compare Objects in Python

LangChain in Chains #5: Prompt Template

The Best Python IDE For Mac Users 2024— Part 2