Professionals in finance, legal, and healthcare know the grind of working through endless bank statements, legal contracts, medical records, and more, trying to pull out something meaningful from reams of unstructured data.
Fortunately, we live in an era where Large Language Models (LLMs) are more accessible than ever, and it’s relatively simple these days to build an AI-powered app that can identify patterns, trends, and connections to aid — not replace — human teams in making data-driven decisions.
The catch? This is only possible if you can reliably extract structured data from unstructured PDFs to begin with.
The Bottleneck: Simple Text Extraction Is Not Enough
While we can certainly attempt to use any of the several PDF.js-based libraries to extract raw text from PDFs, we’d never be able to preserve data in tabular format — and this is a dealbreaker when processing bank statements, financial reports, insurance documents, and more.
The core problem is that PDFs were never designed to be data repositories. Rather, they were meant for consistent display, built to preserve how documents look — font, text, raster/vector images, tables, forms, and all — across devices and operating systems, instead of being tethered to a structured data model you might be used to working with, like JSON or XML.
This lack of an inherent schema turns data extraction into a tricky task, because content in PDFs isn’t organized logically, but visually.
Rule-based templates can partially address this but is a fragile solution that falls apart at scale. Each bank, for example, has its own statement format, and even small design tweaks can throw off template-based extraction. Manually creating and maintaining templates for every bank is impractical — after all, a template by definition is static, requiring constant upkeep — a poor long-term choice.
And without this crucial first step done right, a wall of linear, unformatted text that cannot preserve context is never going to give you results from a non-deterministic LLM reliably, or at scale. I would never stake a business on this.
Is there something better?
For some time now, I’ve been looking for an alternative that could combine powerful document processing with a flexible, scalable API (so, not a standalone desktop app) that integrated seamlessly into our development pipeline — without the headaches of mandatory customization or constant manual oversight. And because security is paramount, I needed it to be deployable on our own infrastructure.
Apryse checked every box.
Apryse is an all-in-one native toolkit for document management — it provides libraries for web, mobile, client, and server use — that cover PDF viewing, annotation, editing, creation, generation, and, most relevant to my needs: data extraction via its server SDK, delivering data in JSON, XML, or even XLSX formats.
With Apryse, I can finally shift focus from busywork (data extraction, template maintenance) to building out analytics that drive value at scale. It’s a reliable backbone for high-volume, data-driven operations.
“Intelligent Data Processing”
Here’s what sets it apart: a complex neural network under the hood, using deep learning models to intelligently extract structured data from PDFs. Essentially, the Apryse library uses a pipeline of these models that has “learnt” to recognize what tabular data in a PDF would look like — grids, columns, rows — how they’re positioned in relation to each other, and how they’re different from paragraphs of text, raster/vector images, and so on.
Initialize the library, feed it a PDF as input, and it’ll deliver parsed and structured data in a way that reflects its layout on the page — identifying tables, headers, footers, rows, and columns, and extracting paragraphs/text content along with its reading order and positional data (bounding box coordinates and baselines).
// output.json
{
"pages": [
{
"elements": [
{
"columnWidths": [138, 154],
"groupId": 0,
"rect": [70, 697, 362, 524],
"trs": [
{
"rect": [70, 697, 362, 675],
"tds": [
{
"colStart": 0,
"contents": [
{
"baseline": 674.4262771606445,
"rect": [
71.999997, 698.7012722546347, 190.82616323890954,
665.7694694275744
],
"text": "ABC Bank",
"type": "textLine"
}
],
"rect": [70, 697, 208, 675],
"rowStart": 0,
"type": "td"
}
],
"type": "tr"
}
// more rows here
],
"type": "table"
}
// more elements here
]
}
// more pages here
]
}
What you get back is highly structured output that makes it possible for your downstream processes to analyze or reformat the data for further insights — perfect for ingestion by an LLM in the next stage of our insights pipeline.
Let’s see how this works.
Prerequisites
First off, make sure you’re running Node.js 18+, and init a new project.
We’re going to install the core Apryse library, and its Data Extraction module (which contains the neural network we’ve talked about). We’ll also get dotenv
for our environment variables.
You can use your package manager of choice to install the dependencies. Let’s just use NPM here as that’s the one most Node.js users have by default.
npm install @pdftron/pdfnet-node
npm install @pdftron/data-extraction
npm install dotenv
For my LLM needs, I’ll be using the one I have a subscription for — OpenAI. But to keep this tutorial as open-ended as possible, and make sure anyone reading can follow along, we’ll use the Vercel AI SDK, which is a unified interface that allows you to use (and swap out, with ease) OpenAI, Anthropic, Gemini, and whatever else you have access to — even custom ones.
npm install ai @ai-sdk/openai
Finally, API keys.
- Get one for the LLM you’re using, if you need one. For OpenAI, you’ll find it here.
- For your free Apryse API key, log in here to reveal it.
Put them in an .env
file in your project folder. Here’s what mine looks like.
OPENAI_API_KEY = openai_api_key_here;
APRYSE_API_KEY = apryse_trial_api_key_here;
Step 1: Extraction
The entry point for our script is quite simple, really. You import the library, use addResourceSearchPath
to point to the data extraction add-on (which is, technically, an external resource), and await the extraction of tabular data from an input PDF (bank-statement.pdf
in the same directory, here) as a JSON string.
require("dotenv").config();
const { PDFNet } = require("@pdftron/pdfnet-node");
async function main() {
await PDFNet.addResourceSearchPath(
"./node_modules/@pdftron/data-extraction/lib"
);
try {
const json = await PDFNet.DataExtractionModule.extractDataAsString(
"./bank-statement.pdf",
PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular
);
console.log("-----Extracted Text------");
console.log(json);
} catch (error) {
console.error("Error :", error);
}
}
If your PDF is password protected, simply set the password in a DataExtractionOptions
options object, like so:
/* if password protected */
const options = new PDFNet.DataExtractionModule.DataExtractionOptions();
options.setPDFPassword("password");
const json = await PDFNet.DataExtractionModule.extractDataAsString(
"./bank-statement.pdf",
PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular,
options
);
// rest of code
Oh, and to make sure the Apryse SDK cleans up all in-memory objects once a process has finished running, you should run main()
initialized with a PDFNet.runWithCleanup()
, which makes our code at the end of the extraction stage look like this:
require("dotenv").config();
const { PDFNet } = require("@pdftron/pdfnet-node");
async function main() {
await PDFNet.addResourceSearchPath(
"./node_modules/@pdftron/data-extraction/lib"
);
try {
const json = await PDFNet.DataExtractionModule.extractDataAsString(
"./bank-statement.pdf",
PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular
);
console.log("-----Extracted Text------");
console.log(json);
} catch (error) {
console.error("Error :", error);
}
}
PDFNet.runWithCleanup(main, process.env.APRYSE_API_KEY)
.catch((error) =>
console.error("Apryse library failed to initialize:", error)
)
.then(function () {
PDFNet.shutdown();
});
Again, make sure your Apryse API key is set in the.env
file, and passed as the second argument to the runWithCleanup
function here.
When you run this script, it should print out the extracted, deeply structured JSON that we talked about earlier.
The next step would be to pass this extracted JSON to an LLM, with a prompt designed to extract insights from it.
Step 2: Making your LLM Ingest Structured Output to Generate Insights
Start by integrating the Vercel AI SDK to handle requests to your LLM provider. We’ll write a simple function that takes in the JSON data from the previous step, and sends it to the LLM with a specific prompt we’ll design for actionable insights.
Our scenario is analyzing a bank statement for internal business use (i.e. we want strategic and financial insights for stakeholders), so this sounds like a fine prompt to use, right?
const prompt = `Analyze the following text, and generate a bullet-point list
of actionable financial and strategic insights for internal business use.`;
Almost there, but not quite. LLMs interpret prompts as text, without a clear separation between user instructions and commands hidden within the underlying text. Imagine malicious text in a PDF that subtly redirects your prompt to overemphasize certain data, potentially skewing your company’s conclusions towards biased, or even harmful decisions. Not a pretty thought.
If this were a traditional application, we’d just handle user inputs with strict sanitation/validation. Prompts for LLMs though, are harder to secure due to their lack of formal coding structure.
But there are some safeguards we can take. Since we already have structured JSON as input, ready to go, we could just put that under an extra JSON field, and tell the LLM to process nothing that isn’t within that particular key-value pair.
const prompt = `Analyze the transaction data provided in JSON format
under the 'text_to_analyze' field, and nothing else. Each transaction is structured with details
like transaction description, amount, date, and other metadata.
Generate a bullet-point list of actionable financial and strategic insights
for internal business use. Focus on identifying cash flow patterns,
high-spend categories, recurring payments, large or unusual transactions,
and any debt obligations. Provide insights into areas for cost savings,
credit risk, operational efficiency, and potential financial risks.
Each insight should suggest actions or strategic considerations to
improve cash flow stability, optimize resource allocation, or flag
potential financial risks. Expand technical terms as needed to clarify
for business stakeholders.`;
The non-determinism of LLMs means you can spend forever fine-tuning this prompt to your needs, but this should do as a baseline.
After that, we can import the necessary libraries, and simply pass our LLM the necessary data like this.
const { openai } = require("@ai-sdk/openai");
const { generateText } = require("ai");
async function analyze(input) {
try {
const { text } = await generateText({
model: openai("gpt-4o"),
/* structured inputs to safeguard against prompt injection */
prompt: `${prompt}\n{"text_to_analyze": ${input}}`,
});
if (!text) {
throw new Error(
"No response text received from the generateText function."
);
}
return text;
} catch (error) {
console.error("Error in analyze function:", error);
return null; // return null if there’s an error
}
}
As you can see, the Vercel AI SDK here makes it very easy to swap out the model you want. Everything else will remain the same except the value for the model property.
Putting it all together and cleaning up the code a bit, here’s what we have.
const { PDFNet } = require("@pdftron/pdfnet-node");
const { openai } = require("@ai-sdk/openai");
const { generateText } = require("ai");
const fs = require("fs");
require("dotenv").config();
const prompt = `YOUR_PROMPT_HERE`; // remember to use structured inputs to safeguard against prompt injection
async function analyze(input) {
try {
const { text } = await generateText({
model: openai("gpt-4o"),
prompt: `${prompt}\n{"text_to_analyze": ${input}}`, // structured input
});
if (!text) {
throw new Error(
"No response text received from the generateText function."
);
}
return text;
} catch (error) {
console.error("Error in analyze function:", error);
return null;
}
}
async function extractDataFromPDF() {
console.log("Extracting tabular data as a JSON string...");
/* if password protected */
// const options = new PDFNet.DataExtractionModule.DataExtractionOptions();
// options.setPDFPassword("password")
const json = await PDFNet.DataExtractionModule.extractDataAsString(
"./bank-statement.pdf",
PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular
); // include options if you're using it
fs.writeFileSync("./output.json", json);
console.log("Result saved ");
return json;
}
async function main() {
await PDFNet.addResourceSearchPath(
"./node_modules/@pdftron/data-extraction/lib"
);
try {
const extractedData = await extractDataFromPDF();
const insights = await analyze(extractedData);
if (insights) {
// Save insights to a text file
fs.writeFileSync("insights.md", insights, "utf8");
console.log("Insights saved to insights.md");
} else {
console.error("No insights generated. Skipping file write.");
}
} catch (error) {
console.error("Error :", error);
}
}
PDFNet.runWithCleanup(main, process.env.APRYSE_API_KEY)
.catch((error) =>
console.error("Apryse library failed to initialize:", error)
)
.then(function () {
PDFNet.shutdown();
});
For clarity, I’ve chosen to write outputs at both stages (the extraction from PDF, as well as the insights generated by the LLM) to files on disk. If something goes wrong, hopefully, that’ll help you diagnose + fine-tune as needed.
Your insights are going to be something similar to this. Pretty neat!
// excerpt-of-insights.md
//...
### Significant or Unusual Transactions
1. **Equipment Purchase**: For the **$9,000.00** expense for computers on April 21, 2023,
I suggest verifying that the hardware specifications meet both current
operational demands and potential future needs to prevent frequent upgrades.
I also suggest considering bulk purchasing discounts or bundled warranties to
lower long-term maintenance costs. Additionally, consider leasing instead
of purchasing outright if you want to bring down upfront costs.
2. **Business Travel Expenses**: You spent **$36,500** on travel for the month
of April. This is a significant expense on travel, and I suggest double
checking to make sure the numbers are accurate, and shows clear alignment
between travel goals and expected returns. I also suggest a travel policy
that defines allowable expenses, reimbursement guidelines, and cost-saving
practices (e.g., advance bookings, lodging caps, per diem).
Regularly review this policy, and promote virtual meetings whenever possible.
//...(more)
Lastly, this is outside the scope of this tutorial, but I’d be remiss if I didn’t mention something else: a bank statement is typically a small enough PDF that the structured data Apryse extracts is unlikely to exceed your LLM usage or context limits. OpenAI’s free tier, for example, has an 8K token context limit, which should comfortably handle most bank statements. But if you need to process larger PDFs, like:
- annual financial reports,
- multi-page legal contracts,
- or comprehensive healthcare records,
Then the JSON output generated can easily exceed token limits for many LLMs, especially if it includes a lot of layout metadata. For these, consider using a RAG (Retrieval-Augmented Generation) approach to break the content into smaller, more relevant chunks. This way, you can index the chunks and retrieve only the parts most relevant to each query, reducing token costs and staying within your model’s context window.
Lessons Learnt
PDF data extraction is a crapshoot — often involving cobbling together multiple libraries and fighting with inconsistent output. The deep learning-based Apryse was a refreshingly straightforward solution.
The SDK’s ability to maintain structural fidelity when converting PDFs to JSON — preserving everything from table relationships to spatial layout — provides the perfect foundation for LLM-powered analysis. No more manual parsing, no more regex gymnastics, and no more guessing at document hierarchy.
The developer experience has been exceptional, too. Crystal-clear documentation that answers questions and lets you deep dive into the API to figure out potential use cases you might have, lots of sample code that actually reflects real-world use cases, and of course, it’s fast.
I used it for bank statements, but whether you’re building financial analysis tools, document processing pipelines, or any application that needs to extract meaningful data from PDFs, Apryse deserves a serious look. It’s turned what’s typically a painful development process into a straightforward implementation that lets you focus on building features rather than fighting with document parsing.