How to extract pdf data with PDF.js

2021-02-24

We live in a data-driven world, consistently transferring data from one location to another. In this brief tutorial, I will show you how to extract pdf content using PDF.js. This npm package will help you roll out custom pdf extraction logic or an interface to explore pdf data.

This article is a guest post by Ammon Victor.

This article glosses over the following ES6 concepts const, promises, async/await, and fat arrow functions

# run
npm install pdfjs-dist
# or
yarn add pdfjs-dist

Core

/**
 * Note the import of pdfjs/es5/build/pdf, required when in Node.js
 * else when using the default import in Node.js, getDocument will throw an error
 */
const pdfjs = require("pdfjs/es5/build/pdf")

async function getContent(src) {
    const doc = await pdfjs.getDocument(src).promise // note the use of the property promise
    const page = await doc.getPage(1)
    return await page.getTextContent()
}

PDF.js exposes getDocument that abstracts the logic for handling the opening of a pdf. If the file successfully opens getDocument has a property named promise that returns a promise with the document. With this returned document we can access any page with doc.getPage and then access the contents of that page with page.getTextContext, which returns an object with two properties items and styles, the data we need is in items.

Processing

content.items is an array of objects, what we are looking for is the str property of each object.

async function getItems(src) {
    // Perform checks
    const content = await getContent(src)
    /**
     * Expect content.items to have the following structure
     * [{
     *  str: '1',
     *  dir: 'ltr',
     *  width: 4.7715440000000005,
     *  height: 9.106,
     *  transform: [ 9.106, 0, 0, 9.106, 53.396, 663.101 ],
     *  fontName: 'g_d0_f2'
     * }, ...]
     */

    // you can do the following on content.items
    return content.items.map((item) => item.str)
    // This is a new array of strings from the str property
    // Expected output: ['1', '06/02/2013', '$1,500.00', 'job 1, check 1', 'yes', 'deposit',...]
}

This new array of strings will repeat n-times the number of columns in the pdf table. The repetition will depend on how the pdf was structured, perform pre-inspection of your pdf

With our data in the new array, we can loop through it n-times saving to JSON, saving to a database, or saving to a different file format.

Usage

Standalone CLI app
Integration into an existing server app
As part of a web app UI/UX

view this GitHub repo extract-pdf-content for demos

How to extract pdf data with PDF.js

Core

Processing

Usage

More in-depth reading