PDF Text Extract

Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction

Installation

npm install --save pdf-text-extract

You will need the pdftotext binary available on your path. There are packages available for many different operating systems

See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext command

Usage

As a module

extract(filePath, [options], [pdftotextcommand], callback)

Options and pdftotextcommand are not required.

var path = require ( 'path' ) var filePath = path.join(__dirname, 'test/data/multipage.pdf' ) var extract = require ( 'pdf-text-extract' ) extract(filePath, function ( err, pages ) { if (err) { console .dir(err) return } console .dir(pages) })

The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to splitPages: false .

var filePath = path.join(__dirname, 'test/data/multipage.pdf' ) var extract = require ( 'pdf-text-extract' ) extract(filePath, { splitPages : false }, function ( err, text ) { if (err) { console .dir(err) return } console .dir(text) })

You can set the following options:

firstPage : First page to extract

: First page to extract lastPage : Last page to extract

: Last page to extract resolution : in dpi, as is specified by pdftotext -r

: in dpi, as is specified by pdftotext -r crop : Should be an object { x:x, y:y, w:w, h:h }

: Should be an object { x:x, y:y, w:w, h:h } layout : Should be either layout , raw or htmlmeta . Default: layout

: Should be either , or . Default: encoding : Should be either UCS-2 , ASCII7 , Latin1 , UTF-8 , ZapfDingbats or Symbol . Default: UTF-8

: Should be either , , , , or . Default: eol : End of line convention. One of either: unix , dos or mac

: End of line convention. One of either: , or ownerPassword : Owner password (for encrypted files)

: Owner password (for encrypted files) userPassword : User password (for encrypted files)

: User password (for encrypted files) splitPages : If true, the result will be an array of pages. Default: true.

If needed you can pass optional arguments to the extract function. These will be passed to the child_process.spawn call.

var filePath = path.join(__dirname, 'test/data/multipage.pdf' ) var extract = require ( 'pdf-text-extract' ) var options = { cwd : "./" } extract(filePath, options, function ( err, pages ) { if (err) { console .dir(err) return } console .dir( 'extracted pages' , pages) })

You can also override the command for pdftotext if it is installed in a location that is not available in the PATH environment variable

var filePath = path.join(__dirname, 'test/data/multipage.pdf' ) var pdfToTextCommand = '/opt/bin/pdftotext' var extract = require ( 'pdf-text-extract' ) var options = { cwd : "./" } extract(filePath, options, pdfToTextCommand, function ( err, pages ) { if (err) { console .dir(err) return } console .dir( 'extracted pages' , pages) })

ES6 promises are supported. You can now call .then(onFulfilled[, onRejected]):

var filePath = path.join(__dirname, 'test/data/multipage.pdf' ) var Extract = require ( '../index.js' ) var extract = new Extract(filePath) extract.then( function ( pages ) { console .dir( 'extracted pages' , pages) }).catch( function ( err ) { console .error( 'error:' , err) })

As a command line tool

npm install -g pdf-text-extract

Execute with the filePath as an argument. Output will be json-formatted array of pages

pdf-text-extract ./ test /data/multipage.pdf

Test