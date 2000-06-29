extracts text from PDF files

This is just a library packaged out of the examples for usage of pdf.js with nodejs.

It reads a pdf file and exports all pages & texts with coordinates. This can be e.g. used to extract structured table data.

This package includes a build of pdf.js. why? pdfs-dist installs not needed dependencies into production deployment.

Note: NO OCR!

Install

Example Usage

javascript async with callback

const PDFExtract = require ( 'pdf.js-extract' ).PDFExtract; const pdfExtract = new PDFExtract(); const options = {}; pdfExtract.extract( 'test.pdf' , options, (err, data) => { if (err) return console .log(err); console .log(data); });

typescript async with promise

import {PDFExtract, PDFExtractOptions} from 'pdf.js-extract' ; const pdfExtract = new PDFExtract(); const options: PDFExtractOptions = {}; pdfExtract.extract( 'test.pdf' , options) .then( data => console .log(data)) .catch( err => console .log(err));

Options

export interface PDFExtractOptions { firstPage?: number ; lastPage?: number ; password?: string ; verbosity?: number ; normalizeWhitespace?: boolean ; disableCombineTextItems?: boolean ; }

Example Output