pdf

pdf2docx

Parse PDF file with PyMuPDF and generate docx with python-docx

Showing:

Popularity

Downloads/wk

0

GitHub Stars

223

Maintenance

Last Commit

3mos ago

Contributors

8

Package

Dependencies

6

License

GPL v3

Categories

Readme

pdf2docx

python-version codecov pypi-version license pypi-downloads

  • Extract data from PDF with PyMuPDF, e.g. text, images and drawings
  • Parse layout with rule, e.g. sections, paragraphs, images and tables
  • Generate docx with python-docx

Features

  • Parse and re-create page layout

    • page margin
    • section and column (1 or 2 columns only)
    • page header and footer
  • Parse and re-create paragraph

    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • external hyper link
    • paragraph horizontal alignment (left/right/center/justify) and vertical spacing
    • list style
  • Parse and re-create image

    • in-line image
      • image in Gray/RGB/CMYK mode
      • transparent image
      • floating image, i.e. picture behind text
  • Parse and re-create table

    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables
  • Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file
  • Left to right language
  • Normal reading direction, no word transformation / rotation
  • Rule-based method can't 100% convert the PDF layout

Documentation

Sample

sample_compare.png

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100
No reviews found
Be the first to rate

Alternatives

No alternatives found

Tutorials

No tutorials found
Add a tutorial