amazon-textract-textractor-helper

Analyze documents with Amazon Textract and generate output in multiple formats.

Showing:

Popularity

Downloads/wk

0

GitHub Stars

144

Maintenance

Last Commit

2d ago

Contributors

15

Package

Dependencies

6

License

Apache License Version 2.0

Categories

Readme

Textractor

textractor helps speed up PoCs by allowing you to quickly extract text, forms and tables from documents using Amazon Textract. It can generate output in different formats including raw JSON, JSON for each page in the document, text, text in reading order, key/values exported as CSV, tables exported as CSV. It can also generate insights or translate detected text by using Amazon Comprehend, Amazon Comprehend Medical and Amazon Translate. It takes advantage of Textract response parser library to easily consume JSON returned by Amazon Textract.

Prerequisites

Overview

The project is structured a little different now.

The original textractor implementation is still available under src, while the new PyPI packages for different features are setup in the following folders:

  • helper
  • caller
  • overlayer
  • prettyprinter

All packages are available on PyPI as well.

Helper

Command line tool to easily call Textract. Essentially the command uses the caller, overlayer and prettyprinter methods. Usage in the README: https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper

Caller

Making it easy to call Textract and get the response, also when paginated or stored on S3 through OutputConfig Usage in README: https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller

Prettyprinter

Format the Textract JSON output for easy reading or use in other systems (e. g. CSV). Usage in README: https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

Overlayer

Generate Bounding Boxes to make it easier to draw for visualizations. Usage in README: https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer

Original Implementation

Setup

  • Download code and unzip on your local machine.
  • run python -m pip install -r requirements.txt

Usage

Format:

  • python3 textractor.py --documents [file|folder|S3Object|S3Folder] --text --forms --tables --region [AWSRegion] --insights --medical-insights --translate [LanguageCode]

Examples:

  • python3 textractor.py --documents mydoc.jpg --text
  • python3 textractor.py --documents ./mydocs/ --text --forms --tables
  • python3 textractor.py --documents s3://mybucket/mydoc.pdf --text --forms --tables
  • python3 textractor.py --documents s3://mybucket/myfolder/ --forms
  • python3 textractor.py --documents s3://mybucket/myfolder/ --text --forms --tables --region us-east-1 --insights --medical-insights --translate es

Path to a folder on local drive or S3 bucket must end with /

Only one of the flags (--text, --forms and --tables) is required at the minimum. You can use combination of all three.

--region is optional. us-east-1 is default for local files/folder. For documents in S3, region of S3 bucket is selected as default AWS region to call Amazon Textract.

--insights, --medical-insights and --translate are optional.

Generated Output

Tool generates several files in the format below:

  • document-response.json: Raw JSON response of Amazon Textract API call.
  • document-page-n-response.json: Raw JSON blocks for each page document.
  • document-page-n-text.txt: Detected text for each page in the document.
  • document-page-n-text-inreadingorder.txt: Detected text in reading order (multi-column) for each page in the document.
  • document-page-n-forms.csv: Key/Value pairs for each page in the document.
  • document-page-n-tables.csv: Tables detected for each page in the document.
  • document-page-n-table-n-tables.csv: Pretty-printed tables detected for each page in the document.
  • document-page-n-insights-entities.csv: Entities in detected text for each page in the document.
  • document-page-n-insights-sentiment.csv: Sentiment in detected text for each page in the document.
  • document-page-n-insights-keyPhrases.csv: Key phrases in detected text for each page in the document.
  • document-page-n-insights-syntax.csv: Syntax in detected text for each page in the document.
  • document-page-n-medical-insights-entities.csv: Medical entities in detected text for each page in the document.
  • document-page-n-medical-insights-phi.json: Phi in detected text for each page in the document.
  • document-page-n-text-translation.txt: Translation of detected text for each page in the document.

Arguments

ArgumentDescription
--documentsName of the document or local folder/S3 bucket
--textExtract text from the document
--formsExtract key/value pairs from the document
--tablesExtract tables from the document
--regionAWS region to use for Amazon Textract API call. us-east-1 is default.
--insightsGenerate files with sentiment, entities, syntax, and key phrases.
--medical-insightsGenerate files with medical entities and phi.
--translateGenerate file with translation.

Source Code


# Call Amazon Textract and get JSON response
docproc = DocumentProcessor(bucketName, filePath, awsRegion, detectText, detectForms, tables)
response = docproc.run()

# Get DOM
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}--{}".format(line.text, line.confidence))
        for word in line.words:
            print("Word: {}--{}".format(word.text, word.confidence))
    
    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}-{}".format(r, c, cell.text, cell.confidence))

    # Print fields
    for field in page.form.fields:
        print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))

    # Get field by key
    key = "Phone Number:"
    field = page.form.getFieldByKey(key)
    if(field):
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

    # Search fields by key
    key = "address"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

Cost

  • As you run this tool, it calls different APIs (Amazon Textract, optionally Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate) in your AWS account. You will get charged for all the API calls made as part of the analysis.
  • If you are using the free tier, the free limit is different when using only the --text flag than using also the flags --forms and --tables check the pricing here

Other Resources

License

This library is licensed under the Apache 2.0 License.

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100