Skip to main content

Amazon Textract

With Amazon Textract, you can extract text from assets based on the content types Document, Spreadsheet, Presentation, and Attachment. Brightspot associates the extracted text with the asset, and editors can then search for and use your asset in their own content.

Note

The Amazon Textract integration is currently not available for image files you add to Brightspot.

This section describes how to configure the Amazon Textract integration in Brightspot, and how to view extracted text.

Including Amazon Textract in a build

The following table lists the dependencies to include in your build configuration.

ArtifactDescription
com.psddev:aws-textractExposes Textract-related controls in Sites & Settings, as well as the UI and processing to submit and display results of Textract jobs.

Runtime prerequisites

See also:

Configuring the integration

This topic explains how to configure the Amazon Textract integration in Brightspot.

To configure Amazon Textract:

  1. Obtain the following from your AWS console:

    • Name of the SQS queue managing messages between Amazon Textract and Brightspot. For a list of your available queues, see your SQS console.
    • ARN of the topic to which Amazon Textract publishes messages. For a list of your available topics, see your SNS console.
    • ARN of the role with permissions to make calls to Amazon Textract. For a list of your available AWS roles, see your IAM console.
  2. Click > Admin > Sites & Settings > Sites > Global.
  3. Configure the interface with Amazon Textract by doing the following:

    1. Expand Integrations > AWS Textract.
    2. Toggle on Enable Textract Service.
    3. Enter the SQS Queue Name, Topic ARN, and Role ARN you determined in step 1.
    4. In the Minimum Block Confidence field, enter a confidence value for text within each block. Generally, higher confidence levels provide more accurate results (fewer false positives) but may miss some matches (more false negatives).
  4. Configure the thumbnail generator by doing the following:

    1. Expand CMS > DAM Document Data Extraction Settings.
    2. Under Extractor Services, click , and select Textract Document Data Extractor.
    3. From the Thumbnail Extractor list, select Pdf Document Data Extractor.
  5. Click Save.

Textract is configured, and editors can view the results of a text extraction in the content edit form.

Viewing an asset's extracted text

When you upload a rich-text file (PDF, Document, Spreadsheet, or Presentation), Brightspot automatically submits it to Amazon Textract for analysis. You can view the extracted text in the content edit form.

To view an asset's extracted text:

  1. Search for and open the asset's content edit page.
  2. Expand Main > Extracted Data.

When editors search for keywords in the search panel, Brightspot includes the asset in the search results. For example, if an editor searches for best CMS in the search panel and that text appears in an uploaded document, Brightspot includes the corresponding attachment in the search results.