Amazon Textract
With Amazon Textract, you can extract text from assets based on the content types Document, Spreadsheet, Presentation, and Attachment. Brightspot associates the extracted text with the asset, and editors can then search for and use your asset in their own content.
The Amazon Textract integration is currently not available for image files you add to Brightspot.
This section describes how to configure the Amazon Textract integration in Brightspot, and how to view extracted text.
Including Amazon Textract in a build
The following table lists the dependencies to include in your build configuration.
| Artifact | Description |
|---|---|
com.psddev:aws-textract | Exposes Textract-related controls in Sites & Settings, as well as the UI and processing to submit and display results of Textract jobs. |
Runtime prerequisites
- Developer configuration—Extend the
TextractPostProcessorclass to process what Textract returns. - Ops configuration—Textract requires a queue, topic, and role ARNs to make API calls. The role must have permission to call Textract (see How Amazon Textract Works with IAM - Amazon Textract). The topic is used by Textract to notify of completion, and the queue is used to post completion status (see Configuring Amazon Textract for Asynchronous Operations - Amazon Textract).
- CMS configuration—Configure the site interfacing with Amazon Textract. For details, see Configuring the Amazon Textract integration.
See also:
Configuring the integration
This topic explains how to configure the Amazon Textract integration in Brightspot.
To configure Amazon Textract:
-
Obtain the following from your AWS console:
- Name of the SQS queue managing messages between Amazon Textract and Brightspot. For a list of your available queues, see your SQS console.
- ARN of the topic to which Amazon Textract publishes messages. For a list of your available topics, see your SNS console.
- ARN of the role with permissions to make calls to Amazon Textract. For a list of your available AWS roles, see your IAM console.
- Click > Admin > Sites & Settings > Sites > Global.
-
Configure the interface with Amazon Textract by doing the following:
- Expand Integrations > AWS Textract.
- Toggle on Enable Textract Service.
- Enter the SQS Queue Name, Topic ARN, and Role ARN you determined in step 1.
- In the Minimum Block Confidence field, enter a confidence value for text within each block. Generally, higher confidence levels provide more accurate results (fewer false positives) but may miss some matches (more false negatives).
-
Configure the thumbnail generator by doing the following:
- Expand CMS > DAM Document Data Extraction Settings.
- Under Extractor Services, click , and select Textract Document Data Extractor.
- From the Thumbnail Extractor list, select Pdf Document Data Extractor.
-
Click Save.
Textract is configured, and editors can view the results of a text extraction in the content edit form.
Viewing an asset's extracted text
When you upload a rich-text file (PDF, Document, Spreadsheet, or Presentation), Brightspot automatically submits it to Amazon Textract for analysis. You can view the extracted text in the content edit form.
To view an asset's extracted text:
- Search for and open the asset's content edit page.
- Expand Main > Extracted Data.
When editors search for keywords in the search panel, Brightspot includes the asset in the search results. For example, if an editor searches for best CMS in the search panel and that text appears in an uploaded document, Brightspot includes the corresponding attachment in the search results.