Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions api-reference/workflow/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2250,6 +2250,7 @@ An **Extract** node has a `type` of `structured_data_extractor` and a `subtype`
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"output_mode": "<elements_with_extracted_data|extracted_data_only>",
"provider": "<provider>",
"model": "<model>"
}
Expand All @@ -2267,6 +2268,7 @@ An **Extract** node has a `type` of `structured_data_extractor` and a `subtype`
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"output_mode": "<elements_with_extracted_data|extracted_data_only>",
"provider": "<provider>",
"model": "<model>"
}
Expand All @@ -2282,6 +2284,11 @@ Fields for `settings` include:
- `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string.
- `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string.

- `output_mode`: _Optional_. The mode in which to output the extracted data. Allowed values include:

- `elements_with_extracted_data` (the default, if not otherwise specified): Output the extracted data as JSON into an `extracted_data` field inside of `metadata` within a parent `DocumentData` element, followed by other built-in Unstructured document elements.
- `extracted_data_only`: Output only the extracted data as JSON, without any parent `DocumentData` element or any other built-in Unstructured document elements.

- Allowed values for `provider` and `model` include the following:

- `"provider": "anthropic"`
Expand Down
43 changes: 38 additions & 5 deletions ui/data-extractor.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,9 @@ In the preceding output, the `text` fields contain information about the listing
the square footage, one of the listing's features, and so on. However,
you might want the information presented as `street_address`, `square_footage`, `features`, and so on.

By using the structured data extractor in your Unstructured workflows, you could have Unstructured extract the listing's data in a custom-defined output format similar to the following (ellipses indicate omitted fields for brevity):
By using the structured data extractor in your Unstructured workflows, you can have Unstructured extract the listing's data in a custom-defined output format.

The first custom-defined output format is known as the _elements with extracted data_ format, as follows (ellipses indicate omitted fields for brevity):

```json
[
Expand Down Expand Up @@ -127,10 +129,34 @@ By using the structured data extractor in your Unstructured workflows, you could
]
```

In the preceding output, the first document element, of type `DocumentData`, has an `extracted_data` field within `metadata`
In the preceding elements with extracted data output, the first document element in the JSON output, of type `DocumentData`, has an `extracted_data` field within `metadata`
that contains a representation of the document's data in the custom output format that you specify. Beginning with the second document element and continuing
until the end of the document, Unstructured also outputs the document's data as a series of Unstructured's document elements and metadata as it normally would.

The second custom-defined output format is known as the _extracted data only_ format, as follows:

```json
{
"street_address": "221 Queen Street, Melbourne VIC 3000",
"square_footage": 2800,
"price": 1000000,
"features": [
"Recently renovated kitchen",
"Smart home automation system",
"2-car garage with storage space",
"Spacious open-plan layout with natural lighting",
"Designer kitchen with quartz countertops and built-in appliances",
"Master suite with walk-in closet and en-suite bath",
"Covered patio and landscaped backyard garden"
],
"agent_contact": {
"phone": "+01 555 123456"
}
}
```

In the preceding extracted data only output, the document's data is output as JSON and only in the custom output format that you specify.

To use the structured data extractor, you can provide Unstructured with an _extraction schema_, which defines the structure of the data for Unstructured to extract.
Or you can specify an _extraction prompt_ that guides Unstructured on how to extract the data from the source documents, in the format that you want.

Expand Down Expand Up @@ -297,7 +323,14 @@ below the threshold of what the structured data extractor typically needs for th

## Saving the extracted data separately

Unstructured does not recommend that you save `DocumentData` elements as rows or entries within a traditional SQL-style destination database or vector store, for the following reasons:
Unstructured recommends that you save the extracted JSON data only to a blob storage, file storage, or No-SQL database destination location.

For the extracted data only output format, because of the custom nature of the extracted JSON data, Unstructured's destination connectors
do not have the built-in ability to insert custom-defined JSON objects as rows, records, or entries in a traditional SQL-style destination database or vector store.
Instead, you should save the extracted JSON data to a blob storage, file storage, or No-SQL database destination location. You can
then use your own custom code to insert the extracted JSON data into a traditional SQL-style destination database or vector store.

For the elements with extracted data output format, Unstructured does not recommend that you save `DocumentData` elements as rows, records, or entries within a traditional SQL-style destination database or vector store either, for the following reasons:

- Saving a mixture of `DocumentData` elements and default Unstructured elements such as `Title`, `NarrativeText`, and `Table` elements and
so on in the same table, collection, or index might cause unexpected performance issues or might return less useful search and query results.
Expand Down Expand Up @@ -456,7 +489,7 @@ Extract the plant information for each of the plants in this document, and prese
- humidity: The humidity requirements for the plant (for example: 'Low', 'Medium', 'High').
```

And Unstructured's output would look like the following:
And Unstructured's elements with extracted data output would look like the following:

```json
[
Expand Down Expand Up @@ -739,7 +772,7 @@ Before returning the extraction:
6. Ensure diagnoses and comorbidities are non-overlapping.
```

And Unstructured's output would look like the following:
And Unstructured's elements with extracted data output would look like the following:

```json
[
Expand Down