From 5ecb956db1d83c17a3a2231301a108844a02b1f6 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Tue, 10 Feb 2026 12:12:10 -0800 Subject: [PATCH] UI/API: Structured data extractor - options for extracted data with or without document elements --- api-reference/workflow/workflows.mdx | 7 +++++ ui/data-extractor.mdx | 43 ++++++++++++++++++++++++---- 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx index a3482d14..76c277b9 100644 --- a/api-reference/workflow/workflows.mdx +++ b/api-reference/workflow/workflows.mdx @@ -2250,6 +2250,7 @@ An **Extract** node has a `type` of `structured_data_extractor` and a `subtype` "json_schema": "", "extraction_guidance": "" }, + "output_mode": "", "provider": "", "model": "" } @@ -2267,6 +2268,7 @@ An **Extract** node has a `type` of `structured_data_extractor` and a `subtype` "json_schema": "", "extraction_guidance": "" }, + "output_mode": "", "provider": "", "model": "" } @@ -2282,6 +2284,11 @@ Fields for `settings` include: - `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string. - `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string. +- `output_mode`: _Optional_. The mode in which to output the extracted data. Allowed values include: + + - `elements_with_extracted_data` (the default, if not otherwise specified): Output the extracted data as JSON into an `extracted_data` field inside of `metadata` within a parent `DocumentData` element, followed by other built-in Unstructured document elements. + - `extracted_data_only`: Output only the extracted data as JSON, without any parent `DocumentData` element or any other built-in Unstructured document elements. + - Allowed values for `provider` and `model` include the following: - `"provider": "anthropic"` diff --git a/ui/data-extractor.mdx b/ui/data-extractor.mdx index 313f970e..24e860a1 100644 --- a/ui/data-extractor.mdx +++ b/ui/data-extractor.mdx @@ -84,7 +84,9 @@ In the preceding output, the `text` fields contain information about the listing the square footage, one of the listing's features, and so on. However, you might want the information presented as `street_address`, `square_footage`, `features`, and so on. -By using the structured data extractor in your Unstructured workflows, you could have Unstructured extract the listing's data in a custom-defined output format similar to the following (ellipses indicate omitted fields for brevity): +By using the structured data extractor in your Unstructured workflows, you can have Unstructured extract the listing's data in a custom-defined output format. + +The first custom-defined output format is known as the _elements with extracted data_ format, as follows (ellipses indicate omitted fields for brevity): ```json [ @@ -127,10 +129,34 @@ By using the structured data extractor in your Unstructured workflows, you could ] ``` -In the preceding output, the first document element, of type `DocumentData`, has an `extracted_data` field within `metadata` +In the preceding elements with extracted data output, the first document element in the JSON output, of type `DocumentData`, has an `extracted_data` field within `metadata` that contains a representation of the document's data in the custom output format that you specify. Beginning with the second document element and continuing until the end of the document, Unstructured also outputs the document's data as a series of Unstructured's document elements and metadata as it normally would. +The second custom-defined output format is known as the _extracted data only_ format, as follows: + +```json +{ + "street_address": "221 Queen Street, Melbourne VIC 3000", + "square_footage": 2800, + "price": 1000000, + "features": [ + "Recently renovated kitchen", + "Smart home automation system", + "2-car garage with storage space", + "Spacious open-plan layout with natural lighting", + "Designer kitchen with quartz countertops and built-in appliances", + "Master suite with walk-in closet and en-suite bath", + "Covered patio and landscaped backyard garden" + ], + "agent_contact": { + "phone": "+01 555 123456" + } +} +``` + +In the preceding extracted data only output, the document's data is output as JSON and only in the custom output format that you specify. + To use the structured data extractor, you can provide Unstructured with an _extraction schema_, which defines the structure of the data for Unstructured to extract. Or you can specify an _extraction prompt_ that guides Unstructured on how to extract the data from the source documents, in the format that you want. @@ -297,7 +323,14 @@ below the threshold of what the structured data extractor typically needs for th ## Saving the extracted data separately -Unstructured does not recommend that you save `DocumentData` elements as rows or entries within a traditional SQL-style destination database or vector store, for the following reasons: +Unstructured recommends that you save the extracted JSON data only to a blob storage, file storage, or No-SQL database destination location. + +For the extracted data only output format, because of the custom nature of the extracted JSON data, Unstructured's destination connectors +do not have the built-in ability to insert custom-defined JSON objects as rows, records, or entries in a traditional SQL-style destination database or vector store. +Instead, you should save the extracted JSON data to a blob storage, file storage, or No-SQL database destination location. You can +then use your own custom code to insert the extracted JSON data into a traditional SQL-style destination database or vector store. + +For the elements with extracted data output format, Unstructured does not recommend that you save `DocumentData` elements as rows, records, or entries within a traditional SQL-style destination database or vector store either, for the following reasons: - Saving a mixture of `DocumentData` elements and default Unstructured elements such as `Title`, `NarrativeText`, and `Table` elements and so on in the same table, collection, or index might cause unexpected performance issues or might return less useful search and query results. @@ -456,7 +489,7 @@ Extract the plant information for each of the plants in this document, and prese - humidity: The humidity requirements for the plant (for example: 'Low', 'Medium', 'High'). ``` -And Unstructured's output would look like the following: +And Unstructured's elements with extracted data output would look like the following: ```json [ @@ -739,7 +772,7 @@ Before returning the extraction: 6. Ensure diagnoses and comorbidities are non-overlapping. ``` -And Unstructured's output would look like the following: +And Unstructured's elements with extracted data output would look like the following: ```json [