Skip to content

feat: add JsonItemsDecoder for streaming large JSON responses#1026

Open
devin-ai-integration[bot] wants to merge 9 commits into
mainfrom
devin/1778790048-streaming-json-items-decoder
Open

feat: add JsonItemsDecoder for streaming large JSON responses#1026
devin-ai-integration[bot] wants to merge 9 commits into
mainfrom
devin/1778790048-streaming-json-items-decoder

Conversation

@devin-ai-integration

Copy link
Copy Markdown
Contributor

Summary

Adds a new declarative JsonItemsDecoder that streams elements of a nested array out of a single JSON document one at a time, so manifest-only connectors can decode multi-GB JSON responses without OOMing.

Related to https://github.com/airbytehq/oncall/issues/12143:

That issue surfaced 8 GB-cap OOM (exit code 137) on source-amazon-seller-partner's Brand Analytics streams. Today the only ways to JSON-decode a single large document in the declarative CDK are JsonDecoder (full response.contentorjson.loads) and GzipDecoder wrapping JsonDecoder (full decompress → full parse). Both materialize the entire payload in memory. The closed connector-side fix (airbytehq/airbyte#77709) added a custom Python component for this, but per maintainer feedback we want it as a first-class CDK component so any connector can opt in via YAML.

What's in this PR

CDK-side only:

  • New parser: JsonItemsParser in airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py, alongside JsonParser / JsonLineParser / CsvParser / GzipParser. Uses ijson.items(stream, f"{items_path}.item") to lazily yield each element of the configured array.
  • New schema entry: JsonItemsDecoder in declarative_component_schema.yaml with items_path (required) and encoding (default utf-8). Added to the anyOf unions for GzipDecoder.decoder, ZipfileDecoder.decoder, and the top-level decoder / download_decoder slots.
  • Pydantic models regenerated via poe assemble.
  • Factory wiring in model_to_component_factory.py: create_json_items_decoder + a new branch in _get_parser that builds JsonItemsParser(items_path=..., encoding=...).
  • Runtime dep: ijson = "^3.3.0" added to [tool.poetry.dependencies] (this is what was in the now-closed build: add ijson as a runtime dependency #1011). Since the CDK now imports ijson directly, dropped the DEP002 ignore entry that PR added.
  • Unit tests in unit_tests/sources/declarative/decoders/test_composite_decoder.py covering: top-level / nested / empty array paths, encoding, gzip composition, required-field validation, and a _CountingStream-based test confirming the parser yields the first item before consuming the full document (lazy streaming).

Example connector manifest after this lands:

download_decoder:
  type: GzipDecoder
  decoder:
    type: JsonItemsDecoder
    items_path: dataByDepartmentAndSearchTerm
download_extractor:
  type: DpathExtractor
  field_path: []

Declarative-First Evaluation

This is the declarative approach. The previous attempt used a custom Python component in the connector; the new component is a generic, reusable building block that any manifest-only connector can opt into via YAML, no custom code required. Existing declarative decoders (JsonDecoder, GzipDecoder wrapping JsonDecoder) cannot stream a single large document — they buffer the full payload — so neither can solve the OOM on their own.

Local verification

  • poetry run pytest unit_tests/sources/declarative/decoders/ -x → 60 passed
  • poetry run pytest unit_tests/sources/declarative/parsers/ -x → 157 passed
  • poetry run ruff check + ruff format --check on changed files → clean
  • poetry run mypy --config-file mypy.ini on the two modified airbyte_cdk/ files → clean

Review & Testing Checklist for Human

  • Confirm the items_path syntax (ijson dotted path with implicit .item suffix, not JSONPath) is the right ergonomics for connector authors. The schema description and parser docstring spell this out, but it differs from how DpathExtractor uses field_path.
  • Sanity-check the anyOf wiring: JsonItemsDecoder is now valid wherever JsonDecoder is valid (top-level decoder / download_decoder, inside GzipDecoder, inside ZipfileDecoder). Confirm that's the desired scope.
  • Confirm pinning to ijson = "^3.3.0" is acceptable (3.3.0 is the floor where ijson.items(stream, "path.item") behaves consistently across backends; 3.5.0 is what poetry currently resolves to).
  • Test plan: once this lands and an SDM image with ijson ships, the follow-up connector PR replaces source-amazon-seller-partner's custom GzipJsonStreamingItemsDecoder with a manifest-only JsonItemsDecoder (5 Brand Analytics streams). End-to-end memory metrics will be captured then; the CDK unit test only asserts the parser is lazy (does not read the full document before yielding the first item) and that ijson is wired correctly.

Notes

  • JsonItemsDecoder is a sibling of JsonDecoder rather than a flag on it — JSON path semantics and encoding are first-class enough that overloading JsonDecoder would muddy the schema for the common case.
  • Naming was discussed and confirmed: JsonItemsDecoder + items_path. Avoided "streaming" in the name because streaming is implicit for CompositeRawDecoder-backed parsers.
  • Follow-up: once this lands and a new SDM image with ijson is published, I'll open a connector-side PR that:
    • Deletes GzipJsonStreamingItemsDecoder from source-amazon-seller-partner/components.py
    • Switches the 5 Brand Analytics streams in manifest.yaml to JsonItemsDecoder + DpathExtractor
    • Bumps baseImage and the connector PATCH version

Link to Devin session: https://app.devin.ai/sessions/e31a7df6ebe54ce4a68e0eecc7117555

devin-ai-integration Bot and others added 2 commits May 14, 2026 20:20
Adds the ijson streaming JSON parser as a direct dependency so connectors that
ship inside the source-declarative-manifest base image can stream-parse very
large JSON response bodies without materializing the full document in memory.

Motivation: source-amazon-seller-partner currently OOMs while reading
GET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT documents that can exceed 3 GB
uncompressed. See airbytehq/oncall#12143.
Adds a new declarative decoder, JsonItemsDecoder, that streams elements
of a nested array out of a single JSON document one at a time using the
ijson library. This lets manifest-only connectors decode multi-GB JSON
responses (e.g. Amazon Seller Partner Brand Analytics reports) without
loading the full document into memory.

- New `JsonItemsParser` in composite_raw_decoder.py (wraps ijson.items)
- New `JsonItemsDecoder` schema entry, wired into GzipDecoder /
  ZipfileDecoder / top-level decoder unions so it composes with the
  existing decoder hierarchy
- Pydantic models regenerated from schema
- Factory: create_json_items_decoder + JsonItemsDecoderModel handling
  in _get_parser
- Drop ijson from deptry DEP002 ignore list now that the CDK imports it
  directly; update pyproject.toml comment to reflect first-class use
- Unit tests covering top-level, nested, empty, encoding, gzip
  composition, missing path validation, and lazy streaming behavior
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions

Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1778790048-streaming-json-items-decoder#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1778790048-streaming-json-items-decoder

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

…msDecoder

The earlier regeneration via `poe assemble` produced datamodel-code-generator drift
unrelated to this PR (Optional[conint(ge=1)] instead of Optional[int] + ge=1 kwarg,
removed ScopesJoinStrategy, reordered classes, whitespace in descriptions). That
drift broke mypy on Python 3.13. Reset the generated file to match main and add
only the new `JsonItemsDecoder` Pydantic class manually, mirroring the style of
`JsonDecoder` / `JsonlDecoder`.
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

PyTest Results (Fast)

4 102 tests  +8   4 091 ✅ +9   7m 51s ⏱️ +15s
    1 suites ±0      11 💤  - 1 
    1 files   ±0       0 ❌ ±0 

Results for commit 6be09f3. ± Comparison against base commit 7da322d.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

PyTest Results (Full)

4 105 tests  +8   4 093 ✅ +8   11m 56s ⏱️ + 2m 32s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 6be09f3. ± Comparison against base commit 7da322d.

♻️ This comment has been updated with latest results.

ZaneHyattAB and others added 2 commits June 8, 2026 03:39
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ecoder in all Union types

Ran `poe assemble` (datamodel-codegen) against the YAML schema to regenerate
the Pydantic model. This adds JsonItemsDecoder to:
- GzipDecoder.decoder
- ZipfileDecoder.decoder
- SimpleRetriever.decoder
- AsyncRetriever.decoder
- AsyncRetriever.download_decoder

The YAML schema already had JsonItemsDecoder in these anyOf unions, but the
previously generated Python model was stale and missing it. This fixes manifest
validation for GzipDecoder -> JsonItemsDecoder pipelines (needed by
airbytehq/airbyte#78360).

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Regenerated declarative_component_schema.pyJsonItemsDecoder now accepted in all decoder unions

The YAML schema already included JsonItemsDecoder in the anyOf unions for GzipDecoder.decoder, ZipfileDecoder.decoder, SimpleRetriever.decoder, AsyncRetriever.decoder, and AsyncRetriever.download_decoder — but the generated Pydantic model was stale and missing it.

What changed: Ran poe assemble (datamodel-codegen==0.26.3) to regenerate declarative_component_schema.py from the YAML, then applied ruff format. The regenerated model now includes JsonItemsDecoder in all 5 Union types:

# GzipDecoder.decoder (was missing JsonItemsDecoder)
decoder: Union[CsvDecoder, GzipDecoder, JsonDecoder, JsonItemsDecoder, JsonlDecoder]

# AsyncRetriever.decoder and download_decoder (were missing JsonItemsDecoder)
decoder: Optional[Union[CsvDecoder, GzipDecoder, JsonDecoder, JsonItemsDecoder, JsonlDecoder, ...]]
download_decoder: Optional[Union[CsvDecoder, GzipDecoder, JsonDecoder, JsonItemsDecoder, JsonlDecoder, ...]]

Verification:

  • poetry run ruff check + ruff format --check → clean
  • poetry run pytest unit_tests/sources/declarative/decoders/ -x → all passed
  • poetry run pytest unit_tests/sources/declarative/parsers/ -x → 165 passed

This should unblock the downstream airbytehq/airbyte#78360 manifest validation for GzipDecoder → JsonItemsDecoder pipelines.


Devin session

@devin-ai-integration devin-ai-integration Bot marked this pull request as ready for review June 8, 2026 04:00
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

I reran the cross-PR smoke test against source-amazon-seller-partner PR airbytehq/airbyte#78360 after the declarative component model regeneration. JsonItemsDecoder is now included in the relevant generated Union[...] definitions, and both source-declarative-manifest check and discover pass locally with the valid Amazon Seller Partner customer config. This resolves the previous manifest validation blocker for GzipDecoder -> JsonItemsDecoder; full Brand Analytics OOM validation should happen after this CDK PR is merged/released and the ASP connector PR is rebuilt against the new SDM/CDK image.


Devin session

ZaneHyattAB and others added 3 commits June 8, 2026 04:08
Instead of a full regeneration (which introduced conint/confloat drift that
broke MyPy), this takes the main branch model and makes only the minimal
edits needed:

1. Add JsonItemsDecoder class definition (after JsonDecoder)
2. Add JsonItemsDecoder to GzipDecoder.decoder Union
3. Add JsonItemsDecoder to ZipfileDecoder.decoder Union
4. Add JsonItemsDecoder to SimpleRetriever.decoder Union
5. Add JsonItemsDecoder to AsyncRetriever.decoder Union
6. Add JsonItemsDecoder to AsyncRetriever.download_decoder Union

Verified: MyPy clean (454 source files), ruff lint+format clean,
all decoder (60) and parser (165) tests pass.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The previous poetry lock resolved mypy 2.1.0 (was 1.14.1 on main), which
introduced new type errors in unrelated files. Using --no-update keeps
existing versions pinned and only adds the new ijson dependency.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@ZaneHyattAB

ZaneHyattAB commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

❌ Failed to trigger prerelease workflow.
Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/27167986557

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

❌ Cannot revive Devin session - the session is too old. Please start a new session instead.

Anatolii Yatsuk (tolik0) added a commit that referenced this pull request Jun 11, 2026
Adds a streaming JSON decoder for very large single-document JSON responses where the
records live under a nested array. JsonItemsParser yields each array element via ijson,
so peak memory is bounded by a single record rather than the whole document. Composes
with the existing CompositeRawDecoder hierarchy (gzip/zip) and is wired into the decoder
unions + factory. Adds ijson as a first-class CDK dependency.

JsonItemsParser also honors a configured non-UTF-8 encoding by transcoding to UTF-8 bytes
via a lazy streaming recoder, keeping ijson on its native byte backend.

Adopts and supersedes #1026.

Co-Authored-By: devin-ai-integration[bot] <devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Anatolii Yatsuk (tolik0) added a commit that referenced this pull request Jun 11, 2026
Adds a streaming JSON decoder for very large single-document JSON responses where the
records live under a nested array. JsonItemsParser yields each array element via ijson,
so peak memory is bounded by a single record rather than the whole document. Composes
with the existing CompositeRawDecoder hierarchy (gzip/zip) and is wired into the decoder
unions + factory. Adds ijson as a first-class CDK dependency.

JsonItemsParser also honors a configured non-UTF-8 encoding by transcoding to UTF-8 bytes
via a lazy streaming recoder, keeping ijson on its native byte backend.

Adopts and supersedes #1026.

Co-Authored-By: devin-ai-integration[bot] <devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Anatolii Yatsuk (tolik0) added a commit that referenced this pull request Jun 11, 2026
Adds a streaming JSON decoder for very large single-document JSON responses where the
records live under a nested array. JsonItemsParser yields each array element via ijson,
so peak memory is bounded by a single record rather than the whole document. Composes
with the existing CompositeRawDecoder hierarchy (gzip/zip) and is wired into the decoder
unions + factory. Adds ijson as a first-class CDK dependency.

JsonItemsParser also honors a configured non-UTF-8 encoding by transcoding to UTF-8 bytes
via a lazy streaming recoder, keeping ijson on its native byte backend.

Adopts and supersedes #1026.

Co-Authored-By: devin-ai-integration[bot] <devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants