Skip to content

Add result_set_type_hints for precise complex type conversion#690

Merged
laughingman7743 merged 18 commits intomasterfrom
feat/result-set-type-hints
Feb 28, 2026
Merged

Add result_set_type_hints for precise complex type conversion#690
laughingman7743 merged 18 commits intomasterfrom
feat/result-set-type-hints

Conversation

@laughingman7743
Copy link
Member

@laughingman7743 laughingman7743 commented Feb 28, 2026

WHAT

Add result_set_type_hints parameter to all cursor execute() methods and change default behavior for nested type conversion.

Breaking Change

_convert_value() no longer performs heuristic type inference (isdigit, float detection, bool detection) for elements inside complex types parsed from Athena's native format. Values now remain as strings by default.

Before: [{string: 1234}, {string: "value"}] (int inferred from varchar)
After: [{string: "1234"}, {string: "value"}] (stays as string)

New result_set_type_hints Parameter

Users who need typed conversion of nested elements can provide full Athena DDL type signatures:

cursor.execute(
    "SELECT * FROM table",
    result_set_type_hints={
        "field": "array(row(name varchar, age integer))",
        "tags": "array(varchar)",
        "metadata": "map(varchar, integer)",
    }
)

Changes

Core (pyathena/converter.py)

  • TypeNode dataclass for representing parsed type trees
  • parse_type_signature() recursive parser for Athena DDL type strings
  • Typed conversion functions: _convert_value_with_type(), _convert_typed_array(), _convert_typed_map(), _convert_typed_struct()
  • _convert_value() changed to string-by-default (only null → None)
  • Converter.convert() and DefaultTypeConverter.convert() extended with type_hint parameter
  • Parsed type hint caching in DefaultTypeConverter._parsed_hints

Threading (result_set_type_hints parameter added to)

  • All cursor execute() methods (10 cursor types across sync/async, standard/pandas/arrow/polars/s3fs)
  • All result set constructors and _get_rows() methods
  • All converter convert() methods

Tests (tests/pyathena/test_converter.py)

  • Updated expectations for breaking change
  • 12 tests for parse_type_signature() DDL parser
  • 16 tests for typed conversion via DefaultTypeConverter.convert() with type hints
  • 99/99 tests pass

WHY

The Athena GetQueryResults API only returns base type names (e.g., "array", "map", "row") in ColumnInfo.Type, without nested type signatures. This caused _convert_value() to use heuristic inference, incorrectly converting varchar values like "1234" to int(1234) inside complex types.

Closes #689

🤖 Generated with Claude Code

laughingman7743 and others added 3 commits February 28, 2026 13:40
The Athena GetQueryResults API only returns base type names (e.g., "array",
"map", "row") without nested type signatures, causing _convert_value() to
use heuristic inference that incorrectly converts varchar values like "1234"
to int(1234) inside complex types.

This adds a result_set_type_hints parameter to all cursor execute() methods
so users can provide full Athena DDL type signatures for precise conversion.
Also changes the default behavior so nested elements without type hints
remain as strings instead of being heuristically inferred (breaking change).

Closes #689

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move TypeNode, TypeSignatureParser, and TypedValueConverter into a new
pyathena/parser.py module. TypedValueConverter receives converter
dependencies via constructor injection to avoid circular imports.
Also moves _split_array_items to parser.py as a shared parsing utility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laughingman7743 laughingman7743 force-pushed the feat/result-set-type-hints branch from 2251ca6 to 5d05791 Compare February 28, 2026 04:40
laughingman7743 and others added 10 commits February 28, 2026 14:03
Native format complex types (map, struct) now return string values
instead of type-inferred values to prevent incorrect conversions
(e.g., varchar "1234" → int 1234). JSON format paths are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestTypeSignatureParser and TestTypedValueConverter test the parser
module directly, so they belong in a dedicated test file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Place the private helper function before public classes for
clearer top-down reading order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document class-based vs standalone function test patterns,
fixture usage with indirect parametrization, and integration
vs unit test distinction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ll string

- Only pass type_hint kwarg when hint exists (avoids breaking custom Converters)
- Use json.dumps for dict/list in JSON paths instead of str() (fixes nested structs)
- Use convert() instead of _convert_element() in JSON paths (preserves "null" strings)
- Use _split_array_items in typed map native path (supports nested row/map values)
- Normalize result_set_type_hints keys to lowercase for case-insensitive lookup
- Cache DefaultTypeConverter instance in S3FS converter
- Add unit tests for all fixed edge cases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

- Fix _parse_type_hint docstring to match renamed method
- Add docstring to DefaultTypeConverter.convert
- Remove unused delimiter parameter from _split_type_args
- Use TYPE_CHECKING for DefaultTypeConverter type annotation in S3FS converter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The JSON parse path in _convert_typed_struct used positional indexing
(field_types[i]) to assign types to fields. This breaks when JSON key
order differs from the type definition order. Use _get_field_type()
which matches by field name first, falling back to positional index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the motivation (Athena API lacks nested type info), usage,
constraints (nested arrays in native format, Arrow/Pandas/Polars),
and the breaking change in 3.30.0 (complex type internals kept as
strings without hints).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laughingman7743 laughingman7743 marked this pull request as ready for review February 28, 2026 08:54
@laughingman7743 laughingman7743 marked this pull request as draft February 28, 2026 10:21
laughingman7743 and others added 4 commits February 28, 2026 19:22
- ResultSet: Pre-compute column_type_hints tuple once in
  _process_metadata instead of per-cell dict creation and .lower()
  lookup. Replace **({} if ... else {}) with simple if/else branching.
  Applied to AthenaResultSet, AthenaDictResultSet, and S3FS.

- Array JSON guard: Add JSON detection heuristic (check for '"', '[{',
  '[null') before json.loads in _convert_typed_array, matching the
  existing pattern in map/struct to avoid JSONDecodeError exceptions
  on native format strings.

- TypeNode field lookup: Add cached _field_type_map dict for O(1)
  name-based field type resolution, replacing O(n) list.index() in
  _get_field_type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check column metadata types against _COMPLEX_TYPES (array, map, row,
struct) in _process_metadata. Only compute and store column type hints
when the result set actually contains complex type columns with
matching hints. This eliminates all hint-related overhead in the hot
loop for queries that return only scalar types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add '[[' to JSON detection guard in _convert_typed_array so nested
  arrays like [[1,2],[3]] are parsed via json.loads instead of falling
  through to native format (which returns None for nested arrays).

- Pre-compute _column_types and _column_names tuples once in
  _process_metadata. Use them in _get_rows to eliminate per-cell
  meta.get("Type") and meta.get("Name") dict lookups.

- S3FSResultSet._fetch() reuses _column_types from parent instead of
  rebuilding from self.description on every call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d hints

- Normalize Hive-style DDL syntax (array<struct<a:int>>) to Trino-style
  so users can paste DESCRIBE TABLE output directly as type hints
- Resolve type alias "int" to "integer" in the parser
- Fall back to untyped conversion when typed converter returns None,
  preventing silent data loss on parse failures
- Support integer keys in result_set_type_hints for index-based column
  resolution, enabling hints for duplicate column names
- Update type annotations across all cursor/result_set files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use _find_matching_paren() instead of assuming closing ')' is at
  end of string, so trailing modifiers don't break parsing
- Replace naive comma split with _split_array_items() in unnamed
  struct path to handle nested values correctly

Closes #693, closes #694.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laughingman7743 laughingman7743 merged commit 90e67bc into master Feb 28, 2026
15 checks passed
@laughingman7743 laughingman7743 deleted the feat/result-set-type-hints branch February 28, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant