Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions skills/datafusion_python/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ polars_df = df.to_polars() # pl.DataFrame
py_dict = df.to_pydict() # dict[str, list]
py_list = df.to_pylist() # list[dict]
count = df.count() # int
df = df.cache() # materialize in memory, return DataFrame
```

### Date and Timestamp Type Conversion
Expand Down Expand Up @@ -309,6 +310,29 @@ Async iteration is also supported via `async for batch in df: ...` (or
`df.execute_stream()`), which is useful when batches are interleaved with
other I/O.

### Caching Intermediate Results

`df.cache()` materializes a DataFrame as an in-memory table and returns a new
DataFrame backed by it. Reach for it when the same intermediate result feeds
multiple downstream queries — without `cache()`, each branch re-executes the
full upstream plan (re-reading files, recomputing filters/aggregates).

```python
base = (
ctx.read_parquet("orders.parquet")
.filter(col("status") == "shipped")
.cache() # materialize once, reuse below
)
by_region = base.aggregate(["region"], [F.sum(col("amount")).alias("total")])
by_customer = base.aggregate(["customer"], [F.sum(col("amount")).alias("total")])
```

Skip `cache()` for single-use DataFrames — the lazy plan is already optimal.

The cached table is owned by the DataFrame returned from `cache()` (and any
DataFrames chained from it). To free the memory, drop every reference — let
them go out of scope, or `del base; del by_region; del by_customer`.

### Writing Results

```python
Expand Down
Loading