Skip to content

Commit d528870

Browse files
committed
Add details on caching to skill
1 parent c657dad commit d528870

1 file changed

Lines changed: 24 additions & 0 deletions

File tree

skills/datafusion_python/SKILL.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,7 @@ polars_df = df.to_polars() # pl.DataFrame
253253
py_dict = df.to_pydict() # dict[str, list]
254254
py_list = df.to_pylist() # list[dict]
255255
count = df.count() # int
256+
df = df.cache() # materialize in memory, return DataFrame
256257
```
257258

258259
### Date and Timestamp Type Conversion
@@ -309,6 +310,29 @@ Async iteration is also supported via `async for batch in df: ...` (or
309310
`df.execute_stream()`), which is useful when batches are interleaved with
310311
other I/O.
311312

313+
### Caching Intermediate Results
314+
315+
`df.cache()` materializes a DataFrame as an in-memory table and returns a new
316+
DataFrame backed by it. Reach for it when the same intermediate result feeds
317+
multiple downstream queries — without `cache()`, each branch re-executes the
318+
full upstream plan (re-reading files, recomputing filters/aggregates).
319+
320+
```python
321+
base = (
322+
ctx.read_parquet("orders.parquet")
323+
.filter(col("status") == "shipped")
324+
.cache() # materialize once, reuse below
325+
)
326+
by_region = base.aggregate(["region"], [F.sum(col("amount")).alias("total")])
327+
by_customer = base.aggregate(["customer"], [F.sum(col("amount")).alias("total")])
328+
```
329+
330+
Skip `cache()` for single-use DataFrames — the lazy plan is already optimal.
331+
332+
The cached table is owned by the DataFrame returned from `cache()` (and any
333+
DataFrames chained from it). To free the memory, drop every reference — let
334+
them go out of scope, or `del base; del by_region; del by_customer`.
335+
312336
### Writing Results
313337

314338
```python

0 commit comments

Comments
 (0)