@@ -253,6 +253,7 @@ polars_df = df.to_polars() # pl.DataFrame
253253py_dict = df.to_pydict() # dict[str, list]
254254py_list = df.to_pylist() # list[dict]
255255count = df.count() # int
256+ df = df.cache() # materialize in memory, return DataFrame
256257```
257258
258259### Date and Timestamp Type Conversion
@@ -309,6 +310,29 @@ Async iteration is also supported via `async for batch in df: ...` (or
309310` df.execute_stream() ` ), which is useful when batches are interleaved with
310311other I/O.
311312
313+ ### Caching Intermediate Results
314+
315+ ` df.cache() ` materializes a DataFrame as an in-memory table and returns a new
316+ DataFrame backed by it. Reach for it when the same intermediate result feeds
317+ multiple downstream queries — without ` cache() ` , each branch re-executes the
318+ full upstream plan (re-reading files, recomputing filters/aggregates).
319+
320+ ``` python
321+ base = (
322+ ctx.read_parquet(" orders.parquet" )
323+ .filter(col(" status" ) == " shipped" )
324+ .cache() # materialize once, reuse below
325+ )
326+ by_region = base.aggregate([" region" ], [F.sum(col(" amount" )).alias(" total" )])
327+ by_customer = base.aggregate([" customer" ], [F.sum(col(" amount" )).alias(" total" )])
328+ ```
329+
330+ Skip ` cache() ` for single-use DataFrames — the lazy plan is already optimal.
331+
332+ The cached table is owned by the DataFrame returned from ` cache() ` (and any
333+ DataFrames chained from it). To free the memory, drop every reference — let
334+ them go out of scope, or ` del base; del by_region; del by_customer ` .
335+
312336### Writing Results
313337
314338``` python
0 commit comments