Add PySpark ETL best practices cursorrules#203
Add PySpark ETL best practices cursorrules#203rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new "PySpark ETL Best Practices" ruleset and its README, and updates the repository README to link to the new ruleset. Changes are documentation-only; no public APIs or code entities were modified. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules`:
- Around line 352-364: The code assumes max_partition from
partition_df.orderBy(...).first() is non-null; when the partitions table is
empty this will be None and max_partition["partition_date"] will TypeError—fix
by checking max_partition is not None before accessing its keys (in the block
that defines max_partition and latest_date), and handle the empty case (e.g.,
set latest_date to None or raise a clear error/log via processLogger) so
subsequent logic using latest_date (or partition_df/max_partition) won't crash.
- Around line 96-109: Update the example to use F.col() for column references in
the datediff call: replace the current F.datediff('date_a', 'date_b') usage with
a call that passes F.col('date_a') and F.col('date_b') so the date_passed
variable uses F.datediff(F.col('date_a'), F.col('date_b')) consistent with the
guideline; ensure the variables is_delivered, date_passed and has_registration
all use F.col() where appropriate and keep the final F.when(...) expression
unchanged.
- Around line 214-254: The examples reference the Window alias W (W.partitionBy,
W.unboundedPreceding, W.unboundedFollowing) but never define/import it; add
guidance to import and alias Spark's Window class (e.g., "from
pyspark.sql.window import Window as W") near the top or in the Code Style
section so examples using W and the F.* conventions are valid and consistent;
update the documentation text to mention importing Window as W when showing
windowed examples.
- Around line 259-274: The lambda passed to map_zip_with uses when() unprefixed
— change all uses of when(...) to F.when(...) in that lambda (and anywhere else
in this snippet) to follow the project convention; update or confirm the module
alias import (functions as F) is present so F.when is available, and keep
map_zip_with/map_concat usages unchanged except for the F.when prefix to ensure
consistent PySpark expression usage.
- Around line 59-74: The read_latest method in class PartitionedReader can crash
when the table is empty because .first() may return None; modify
PartitionedReader.read_latest to capture the result of
.agg(F.max(partition_col)).first() into a variable (e.g., first_row), check if
it is None (or if first_row[0] is None), and handle that case by returning an
empty DataFrame with the target table schema (e.g., use
spark.createDataFrame(spark.sparkContext.emptyRDD(),
spark.read.table(table_name).schema) or spark.read.table(table_name).limit(0))
or raise a clear, descriptive error; otherwise proceed to filter on max_val as
before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fe9a030d-4e7c-4503-8aef-52afa13d30f6
📒 Files selected for processing (3)
README.mdrules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrulesrules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md
Summary
Adds production-tested PySpark & ETL best practices as a
.cursorrulesfile — the first PySpark/Spark-specific rules in the repository.What's covered
8 sections covering the full ETL development lifecycle:
.transform()pipeline composition, shared partition-aware readers, reusable merge utilitiesF.col()prefix convention, named conditions,selectoverwithColumn,aliasoverwithColumnRenamed, chaining limitshow=, left over right,.alias()for disambiguation,F.broadcast()for small dims, no.dropDuplicates()as a crutchrow_numbervsfirst,ignorenulls=True, avoid emptypartitionBy()map_zip_withfor conflict-aware merges,transform+array_maxfor nested structs, avoid UDFsF.lit(None)over empty strings,.otherwise()pitfalls, production-safe logging, intentionalpersist().byName()for schema evolution,__partitionsmetadata table,write.distribution-mode(none/hash/range)Credits
Inspired by the Palantir PySpark Style Guide and production experience debugging data skew, cumulative table merges, and Iceberg write patterns.
Checklist
rules/pyspark-etl-best-practices-cursorrules-prompt-file/directory.cursorrulesandREADME.mdREADME.mdupdated with entry in the "Language-Specific" section (alphabetical order)Summary by CodeRabbit