Skip to content

Add PySpark ETL best practices cursorrules#203

Open
rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
rishikaidnani:add-pyspark-etl-best-practices
Open

Add PySpark ETL best practices cursorrules#203
rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
rishikaidnani:add-pyspark-etl-best-practices

Conversation

@rishikaidnani
Copy link

@rishikaidnani rishikaidnani commented Mar 21, 2026

Summary

Adds production-tested PySpark & ETL best practices as a .cursorrules file — the first PySpark/Spark-specific rules in the repository.

What's covered

8 sections covering the full ETL development lifecycle:

  1. Project Structure — ETL base class scaffold, config factory pattern, .transform() pipeline composition, shared partition-aware readers, reusable merge utilities
  2. Code StyleF.col() prefix convention, named conditions, select over withColumn, alias over withColumnRenamed, chaining limits
  3. Joins — explicit how=, left over right, .alias() for disambiguation, F.broadcast() for small dims, no .dropDuplicates() as a crutch
  4. Window Functions — explicit frame specification, row_number vs first, ignorenulls=True, avoid empty partitionBy()
  5. Map & Array HOFsmap_zip_with for conflict-aware merges, transform + array_max for nested structs, avoid UDFs
  6. Cumulative Table Patterns — idempotent merges, order-independent conflict resolution, primary key uniqueness validation
  7. Data Quality & PerformanceF.lit(None) over empty strings, .otherwise() pitfalls, production-safe logging, intentional persist()
  8. Iceberg Write Patterns.byName() for schema evolution, __partitions metadata table, write.distribution-mode (none/hash/range)

Credits

Inspired by the Palantir PySpark Style Guide and production experience debugging data skew, cumulative table merges, and Iceberg write patterns.

Checklist

  • Rule is in its own rules/pyspark-etl-best-practices-cursorrules-prompt-file/ directory
  • Directory contains .cursorrules and README.md
  • Main README.md updated with entry in the "Language-Specific" section (alphabetical order)

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive PySpark ETL best-practices guide covering ETL scaffolding, configuration patterns, pipeline composition, code-style conventions, join/merge strategies, window-function guidance, higher-order/map-merge patterns, idempotent cumulative/snapshot table rules, data-quality guardrails, and Iceberg write/read patterns.
    • Included usage instructions for applying the rules within PySpark projects.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5835bed6-ba58-4465-b5a8-5ce1d1afe002

📥 Commits

Reviewing files that changed from the base of the PR and between c3d938c and 3b4df49.

📒 Files selected for processing (1)
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules
✅ Files skipped from review due to trivial changes (1)
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules

📝 Walkthrough

Walkthrough

Adds a new "PySpark ETL Best Practices" ruleset and its README, and updates the repository README to link to the new ruleset. Changes are documentation-only; no public APIs or code entities were modified.

Changes

Cohort / File(s) Summary
Main README Update
README.md
Inserted a new list entry linking to the PySpark ETL Best Practices ruleset and brief description.
PySpark ETL Best Practices Rules
rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules, rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md
Added a comprehensive .cursorrules file prescribing ETL scaffold, config parsing, transform composition, partition-aware readers, join/window/map patterns, data-quality guardrails, and Iceberg write conventions; included a README describing usage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested reviewers

  • PatrickJS

Poem

🐰 With whiskers twitching, I hop and cheer,
A ruleset landed, tidy and clear.
Joins and windows, maps in line,
ETL steps now neatly defined.
Hop on, coders — spark the light! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new PySpark ETL best practices cursorrules file to the repository.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules`:
- Around line 352-364: The code assumes max_partition from
partition_df.orderBy(...).first() is non-null; when the partitions table is
empty this will be None and max_partition["partition_date"] will TypeError—fix
by checking max_partition is not None before accessing its keys (in the block
that defines max_partition and latest_date), and handle the empty case (e.g.,
set latest_date to None or raise a clear error/log via processLogger) so
subsequent logic using latest_date (or partition_df/max_partition) won't crash.
- Around line 96-109: Update the example to use F.col() for column references in
the datediff call: replace the current F.datediff('date_a', 'date_b') usage with
a call that passes F.col('date_a') and F.col('date_b') so the date_passed
variable uses F.datediff(F.col('date_a'), F.col('date_b')) consistent with the
guideline; ensure the variables is_delivered, date_passed and has_registration
all use F.col() where appropriate and keep the final F.when(...) expression
unchanged.
- Around line 214-254: The examples reference the Window alias W (W.partitionBy,
W.unboundedPreceding, W.unboundedFollowing) but never define/import it; add
guidance to import and alias Spark's Window class (e.g., "from
pyspark.sql.window import Window as W") near the top or in the Code Style
section so examples using W and the F.* conventions are valid and consistent;
update the documentation text to mention importing Window as W when showing
windowed examples.
- Around line 259-274: The lambda passed to map_zip_with uses when() unprefixed
— change all uses of when(...) to F.when(...) in that lambda (and anywhere else
in this snippet) to follow the project convention; update or confirm the module
alias import (functions as F) is present so F.when is available, and keep
map_zip_with/map_concat usages unchanged except for the F.when prefix to ensure
consistent PySpark expression usage.
- Around line 59-74: The read_latest method in class PartitionedReader can crash
when the table is empty because .first() may return None; modify
PartitionedReader.read_latest to capture the result of
.agg(F.max(partition_col)).first() into a variable (e.g., first_row), check if
it is None (or if first_row[0] is None), and handle that case by returning an
empty DataFrame with the target table schema (e.g., use
spark.createDataFrame(spark.sparkContext.emptyRDD(),
spark.read.table(table_name).schema) or spark.read.table(table_name).limit(0))
or raise a clear, descriptive error; otherwise proceed to filter on max_val as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe9a030d-4e7c-4503-8aef-52afa13d30f6

📥 Commits

Reviewing files that changed from the base of the PR and between fc2ce04 and c3d938c.

📒 Files selected for processing (3)
  • README.md
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants