-
-
Notifications
You must be signed in to change notification settings - Fork 9
Feature/real time sink patterns #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tion-based and ramp load patterns, enhance validation capabilities, and improve performance metrics collection. Introduce new test plans and examples for various execution strategies, along with updates to documentation and Docker configurations.
- Add CardinalityConfigBuilder and NullabilityConfigBuilder for configuring cardinality and nullability in relationships - Introduce new processors for foreign key uniqueness and connection resolution - Add comprehensive tests for new features and update existing test cases - Enhance documentation with new YAML examples and integration tests for execution strategies, including performance test examples This implementation enables flexible foreign key management in data generation, supporting various relationship patterns and improving data integrity validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on December 4
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
api/src/main/scala/io/github/datacatering/datacaterer/api/model/PlanModels.scala
Show resolved
Hide resolved
api/src/main/scala/io/github/datacatering/datacaterer/api/model/PlanModels.scala
Show resolved
Hide resolved
api/src/main/scala/io/github/datacatering/datacaterer/api/model/PlanModels.scala
Outdated
Show resolved
Hide resolved
Pull Request Review: Feature/real time sink patternsThis is a substantial PR adding ~22k lines with significant enhancements to foreign key handling, execution strategies, and real-time sink patterns. Overall, the code quality is high with good architecture, but there are several areas that need attention. 🎯 High-Level AssessmentStrengths:
Concerns:
🐛 Critical Issues1. Missing Test File ReferenceFile: The file path in the PR shows Action: Verify all new test files are properly committed to the repository.
|
…vel and not top level of foreign keys
| source: ForeignKeyRelation = ForeignKeyRelation(), | ||
| generate: List[ForeignKeyRelation] = List(), | ||
| delete: List[ForeignKeyRelation] = List(), | ||
| delete: List[ForeignKeyRelation] = List() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Auxiliary constructor missing new optional parameters
The auxiliary constructor for ForeignKeyRelation doesn't pass the new optional parameters (cardinality, nullability, generationMode) to the primary constructor. It calls this(dataSource, step, List(field)) which only provides 3 arguments, but the primary constructor now expects 6 parameters. This will fail to compile because Scala requires all parameters to be provided when calling the primary constructor from an auxiliary constructor, even if they have default values. The auxiliary constructor should be: def this(dataSource: String, step: String, field: String) = this(dataSource, step, List(field), None, None, None).
- Updated the build workflow to support both amd64 and arm64 architectures for packaging the application as a Debian package. - Introduced a new workflow for testing Linux ARM64 builds, including setup for QEMU and Docker Buildx for cross-platform compatibility. - Adjusted artifact naming conventions to clearly indicate architecture in the output files.
- Modified the build and test workflows to include the installation of fakeroot prior to executing the jpackage command, ensuring successful packaging of the application for multi-architecture builds.
| } | ||
|
|
||
| LOGGER.info(s"Using INDEX-BASED approach: assigning FKs by row position (${recordsPerParent} records per parent)") | ||
| applyCardinalityWithIndex(sourceDf, targetDf, sourceFields, targetFields, sourceCount, recordsPerParent.toLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Division by zero when cardinality config produces zero value
When using the index-based FK assignment approach, recordsPerParent is computed from CardinalityConfig values and converted to Long before being used as a divisor. If the config has min=0, max=0 (average=0), min=0, max=1 (average=0.5→0 after toLong), or ratio=0.0 (or any value less than 1.0), the division floor(col("_row_num") / recordsPerParent) at line 234 will fail with a division by zero error. Unlike NullabilityConfig which validates its bounds with a require statement, CardinalityConfig has no validation preventing these edge case values.
Additional Locations (1)
| val insertOrderDfs = insertOrder | ||
| .map(s => { | ||
| foreignKeyAppliedDfs.find(f => f._1.equalsIgnoreCase(s)) | ||
| .getOrElse(s -> taskDfs.find(t => t._1.equalsIgnoreCase(s)).get._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unsafe .get on potentially empty Option causes runtime exception
The insertOrder is calculated from foreignKeyRelations (all FK relations) at line 130, while enabledForeignKeys (line 60) filters out invalid FKs. When building insertOrderDfs, the code tries to find each data source name in taskDfs with .get._2 at line 136. If a FK relation references a disabled or non-existent data source (which gets filtered out from enabledForeignKeys but still contributes to insertOrder), taskDfs.find(...) returns None, and calling .get throws a NoSuchElementException. The insert order should use enabledForeignKeys instead of foreignKeyRelations, or the fallback lookup should handle missing entries gracefully.
Code Review - PR #120: Feature/real time sink patternsThis is a substantial PR with 167 files changed (16k additions, 3k deletions). I have reviewed the key architectural changes and have the following feedback: ✅ Strengths1. Excellent Architectural Refactoring - Foreign Key Processing
2. Well-Designed Execution Strategy Framework
3. Metrics & Observability
4. Pekko Migration
5. Test Coverage
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on January 4
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| case config if config.ratio.isDefined => | ||
| // Ratio: set perField with fixed count | ||
| // Use requiredCount for total records, perField count for records per parent | ||
| val recordsPerParent = config.ratio.get.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Truncation of ratio causes fewer records than expected
Using .toInt and .toLong truncates the ratio value when converting from Double to integer types. For a ratio of 2.5, this becomes 2 instead of rounding to 3. This inconsistency with calculateRequiredCount (which uses math.ceil) causes the perField count to generate fewer records than the total required count calculation expects. For example, with 3 parents and ratio 2.5: total required = 8 (via ceil), but perField generates only 6 (3 * 2).
Additional Locations (1)
.../scala/io/github/datacatering/datacaterer/core/foreignkey/strategy/CardinalityStrategy.scala
Show resolved
Hide resolved
Pull Request Review: Feature/real time sink patternsThis is a major architectural enhancement with 16,079 additions across 93 files. Overall AssessmentVerdict: This PR demonstrates excellent engineering with well-structured architecture, comprehensive testing, and thoughtful design patterns. There are concerns to address before merging. Highlights:
Key Concerns1. TDigest Implementation (HIGH PRIORITY)TDigest.scala stores all values in memory (up to 100k) - this is NOT a true T-Digest streaming sketch algorithm. Recommendation: Rename to SimplePercentileCalculator or implement actual T-Digest. 2. Memory Risks (HIGH PRIORITY)
3. Race Condition (HIGH PRIORITY)ForeignKeyProcessor:122-124 mutates taskDfs while iterating. If same target appears multiple times, later iterations use stale data. 4. Actor System PerformancePekkoStreamingSinkWriter creates/terminates actor systems per call (expensive thread pool initialization). 5. Timeout Configuration300-second timeout cap may be too aggressive for large datasets with low rates. Potential Bugs
Architecture ReviewForeign Key Processing: Excellent separation of concerns with ForeignKeyProcessor and strategy pattern. InsertOrderCalculator provides topological sort with circular dependency detection. Execution Strategies: Well-designed trait-based system with LoadPatternParser providing clean YAML mapping. Real-Time Sink Writer: Clean refactoring with SinkRouter centralizing routing logic. Test CoverageExcellent coverage with 583+ lines in ForeignKeyEndToEndIntegrationTest, 753+ lines in EnhancedForeignKeyIntegrationTest. Gaps: TDigest edge cases, actor lifecycle, circular dependency errors. RecommendationsHIGH PRIORITY (Address Before Merge):
MEDIUM PRIORITY:
SummaryHigh-quality work with excellent architecture and comprehensive testing. Architectural improvements significantly outweigh concerns, which are addressable. Recommendation: Request changes for high-priority items, then approve for merge. |
| val nullCount = (totalCount * percentage).toLong | ||
| targetDf | ||
| .withColumn("_row_idx", row_number().over(Window.orderBy(lit(1))) - 1) | ||
| .withColumn("_should_null_fk", col("_row_idx") < nullCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Nullability "head" and "tail" strategies use non-deterministic ordering
The "head" and "tail" nullability strategies are documented to apply nulls to the "first N%" and "last N%" of records respectively. However, the implementation uses Window.orderBy(lit(1)) which provides no meaningful ordering since all rows have the same constant sort key. This makes the row assignment arbitrary and non-deterministic, causing both strategies to behave essentially like "random". The documented behavior of selecting specific positional records cannot be achieved without an actual column-based ordering.
Additional Locations (1)
…-based approaches - Updated ForeignKeyEndToEndIntegrationTest to assert expected null rows for foreign keys based on deterministic hash values. - Modified CardinalityStrategy, DistributedSamplingStrategy, GenerationModeStrategy, and NullabilityStrategy to utilize hash-based methods for consistent results across different Spark environments. - Introduced SimplePercentileCalculator for efficient percentile calculations in performance metrics, replacing the deprecated T-Digest. - Added tests for new deterministic behaviors in foreign key strategies and updated existing tests for consistency. - Improved DataGenerator to support deterministic SQL generation with seed-based hash functions.
| val executableTasks = context.executableTasks | ||
|
|
||
| val enabledSources = plan.tasks.filter(_.enabled).map(_.dataSourceName) | ||
| val sinkOptions = plan.sinkOptions.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Unchecked Option.get may throw NoSuchElementException
The process method calls plan.sinkOptions.get without first verifying that sinkOptions is defined. If ForeignKeyProcessor.process() is called directly (e.g., via ForeignKeyUtil.getDataFramesWithForeignKeys) with a plan where sinkOptions is None, this will throw a NoSuchElementException at runtime. The method lacks a guard clause to handle this edge case safely.
| // Use absolute hash value modulo sourceCount for uniform distribution | ||
| targetDf.withColumn("_fk_idx", abs(hashExpr) % sourceCount) | ||
| case None => | ||
| targetDf.withColumn("_fk_idx", floor(rand() * sourceCount).cast(LongType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Division by zero when source DataFrame is empty
The apply method performs modulo and multiplication operations with sourceCount (lines 78 and 80) without checking if it's zero first. If the source DataFrame has no records, sourceCount will be 0, causing a division by zero error during the modulo operation abs(hashExpr) % sourceCount or undefined behavior with rand() * sourceCount. Unlike CardinalityStrategy which properly guards against this case, DistributedSamplingStrategy lacks this protection.
- Removed obsolete integration test steps from GitHub Actions workflow. - Improved logging in StepDataCoordinator for better debugging during record generation. - Updated CardinalityCountAdjustmentProcessor to ensure only foreign key target steps are modified, preventing unintended changes. - Added new YAML plan and task files for account balances and transactions, including validation of foreign key relationships. - Introduced integration tests for YAML plan execution to verify record counts and foreign key integrity.
PR Review: Feature/real time sink patternsThis is an impressive and comprehensive enhancement that adds significant new capabilities to Data Caterer. I've reviewed the key architectural changes across load execution, foreign keys, sinks, and metrics. Overall, the code quality is excellent with strong adherence to Scala best practices. Below are my findings organized by category. 🎯 SummaryVerdict: ✅ Approved with minor recommendations This PR successfully delivers:
💪 Strengths1. Excellent Architecture
2. Code Quality
3. Test Coverage
4. Backward Compatibility
🔍 Issues & RecommendationsCritical Issues (None found)High Priority1. Thread.sleep in RateLimiter (
|
| val optTargetStep = executableTasks.flatMap(tasks => | ||
| tasks | ||
| .find(_._1.dataSourceName == target.dataSource) | ||
| .flatMap(_._2.steps.find(_.name == target.step)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Inconsistent case sensitivity in data source name comparison
The lookup for targetPerFieldCount at line 94 uses case-sensitive comparison (==) when comparing dataSourceName with target.dataSource, while all other DataFrame lookups in the same file (lines 123, 135, 136, 139, 148) use equalsIgnoreCase. This inconsistency causes targetPerFieldCount to be None when data source names differ only by case, even though the DataFrames are successfully found elsewhere. This affects the selection between group-based and index-based cardinality strategies, potentially causing incorrect FK assignments.
Note
Add pattern-based load execution with rate limiting, strategy-driven foreign keys, performance metrics/validations, and Pekko-based real‑time sink routing with broad docs/examples updates.
0.18.0.Written by Cursor Bugbot for commit a6f8e4c. This will update automatically on new commits. Configure here.