Skip to content

Validation_goldens modfication#2043

Open
niveditasing wants to merge 6 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix
Open

Validation_goldens modfication#2043
niveditasing wants to merge 6 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix

Conversation

@niveditasing
Copy link
Copy Markdown
Contributor

Modified code to fix 2 bugs:
Problem 1: The "must-include" file (top_100k_places.csv) was being loaded into a set of IDs without namespaces (e.g., "Earth", "country/USA"). However, input data (WorldBank.csv) contains values with namespaces (e.g., "dcid:Earth", "dcid:country/USA").
Result: An exact match ("dcid:Earth" == "Earth") is False, so the script filtered out all 450,000+ nodes, resulting in 0 goldens.
Fix: By using strip_namespace(), the script compares "Earth" to "Earth", allowing the nodes to correctly pass the filter.

Problem 2: The default utility function (file_write_csv_dict) is designed for complex data structures. When it sees a single property, it doesn't always recognize it as a column header. Instead, it generates a "cluttered" format with a redundant value column containing stringified dictionaries (e.g., "{'Column name ': 'Column value'}").
Result: This format is difficult to read and doesn't match the standard "single column" CSV format used in other golden files.
Fix: Manual writing allows us to explicitly define the header (Output Column) and write only the clean values, ensuring the output is exactly one column as you expected.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates validator_goldens.py to support namespace-stripped matching when generating goldens and replaces the helper function for writing CSVs with custom csv.DictWriter logic. A review comment points out that extracting CSV column headers from only the first node in golden_nodes assumes all nodes share the same keys, which can lead to errors or missing columns if keys vary. It suggests collecting the union of all keys across all nodes instead.

Comment thread tools/import_validation/validator_goldens.py
@balit-raibot
Copy link
Copy Markdown
Contributor

@gemini-code-assist review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates tools/import_validation/validator_goldens.py to import the csv module, strip namespaces when matching nodes against must_include_values, and replace the helper file_write_csv_dict with a custom csv.DictWriter implementation. The feedback suggests adding newline='' when opening the CSV file to prevent platform-specific carriage return issues and simplifying the logic for extracting and sorting unique keys.

Comment thread tools/import_validation/validator_goldens.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants