Skip to content

[DCP - Ingestion] Update the Preprocessing job to support the importName #515

Open
gmechali wants to merge 1 commit into
datacommonsorg:masterfrom
gmechali:importconfig
Open

[DCP - Ingestion] Update the Preprocessing job to support the importName #515
gmechali wants to merge 1 commit into
datacommonsorg:masterfrom
gmechali:importconfig

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 1 high · 1 medium · 10 minor

Alerts:
⚠ 12 issues (≤ 0 issues of at least minor severity)

Results:
12 new issues

Category Results
UnusedCode 1 medium
ErrorProne 1 high
CodeStyle 10 minor

View in Codacy

🟢 Metrics 4 complexity · 0 duplication

Metric Results
Complexity 4
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for bulk loading and processing multiple imports under subdirectories, merging their configurations, and exporting JSON-LD observations grouped by provenance. The review feedback highlights several opportunities to improve robustness, including using fspath.relative for safe path computation, validating that parsed configurations are dictionaries, avoiding an UnboundLocalError when opening import directories, skipping empty import names, using regex for safer directory name sanitization, and logging warnings for empty provenances or conflicting configuration merges.

Comment thread simple/stats/runner.py
raise e

dir_path = fspath.dirname(file.path)
rel_dir = dir_path.replace(base_dir.path, "").strip("/")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using str.replace to remove the base directory path from dir_path is highly error-prone. If the base directory name (e.g., base or import) appears anywhere else in the subdirectory path (for example, /base/import/my_import/base/config.json), replace will remove all occurrences of that substring, corrupting the relative path. Use fs.path.relative to safely compute the relative path.

Suggested change
rel_dir = dir_path.replace(base_dir.path, "").strip("/")
rel_dir = fspath.relative(base_dir.path, dir_path).strip("/")

Comment thread simple/stats/runner.py
Comment on lines +697 to +698
rows = self.db.engine.fetch_all("SELECT DISTINCT provenance FROM observations")
provenances = [row[0] for row in rows if row[0]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If any observations have a missing or empty provenance field, they will be completely skipped during the export because provenances only includes non-empty values. To prevent silent data loss, we should log a warning if any observations with an empty provenance are detected.

Suggested change
rows = self.db.engine.fetch_all("SELECT DISTINCT provenance FROM observations")
provenances = [row[0] for row in rows if row[0]]
rows = self.db.engine.fetch_all("SELECT DISTINCT provenance FROM observations")
provenances = [row[0] for row in rows if row[0]]
if any(not row[0] for row in rows):
logging.warning("Found observations with missing or empty provenance. These will be skipped during export!")

Comment thread simple/stats/runner.py
Comment on lines +243 to +247
try:
config_data = json.loads(raw_config)
except json.JSONDecodeError as e:
logging.error("Failed to parse JSON from %s: %s", file.full_path(), e)
raise e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a config.json file contains valid JSON but is not a JSON object (e.g., a JSON array or a primitive value), json.loads will succeed but subsequent .get() calls on config_data will raise an AttributeError. We should defensively verify that the parsed JSON is a dictionary.

Suggested change
try:
config_data = json.loads(raw_config)
except json.JSONDecodeError as e:
logging.error("Failed to parse JSON from %s: %s", file.full_path(), e)
raise e
try:
config_data = json.loads(raw_config)
if not isinstance(config_data, dict):
raise ValueError("Config content must be a JSON object")
except (json.JSONDecodeError, ValueError) as e:
logging.error("Failed to parse JSON from %s: %s", file.full_path(), e)
raise e

Comment thread simple/stats/runner.py
Comment on lines +289 to +290
for name in import_names:
name = name.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If import_names contains empty strings or trailing commas (e.g., "oecd,"), name.strip() will be empty. Calling base_dir.open_dir("") on an empty string will open the base directory itself, which is unintended. We should skip empty names.

    for name in import_names:
      name = name.strip()
      if not name:
        continue

Comment thread simple/stats/runner.py
Comment on lines +291 to +303
try:
imp_dir = base_dir.open_dir(name)
file = imp_dir.open_file(constants.CONFIG_JSON_FILE_NAME, create_if_missing=False)
configs.append(file)
except FileNotFoundError:
logging.info("Config file not found at root of %s. Scanning subdirectories.", name)
sub_configs = self._find_configs_in_dir(imp_dir)
if not sub_configs:
raise FileNotFoundError(f"No config files found for {name}")
configs.extend(sub_configs)
except ValueError as e:
logging.error("Invalid directory for import %s: %s", name, e)
raise e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If base_dir.open_dir(name) fails or raises an exception before imp_dir is assigned, referencing imp_dir in the except FileNotFoundError block will raise an UnboundLocalError. We should separate the directory opening from the file opening to ensure imp_dir is safely bound before it is used.

Suggested change
try:
imp_dir = base_dir.open_dir(name)
file = imp_dir.open_file(constants.CONFIG_JSON_FILE_NAME, create_if_missing=False)
configs.append(file)
except FileNotFoundError:
logging.info("Config file not found at root of %s. Scanning subdirectories.", name)
sub_configs = self._find_configs_in_dir(imp_dir)
if not sub_configs:
raise FileNotFoundError(f"No config files found for {name}")
configs.extend(sub_configs)
except ValueError as e:
logging.error("Invalid directory for import %s: %s", name, e)
raise e
try:
imp_dir = base_dir.open_dir(name)
except (FileNotFoundError, ValueError) as e:
logging.error("Invalid or missing directory for import %s: %s", name, e)
raise e
try:
file = imp_dir.open_file(constants.CONFIG_JSON_FILE_NAME, create_if_missing=False)
configs.append(file)
except FileNotFoundError:
logging.info("Config file not found at root of %s. Scanning subdirectories.", name)
sub_configs = self._find_configs_in_dir(imp_dir)
if not sub_configs:
raise FileNotFoundError(f"No config files found for {name}")
configs.extend(sub_configs)

Comment thread simple/stats/runner.py
Comment on lines +720 to +721
# Sanitize provenance name for folder
prov_folder = prov.replace("/", "_").replace(":", "_")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Sanitizing the provenance name by only replacing / and : is not fully robust. Provenance strings (especially URLs) can contain other characters like ?, &, =, or \, which are invalid or problematic in directory names on various filesystems (e.g., Windows). Using a regular expression to replace all non-alphanumeric characters (except underscores and dashes) is much safer and more robust.

Suggested change
# Sanitize provenance name for folder
prov_folder = prov.replace("/", "_").replace(":", "_")
# Sanitize provenance name for folder to be safe for all filesystems
import re
prov_folder = re.sub(r'[^a-zA-Z0-9_\-]', '_', prov)

Comment thread simple/stats/runner.py
Comment on lines +260 to +268
# Merge variables
variables = config_data.get("variables", {})
for k, v in variables.items():
merged_data["variables"][k] = v

# Merge sources
sources = config_data.get("sources", {})
for k, v in sources.items():
merged_data["sources"][k] = v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When merging variables and sources from multiple configuration files, conflicting definitions for the same key will silently overwrite each other. To prevent hard-to-debug configuration conflicts, we should log a warning if a key is being overwritten with a different value.

Suggested change
# Merge variables
variables = config_data.get("variables", {})
for k, v in variables.items():
merged_data["variables"][k] = v
# Merge sources
sources = config_data.get("sources", {})
for k, v in sources.items():
merged_data["sources"][k] = v
# Merge variables, warning on conflicts
variables = config_data.get("variables", {})
for k, v in variables.items():
if k in merged_data["variables"] and merged_data["variables"][k] != v:
logging.warning("Conflicting definition for variable %s. Overwriting.", k)
merged_data["variables"][k] = v
# Merge sources, warning on conflicts
sources = config_data.get("sources", {})
for k, v in sources.items():
if k in merged_data["sources"] and merged_data["sources"][k] != v:
logging.warning("Conflicting definition for source %s. Overwriting.", k)
merged_data["sources"][k] = v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant