Skip to content

Conversation

@arash77
Copy link
Contributor

@arash77 arash77 commented Jan 16, 2026

Introduce a normalization function to convert version fields to strings across various import scripts, ensuring consistent data formatting. This change enhances data integrity when processing tool and package metadata.
Closes research-software-ecosystem/content#1190

@mihai-sysbio
Copy link

Thanks @arash77 this is a neat contribution. It's mixed together with formatting though, which albeit a great idea, it muddles what is the fix vs purely formatting. Is there a way you could split the two aspects? And if an adoption of PEP8 is desired in this repo, how about a GH Action that applies it automatically?

@arash77
Copy link
Contributor Author

arash77 commented Jan 19, 2026

I will exclude the formatting from this PR. I can create a separate PR to talk about how an automated formatting could be applied.

@arash77 arash77 force-pushed the normalize-version-fields branch from 17db00a to 74ba363 Compare January 19, 2026 16:23
Add normalize_version_fields function to convert version fields
(which can be int, float, or str) to string type for consistency.

Integrate version normalization into all import scripts:
- bioconda: normalize package.version
- bioconductor: normalize Version
- biotools: normalize version and nested version fields
- galaxytool: normalize Suite_version, conda package version, and workflow versions
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a new common utility module for normalizing version fields from numeric types to strings across various metadata import scripts, addressing data integrity issues when processing tool and package metadata.

Changes:

  • Added common/metadata.py module with normalize_version_to_string and normalize_version_fields functions
  • Updated four import scripts (galaxytool-import, biotools-import, bioconductor-import, bioconda-import) to use the new normalization functions

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
common/metadata.py New utility module providing functions to normalize version fields (integers/floats) to strings with support for nested paths and list structures
galaxytool-import/galaxytool-import.py Integrated version normalization for Suite_version, Latest_suite_conda_package_version, and Related_Workflows latest_version fields
biotools-import/import.py Added version field normalization for both top-level version field and nested version fields within version arrays
bioconductor-import/import.py Applied normalization to the Version field in package metadata
bioconda-import/bioconda_importer.py Normalized package.version field in conda package data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

try:
if "[" in field_path:
if "[]." not in field_path:
list_key = field_path.rstrip("[]")
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rstrip("[]") removes all occurrences of '[' and ']' characters from the right side, not just the substring "[]". This could cause unexpected behavior if a field name contains these characters. Consider using string slicing instead: field_path[:-2] if field_path.endswith("[]") else field_path

Suggested change
list_key = field_path.rstrip("[]")
list_key = field_path[:-2] if field_path.endswith("[]") else field_path

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +49
except Exception:
continue
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad exception handler silently swallows all errors, making it difficult to debug issues with field path syntax or data structure problems. Consider logging the exception or at least handling specific expected exceptions separately to aid in debugging

Copilot uses AI. Check for mistakes.
return value


def normalize_version_fields(data, field_paths):
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function lacks documentation explaining its purpose, parameters, return value, and the syntax for field_paths (e.g., "field", "field.nested", "field[]", "field[].nested"). Adding docstrings would improve maintainability and help users understand how to use the different path syntaxes

Suggested change
def normalize_version_fields(data, field_paths):
def normalize_version_fields(data, field_paths):
"""
Normalize version-like fields in a metadata dictionary to string form.
This function walks the provided ``data`` mapping and applies
:func:`normalize_version_to_string` to values located at the paths
specified in ``field_paths``. The ``data`` dictionary is modified in
place and also returned.
Parameters
----------
data : dict
A dictionary of metadata to normalize. Must be a mapping; a
:class:`TypeError` is raised otherwise.
field_paths : Iterable[str]
An iterable of field path strings describing which values in
``data`` should be normalized. Supported path syntaxes are:
* ``"field"``a top-level key in ``data``.
* ``"field.nested"``a nested key reached by following
dictionaries (e.g. ``data["field"]["nested"]``).
* ``"field[]"``a top-level key whose value is a list; every
element of that list is normalized.
* ``"field[].nested"``a top-level key whose value is a list of
dictionaries; for each dictionary item, the ``"nested"`` key is
normalized (e.g. ``data["field"][i]["nested"]``).
Returns
-------
dict
The same ``data`` dictionary, after in-place normalization.
"""

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,51 @@
def normalize_version_to_string(value):
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function lacks documentation explaining its purpose, parameters, and behavior. Consider adding a docstring that describes what values are normalized and which are preserved (e.g., None, bool, strings are returned as-is, while int/float are converted to strings)

Suggested change
def normalize_version_to_string(value):
def normalize_version_to_string(value):
"""
Normalize version-like values to strings while preserving certain types.
This function is intended to prepare version fields for consistent handling
by converting numeric values (int, float) to their string representation.
Behavior:
- None and bool values are returned as-is.
- int and float values are converted to strings.
- list values are processed recursively, applying the same rules to each
element.
- dict values are processed recursively, applying the same rules to each
value.
- All other values (e.g., existing strings, custom objects) are returned
unchanged.
:param value: The value to normalize (may be a scalar, list, or dict).
:return: The normalized value with numeric types converted to strings and
other types preserved according to the rules above.
"""

Copilot uses AI. Check for mistakes.
drop_false = lambda path, key, value: bool(value)
tool_cleaned = remap(tool, visit=drop_false)
tool_cleaned = normalize_version_fields(
tool_cleaned, ["version", "version[].version"]
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using both "version" and "version[].version" may result in redundant processing. The normalize_version_to_string function recursively processes lists and dictionaries, so normalizing "version" would already handle nested version fields within list items. Consider whether both field paths are necessary or if "version" alone would suffice

Suggested change
tool_cleaned, ["version", "version[].version"]
tool_cleaned, ["version"]

Copilot uses AI. Check for mistakes.
@hmenager hmenager requested a review from mihai-sysbio January 22, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consistent field types in metadata formats

2 participants