feat(tools): add HuggingFace config converter#197
feat(tools): add HuggingFace config converter#197rishi-jat wants to merge 2 commits intomodelpack:mainfrom
Conversation
Signed-off-by: Rishi Jat <rishijat098@gmail.com>
Signed-off-by: Rishi Jat <rishijat098@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a standardized Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new ArchitectureConfig to the model configuration, defining architecture-specific parameters such as the number of layers, hidden size, and attention heads. This involves updates to the JSON schema, corresponding Go structs, and new validation tests. Additionally, a Python script hf_to_arch.py is added to convert HuggingFace configurations to this new format, complete with comprehensive unit tests. The review suggests improving the Python script's command-line argument parsing by using argparse and refactoring the Python test script to utilize the unittest framework for enhanced robustness and maintainability.
| if len(sys.argv) != 2: | ||
| print(f"usage: {sys.argv[0]} <config.json>", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| config_path = sys.argv[1] |
There was a problem hiding this comment.
For more robust and maintainable command-line argument parsing, consider using the argparse module from the standard library. It provides features like automatic help message generation (-h/--help) and is the standard for creating command-line interfaces in Python.
You will need to add import argparse at the top of the file.
| if len(sys.argv) != 2: | |
| print(f"usage: {sys.argv[0]} <config.json>", file=sys.stderr) | |
| sys.exit(1) | |
| config_path = sys.argv[1] | |
| parser = argparse.ArgumentParser(description=__doc__.strip()) | |
| parser.add_argument("config_path", metavar="config.json", help="Path to HuggingFace config.json") | |
| args = parser.parse_args() | |
| config_path = args.config_path |
| def main(): | ||
| test_valid_config() | ||
| test_missing_field() | ||
| test_invalid_json() | ||
| test_file_not_found() | ||
| test_invalid_field_type() | ||
| test_zero_value() | ||
| test_bool_value() | ||
| print("\nAll tests passed.") |
There was a problem hiding this comment.
While this test script is functional, structuring it with Python's standard unittest framework would make it more robust and align with common testing practices. unittest provides a test runner, test discovery, and a richer set of assertion methods (e.g., self.assertEqual(), self.assertIn()). This would also remove the need for a custom main function to run tests and manual print statements for test status.
For example, you could structure your tests in a class:
import unittest
# ... other imports
class HFToArchTests(unittest.TestCase):
def test_valid_config(self):
# ... test logic ...
self.assertEqual(exitcode, 0, f"...")
# ...
if __name__ == '__main__':
unittest.main()This approach is more maintainable and scalable as more tests are added.
There was a problem hiding this comment.
Pull request overview
Adds support for an architecture_config object in the model config schema/Go types and introduces a Python tool to convert a HuggingFace config.json into that architecture-config shape.
Changes:
- Add
tools/hf_to_arch.pyconverter (HFconfig.json→ architecture config JSON) - Extend JSON schema + Go spec types to include
architecture_config - Add schema validation tests and a Python test script for the converter
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/hf_to_arch.py | New converter script that maps key HF fields into an architecture config JSON object |
| tools/hf_to_arch_test.py | New Python test script that shells out to the converter and checks common failure modes |
| specs-go/v1/config.go | Adds ArchitectureConfig to ModelConfig and defines the ArchitectureConfig struct |
| schema/config_test.go | Adds Go tests to validate schema acceptance/rejection of architecture_config |
| schema/config-schema.json | Adds architecture_config to ModelConfig and defines an ArchitectureConfig schema |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "config": { | ||
| "paramSize": "8b", | ||
| "architecture_config": { | ||
| "type": "transformer", | ||
| "numLayers": 32, | ||
| "hiddenSize": 4096, | ||
| "numAttentionHeads": 32 | ||
| } |
There was a problem hiding this comment.
The tests and schema are using the new architecture_config field name. If the intent is to follow the existing lowerCamelCase JSON convention for config fields, consider renaming this to architectureConfig here (and in the schema + Go structs) to avoid introducing an inconsistent naming style.
| def convert_hf_config(hf_config: dict) -> dict: | ||
| """Convert HuggingFace config to architecture_config format.""" | ||
| arch_config = {"type": "transformer"} | ||
|
|
||
| for arch_key, hf_key in REQUIRED_MAPPINGS.items(): | ||
| if hf_key not in hf_config: | ||
| raise ValueError(f"missing required field: {hf_key}") | ||
| value = hf_config[hf_key] | ||
| if not isinstance(value, int) or isinstance(value, bool): |
There was a problem hiding this comment.
convert_hf_config assumes the parsed JSON is a dict/object. If config.json is a JSON array/string/etc, the current checks (e.g., hf_key not in hf_config) will behave incorrectly or raise confusing errors. Consider validating hf_config is a dict up front and raising a clear ValueError if not.
| exitcode, stdout, stderr = run_script("not valid json {") | ||
|
|
||
| assert exitcode != 0, "expected non-zero exit for invalid JSON" | ||
| assert "invalid JSON" in stderr.lower() or "json" in stderr.lower(), f"error should mention JSON: {stderr}" |
There was a problem hiding this comment.
This assertion lowercases stderr but checks for the mixed-case substring "invalid JSON", which can never match. Consider checking for "invalid json" (lowercase) or just relying on the existing "json" substring check.
| assert "invalid JSON" in stderr.lower() or "json" in stderr.lower(), f"error should mention JSON: {stderr}" | |
| assert "invalid json" in stderr.lower() or "json" in stderr.lower(), f"error should mention JSON: {stderr}" |
| def main(): | ||
| test_valid_config() | ||
| test_missing_field() | ||
| test_invalid_json() | ||
| test_file_not_found() | ||
| test_invalid_field_type() | ||
| test_zero_value() | ||
| test_bool_value() | ||
| print("\nAll tests passed.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
These Python tests are not picked up by the repository's default make test (which currently runs go test ./... only). If these are meant to run in CI, consider integrating them into the test target/CI workflow (or converting to a Go test that shells out to the tool), otherwise the PR's "Tests included" claim may be misleading.
| "capabilities": { | ||
| "$ref": "#/$defs/ModelCapabilities" | ||
| }, | ||
| "architecture_config": { |
There was a problem hiding this comment.
The new property name architecture_config introduces snake_case into the schema, while other ModelConfig fields use lowerCamelCase (e.g., paramSize, docURL, diffIds). Consider renaming this to architectureConfig (and updating the Go struct tags/tests/tools accordingly) to keep JSON field naming consistent.
| "architecture_config": { | |
| "architectureConfig": { |
| Capabilities *ModelCapabilities `json:"capabilities,omitempty"` | ||
|
|
||
| // Architecture-specific configuration parameters | ||
| ArchitectureConfig *ArchitectureConfig `json:"architecture_config,omitempty"` |
There was a problem hiding this comment.
json:"architecture_config,omitempty" is inconsistent with the existing JSON naming in this file (e.g., paramSize, docURL, diffIds). If possible, align the JSON tag with the prevailing lowerCamelCase convention (e.g., architectureConfig) and update the schema/tests accordingly.
| ArchitectureConfig *ArchitectureConfig `json:"architecture_config,omitempty"` | |
| ArchitectureConfig *ArchitectureConfig `json:"architectureConfig,omitempty"` |
Adds tool to convert HuggingFace config.json to architecture_config format.
Usage:
Tests included.