Feature/harvesting metadata from a provided repository url via cloning#473
Conversation
Added clone utility functions for repository cloning with error handling and cleanup.
Added new command-line arguments for URL and token in harvest command.
Add temporary directory handling for cloning repositories and update token management.
Added an explicit logging shutdown step before clearing the HERMES caches. Without shutting down logging first, the `clean` command fails on Windows with: `An error occurred during execution of clean (Find details in './hermes.log')` "Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'"
|
@sferenz |
|
|
||
| import argparse | ||
| import shutil | ||
| import logging |
There was a problem hiding this comment.
@Aidajafarbigloo Is it necessary for your changes to include logging? If not, please remove it everywhere.
There was a problem hiding this comment.
@sferenz Thank you for the comments.
When using hermes clean command on Windows, I face this error:
Run subcommand clean
Removing HERMES caches...
An error occurred during execution of clean (Find details in './hermes.log')
Error in the "hermes.log":
Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'.
This happens because Windows does not allow deletion of a file that is still open by the current process. The audit.log file inside .hermes is held open by a logging file handler. When shutil.rmtree() attempts to remove the directory, it fails due to the open file handle. I'm using logging.shutdown() to ensure that all logging handlers are closed before the directory is deleted, it does not introduce new logging behavior.
There was a problem hiding this comment.
This was fixed before but somehow the fix got lost... Weird 🤔
There was a problem hiding this comment.
Well, not only once. But it seems more important to not have a log file in the working directory than having a properly cleaned .hermes cache.
you should be able to configure the path to the logfile, though.
|
|
||
| # ---------------- utilities ---------------- | ||
|
|
||
| def _normalize_clone_url(url: str) -> str: |
There was a problem hiding this comment.
Please provide a general comment for each function.
| @@ -0,0 +1,249 @@ | |||
| # SPDX-FileCopyrightText: 2026 OFFIS e.V. | |||
| import toml | ||
|
|
||
|
|
||
| def _load_config(config_path: str) -> dict: |
There was a problem hiding this comment.
Please ensure that every function has a comment
This script bulk-tests HERMES metadata harvesting across multiple repositories, checking for expected metadata files and generating a CSV report.
Add test repositories for bulk testing.
|
@zyzzyxdonta @led02 Could you please review the PR and let me know if something needs to be changed? |
zyzzyxdonta
left a comment
There was a problem hiding this comment.
I left some comments on the cloning procedure and the tests. I haven't looked at how it is integrated into the application yet.
There was a problem hiding this comment.
The tests are currently a standalone script but they should use the Python unittest library or Pytest just like all the other tests do. You rebuilt a lot of things that already exist in these libraries, e.g. output of messages, test fixtures, ...
| """ | ||
|
|
||
| import subprocess | ||
| import pandas as pd |
There was a problem hiding this comment.
Pandas is not part of the dependencies and I would prefer not to add it.
| # JSON file that stores previous yes/no answers per URL to detect regressions across runs. | ||
| state_file = "hermes_bulk_test_state.json" | ||
|
|
||
| # Load previous state (if any). |
There was a problem hiding this comment.
I don't think comparing against previous results makes sense here. Either the code does the right thing (tests succeed) or it doesn't (tests fail). There is no performance degradation over time.
| # Step 2: Run harvest for the repository. | ||
| # Capture stdout/stderr to include HERMES output as an error message when needed. | ||
| proc = subprocess.run( | ||
| ["hermes", "harvest", "--url", url, "--token", token], |
There was a problem hiding this comment.
I don't think it's a good idea to run the whole program like this. Especially, we shouldn't actually clone any real repositories. Doing this can easily lead to our tests degrading, e.g. because repositories change over time or are deleted.
Here are some things that could be tested instead:
- Git repo URLs are constructed correctly
- git command lines (that are passed to subprocess.run) are constructed correctly
I'm not averse to the idea of having end-to-end tests for the whole metadata extraction workflow. But I think it should be based on a dedicated test repo that should ideally be part of the tests/ directory and not cloned. It should be safe to assume that a git clone works if the command parameters are correct.
| Normalization rules: | ||
| - For SSH and HTTPS, append ".git" when missing (common, but not required by all hosts). | ||
| """ | ||
| s = str(url).strip() |
There was a problem hiding this comment.
More expressive variable names would be nice. stripped_url? Or just overwrite the url parameter with the new value so the unstripped value isn't accidentally used again.
| print(f"error: failed to remove temp dir {path!s} after {retries} attempts. " | ||
| f"Please remove it manually. (Often caused by antivirus or open handles.)") | ||
|
|
||
| def _move_or_copy(src: Path, dst: Path): |
There was a problem hiding this comment.
Why make things this complicated? We don't need atomic moves here.
| return | ||
|
|
||
| # Optimized clone succeeded | ||
| if dest_path.exists(): |
There was a problem hiding this comment.
There is no reason to run into this problem at all. Just create a new temporary directory for every clone attempt.
| *, | ||
| root_only: bool = False, | ||
| include_files: Sequence[str] | None = None, | ||
| verbose: bool = False, |
There was a problem hiding this comment.
Instead of this verbose flag and print() calls, the logging features should be used with appropriate log levels (info, warning, error, ...)
| def rmtree_with_retries(path: Path, retries: int = 6, initial_wait: float = 0.1): | ||
| """ | ||
| Recursive directory deletion with retries and read-only handling, for environments where temporary directories may be locked | ||
| or marked read-only (e.g., Windows, CI systems, antivirus interference). |
There was a problem hiding this comment.
I wouldn't call this "retries". git clone always writes the .git/ directory as read-only so the issue that something can not be removed will always occur. I think shutil.rmtree with the _clear_readonly handler should suffice 🤔
| return | ||
| except Exception as e: | ||
| print(f"warn: rmtree attempt {attempt} failed for {path!s}: {e!r}") | ||
| time.sleep(wait) |
There was a problem hiding this comment.
Something is seriously wrong if a sleep is required.
This feature branch introduces functionality for
harvesting metadata from a provided repository URL via cloning.Changes:
hermes harvestcommandhermes harvestcommand (for the plugin githublab)