Dedupe individual tracks during import based on stable identifier#6723
Dedupe individual tracks during import based on stable identifier#6723tommyschnabel wants to merge 8 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6723 +/- ##
==========================================
- Coverage 74.78% 74.78% -0.01%
==========================================
Files 163 163
Lines 20966 21049 +83
Branches 3302 3326 +24
==========================================
+ Hits 15680 15742 +62
- Misses 4529 4547 +18
- Partials 757 760 +3
🚀 New features to boost your workflow:
|
Move the import-time duplicate-track handling out of the `duplicates` plugin and into importer core, next to the existing duplicate machinery. Add the `import.duplicate_track_resolution` option (off by default). When enabled, album imports check each track against the library using `import.duplicate_keys.item` and resolve matches via the existing `import.duplicate_action`: - `skip` drops the already-imported tracks and imports the rest of the album (a fully-duplicate album is skipped); - `remove` removes the matching old library items; - `keep`/`merge` import everything unchanged; - `ask` prompts the session. The check runs before the autotag lookup so dropped tracks are excluded from the match. The `duplicates` plugin is left untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Match track duplicates against all library items (including album members, not just singletons), so re-importing catches tracks already present in an existing album. When per-track resolution prunes some tracks of an album, suppress the album-level duplicate check so the remaining new tracks are still imported instead of the whole (partial) album being skipped. Add a dedicated `duplicate_track_action` option (inheriting `duplicate_action` when unset) with a new `fold` action that adds the remaining new tracks to the existing album they belong to, completing a partially-imported album. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collapse the separate `fold` track duplicate action into `skip`: skipping per-track duplicates now folds the remaining new tracks into the existing album they belong to (falling back to a new album when the matches do not resolve to a single album). Removes the `fold` action, prompt option, and `FOLD` resolution. Show the per-track "Skipping duplicate track" message and the whole-album "Skipping album, all tracks are duplicates" message in the warning color. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6ece05a to
41c25aa
Compare
|
@semohr this is ready for review |
|
Hi @tommyschnabel, thanks for this work! I would suggest to rebase as there has been a considerable amount of changes and see how refactored duplicates handling by introducing |
…_actions # Conflicts: # beets/importer/session.py # beets/importer/stages.py # beets/importer/tasks.py # beets/test/helper.py # docs/changelog.rst
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@snejus got your changes integrated! Ready for a review 😄 |
…_actions # Conflicts: # docs/changelog.rst
|
@semohr ready for review again 😄 |
There was a problem hiding this comment.
Pull request overview
grug see PR add per-track duplicate handling during album import. goal is make repeat beet import more idempotent by matching tracks on stable key (like mb_trackid) and then skip/remove/keep/ask for only the dup tracks.
Changes:
- add
import.duplicate_track_resolution+import.duplicate_track_actionconfig, plus docs + changelog entry - add importer pipeline stage to detect/resolve per-track dupes before autotag lookup, with optional fold-into-existing-album behavior
- add tests for skip/remove/keep/ask flows and fold behavior
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_importer.py | add tests for per-track duplicate resolution behaviors |
| beets/ui/commands/import_/session.py | add interactive prompt for track-duplicate resolution in terminal session |
| beets/test/helper.py | extend ImportSessionFixture to simulate track-duplicate prompt answers |
| beets/importer/stages.py | implement per-track duplicate detection + resolution pipeline stage |
| beets/importer/tasks.py | factor duplicate-item deletion helper; support folding new tracks into existing album; remove track-dup items during manipulate stage |
| beets/importer/session.py | add ImportSession abstract hook for track-duplicate resolution + stage wiring |
| beets/config_default.yaml | add default config keys for new options |
| docs/reference/config.rst | document new config options and behavior notes |
| docs/changelog.rst | add changelog entry for new feature |
| existing = [item for matches in duplicates.values() for item in matches] | ||
| ui.print_("Old: " + summarize_items(existing, True)) | ||
| if config["import"]["duplicate_verbose_prompt"]: | ||
| for item in existing: | ||
| print(f" {item}") |
| ui.print_("New: " + summarize_items(list(duplicates), True)) | ||
| if config["import"]["duplicate_verbose_prompt"]: | ||
| for item in duplicates: | ||
| print(f" {item}") |
| Resolution = Enum("Resolution", "REMOVE SKIP KEEPBOTH MERGE") | ||
|
|
||
| default_resolution = "REMOVE" | ||
|
|
||
| def resolve_track_duplicates(self, task, duplicates): | ||
| return { | ||
| self.Resolution.SKIP: "s", | ||
| self.Resolution.KEEPBOTH: "k", | ||
| self.Resolution.REMOVE: "r", | ||
| }.get(self.default_resolution, "k") |
snejus
left a comment
There was a problem hiding this comment.
Well done! The first round of reviews :)
|
|
||
| choose_item = choose_match | ||
|
|
||
| Resolution = Enum("Resolution", "REMOVE SKIP KEEPBOTH MERGE") |
There was a problem hiding this comment.
Why are we introducing a new Enum here? Can't we reuse DuplicateAction?
| """Decide what to do with album tracks already in the library.""" | ||
| log.warning("Some tracks are already in the library!") | ||
|
|
||
| if config["import"]["quiet"]: |
There was a problem hiding this comment.
Cast these config options to bool:
| if config["import"]["quiet"]: | |
| if config["import"]["quiet"].get(bool): |
|
|
||
| return action | ||
|
|
||
| def resolve_track_duplicates(self, task, duplicates) -> str: |
| A typical configuration for completing partially-imported albums while | ||
| autotagging looks like this: | ||
|
|
||
| :: |
There was a problem hiding this comment.
Use
| :: | |
| .. code-block:: yaml |
for syntax highlighting.
| duplicate_action: ask # whole-album duplicates | ||
| duplicate_track_action: skip # per-track duplicates: fold new tracks into the existing album | ||
| duplicate_keys: | ||
| item: mb_trackid # match on a stable id (recommended when autotagging) |
There was a problem hiding this comment.
Users will find it helpful to see the default values as well!
|
|
||
| # Optionally drop or replace album tracks that already exist in | ||
| # the library before the autotag lookup runs. | ||
| stages += [stagefuncs.resolve_track_duplicates(self)] |
There was a problem hiding this comment.
How is this expected to interact with the default threaded: yes mode?
Looking at the pipeline ordering, I wonder whether this stage may check a second task before the first one reaches add(). For example, if two one-track albums with the same duplicate keys are imported in one invocation, could both pass this check in threaded mode while the second one would be skipped in sequential mode? I'm not sufficiently familiar with the pipeline internals to tell whether this is possible, so could you clarify what guarantees we have here?
Have you tested such scenario yourself?
| Uses ``import.duplicate_track_action`` when set, otherwise falls back to | ||
| ``import.duplicate_action``. | ||
| """ | ||
| choices = { |
There was a problem hiding this comment.
Can we re-use DuplicateAction here? See how it used in such cases:
def get_duplicate_action(
self, task: ImportTask, found_duplicates: list[AnyLibModel]
) -> DuplicateAction:
"""Get the configured duplicate action."""
choice = config["import"]["duplicate_action"].as_choice(
DuplicateAction.choices()
)
log.debug("default action for duplicates: {}", choice)
return DuplicateAction(choice) # type: ignore[call-arg]Then, in the logic I want to see conditionals like action is DuplicationAction.SKIP instead of action == "s".
| task.duplicate_tracks_resolved = True | ||
| # Fold the remaining new tracks into the existing album, if the | ||
| # matched duplicates all belong to a single one. | ||
| album_ids = { |
There was a problem hiding this comment.
Could filtering out matches with album_id is None affect how the fold target is selected here?
For example, if one duplicate belongs to an album and another duplicate is a singleton, would album_ids contain only the first duplicate's album ID and cause the remaining tracks to be folded into that album? The documentation says that new tracks are imported as their own album when the matching tracks do not all belong to a single album, so I wonder whether singleton matches should prevent folding in this case. Is this scenario handled elsewhere?
I would like to see a test covering the following situation:
- Incoming album: tracks A, B, and C.
- Existing track A belongs to album X.
- Existing track B is a singleton.
- Both A and B match incoming tracks.
Would you mind adding this case and making the intended outcome for track C explicit? I think this would clarify whether it should be folded into album X or imported as its own album.
|
|
||
| default_resolution = "REMOVE" | ||
|
|
||
| def resolve_track_duplicates(self, task, duplicates): |
There was a problem hiding this comment.
Actually, I doubt that we need this any more. You should be able to control this using the configuration. See the commit where I removed it for good: ca3fdb7
You can step through commits under this PR to see how I removed the need to redefine anything to do with duplicates in helper.py: https://github.com/beetbox/beets/pull/6702/commits
| def choose_item(self, task: SingletonImportTask) -> TrackMatch | Action: | ||
| raise NotImplementedError | ||
|
|
||
| def resolve_track_duplicates(self, task: ImportTask, duplicates) -> str: |
There was a problem hiding this comment.
Presumably this is going to be a list of Items?
| def resolve_track_duplicates(self, task: ImportTask, duplicates) -> str: | |
| def resolve_track_duplicates(self, task: ImportTask, duplicates: list[Item]) -> str: |
Description
This PR adds the
duplicate_track_resolution(bool) andduplicate_track_action(skip,remove,keep/merge,ask) config options toimportto enable deduplication of individual tracks when importing.The recommended config for this is:
With the above config, any duplicate imports of tracks will be dropped, and if the whole album has been imported before, the album import itself will be dropped. If part of an album is deduped, the new tracks will be folded into the existing album. This config allows for idempotency on any repeat
beet importexecutions.Copied from the new docs, here's what each option does:
The
duplicate_keysconfig option is what's used to dedupe, so set it to something stable likemb_trackid.In the previous PR, @snejus suggested adding this as an addition to the album deduplication that already exists. This implementation is separate from the album dedupe logic because the use-cases seemed incompatible.
The album dedupe skips/merges/removes based on an album already being imported, but the use-case I wanted was a deeper track-based mechanism. Bolting this onto the album dedupe produced some really weird results, and I wasn't able to achieve parity with my last impl with the available config options. This is a better, more flexible, approach.
To Do
docs/to describe it.)docs/changelog.rstto the bottom of one of the lists near the top of the document.)