Fix tc class collision with retry and class ID persistence#179
Fix tc class collision with retry and class ID persistence#179sjmiller609 merged 3 commits intomainfrom
Conversation
f69eed2 to
b942731
Compare
4af990a to
ae99868
Compare
ae99868 to
06e4b98
Compare
Replace tc class replace with tc class add + retry loop to properly handle class ID collisions. On File exists error, probe the next class ID (up to 5 attempts). Persist the actual class ID assigned to each allocation so removal and cleanup use the correct ID. Changes: - addVMClass: retry loop with linear probing on collision - Allocation.ClassID: persisted to disk, loaded in deriveAllocation - removeVMClass/deleteTAPDevice: use stored class ID with fallback - CleanupOrphanedClasses: considers stored class IDs from allocations - New metric: hypeman_network_tc_class_collisions_total (attempt label)
06e4b98 to
ce541ca
Compare
deriveClassIDVal and the probing loop now skip both 0 (invalid) and 1 (root class 1:1). The wrap-around guard checks for 0/1 after uint16 increment instead of the unreachable > 0xFFFF comparison.
hiroTamada
left a comment
There was a problem hiding this comment.
Nice fix — clean linear probing with good backwards compat and observability. Two minor nits inline.
|
|
||
| // Build set of class IDs that belong to existing TAP devices | ||
| // Build set of class IDs that belong to existing TAP devices. | ||
| // Include both derived class IDs and stored class IDs from allocations |
There was a problem hiding this comment.
nit: swallowing the error here means a transient ListAllocations failure would return an empty map, making every class look orphaned and get deleted. might be worth logging a warning and returning early on error.
| output, err := cmd.CombinedOutput() | ||
| if err != nil { | ||
| // Check for "File exists" collision (exit status 2). | ||
| var exitErr *exec.ExitError |
There was a problem hiding this comment.
nit: the goto classAdded is fine and clear, but this loop could also be a small extracted function that returns (string, error) — matter of taste.
Avoids deleting valid probed classes when a transient ListAllocations error returns an empty set of stored class IDs.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| // Persist assigned tc class ID so removal uses the correct ID after collisions. | ||
| if classID != "" { | ||
| m.saveClassID(req.InstanceID, classID) | ||
| } |
There was a problem hiding this comment.
Stale classID file not cleared when rate limiting removed
Medium Severity
Both CreateAllocation and RecreateAllocation only call saveClassID when classID != "". If an instance previously had upload rate limiting (classID file saved to disk) and is later recreated without it (e.g., RecreateAllocation with uploadBps = 0), the old classid file persists with a stale value. deriveAllocation then loads this stale classID into alloc.ClassID, causing ReleaseAllocation to call deleteTAPDevice with the wrong classID — potentially deleting another VM's HTB class.


Summary
Replaces
tc class replacewithtc class add+ retry loop to properly detect and handle class ID hash collisions.Problem: Two different TAP names can hash to the same 16-bit class ID via
deriveClassID. Usingtc class replacesilently overwrites the existing class, breaking rate limiting for the original VM.Solution:
addVMClassnow usestc class addand retries with linear probing on "File exists" errors (up to 5 attempts)classidfile in the instance directoryremoveVMClassanddeleteTAPDeviceuse the stored class ID (falling back toderiveClassIDfor backwards compatibility with old allocations)CleanupOrphanedClassesconsiders both derived and stored class IDs when determining which classes are validhypeman_network_tc_class_collisions_totalwithattemptlabel (initial/retry) tracks collision frequencyChanges
lib/network/bridge_linux.go— retry loop inaddVMClass, classID threading throughcreateTAPDevice/deleteTAPDevice/removeVMClass/CleanupOrphanedClasseslib/network/bridge_darwin.go— updated stubs to match new signatureslib/network/types.go—ClassIDfield onAllocationlib/network/allocate.go— persist/load classID, pass stored ID on releaselib/network/derive.go— load stored classID inderiveAllocationlib/network/metrics.go—hypeman_network_tc_class_collisions_totalcounterBackwards compatibility
Old allocations without a
classidfile continue to work —removeVMClassandCleanupOrphanedClassesfall back toderiveClassID.Note
Medium Risk
Touches Linux traffic-control setup/teardown and cleanup logic; mistakes could leak or delete the wrong
tcclasses and impact VM rate limiting. Changes are localized but affect host networking behavior across restarts.Overview
Fixes a Linux upload-shaping edge case where two TAPs could hash to the same HTB class ID and silently clobber each other by switching to
tc class addwith a small collision-retry probe loop inaddVMClass.Threads the actual assigned class ID through TAP creation and persists it to an instance
classidfile, addsAllocation.ClassID, uses it during release/teardown, and updates orphaned class cleanup to treat stored (probed) IDs as valid. Adds a new metric counterhypeman_network_tc_class_collisions_total(labeledinitial/retry) to track collision frequency, and updates macOS stubs for the new function signatures.Written by Cursor Bugbot for commit 9719ac1. This will update automatically on new commits. Configure here.