Skip to content

Implement Runtime NVMe Instance Storage Discovery Using AWS EBS Symlinks#396

Open
neddp wants to merge 9 commits intomainfrom
fix-nvme-instance-storage-discovery
Open

Implement Runtime NVMe Instance Storage Discovery Using AWS EBS Symlinks#396
neddp wants to merge 9 commits intomainfrom
fix-nvme-instance-storage-discovery

Conversation

@neddp
Copy link
Copy Markdown
Member

@neddp neddp commented Feb 2, 2026

Problem

On AWS Nitro-based instances with NVMe devices, the kernel's PCIe enumeration order is non-deterministic. This means:

  • /dev/nvme0n1 could be the root EBS volume OR instance storage
  • /dev/nvme1n1 could be instance storage OR the root EBS volume
  • The order varies between boots and instance types
  • There is no guaranteed ordering

Solution

Implemented runtime discovery to reliably identify instance storage by excluding EBS volumes.

Discovery Algorithm

  1. Glob all NVMe devices: /dev/nvme*n1
  2. Glob EBS symlinks: /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_*
  3. Resolve each symlink to its target device
  4. Subtract EBS devices from all NVMe devices = instance storage
  5. Validate count matches CPI expectations
  6. Partition only the discovered instance storage devices

Why EBS Symlinks Are Reliable

AWS automatically creates persistent symlinks for all EBS volumes via udev rules:

/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol{volume_id}

Backwards Compatibility

Non-NVMe instances: No changes to behavior

  • Traditional Xen instances (/dev/xvdb, /dev/sdb) use CPI paths directly
  • Paravirtual instances work as before

This must be merged together with the CPI changes - cloudfoundry/bosh-aws-cpi-release#196


Pair @Ivaylogi98

Copy link
Copy Markdown
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would have expected this logic to go into the https://github.com/cloudfoundry/bosh-agent/tree/main/infrastructure/devicepathresolver package.

Comment thread platform/linux_platform.go Outdated
p.logger.Debug(logTag, "Found NVMe devices: %v", allNvmeDevices)

// Identify EBS volumes via symlinks
ebsSymlinks, err := p.fs.Glob("/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_*")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should somehow be passed in via the agent config in the stemcell builder, because it is IaaS specific.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new component instanceStorageResolver for the linux platform. It has 3 variants:

  • autoDetectingInstanceStorageResolver - the linux platform uses this one, it detects which storage resolver to use
  • awsNVMeInstanceStorageResolver - detects what are the instance storage devices; contains the new logic
  • identityInstanceStorageResolver - simple resolver for the traditional naming used before this change

@github-project-automation github-project-automation Bot moved this from Inbox to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Feb 3, 2026
@neddp
Copy link
Copy Markdown
Member Author

neddp commented Feb 3, 2026

In general I would have expected this logic to go into the https://github.com/cloudfoundry/bosh-agent/tree/main/infrastructure/devicepathresolver package.

Thank you for the review! That's was a big oversight on my end, I'll look into it.

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Feb 3, 2026

No worries 🙂

@beyhan
Copy link
Copy Markdown
Member

beyhan commented Feb 5, 2026

We discussed this during the FI WG meeting and this have to relay on the stemcell agent settings and agent strategy for disc handling.

@neddp neddp requested a review from rkoster February 9, 2026 14:27
@neddp neddp changed the title Implement Runtime NVMe Instance Storage Discovery Using EBS Symlinks Implement Runtime NVMe Instance Storage Discovery Using AWS EBS Symlinks Feb 9, 2026
@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Feb 12, 2026

As discussed during the working group meeting, focus is now on validating: cloudfoundry/bosh-aws-cpi-release#196 (comment)

@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Feb 19, 2026

As per: cloudfoundry/bosh-aws-cpi-release#196 (comment) this change is still needed. Please continue reviewing.

@rkoster rkoster requested review from a team and ramonskie and removed request for a team February 19, 2026 15:54
Copy link
Copy Markdown
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think this PR could be done in an IaaS agnostic way.


type awsNVMeInstanceStorageResolver struct {
fs boshsys.FileSystem
devicePathResolver DevicePathResolver
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The devicePathResolver field is stored in the struct but never used anywhere in the awsNVMeInstanceStorageResolver implementation. The DiscoverInstanceStorage method only uses fs for globbing and symlink resolution.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was reworked in bb98869

nvmeDevicePattern string,
) InstanceStorageResolver {
if ebsSymlinkPattern == "" {
ebsSymlinkPattern = "/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like this to be passed in as configuration, so that the strategy can be iaas agnostic. This pattern might some day be re-usable.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked in bb98869
PR in stemcell builder to add patterns to config: cloudfoundry/bosh-linux-stemcell-builder#592

neddp and others added 3 commits February 25, 2026 14:06
* Refactor instance storage discovery into configurable component

Implement auto-detection for instance storage disk type

* Fix windows tests

* Fix windows tests (but for real this time)
@neddp neddp force-pushed the fix-nvme-instance-storage-discovery branch from e7d00b4 to dcd857a Compare February 25, 2026 12:09
@rkoster
Copy link
Copy Markdown
Contributor

rkoster commented Mar 26, 2026

@neddp could you take a look at these failing unit tests?

@neddp
Copy link
Copy Markdown
Member Author

neddp commented Mar 26, 2026

Hi @rkoster,

We still haven't had the time to test the changes on an actual deployment. I will move the PR to draft until we can confirm everything is working fine.

We'll address the tests as well.

@neddp neddp marked this pull request as draft March 26, 2026 13:04
* Make implementation iaas-agnostic

* Rename storage resolver files

* Fix tests

* Remove instance storage resolver

* Don't use the aws pattern as default

* Refactor NVMe instance storage discovery and remove unused symlink patterns

* Enhance NVMe instance storage discovery with managed volume pattern support

* Fix unit tests

* Don't run windows unit tests when not supported

* Simplify FakeDevicePathResolver by removing unused fields and methods

* Wait for udev to settle before resolving EBS symlinks

* Add debug logs

* Import udev and add comment about why it's needed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@neddp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 55 minutes and 49 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eb1aa006-4e73-4c0f-ba74-675b6f862409

📥 Commits

Reviewing files that changed from the base of the PR and between 9491b7e and 834ec2a.

📒 Files selected for processing (1)
  • infrastructure/devicepathresolver/symlink_device_resolver_test.go

Walkthrough

Adds a new SymlinkDeviceResolver for globbing and resolving device symlinks and coordinating udev trigger/settle. Modifies FakeDevicePathResolver to record all diskSettings calls in a slice; tests updated from exact equality to containment/collection assertions. Refactors LinuxPlatform.SetupRawEphemeralDisks to optionally discover NVMe instance-storage devices via the symlink resolver (with configurable patterns) and updates its constructor signature and options. Provider wiring injects the new resolver into Linux platform construction; tests are extended accordingly.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing runtime NVMe instance storage discovery using AWS EBS symlinks, which matches the core functionality added throughout the changeset.
Description check ✅ Passed The description is directly related to the changeset, detailing the problem (non-deterministic PCIe enumeration on AWS Nitro instances), the implemented solution (runtime discovery algorithm), and backwards compatibility considerations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-nvme-instance-storage-discovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 55 minutes and 49 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@neddp
Copy link
Copy Markdown
Member Author

neddp commented Apr 30, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@neddp neddp marked this pull request as ready for review April 30, 2026 12:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/devicepathresolver/symlink_device_resolver_test.go`:
- Around line 72-92: The test currently expects ResolveSymlinksToDevices to
silently skip unresolved symlinks; instead modify the resolver
(ResolveSymlinksToDevices in the symlink device resolver) to treat any existing
symlink that cannot be resolved as an error (fail-closed) and return that error
to the caller; update the symlink_device_resolver_test.go case that created an
unresolved "/dev/disk/by-id/..." entry so it now asserts that
ResolveSymlinksToDevices returns a non-nil error (and no successful mapping)
rather than skipping the broken symlink. Ensure the change surfaces the
underlying readlink/resolve error from the resolver.

In `@platform/linux_platform.go`:
- Around line 811-825: The current logic gates NVMe runtime discovery on
CPI-reported path prefixes via hasNVMeDevices(devices), which can skip NVMe
resolution when CPI reports aliases; change the condition so that if a managed
volume pattern (p.options.InstanceStorageManagedVolumePattern != "") and the
symlink resolver (p.symlinkDeviceResolver != nil) are present, always call
p.discoverNVMeInstanceStorage(devices) instead of falling back to
p.discoverIdentityInstanceStorage(devices), removing the hasNVMeDevices(devices)
check; keep references to discoverNVMeInstanceStorage,
discoverIdentityInstanceStorage, hasNVMeDevices,
p.options.InstanceStorageManagedVolumePattern, and p.symlinkDeviceResolver to
locate and update the conditional.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ed13b328-8e26-46ab-81e0-b8bed7f74223

📥 Commits

Reviewing files that changed from the base of the PR and between a47020a and bb98869.

📒 Files selected for processing (10)
  • agent/bootstrap_test.go
  • infrastructure/devicepathresolver/fakes/fake_device_path_resolver.go
  • infrastructure/devicepathresolver/fallback_device_path_resolver_test.go
  • infrastructure/devicepathresolver/scsi_device_path_resolver_test.go
  • infrastructure/devicepathresolver/symlink_device_resolver.go
  • infrastructure/devicepathresolver/symlink_device_resolver_test.go
  • infrastructure/devicepathresolver/virtio_device_path_resolver_test.go
  • platform/linux_platform.go
  • platform/linux_platform_test.go
  • platform/provider.go

Comment thread infrastructure/devicepathresolver/symlink_device_resolver_test.go Outdated
Comment thread platform/linux_platform.go
@neddp neddp requested a review from rkoster April 30, 2026 12:54
Comment thread infrastructure/devicepathresolver/symlink_device_resolver.go Outdated
Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>
Comment thread infrastructure/devicepathresolver/symlink_device_resolver_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/devicepathresolver/symlink_device_resolver_test.go`:
- Around line 88-90: The test file symlink_device_resolver_test.go has a
formatting/indentation issue around the assertion block that calls
resolver.ResolveSymlinksToDevices and checks the error; run goimports (or gofmt)
on the file to normalize imports and indentation so the block with
ResolveSymlinksToDevices, Expect(err).To(HaveOccurred()), and
Expect(err.Error()).To(ContainSubstring("nvme-invalid")) is properly formatted.
Ensure the file compiles and lints cleanly after running the formatter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d923cdb-2fa3-4c19-bc3c-66c23b259996

📥 Commits

Reviewing files that changed from the base of the PR and between bb98869 and 9491b7e.

📒 Files selected for processing (4)
  • infrastructure/devicepathresolver/symlink_device_resolver.go
  • infrastructure/devicepathresolver/symlink_device_resolver_test.go
  • platform/linux_platform.go
  • platform/linux_platform_test.go

Comment thread infrastructure/devicepathresolver/symlink_device_resolver_test.go Outdated
@beyhan beyhan moved this from Waiting for Changes | Open for Contribution to Pending Review | Discussion in Foundational Infrastructure Working Group Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending Review | Discussion

Development

Successfully merging this pull request may close these issues.

4 participants