Skip to content

Cross-load packed weight cache reuse for XNNPACK (#19988)#19988

Open
doggeral wants to merge 1 commit into
pytorch:mainfrom
doggeral:export-D106717093
Open

Cross-load packed weight cache reuse for XNNPACK (#19988)#19988
doggeral wants to merge 1 commit into
pytorch:mainfrom
doggeral:export-D106717093

Conversation

@doggeral

@doggeral doggeral commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary:

Add cross-load reuse + multi-PTE safety to the file-backed packed weight cache (D106673663). The first PTE in a session calls save_packed_index() to append a trailer; subsequent process launches mmap the file and pre-populate name_to_packed_data_metadata_ so look_up() hits for every saved weight and xnn_create_runtime skips packing entirely.

Cache file format

[packed data regions]                          (written by reserve_space)
[index entries]                                (written by save_packed_index)
  each: name_len(4B) | name(N) | file_offset(8B) | data_size(8B)
[footer: 20 bytes]
  index_start(8B) | entry_count(4B) | magic "XPWC"(4B) | version(4B)

Lifecycle invariants

  • cache_loaded_ gate: load_packed_cache() runs at most once per process per path. Subsequent PTE inits for the same path reopen the write fd without re-reading the trailer.
  • from_load flag: persistent entries (loaded from trailer or promoted on save) skip delete_packed_data cleanup. This keeps the mmap region and metadata alive across PTE unload/reload, so the next init hits the cache instead of repacking. Without this, every PTE destroy/recreate cycle appended a fresh copy to the file (~450 MB per cycle).
  • No-op save short-circuit: save_packed_index returns early when no new reserve_space happened since the last save, avoiding the mtime churn that previously made the cache file look modified on every model load.

Multi-PTE behavior

  • Multiple PTEs (or methods that don't share weights) in the same model load share one cache file. Each PTE's reserve_space extends the file; finalize_for_runtime msyncs only newly added regions; save_packed_index writes one trailer covering all PTEs at the end of the load.
  • Sibling PTEs that opt out of the mmap path (caller passes empty packed_cache_path) early-return from initialize_for_runtime and fall through to heap allocation, without touching the singleton's PLLM state.
  • Cross-model coexistence relies on caller-side discipline: only models that opt in set a non-empty cache path. Setting different non-empty paths concurrently is not supported by this singleton design.

Caller change

XNNPACKBackend::init always calls set_packed_cache_path (with empty string for non-opted-in PTEs). This keeps the singleton path in sync with the current PTE instead of inheriting a sibling's path.

Test Plan

buck2 test fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache  # 5 pass
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backendApple
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend
buck2 build fbcode//executorch/backends/xnnpack:xnnpack_backend

On device (iOS Stella build, PLLM + Llama3 runner):

  • Cold start: load (1184 entries) from cache, reserve_mmap=0 for cached weights
  • Cache file size stable at ~593 MB across PLLM unload/reload cycles
  • app_peak ~700 MB (vs ~2.5 GB pre-fix)
  • compressed ~100 MB (vs ~1.7 GB pre-fix)

Reviewed By: GregoryComer

Differential Revision: D106717093

@doggeral doggeral requested a review from digantdesai as a code owner June 3, 2026 21:47
@pytorch-bot

pytorch-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19988

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 5 New Failures, 1 Pending

As of commit f6f3ee8 with merge base 185bd09 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2026
@meta-codesync

meta-codesync Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@doggeral has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106717093.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@digantdesai digantdesai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

Summary:

Add cross-load reuse + multi-PTE safety to the file-backed packed weight cache (D106673663). The first PTE in a session calls `save_packed_index()` to append a trailer; subsequent process launches mmap the file and pre-populate `name_to_packed_data_metadata_` so `look_up()` hits for every saved weight and `xnn_create_runtime` skips packing entirely.

## Cache file format

```
[packed data regions]                          (written by reserve_space)
[index entries]                                (written by save_packed_index)
  each: name_len(4B) | name(N) | file_offset(8B) | data_size(8B)
[footer: 20 bytes]
  index_start(8B) | entry_count(4B) | magic "XPWC"(4B) | version(4B)
```

## Lifecycle invariants

- `cache_loaded_` gate: `load_packed_cache()` runs at most once per process per path. Subsequent PTE inits for the same path reopen the write fd without re-reading the trailer.
- `from_load` flag: persistent entries (loaded from trailer or promoted on save) skip `delete_packed_data` cleanup. This keeps the mmap region and metadata alive across PTE unload/reload, so the next init hits the cache instead of repacking. Without this, every PTE destroy/recreate cycle appended a fresh copy to the file (~450 MB per cycle).
- No-op save short-circuit: `save_packed_index` returns early when no new `reserve_space` happened since the last save, avoiding the mtime churn that previously made the cache file look modified on every model load.

## Multi-PTE behavior

- Multiple PTEs (or methods that don't share weights) in the same model load share one cache file. Each PTE's `reserve_space` extends the file; `finalize_for_runtime` msyncs only newly added regions; `save_packed_index` writes one trailer covering all PTEs at the end of the load.
- Sibling PTEs that opt out of the mmap path (caller passes empty `packed_cache_path`) early-return from `initialize_for_runtime` and fall through to heap allocation, without touching the singleton's PLLM state.
- Cross-model coexistence relies on caller-side discipline: only models that opt in set a non-empty cache path. Setting different non-empty paths concurrently is not supported by this singleton design.

## Caller change

`XNNPACKBackend::init` always calls `set_packed_cache_path` (with empty string for non-opted-in PTEs). This keeps the singleton path in sync with the current PTE instead of inheriting a sibling's path.

## Test Plan

```
buck2 test fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache  # 5 pass
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backendApple
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend
buck2 build fbcode//executorch/backends/xnnpack:xnnpack_backend
```

On device (iOS Stella build, PLLM + Llama3 runner):
- Cold start: load `(1184 entries)` from cache, `reserve_mmap=0` for cached weights
- Cache file size stable at ~593 MB across PLLM unload/reload cycles
- `app_peak ~700 MB` (vs ~2.5 GB pre-fix)
- `compressed ~100 MB` (vs ~1.7 GB pre-fix)

Reviewed By: GregoryComer

Differential Revision: D106717093
@meta-codesync meta-codesync Bot changed the title Cross-load packed weight cache reuse for XNNPACK Cross-load packed weight cache reuse for XNNPACK (#19988) Jun 12, 2026
@doggeral doggeral force-pushed the export-D106717093 branch from f6f3ee8 to 072586f Compare June 12, 2026 23:26

@digantdesai digantdesai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@doggeral doggeral dismissed digantdesai’s stale review June 12, 2026 23:29

addressed comments already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants