Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
ab02e85
Replace Claude imports with symlinks
Nucs Apr 14, 2026
b1d1731
feat(NpyIter): Implement 8 NumPy parity fixes for NpyIter
Nucs Apr 15, 2026
8335532
refactor(NpyIter): Support unlimited dimensions (NumSharp divergence)
Nucs Apr 15, 2026
b71a5e8
feat(NpyIter): Add NpyAxisIter and logical reduction infrastructure
Nucs Apr 15, 2026
5c2d6fd
fix(tests): Convert TUnit [Test] to MSTest [TestMethod]
Nucs Apr 15, 2026
372a8e7
feat(NpyIter): Implement axis reordering before coalescing for full 1…
Nucs Apr 15, 2026
932a836
feat(NpyIter): Add NumPy parity features and comprehensive test coverage
Nucs Apr 15, 2026
9e6ebfd
feat(NpyIter): Implement full F-order and K-order support with MULTI_…
Nucs Apr 15, 2026
4534093
feat(NpyIter): Implement GotoIndex for flat C/F index jumping
Nucs Apr 15, 2026
3a383df
feat(NpyIter): Implement Copy() for independent iterator copies
Nucs Apr 15, 2026
a620349
docs(NpyIter): Update remaining features list
Nucs Apr 15, 2026
6b883b3
feat(NpyIter): Implement negative stride flipping for memory-order it…
Nucs Apr 15, 2026
f140b4b
feat(NpyIter): Implement GetIterView for operand view with iterator axes
Nucs Apr 15, 2026
d00df9e
feat(NpyIter): Implement cast support for type conversion during buff…
Nucs Apr 15, 2026
719d668
feat(NpyIter): Implement reduction support via op_axes
Nucs Apr 15, 2026
cfde429
feat(NpyIter): Improve reduction NumPy parity
Nucs Apr 16, 2026
f542426
feat(NpyIter): Implement buffered reduction double-loop with full Num…
Nucs Apr 16, 2026
8da97a2
fix(NpyIter): Fix buffered reduction for small buffers (bufferSize < …
Nucs Apr 16, 2026
0943e04
docs(NpyIter): Add comprehensive implementation audit
Nucs Apr 16, 2026
2ffc73a
feat(NpyIter): Implement unlimited operands (NumPy NPY_MAXARGS parity)
Nucs Apr 16, 2026
2f42caf
docs(NpyIter): Add deep audit with 4 comparison techniques
Nucs Apr 16, 2026
fc4790a
fix(tests): Mark NpyIter iteration order differences as [Misaligned]
Nucs Apr 16, 2026
12e3629
fix(NpyIter): Fix F-order iteration to match NumPy behavior
Nucs Apr 16, 2026
0d5c2ef
fix(NpyIter): Fix K-order iteration for broadcast and non-contiguous …
Nucs Apr 16, 2026
3d47d17
fix(NpyIter): Achieve 100% NumPy 2.4.2 parity - 7 bugs fixed via TDD
Nucs Apr 16, 2026
b823f81
feat(Shape): Minimal multi-order memory layout support (C/F/A/K)
Nucs Apr 19, 2026
0376003
refactor(Shape): Align contiguity computation with NumPy conventions
Nucs Apr 19, 2026
9457bc7
feat(NpyIter): Implement 10 missing NumPy APIs + battletest to 566 sc…
Nucs Apr 20, 2026
528a1da
test(order): Add TDD coverage for C/F/A/K order support across API su…
Nucs Apr 20, 2026
3b55e9e
test(order): Expand coverage to every np.* function accepting order
Nucs Apr 20, 2026
41d65f7
test(order): Add coverage for ops, statistics, manipulation, matmul
Nucs Apr 20, 2026
c10b4a6
test(order): Round 4 — unary math, division, in-place, NaN-aware, bro…
Nucs Apr 20, 2026
47b6400
fix(order): Wire F-order support through copy/conversion and _like/as…
Nucs Apr 20, 2026
2ba101f
feat(NpyIter): Three-tier custom-op API + expanded NpyExpr DSL
Nucs Apr 20, 2026
4b2f7a9
fix(order): Wire F-order support through flatten/ravel/reshape/eye (G…
Nucs Apr 20, 2026
50de6c9
fix(order): Add asarray/asanyarray/asfortranarray/ascontiguousarray +…
Nucs Apr 20, 2026
23806cd
fix(order): NDArray.argsort copies non-C-contig input to C-contig fir…
Nucs Apr 20, 2026
42381d5
docs(NDIter): Max-effort amend — gotchas, validation, 4 new bugs, 4 n…
Nucs Apr 20, 2026
39ef08c
fix(order): Post-hoc F-contig preservation across ILKernel dispatch +…
Nucs Apr 20, 2026
ee8c65b
perf(flatten): Drop redundant ArraySlice clone on F-order path
Nucs Apr 20, 2026
74a92e9
feat(NpyExpr): Add Call() for arbitrary delegate/MethodInfo invocation
Nucs Apr 20, 2026
53a506f
fix(order): Review cleanups — dim aliasing, modf Type overload, resha…
Nucs Apr 20, 2026
e7ec2fd
docs(NDIter): Promote Call to dedicated subsection + memory-model sec…
Nucs Apr 20, 2026
25b058a
docs(NDIter): Add 7-technique quick reference + decision tree at top
Nucs Apr 20, 2026
387c4e6
refactor(NpyIter): Rename Tier A/B/C to Tier 3A/3B/3C
Nucs Apr 20, 2026
c1f6e84
fix(NPTypeCode): Char SizeOf returned 1 (real=2); GetPriority Decimal…
Nucs Apr 20, 2026
3d1a529
feat(examples): 2-layer MLP on MNIST with single-NpyIter bias+ReLU fu…
Nucs Apr 20, 2026
b5ede36
test(order): Section 41 — Reductions keepdims=True on F-contig (17 te…
Nucs Apr 20, 2026
cfe2a77
test(order): Section 42 — np.sort API gap (1 test, 1 [OpenBugs])
Nucs Apr 20, 2026
f90fe45
test(order): Section 43 — matmul/dot/outer/convolve layout (11 tests,…
Nucs Apr 20, 2026
779f6fc
test(order): Section 44 — Broadcasting from F-contig (5 tests, 0 [Ope…
Nucs Apr 20, 2026
e18caef
feat(examples): Trainable MNIST MLP -- fused forward + backward + Ada…
Nucs Apr 20, 2026
76b9c4e
test(order): Section 45 — Manipulation ops layout (20 tests, 2 [OpenB…
Nucs Apr 20, 2026
2e48d2c
test(order): Section 46 — File I/O fortran_order flag (4 tests, 3 [Op…
Nucs Apr 20, 2026
3f7172e
test(order): Section 47 — around / round_ (6 tests, 3 [OpenBugs])
Nucs Apr 20, 2026
b02a304
test(order): Section 49 — Decimal scalar-full path (10 tests, 1 [Open…
Nucs Apr 20, 2026
61db29e
test(order): Section 50 — Edge cases (12 tests, 1 [OpenBugs])
Nucs Apr 20, 2026
eda98fb
test(order): Section 51 — Fancy-write isolation repros (5 tests, 3 [O…
Nucs Apr 20, 2026
cd38eb1
perf(examples/mlp): 31x faster training -- copy transposed views befo…
Nucs Apr 21, 2026
7e46030
feat(Char8): 1-byte Char8 type with 100% NumPy/Python bytes parity + …
Nucs Apr 21, 2026
038d1ca
feat(mlp): periodic test eval + 100-epoch demo config
Nucs Apr 21, 2026
1783b48
fix(examples): complete all stubbed/broken NN scaffolding classes
Nucs Apr 21, 2026
edcf866
Add NDArray documentation
Nucs Apr 21, 2026
6e1da5d
perf(matmul): stride-native GEMM for all 12 dtypes — no copies
Nucs Apr 21, 2026
ef0c0b8
feat(tile): 1-to-1 parity with NumPy 2.x — battletest + edge-case cov…
Nucs Apr 22, 2026
259e893
docs(tile): update CLAUDE.md inventory + unmark Tile_ApiGap
Nucs Apr 22, 2026
572f6b6
refactor(iterators): migrate all production callers from MultiIterato…
Nucs Apr 22, 2026
d12d7ba
feat(npyiter): promote Iterators/ to full public API + NDArray overloads
Nucs Apr 22, 2026
9b2749b
refactor(iterators): Phase 2 migration — NaN reductions, BooleanMask,…
Nucs Apr 22, 2026
8af86b2
refactor(iterators): Phase 2 cont. — random sampling, casting, GetEnu…
Nucs Apr 22, 2026
7264173
refactor(iterators): rewrite NDIterator as NpyIter wrapper, delete le…
Nucs Apr 22, 2026
51ad43c
fix(npyiter): ForEach/ExecuteGeneric/ExecuteReducing read past end wi…
Nucs Apr 22, 2026
bb205d3
docs(examples): CLAUDE.md for the NeuralNetwork.NumSharp project
Nucs Apr 22, 2026
fb4b7dc
refactor(iterators): NDIterator now iterates lazily — no materialized…
Nucs Apr 22, 2026
b86b348
refactor(iterators): NDIterator fully backed by NpyIter state
Nucs Apr 22, 2026
e2318d4
fix(storage): DTypeSize reports in-memory stride, not Marshal.SizeOf
Nucs Apr 22, 2026
d364e7f
refactor(iterators+docs): cleanup from NpyIter migration
Nucs Apr 23, 2026
574a0d8
refactor(npfunc): replace ~400 NPTypeCode switch cases with NpFunc ge…
Nucs Apr 23, 2026
c3bbe9a
fix(clip): Complex IComparable constraint + Half NaN propagation
Nucs Apr 23, 2026
a96c9d9
fix(power): NumPy-aligned np.power — crash fix, neg-exp ValueError, f…
Nucs May 13, 2026
ff1149f
docs(audit): nditer branch quality audit V1+V2 with chapter findings
Nucs May 13, 2026
7a1d44e
test(audit-v2): OpenBugs reproductions for V2 audit Tier 1 findings
Nucs May 13, 2026
65ef76f
docs(claude): refresh project doc for F-order support and NpyIter
Nucs May 13, 2026
01d57a0
fix(unmanaged): correct CopyTo direction + bounds in ArraySlice / Hel…
Nucs May 13, 2026
1a9646f
fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert…
Nucs May 13, 2026
f21ea30
fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-c…
Nucs May 13, 2026
414b35f
fix(default-engine): propagate TensorEngine through Cast and Transpose
Nucs May 13, 2026
9c9a396
fix(creation+manipulation): wire TensorEngine through copy / reshape …
Nucs May 13, 2026
443b7e0
feat(np.where): NumPy-aligned C/F output layout selection
Nucs May 13, 2026
50737cd
feat(np.tile): preserve input order on all-ones / no-reps path; refre…
Nucs May 13, 2026
fa60569
chore: deep-clone Options.RemainingArgs; drop trailing newline in Npy…
Nucs May 13, 2026
40371e4
fix(npyiter): deep-copy buffered Clone buffers; preserve stride width…
Nucs May 13, 2026
d4b2af3
test(clone): regression suite for unmanaged copy + storage + iterator…
Nucs May 13, 2026
786d705
feat(dtype): full 15-dtype parity for SByte/Half/Complex across hot p…
Nucs May 13, 2026
e188939
fix(ravel): return view for ravel('F') of F-contig source — eliminate…
Nucs May 15, 2026
aacee45
test(ravel): cover F-contig column-slice offset and both-C+F-contig p…
Nucs May 15, 2026
3505edc
feat(np.clip): NumPy 2.x parity — add min=/max= keyword aliases and d…
Nucs May 15, 2026
79c1894
fix(np.clip): NumPy 2.x dtype promotion, out= dtype validation, NaN-o…
Nucs May 15, 2026
9334bd7
perf(np.clip): scalar-bounds fast path + fix latent ClipScalar min>ma…
Nucs May 15, 2026
1a6c07b
perf(np.clip): fused copy+clip + skip redundant astype clones
Nucs May 15, 2026
10064ab
refactor(np.clip): all kernels via IL-generated DynamicMethods, width…
Nucs May 15, 2026
96104c4
merge: bring clip-min-max-aliases (NumPy 2.x clip parity + IL kernels…
Nucs May 15, 2026
dc9875c
feat(unique): add 5 missing NumPy 2.x keyword arguments
Nucs May 15, 2026
0a62d41
perf(unique): switch flat path from Dictionary to sort+mask — up to 4…
Nucs May 15, 2026
f77efcf
perf(unique): NaN-partition + default sort — eliminates Comparer over…
Nucs May 15, 2026
c3e02a4
feat(unique): add long-indexed fallback for n > Array.MaxLength
Nucs May 15, 2026
9a071e4
feat(searchsorted): full NumPy API parity — side, sorter, multidim va…
Nucs May 15, 2026
6b16c91
perf(searchsorted): IL-emitted typed binary-search kernels (5–25× ove…
Nucs May 15, 2026
633d35c
perf(searchsorted): emit specialized contiguous-a kernel — bakes elem…
Nucs May 15, 2026
4414681
feat(np.copyto): add casting= and where= parameters; fix unwriteable …
Nucs May 15, 2026
be75ea5
perf(np.copyto): 0-d where-mask short-circuit, matching NumPy's ndite…
Nucs May 15, 2026
c371294
perf(np.unique): route legacy unique() to optimized kwargs path + add…
Nucs May 20, 2026
b0ac360
refactor(np.copyto): centralize cross-dtype SIMD casts via ILKernelGe…
Nucs May 20, 2026
2516c49
perf(np.copyto): add IL StridedCastKernel for broadcast/strided cases…
Nucs May 20, 2026
526e5a7
perf(np.copyto): IL MaskedCastKernel for where= masked copies (1.0-1.…
Nucs May 20, 2026
cc6977f
perf(np.copyto): broadcast convert-once + memcpy fast path (int32->fl…
Nucs May 20, 2026
40322ac
perf(np.copyto): IL-emitted any-true mask prescan inside MaskedCastKe…
Nucs May 20, 2026
d576e54
perf(np.nonzero): IL-emitted SIMD kernel — closes 8-241x gap to NumPy
Nucs May 20, 2026
24301a3
perf(np.where): IL-emitted scalar-broadcast Where kernels
Nucs May 20, 2026
e177e99
refactor(IL): consolidate Vector* reflection into VectorMethodCache
Nucs May 20, 2026
99bd588
refactor(IL): finish VectorMethodCache migration — drop CachedMethods…
Nucs May 20, 2026
18a34e1
refactor(IL): route last typeof(Vector{128,256}<T>) sites through Vec…
Nucs May 20, 2026
1a96551
refactor(IL): introduce ScalarMethodCache — companion to VectorMethod…
Nucs May 20, 2026
a598478
refactor(IL): centralize internal helper-method reflection via GetHelper
Nucs May 20, 2026
65bef6c
feat(np.expand_dims): NumPy 2.x tuple-axis support
Nucs May 20, 2026
2fac93d
feat(np.concatenate): full NumPy 2.x parity + fast paths
Nucs May 20, 2026
f8154b9
perf(np.concatenate): F-contig fast path
Nucs May 21, 2026
4c171f8
bench(np.concatenate): manual Stopwatch sweep + NumPy companion
Nucs May 21, 2026
0642ef9
docs(np.concatenate): justify custom fast paths over NpyIter.Copy
Nucs May 21, 2026
bf9bd4e
perf(NpyIter): K-order axis permutation in CreateCopyState
Nucs May 21, 2026
043d4a7
perf(NpyIter): skip np.broadcast_to in CreateCopyState when shapes match
Nucs May 21, 2026
3302b0b
feat(memory): atomic reference counting (ARC) + IDisposable on NDArray
Nucs May 21, 2026
45a7b33
test(arc): extend coverage with lessons learned from adversarial probing
Nucs May 21, 2026
294d432
perf(np.concatenate): use `using` on internal intermediates for atomi…
Nucs May 21, 2026
392529f
perf(np.allclose): use `using` on the np.isclose intermediate
Nucs May 21, 2026
2330bd1
perf(np.convolve): use `using` on the full-convolution intermediate
Nucs May 21, 2026
05342fe
perf(np.random.shuffle): use `using` on swap temps in SwapSlicesAxis0
Nucs May 21, 2026
c09c129
perf(np.random.exponential): use `using` on the log intermediate
Nucs May 21, 2026
8a9e0d6
perf(boolean-mask indexing): `using` on per-iter view wrappers
Nucs May 21, 2026
28653b5
perf(argmax/argmin fallback): `using` on per-iter slice in ArgReducti…
Nucs May 21, 2026
d1602d6
perf(np.eye): use `using` on the flat view wrapper
Nucs May 21, 2026
8b6ebed
perf(NDArray.Contains): `using` on the equality intermediate
Nucs May 21, 2026
d279f9b
perf(np.tile): `using` on promoted/broadcasted/contiguous intermediates
Nucs May 21, 2026
f90142c
perf(NDArray.flatten): `using` on the F-order copy intermediate
Nucs May 21, 2026
27c32c9
revert(NDArray.flatten): drop the using on fcopy; one-liner the np.ey…
Nucs May 21, 2026
c7336ff
perf(NDArray.flatten): `using` on the F-order copy intermediate
Nucs May 21, 2026
3785b7f
feat(np.all, np.any): tuple-axis, out=, where= (NumPy 2.x parity)
Nucs May 21, 2026
793d6bc
feat(np.asarray): full NumPy 2.x parity — copy=, like=, device=, dtyp…
Nucs May 21, 2026
8050eaf
perf(np.all, np.any axis): IL kernel with AVX2 gather and specialized…
Nucs May 21, 2026
3410448
feat(np.repeat): add axis parameter with IL-emitted 3-loop kernel
Nucs May 21, 2026
f50e6c1
perf(np.repeat): SIMD-broadcast inner loop with cnt-gated three-stage…
Nucs May 21, 2026
3a2b30c
feat(np.median, np.percentile, np.quantile): full NumPy 2.x parity
Nucs May 22, 2026
f4d6bc2
feat(np.argwhere): add NumPy 2.4.2 argwhere with IL-kernel two-pass scan
Nucs May 22, 2026
c23a549
refactor(np.quantile): route through ILKernelGenerator, remove the dt…
Nucs May 22, 2026
7655528
refactor(np.argwhere): rewrite the contig path as per-dtype IL kernels
Nucs May 22, 2026
4e0404b
feat(np.ptp): NumPy 2.x peak-to-peak with axis / tuple-axis / keepdim…
Nucs May 22, 2026
c5af4e0
test(np.ptp): add 15 variation tests from the post-implementation audit
Nucs May 22, 2026
946d550
perf(reduction.axis): op-tag struct generics + C-contig fast paths
Nucs May 22, 2026
4ad62bb
refactor(np.nonzero): per-dtype IL kernels + per-dim expand, drop boo…
Nucs May 22, 2026
6191279
perf(reduction.axis): F-contiguous fast path (transposed view = .T)
Nucs May 22, 2026
999072c
perf(reduction.axis): generalize leading-axis to any inner-contig view
Nucs May 22, 2026
abe5968
feat(np.average): NumPy 2.x parity (weights, returned, tuple-axis, ke…
Nucs May 22, 2026
89d0b3e
feat(np.flatnonzero): direct ArgwhereFlat kernel route, 1-D int64 result
Nucs May 22, 2026
20d7350
fix(np.average): NumPy parity fixes — Half/Complex weighted + empty z…
Nucs May 22, 2026
37236ad
feat(indexing): np.unravel_index / ravel_multi_index / indices via IL…
Nucs May 22, 2026
9c49598
fix(indexing): broadcast, zero-shape, stride-inheritance review fixes
Nucs May 22, 2026
74b3009
feat(indexing): np.take / np.put / np.place via IL byte-level scatter…
Nucs May 22, 2026
4d7d797
perf(np.average): fused weighted-sum NpyIter kernel (1.3-1.6x faster …
Nucs May 22, 2026
e4c699e
fix(indexing): take out=, put/place non-contig writeback review fixes
Nucs May 22, 2026
1e883f7
feat(indexing): np.extract and np.compress (Phase 4)
Nucs May 22, 2026
86a51a6
fix(indexing): take out= enforces NumPy safe-cast direction (Phase 4 …
Nucs May 23, 2026
c1ccf92
perf(indexing): fused FilterAxisKernel for np.extract / np.compress
Nucs May 23, 2026
fa76194
refactor(np.average): route through ILKernelGenerator, drop struct-ke…
Nucs May 23, 2026
fa55b28
feat(indexing): np.diagonal and np.trace with fused TraceKernel (Phas…
Nucs May 23, 2026
a44aa07
perf(indexing): TraceKernel covers Half / Decimal / Complex (+inline …
Nucs May 24, 2026
97c8769
feat(manipulation): np.delete / np.insert / np.append + fix CloneData…
Nucs May 25, 2026
88e6606
perf(manipulation): direct-view np.split family (1.5-3.2x NumPy)
Nucs May 25, 2026
bbdeceb
feat(manipulation): np.pad — full NumPy 2.4.2 parity, 11 modes + call…
Nucs May 25, 2026
2242342
perf(pad): remove uniform-fill pessimization + workaround mean axis bug
Nucs May 25, 2026
0488362
perf(manipulation): O(1) split sub-shape derivation, 1.5-4x NumPy acr…
Nucs May 25, 2026
76d0436
refactor(np.split): rename SplitInt -> SplitBySections for clarity
Nucs May 25, 2026
d403e89
perf(NpyIter,kernels): L1+L2+L4-a — Avx2 intrinsics, 8x pairwise redu…
Nucs May 25, 2026
7ae5bf4
perf(kernels): IL-emit UnmanagedStorage.Alias typed-field copier (no …
Nucs May 25, 2026
697aae8
perf(reduction,binaryop): L2 axis IL kernel + L3-a F-contig coalesce
Nucs May 25, 2026
dd64acc
perf(binaryop): L3-a real SimdChunk kernel — hoist outer coord calc o…
Nucs May 25, 2026
10df507
perf(binaryop): L3-b — route ALL non-contig/non-scalar to SimdChunk
Nucs May 25, 2026
e15d051
perf(binaryop): L3-c SIMD inner loop for SimdChunk (contig+contig)
Nucs May 25, 2026
499f991
perf(binaryop): L3-d SimdScalarL/R for (M,1)+(M,N) and (M,N)+(M,1)
Nucs May 25, 2026
631802d
fix(pad): preserve dtype in linear_ramp arithmetic (Complex imaginary)
Nucs May 25, 2026
8c0ad72
perf(NpyIter Tier 3B): port L3-d SimdScalarL/R broadcast into InnerLo…
Nucs May 25, 2026
3a43e96
perf(DefaultEngine.BinaryOp): route same-dtype binary ops via NpyIter…
Nucs May 25, 2026
ec5419a
perf(DefaultEngine.BinaryOp): route mixed-dtype binary ops via NpyIte…
Nucs May 25, 2026
41b90bc
perf(UnmanagedMemoryBlock): tcache-style buffer pool for sub-1MiB allocs
Nucs May 25, 2026
f44530e
perf(DefaultEngine.UnaryOp): route unary ops via NpyIter Tier 3B
Nucs May 25, 2026
a853940
perf(axis-0 reduction): column-tiled accumulation eliminates output R…
Nucs May 25, 2026
e378ecb
perf(axis-0 reduction): drop unsuccessful 8× unroll + intrinsic-wrapp…
Nucs May 25, 2026
d1bd5fa
perf(comparison kernel): PDEP-based packed mask-to-bool store
Nucs May 25, 2026
8c89474
feat(engine): bool/char max/min + complex quantile (full pad dtype pa…
Nucs May 25, 2026
7740e6a
perf(reductions): Phase 0 bug surgery — 217x mean, 21x var/std, 20x c…
Nucs May 25, 2026
a97b394
perf(comparison): Phase 1 — route broadcast/strided compares via NpyIter
Nucs May 25, 2026
a3ed4f5
perf(reductions): Phase 1 — route non-contig flat reductions via NpyIter
Nucs May 25, 2026
ac5cb1f
perf(unary): Phase 2 — int SIMD for abs/negate/square
Nucs May 25, 2026
a0f5e48
docs(argmax kernel): document the platform gap on Vector256.Max(float)
Nucs May 25, 2026
435a07e
perf(axis reduce): type-specialized AddOp/MulOp structs for float/double
Nucs May 26, 2026
6ac6eed
perf(axis reduce): widening SIMD for sum/prod/min/max (int32->int64, …
Nucs May 26, 2026
35186d7
refactor(kernels): split ILKernelGenerator into ILKernelGenerator (Np…
Nucs May 26, 2026
32b03ab
docs(CLAUDE.md): document ILKernelGenerator/DirectILKernelGenerator s…
Nucs May 26, 2026
7e8f4e1
refactor(examples): finish ILKernelGenerator -> DirectILKernelGenerat…
Nucs May 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .agents/skills/np-function/SKILL.md
1 change: 1 addition & 0 deletions .agents/skills/np-tests/SKILL.md
136 changes: 104 additions & 32 deletions .claude/CLAUDE.md

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions .claude/commands/np-function.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
name: np-function
description: Implement a NumPy np.* function in NumSharp with full API parity, optimizations, and variation coverage (NumPy 2.4.2 source of truth).
argument-hint: <np.function_name or description>
---

When user requests /np-function, you are to follow these instructions carefully!:

# np-function command

We are looking to support NumPy's np.* to the fullest. we are aligning with NumPy 2.4.2 as source of truth and are to provide exact same API (np.* overloading) as NumPy does.
This session we focusing on: """$ARGUMENTS"""
You job is around interacting with np.* functions - no more than one unless they are closely related.

np.* / function's high-level development cycle is defined as follows:

## 1. Read, investigate, learn and experiment
Read how NumPy (src\NumPy\) implemented the np functions you are about to implement - noting all parameters and overloads.
NumPy is the source of truth and if NumPy does A, we do A but in NumSharp's C# way.

### Definition of Done:
- At the end of step (1) step you understand to 100%:
- How the np function works internally in NumPy and reacts to inputs / parameters.
- What parameters the np function accepts and what modes the function works in.
- Understand what optimizations are used by NumPy and what optimizations can we use.
- Understand how would be the best integration to our existing infrastructure.
- Do we use ILKernelGenerator or NpyIter to implement the loop.
- Do not implement struct kernel.
## 2. Implement np method/s
- Implement np methods to the fullest, integrating into our existing infrastructure and patterns.
- Our implementation might differ from NumPy's because NumPy uses C++ macros while we generate IL methods during runtime to achieve peak performance and cpu acceleration. But any input given to NumPy will produce same output with complete parity.
- Our implementation must provide same parameters as the NumPy function and support all dtypes NumSharp currently supports.
- Do not create a function per dtype/NPTypeCode or if-else/switch-case per dtype/NPTypeCode to call a specialized path.
- Do not use struct kernel pattern.
- Do utilize IL generation (ILKernelGenerator) and/or NpyIter to implement the function, including fast paths.
- Any loops must be implemented via NpyIter or via ILKernelGenerator.

## Tools:
### Asserting, Validating, Comparing, Experimenting and Probing
"dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" both can be used to asserting, validating, comparing, test and confirm how behaviors, edge cases, parameter variations, happyflow, unhappyflow are acting based on given input/s.
These cli functions allow rapid development and experimentation.
Specifying '#:project' and other '#' with paths must be absolute path.

### Benchmarking
Use "dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" to produce professional benchmarks.

#### Benchmarking Rules of Thumbs
- We must be at-least x1.5 as fast as NumPy at all variations of execution extensively and modes possible extensively (all dtypes, all parameters combinations, see "Variations for Asserting, Validating, Comparing and Experimenting").
- There is a reason towards why NumPy does

## Optimizations and Implementation
Our codebase uses and follows the following techniques:

### A. Specialization & code generation

- Runtime IL emission per cache key — DynamicMethod generates a kernel once per (op, dtypes, layout) and the JIT compiles it to native; subsequent calls hit a ConcurrentDictionary lookup.
- Per-startup SIMD width baking — VectorBits resolved once via IsHardwareAccelerated; the emitted IL targets exactly one of V128/V256/V512 with no runtime width branch.
- Layout-specialized kernel paths — Generate distinct kernels for SimdFull / SimdScalarLeft / SimdScalarRight / SimdChunk / General instead of one kernel with runtime layout branches; layout becomes part of
the cache key.
- Signature collapse for fast paths — Contig kernels drop stride/shape args; scalar-broadcast kernels take T scalar not T*; cuts indirection and shrinks the IL body.
- Helper-call vs inline-IL choice — When an op has a tidy generic-constrained C# helper (e.g. CumSumHelperSameType<T>), the kernel emits a single Call and lets the JIT inline; only complex bodies inline the
IL loop themselves.
- Negative cache for unsupported combos — _castUnsupported/_maskedCastUnsupported record dtype pairs that fail IL gen so retries are O(1) instead of re-attempting emission.

### B. Loop shaping

- 4x-8x unrolling with independent accumulators — Body processes 4-8 vectors per iter into 4-8 separate accumulators; breaks the carried dependency so the CPU dispatches 4-8 SIMD ops/cycle.
- Three-stage loop — Unrolled SIMD body + 1-vector remainder + scalar tail; handles any count without padding.
- Inner-contig runtime dispatch — Inside strided kernels, compare each operand's stride to its element size; branch into the SIMD inner body when all match, else strided.
- Cache-friendly loop ordering — IKJ in MatMul so the inner SIMD walk is over sequential B[k,:] memory; A[i,k] is broadcast once and reused across all j.

### C. SIMD primitives

- Mask→uint via ExtractMostSignificantBits — Convert a Vector mask to packed bits in a uint — the universal building block for All/Any/NonZero/CountTrue/CopyMasked.
- Bit-scan loop (TrailingZeroCount + bits &= bits-1) — Materialize lane indices from a packed mask one-at-a-time without a per-lane branch; standard idiom for sparse-extract.
- Self-equality NaN mask — Equals(v, v) produces lanes that are true for non-NaN (NaN ≠ NaN); used to zero/count out NaNs in NaN-aware reductions.
- Branchless ConditionalSelect — Per-lane gating without a branch; used by Where and masked cross-dtype copy.
- Scalar pre-broadcast — Vector.Create(scalar) hoisted into a local before the loop so the body re-uses it instead of reloading; used by scalar-broadcast variants of binary/where/clip.
- Op-specific identity seeding — Reduction accumulators are pre-loaded with 0 (Sum), 1 (Prod), MinValue (Max), MaxValue (Min) — also defines the empty-array result.
- Tree merge + horizontal halving — Multi-accumulator finalization: acc0 op= acc1; acc2 op= acc3; acc0 op= acc2, then horizontal reduce across the lanes.
- Early-exit on mask state — All/Any/IsAllZero return immediately when the packed bits hit the terminal pattern, skipping the rest of the array.
- Vectorized index discovery, scalar scatter — Even when the data store can't be vectorized (gather/scatter limits), the mask scan that finds the indices is fully SIMD.
- AVX2 gather for strided float/double — Strided axis reductions use intrinsic gather when the dtype is gather-capable.
- Width-adaptive emit via GetVectorContainerType() — One emission function picks Vector{128|256|512} methods through a cache; the same source code path covers all widths.

### D. Memory & pointer

- Cpblk IL intrinsic — Same-type contiguous copy emits the CLR block-memcpy opcode directly instead of a loop.
- Incremental coord advance — Outer-dim walks update offsets by adding strides rather than recomputing via flat → div/mod per element.
- Pre-computed dim strides in stack array — Axis kernels pre-build output-dim strides on the stack so each output index → input offset is O(ndim) muladds, no divmods.
- Pointer/stride prologue hoisting — Inner-loop factory snapshots dataptrs[i] and strides[i] into locals once at the top so the loop body works against locals, not memory loads.
- Pre-size-then-fill — np.nonzero runs an IL-emitted popcount first to size the output buffer, then a second IL-emitted bit-scan kernel writes indices; avoids the "alloc max-size temp" pathology.

### E. Algorithmic

- Two-pass algorithms — ArgMax (find value → find index), Var/Std (mean → squared diffs), masked-copy (count → place). First pass enables vectorization; second pass exploits the known result.
- Monotonic-bound carry — searchsorted carries the lower bound L from the previous iteration when consecutive keys ascend, mirroring NumPy's binsearch.cpp.
- Short-circuit prescan — Quick SIMD all-zero check on a boolean mask short-circuits the whole np.where(cond) pipeline when the condition is fully false.
- Type-promotion-aware path skip — SIMD reduction skipped when input != accumulator (e.g. sum(int32)→int64) because Vector<T> can't widen lanes; falls to scalar IL.
- Two-tier inner-loop API — Callers choose between Tier 3A (raw IL body) for full control or Tier 3B (scalar/vector body lambdas wrapped in the standard 4×-unrolled shell) for boilerplate elimination.

### F. Cross-type bridging

- Decimal-via-double bridge — All transcendental decimal ops emit decimal→double→Math.*→decimal inline IL.
- Bool-mask lane expansion — 1-byte mask is widened through WidenLower chain to match the 1/2/4/8-byte data lane width before ConditionalSelect.
- Magnitude comparison for Complex — ArgMax/ArgMin on Complex compares |z|, since Complex has no native ordering.

### F. NumPy semantic compliance

- NumPy-overflow shift semantics — Branch on shift >= bitWidth returns 0 (or -1 for signed-negative right shift) instead of C# x << (n & 31) masking.
- Sign-preserving zero in Modf — Explicit fixup so modf(-0.0) = (-0.0, -0.0) and modf(+inf) = (+0.0, +inf) per C standard.
- Vacuous truth for empty reductions — all([])=True, any([])=False, identity-valued Sum/Prod/Max/Min for empty arrays.
- NEP50-aligned accumulator types — Reduction kernels promote int32→int64 for Sum/Prod/CumSum, dropping out of SIMD when needed.

### G. Reflection & caching

- MethodInfo cache (fail-fast at type load) — Math.*, Vector*.*, Decimal.* reflection resolved in static initializers with ?? throw; emission never pays GetMethod cost.
- Width-resolved generic method cache — VectorMethodCache.V(VectorBits, clrType) returns the right Vector{W}<T> type and Generic(VectorBits, name, clrType, paramCount) returns the right method handle.
- ConcurrentDictionary.GetOrAdd keyed by structural value — All kernel caches use struct keys with stable Equals/GetHashCode; thread-safe lazy init via GetOrAdd.


## Variations for Asserting, Validating, Comparing and Experimenting
These variations are the range of possabilities of inputs that we need to follow NumPy's output based on inputs for complete parity.
Total: ~44 distinct variations — 25 single-array layouts, 6 pairwise paths, 8 per-operand flags, 8 iteration flags, 4 composite execution paths.

### A. Single-array layouts

- C-contiguous — Row-major, stride[-1]==1 and stride[i]==shape[i+1]*stride[i+1]; baseline fast path via IsContiguous.
- F-contiguous — Column-major, stride[0]==1; 1-D arrays are both. Detected via IsFContiguous.
- Strided / non-contiguous — Arbitrary strides, neither C nor F; built via step slicing or axis swap.
- Transposed — Strides permuted by .T / swapaxes / moveaxis; usually non-contig.
- Negative-stride view — Reversed slicing ([::-1]); strides are signed-negative.
- Simple slice — offset!=0, not broadcast; fast GetOffsetSimple path (IsSimpleSlice).
- Sliced + composed — a[1:5].T, a[1:3][:,None,:]; offset combined with permutation or broadcast.
- Broadcasted — stride=0 with dim>1 (BROADCASTED flag); read-only per NumPy.
- Scalar-broadcast — All strides zero (IsScalarBroadcast); load value once and reuse.
- Partial broadcast — Some axes stride=0, others not; common (1,N)→(M,N) case.
- Scalar (0-d) — ndim==0, size==1, no strides.
- 0-D view from integer indexing — a[0,0,0] shares storage; distinct from np.array(5.0) which owns.
- 1-element 1-D — ndim==1, size==1; ambiguous against 0-d in some paths.
- Empty — size==0 (e.g. np.zeros((0,3))); reductions must return identity.
- Empty + composed — np.zeros((0,3))[::2,:]; rare but must not crash.
- NewAxis-inserted dim — a[None,:] adds dim=1, stride=0; not flagged broadcast since dim=1.
- Singleton dim (dim=1) — Stride is moot; NumPy treats as contig.
- Higher-rank (5+D) — Stack-allocated coord/stride arrays in kernels may have bounds.
- Stride > bufferSize — Negative-stride views can have offset + stride*(dim-1) >= bufferSize.
- Reshape view vs copy — Reshape returns a view when contig allows, materializes otherwise.
- Fancy-indexed result — Always a fresh C-contig owning array, never a view.
- Boolean-mask result — Always a contig owning copy.
- Read-only / non-writeable — IsWriteable==false (set on broadcast views); writes throw.
- Non-owning view — OwnsData==false; writes alias the parent.
- Aligned — ALIGNED flag; always true for managed allocs but a real NumPy axis.

### B. Pairwise (binary-op) paths — MixedTypeKernelKey.Path

- SimdFull — Both operands C-contig same dtype; SIMD baseline.
- SimdScalarRight — RHS is 0-d / scalar-broadcast, LHS is array.
- SimdScalarLeft — LHS is 0-d / scalar-broadcast, RHS is array.
- SimdChunk — Inner dim contig for both, outer strided.
- General — Arbitrary strides on either side; coordinate iteration.
- Mixed dtypes — Orthogonal axis: same layout, different LHS/RHS/result dtypes (NEP50 promotion).

### C. Per-operand variations — NpyIterOpFlags

- Aliased operands — Same buffer on both sides (a + a, out=a); no non-aliasing assumption.
- Overlapping views — Two views with partial overlap (a[1:] and a[:-1]); writes can clobber unread reads.
- In-place output (out=) — Output aliases an input; loop order must respect read-before-write.
- Reduction operand — Output has stride=0 along the reduction axis (REDUCE flag).
- Write-masked operand — WRITEMASKED: write only where mask is true.
- Virtual operand — VIRTUAL: no backing array, computed on demand.
- Buffered / casting operand — CAST / FORCECOPY / HAS_WRITEBACK: type conversion needs a temp.
- Read-only operand — READ without WRITE; matters for output selection.

### D. Iteration-level variations — NpyIterFlags

- Coalesced dimensions — Consecutive axes with matching strides collapsed; ndim=4 may arrive as ndim=1.
- IDENTPERM vs NEGPERM — Axis iteration order: identity vs flipped (negative stride on some axis).
- External loop (EXLOOP) — Kernel sees only the inner axis; outer loop driven by iterator.
- Ranged iteration (RANGE) — Partial traversal of a subset.
- GROWINNER — Inner-loop length varies across outer iterations.
- GATHER_ELIGIBLE — Strided inner axis but dtype supports AVX2 gather.
- EARLY_EXIT — Op supports short-circuit (All/Any/IsAllZero).
- PARALLEL_SAFE — Outer loop has no cross-iteration dependency.

### E. NpyIter composite execution paths

- Source-broadcast + dest-contig — Common reduction shape.
- Source-contig + dest-strided — Writing into a sliced output.
- Buffer-required path — Dtype mismatch or alignment forces NpyIter to insert a temp; kernel sees contig but indirect.
- Reused reduce loops — REUSE_REDUCE_LOOPS: inner-loop kernel runs against successive output positions without re-derivation.

37 changes: 0 additions & 37 deletions .claude/skills/np-function/SKILL.md

This file was deleted.

1 change: 1 addition & 0 deletions AGENTS.md
2 changes: 2 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,8 @@ MoveNext = () => *((T*)Address + shape.GetOffset(index++));

## Code Generation

For the practical implementation rules used by `DefaultEngine` and `ILKernelGenerator`, see `docs/DEFAULTENGINE_ILKERNEL_PLAYBOOK.md`. That guide captures the recurring engine patterns, optimization conventions, and test expectations that are only implicit in the source code.

### Regen Templating

NumSharp uses Regen (a custom templating engine) to generate type-specific code. This results in approximately **200,000 lines of generated code**.
Expand Down
2 changes: 1 addition & 1 deletion benchmark/NumSharp.Benchmark.Exploration/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,7 @@ private class Options
Dtypes = Dtypes,
Sizes = Sizes,
OutputPath = OutputPath,
RemainingArgs = RemainingArgs
RemainingArgs = (string[])RemainingArgs.Clone()
};
}
}
Loading
Loading