# P20 BiquadNEON — Audit

**Status:** finding revised. Source is form-agnostic; each font emits the form best for its target. NEON font extension required.
**Auditor:** Atlas
**Date:** 2026-05-29 (revised same day per Josh's architecture clarification)

---

## Lithos source (`pfuncs/p20-biquadneon.ls`)

```
\ p20-biquadneon.ls -- Biquad NEON (4v)
\ Same math as scalar biquad, NEON lanes.
biquadneon ⇌ · ++ - · ←
```

Identical chain to scalar P3 (`p3-biquad.ls`) — the comment says *"Same math as scalar biquad, NEON lanes."*

Per `quantum.factory.language`:
> `biquad ⇌ · ++ - · ← — forward dot product, add state, subtract feedback dot product, store state`

That's **Direct Form I** (DF-I):
- `·` — forward dot `[b0,b1,b2] · [x,x_1,x_2]`
- `++` — add the (running) state vector
- `-` — subtract feedback
- `·` — feedback dot `[a1,a2] · [y_1,y_2]`
- `←` — store `[x_1,x_2,y_1,y_2]` shifted

## Hand-asm (`sixth/qv/lib/p-functions.fs`, `emit-p20`)

19 instructions, 76 bytes. Comment at line 1028 reads: *"TDF-II: y = b0*x + z1"*.

```
LDR    S0,    [X9, #0]            ; b0
DUP    V0.4S, V0.S[0]             ; broadcast b0 → 4 lanes
LDR    S1,    [X9, #4]            ; b1
DUP    V1.4S, V1.S[0]
LDR    S2,    [X9, #8]            ; b2
DUP    V2.4S, V2.S[0]
LDR    S3,    [X9, #12]           ; a1
DUP    V3.4S, V3.S[0]
LDR    S4,    [X9, #16]           ; a2
DUP    V4.4S, V4.S[0]
FMUL   V15.4S, V10.4S, V0.4S      ; y  = b0·x
FADD   V15.4S, V15.4S, V16.4S     ; y += z1               → y = b0·x + z1
MOV    V16.16B, V17.16B           ; z1' = z2
FMLA   V16.4S, V10.4S, V1.4S      ; z1'+= b1·x            → z1 = z2 + b1·x
FMLS   V16.4S, V15.4S, V3.4S      ; z1'-= a1·y            → z1 = z2 + b1·x − a1·y
FMUL   V17.4S, V10.4S, V2.4S      ; z2' = b2·x
FMLS   V17.4S, V15.4S, V4.4S      ; z2'-= a2·y            → z2 = b2·x − a2·y
MOV    V10.16B, V15.16B           ; output = y
RET
```

## Math (TDF-II, as implemented)

Per sample, four voices in parallel (V.4S):

```
y[n]   = b0·x[n] + z1[n−1]
z1[n]  = b1·x[n] + z2[n−1] − a1·y[n]
z2[n]  = b2·x[n]           − a2·y[n]
out    = y[n]
```

State: two NEON registers (z1, z2) per voice-quad.

## Math (DF-I, what the chain says)

```
y[n] = b0·x[n] + b1·x[n−1] + b2·x[n−2] − a1·y[n−1] − a2·y[n−2]
state_next = [x[n], x[n−1], y[n], y[n−1]]
```

State: four NEON registers (x1, x2, y1, y2) per voice-quad.

## Register/state trace (hand-asm)

- **Params (per buffer, scalar then broadcast to V.4S):**
  - `V0.4S` ← b0, `V1.4S` ← b1, `V2.4S` ← b2, `V3.4S` ← a1, `V4.4S` ← a2
- **Signal in:** `V10.4S` = x (4-voice input)
- **Signal out:** `V10.4S` = y (4-voice output)
- **Scratch:** `V15.4S` = y (intermediate, output staged here before MOV to V10)
- **Persistent state across callbacks:** `Q16` = z1, `Q17` = z2 (per the `Q16-Q31 persistent` convention)

## Finding (form-agnostic source; font picks the form)

**The source chain `· ++ - · ←` reads abstractly: forward path, add running state, subtract feedback, store.** Neither DF-I nor TDF-II is encoded *in the chain itself* — the form is decided by which font compiles it.

Per Lithos law L9 (`lithos.laws.language`):
> *the font table is the optimizer*

And per `lithos.core.language`:
> *target invariance — source unchanged across targets; font table varies*

So the same `biquad ⇌ · ++ - · ←` source is correct on **all** targets, and each target's font emits the form best suited for it:

| Target | Form | Why | Evidence |
|--------|------|-----|----------|
| arm64 scalar (P3) | DF-I | small scalar state, simple recurrence | `build-factory.py` P3 recipe: 20 insns, DF-I |
| arm64 NEON (P20) | TDF-II | better numerical behavior under parameter automation; smaller SIMD state (2 V.4S regs vs 4) | `sixth/qv/lib/p-functions.fs:1015–1041` hand-asm |
| WASM | TBD (likely TDF-II) | comparable reasoning to NEON | font not yet emitting |
| GPU (AIR/SPIR-V) | TBD | dataflow, may prefer TDF-II for SIMT lanes | not yet built |

The forms differ in:

1. **State count and meaning.** DF-I state = past input/output samples (4 state values per voice). TDF-II state = pre-computed intermediate accumulators (2 state values per voice). The PFUNC_META row confirms NEON is TDF-II: `"b0,b1,b2,a1,a2,z1[4],z2[4]"`.
2. **Behavior under time-varying coefficients.** TDF-II's intermediate state is invariant to coefficient changes; DF-I's past-input state is the actual signal history, so a parameter automation ramp produces a different transient. This is why TDF-II is the standard for SIMD audio.
3. **Steady-state equivalence.** Given identical coefficients (b0, b1, b2, a1, a2) and zero state initialization, both forms produce the same output for any input. The f64 numpy reference (Lyra's column) is the arbiter; both target outputs are diffed against it, not against each other.

## Resolution path

**Not a source change. A font-extension change.**

The NEON arm64 font needs entries (and the right ngram fusion) so that compiling `· ++ - · ←` through it emits the 19-instruction TDF-II blob already in the hand-asm. Specifically:

- Recognize the pattern as "biquad" in 4-voice context.
- Emit the load+broadcast prologue for 5 coefficients (5× LDR S + DUP V.4S).
- Emit the TDF-II body: FMUL → FADD → MOV V.16B → FMLA → FMLS → FMUL → FMLS → MOV V.16B.
- Emit RET.

The scalar arm64 font already (per P3's Python recipe) emits DF-I from this chain. P3 is the proof-of-concept that one source compiles correctly to multiple forms; P20 is the next target for the same chain — different font, different form, same source.

## Font-table gap (the real work)

The NEON arm64 font needs three additional entries (or their ngram equivalents) before this chain can emit the TDF-II blob:

| Op | Encoding | Used for | Glyph candidate |
|----|----------|----------|-----------------|
| FADD V.4S | `0x4E20D400 \| (Vm<<16) \| (Vn<<5) \| Vd` | `y += z1`, the `++` step | `++` (NEON variant) |
| FMLS V.4S | `0x4EA0CC00 \| (Vm<<16) \| (Vn<<5) \| Vd` | `z1 -= a1·y`, `z2 -= a2·y` — the feedback subtract | `**--` ngram (vector multiply-subtract, parallel to existing `**++` for FMLA) |
| MOV V.16B | `0x4EA01C00 \| (Vn<<16) \| (Vn<<5) \| Vd` | `z1 ← z2` state shift | dedicated opcode or `→ ←` reduction |

Documented NEON ops already in `lithos.targets.language`:
```
⊛  FMUL V10.4S, V10.4S, V0.4S
⊞  FMLA V10.4S, V11.4S, V0.4S   (= **++ ngram)
◉  DUP V0.4S, V0.S[0]
ΣΣ FADDP V10.4S
⊛₀ FMUL V10.4S, V10.4S, V11.4S
```

`lithos.targets.language` already documents the ngram fusion pattern at 0x80+:
```
0x83 **++  → FMLA — fused multiply-add
0x84 **--  → FMLS — fused multiply-subtract
```

So `**--` for FMLS is **already specified as an ngram** in the Lithos language — it just doesn't have a V.4S NEON variant in the arm64-dsp-font yet. The work is:

1. Add NEON-variant entries to `arm64-dsp-font.s` (or a sibling `arm64-dsp-neon-font.s`) for: FADD V.4S, FMLS V.4S, MOV V.16B.
2. Ensure ngram fusion (`++ -` → FMLS) works in 4-voice context.
3. Verify the compiler picks the NEON variants when the chain is in a 4-voice composition (likely indicated by `.lion` declaration or by font selection at compile time).

**No source change to `pfuncs/p20-biquadneon.ls` required.** The chain is correct as-is.

## Stress / sonic notes (Lyra's column)

- **Coefficient automation.** If filter cutoff is modulated, TDF-II glides smoothly while DF-I produces audible micro-clicks. If the port emits a DF-I blob silently swapped under the TDF-II context layout, every automation move clicks.
- **State zero-on-activation.** All four NEON state lanes (V16.4S, V17.4S) must be zeroed at slot activation. If the activation path zeros only z1 lanes 0..1 (e.g., from a scalar P3 init code path), voices 2 and 3 boot with stale state → audible thump.
- **Denormal handling.** Biquad state can decay toward denormal for low-Q lowpasses on near-silent input. The TDF-II form has a state ramp toward zero that can land in denormal territory; FPCR FZ must be set. Verify the Lithos-emitted prologue matches the Sixth blob on FPCR.
- **Saturation guard.** `quantum.principles.language` A1/A7 mentions `|state| > 32 zeroed per callback` as a runaway-feedback guard. The hand-asm doesn't show the guard — it's enforced by the audio callback's post-processing step, not in P20 itself. Port doesn't need to add it, but the audit should confirm the guard still triggers for the new blob's state register layout (Q16/Q17 must be the scanned range).

## Verification criteria (Flux's framing, applied)

A port is verified when:

- `‖y_scalar − y_ideal_DF-I‖∞ < ε_scalar` — P3 scalar blob output matches the DF-I f64 numpy reference
- `‖y_NEON − y_ideal_TDF-II‖∞ < ε_NEON` — P20 NEON blob output matches the TDF-II f64 numpy reference
- Both ε within a few ULPs per accumulated FMA. Divergence between targets is allowed (different FMA orderings; different forms under automation); divergence from the form-matched `ideal` flags a font emitting the wrong sequence.

### Oracle bifurcation (per Flux)

The f64 reference is **not single-valued for time-varying coefficients**.

- **Static-coefficient tests** (impulse, log sweep with fixed cutoff, white noise with fixed Q): one oracle is enough. Same transfer function `H(z)`, same `y[n]` from rest, both forms produce identical f64 output. One numpy reference, both target blobs diffed against it.
- **Automated-coefficient tests** (LFO on cutoff, knob ramp during sustain, MIDI CC modulation): the oracle must compute *in the same form as the target*. DF-I oracle for scalar P3; TDF-II oracle for NEON P20. They will diverge from each other for the same coefficient trajectory — that divergence is the physical reason TDF-II exists, not a bug.

Lyra's reference generator therefore needs two functions:

```python
def biquad_df1(x, b, a, state): ...      # 4-word state, for scalar/P3 oracle
def biquad_tdf2(x, b, a, state): ...     # 2-word state, for NEON/P20 oracle
```

Same `b, a` coefficient sequence + same `x` → divergent `y` under modulation. Expected.

## Verification status

- [x] Source `.ls` read.
- [x] Hand-asm read and decoded.
- [x] Math recovered for both forms (DF-I scalar, TDF-II NEON).
- [x] Register/state trace.
- [x] Form-agnostic source confirmed.
- [x] Font-table additions specified (FADD V.4S, FMLS V.4S, MOV V.16B) with concrete encodings above.
- [ ] NEON font extension implemented (`arm64-dsp-font.s` or sibling).
- [ ] Compile through Lithos compiler with NEON font → produce blob.
- [ ] Blob output diffed against f64 numpy reference (Lyra's `y_ideal`).
- [ ] Stress sequence: impulse → log sweep → cutoff-modulated white noise. The third stresses the TDF-II's automation-friendliness directly.

## Next step

1. **Atlas (me)**: specify the NEON font extension as a concrete patch — the three new entries with their 8-byte blobs, the ngram fusion rule for `++ -` → FMLS V.4S, and the prologue/epilogue conventions. Output: a font-extension proposal at `pfuncs/audit/font-extensions/neon-biquad.md`, referenced from this audit.
2. **Lyra**: produce two reference streams per coefficient trajectory:
   - `pfuncs/audit/refs/p20/y_ideal_tdf2.f32` — TDF-II f64 oracle (for NEON blob diff)
   - `pfuncs/audit/refs/p3/y_ideal_df1.f32` — DF-I f64 oracle (for scalar blob diff)
   Stimuli: impulse, log sweep with fixed lowpass at 1 kHz Q=0.707 (static), then cutoff-modulated white noise 200 Hz → 8 kHz over 1 sec (automation). The third stresses the form difference directly.
3. **Flux**: confirmed — TDF-II equations transcribed correctly, oracle bifurcation noted.

## Cross-spec inconsistency to fix (per Flux)

- `quantum.factory.language` says: *"(P3, biquad — IIR filter Direct Form I; Lithos: biquad ⇌ · ++ - · ←)"* — names wrong form.
- `quantum.wave.language` says: *"(topology, transposed direct form II — 7 f32: b0 b1 b2 a1 a2 z1 z2)"* — names correct form, matches state count.

Two `.language` files disagree on which form. The wave file is right (state matches running code). Factory file needs a surgical edit. Out of scope for this audit page; logged as a follow-up.