Notebook 021 - The Lab Graded Itself

We asked the morning forecast to face the afternoon tape. It passed on posture and failed on direction.

For weeks ONYX Lab was a wall of tables — scenario matrices, alert ladders, weakness lists, playbook prose. Useful for audit. Exhausting to read at 8:15 AM when you only need to know three things: where price is relative to the plan, what branch is live, and whether the desk should deploy.

So we rebuilt it. Graph first: probability paths, buy zone, exit target, void line, and a price ladder underneath. Symbol Focus became tables instead of nested cards. The detailed Markov math moved into a collapsible audit drawer. The lap graph is context only — it does not approve trades, flip AUTO, or touch the broker. The written plan and Guard still decide.

Then we did the part most research tools skip: we graded it.

Tuesday, June 16, was an FOMC-eve session — fragile tape, selective deployment, every watchlist name on the board as WATCH-only. At 08:03 ET the Lab published its premarket read. At 16:15 ET we replayed the session against the same entry, chase, target, and void levels from that morning report and asked a simple question: did the likely direction — up, sideways, or down — match what the tape actually did?

As always: this is a paper-trading system and a process journal, not trading advice.

The scoreboard

Eleven symbols. Four direction matches. 36.4%.

Layer	Grade	What happened
Direction probabilities	D+	BEAR/sideways skew on mega caps; six names closed green vs plan entry while the model said down or chop
Deployment / capital dial	A	SELECTIVE 35%, zero A-rows, flat book — correct for a fragile day
Chase-miss / plan weaknesses	A	Pre-flagged missed-no-chase on the names that gapped above max chase
Scenario branch coverage	A	Every realized day type had a named branch in the morning matrix
Plan vs execution alignment	C	Board said WATCH-only; some auto brackets were armed anyway

The headline number is ugly. We are not going to sand it down. A calibration tool that only publishes good days is just marketing.

What the Lab said before the open

The morning blend — historical Markov state probabilities plus lane metrics from the opening-range-momentum profile — leaned sideways or down on most of the watchlist. Only two names carried strong up conviction above 70%: the semis leaders. Apple was the cleanest high-confidence call: 70% sideways, and the session played out as chop relative to entry.

The Investment Quality Board matched: SIDEWAYS / FRAGILE_RISK_ON, deployment dial at SELECTIVE 35%, no executable A or strong-B rows. Trade-probability edge was negative on every name. The Lab's job that morning was not to talk us into size. It was to describe branches and warn us where the plan geometry would break.

That part worked.

What the tape did instead

The dominant story was not "the market went up" or "the market went down." It was gap_up_chase_miss_with_reclaim — four times.

The pattern: names opened firm, already above the buy zone and max chase before a pullback fill was possible. Reclaim happened later. Close vs entry was green on several mega caps. No desk fill on the gapped names because the plan correctly forbids chasing above the written cap.

So direction labels and executable outcomes diverged:

META, GOOGL, MSFT, AMZN — Lab said BEAR or sideways; tape closed up vs entry (+1.5% to +2%) but no fill because chase was exceeded at the open.
AAPL — Lab said sideways (70%); tape was sideways vs entry. Chase exceeded; no live fill. Best call of the day.
AMD, ARM — Lab said up; tape hit target (direction bucket = up) even though AMD closed below entry after a violent open.
NVDA, RKLB — Lab said down; tape was flat/sideways. Close misses, not catastrophic.

Broad indices faded into the close — SPY and QQQ down vs session VWAP, semis weak, risk-off distribution into Wednesday's FOMC. The morning fragile / selective posture was right even while individual symbols printed green vs their plan entries.

What we got right (and why it mattered more than 36%)

Three layers of the Lab did their job even when the direction label did not:

1. Deployment posture. The desk finished flat. Zero P&L. On a day when several names looked tradable in hindsight, the capital dial and WATCH-only board said do not force it. That is the outcome we want when trade-probability edge is negative across the book.

2. Chase-miss warnings. Plan weaknesses flagged likely missed-no-chase on the two names we were closest to arming. Both validated live: one closed above max chase; the other gapped above chase at the open. The Lab did not predict the direction perfectly. It predicted we would not get the entry we wrote — and it was right.

3. Scenario coverage. Every realized archetype — limit fill without target, fill with target hit, target hit without fill, gap-up chase miss with reclaim — had a named branch in the morning scenario matrix. When the afternoon happened, we were not improvising labels. We were matching tape to a branch we had already written.

That is the difference between a forecast tool and a desk rehearsal tool. The Lab is supposed to make the day legible before it happens, not replace the trader.

What we missed (and what it teaches)

Historical BEAR state is not today's direction. When Markov says BEAR and the stock gaps above your buy zone, the relevant question is not "will it go up?" It is "will we get filled without violating chase?" The model overweighted regime memory vs opening geometry.

Direction buckets mix two different questions. A name can hit target at the open (bucket = up) while closing red vs entry (looks like failure on a P&L chart). AMD was both. Scoring only on likely-direction vs bucket hides the executable story.

Midday was worse than close. On the two armed lanes at 1:30 PM, intraday calibration was 0 for 2 — AAPL looked up vs entry, AMD looked down. The formal close score improved only because AAPL settled back into the sideways bucket. A single end-of-day label is not enough.

Lab context and execution can disagree. The board said WATCH-only. Some brackets were armed. When the Lab says "selective, weak edge" and the execution layer partially ignores that, the scorecard blames the model for a problem that was also operational.

How we plan to improve ONYX Lab

We are not throwing out the graph or the probabilities. We are splitting what we score from what we show, and making the gap-aware failure modes first-class.

1. Two scores, not one

Keep direction match (likely vs up/sideways/down bucket) for research honesty. Add a separate executable-outcome score:

Chase miss predicted and occurred → credit
Fill without target → neutral / lane-specific
Target hit → credit
Void breached → debit

Tuesday would score much better on executable geometry than on direction labels — and that matches how the desk actually uses the tool.

2. Gap-aware premarket branch

When premarket last is already above max chase, the Lab should promote a gap-open branch to the top of the graph and downgrade the default pullback path — not silently keep drawing three symmetric probability curves as if entry were reachable. The scenario matrix already names chase miss; the graph should lead with it when the math says it is the base case.

3. Richer "actual" labels for scoring

Add secondary buckets for calibration reports:

Close vs prior close (did the stock go up on the day?)
Close vs session VWAP
Close vs plan entry (current method)

Tuesday's mega-cap "misses" were often green vs entry on a down-broad-tape day. Multiple reference frames stop one bucket from telling the whole story.

4. Session score in the UI, every day

The post-close pipeline already builds replay artifacts and runs session scoring. The Lab UI now has Yesterday / Overall score panels; next step is making the today vs yesterday comparison impossible to miss after close — direction match, mean Brier, dominant archetype, and tomorrow hints on the same screen as the graph.

5. Midday calibration as a first-class artifact

Formal score waits for close. Midday calibration (armed names only, live last vs entry, chase state, posture blocks) should auto-generate at 10:30 and 13:30 ET so we catch 0 for 2 at lunch before the close bucket flattens the story.

6. Board–Lab–execution alignment gate

If Investment Quality Board is WATCH-only and deployment dial is below threshold, the Lab banner should surface a hard alignment warning when auto brackets or PLAN LIVE are armed for those symbols. Context and authority should not contradict without an explicit written amendment.

7. Rolling accuracy by archetype

Track direction match and executable-outcome match over rolling 5- and 20-session windows, split by archetype (chase miss, fill+target, chop). We expect direction match to stay noisy when TP edge is negative. We expect chase-miss prediction to be stable. Publish both so we know which layer to trust.

8. Negative-edge days: demote direction, promote geometry

When every name carries negative trade-probability edge, the Lab UI should default to ladder + chase/void geometryand collapse direction conviction visuals — or label them "historical prior only." Tuesday was that day. The desk needed "do not chase" more than "70% sideways."

What we actually learned

We rebuilt ONYX Lab to be readable. Then we let it sit across from the tape and take a grade. 36% direction match is a fail if you think the Lab is a fortune teller. It is a pass on the harder question if you think the Lab is a rehearsal and guardrail system: capital stayed flat, chase misses were pre-flagged, every afternoon archetype was named before the open.

The work ahead is not "make the probabilities smarter" as the only goal. It is score the right things, lead with geometry when the gap makes entry impossible, and keep Lab context, board posture, and execution authority from disagreeing.

We will publish the next score after FOMC day the same way — honestly, with the number at the top, and with a clearer split between what we forecast and what we could actually trade.