about

what this is

Internal sandbox for the criterion-probe. The probe is a frozen-backbone (Qwen-2.5-3B-Instruct) + 2048-dim LR head that returns a probability that a single English criterion sentence is satisfied by a piece of content. Reads pragmatic register at the ASSESSMENT seed position; doesn't reason.

how to use it

playground — one criterion + one content, single score.
batch — up to 64 (criterion, content) pairs in one call.

what it does well / badly

Well: topical, lexical, sentiment, multi-clause AND/OR, deontic, counterfactual. AUCs typically 0.9+.
Badly: negation (anti-correlated on Qwen-2.5-3B, see SETTLED_FINDINGS §4.1), arithmetic / counting, presupposition vs assertion, abstract quality judgments ("comprehensive", "helpful").
Anti-correct on compound vibe criteria: a multi-axis criterion like "comprehensive AND helpful AND clear AND child-accessible" ranks the WORST items highest. If you want quality assessment, decompose into atoms or use an LM judge.

upstream status

checking…

about

what this is

how to use it

what it does well / badly

upstream status

links