playground

Score a piece of content against one or many criteria at once. Each criterion gets an independent probability in [0, 1] — read pragmatic register, not arbitrary logic. ~50-200 ms single-call latency, ~80-300 ms for a few criteria batched together.

pinging upstream…
Load a multi-criterion example:
criteria (5 active)
score: ≥0.85 ≥0.5 ≥0.15
interpretation notes

Each criterion is scored independently against the same content — the probe doesn't know about the other criteria. So "is about cooking" and "is about food" will both fire on a cooking post (they're not mutually exclusive labels).

The probe reads pragmatic register, not arbitrary logic. It's strong on topical / lexical / sentiment / temporal criteria; weak or anti-correlated on negation, arithmetic / counting, and abstract quality judgments. See SETTLED_FINDINGS.md §3 + §4 in the repo for the full capability ladder.

Scores are ranking values, not calibrated probabilities. ECE on the base probe is 0.13-0.24 depending on the distribution — threshold 0.5 is a reasonable default but a customer running 150+ labeled items per criterion can fit isotonic calibration to tighten that.

Bands: score ≥ 0.85 = high, ≥ 0.5 = medium, ≥ 0.15 = low, < 0.15 = very low. Hard cap of 64 criteria per call (set in service/main.py).

internal — Qwen-2.5-3B-Instruct + base_probe_2130 via gex44