playground
Score a piece of content against one or many criteria at once. Each criterion gets an independent probability in [0, 1] — read pragmatic register, not arbitrary logic. ~50-200 ms single-call latency, ~80-300 ms for a few criteria batched together.
interpretation notes
Each criterion is scored independently against the same content — the probe doesn't know about the other criteria. So "is about cooking" and "is about food" will both fire on a cooking post (they're not mutually exclusive labels).
The probe reads pragmatic register, not arbitrary logic. It's strong on topical /
lexical / sentiment / temporal criteria; weak or anti-correlated on negation, arithmetic /
counting, and abstract quality judgments. See SETTLED_FINDINGS.md §3 + §4 in
the repo for the full capability ladder.
Scores are ranking values, not calibrated probabilities. ECE on the base probe is 0.13-0.24 depending on the distribution — threshold 0.5 is a reasonable default but a customer running 150+ labeled items per criterion can fit isotonic calibration to tighten that.
Bands: score ≥ 0.85 = high, ≥ 0.5 = medium, ≥ 0.15 = low, < 0.15 = very low. Hard cap of
64 criteria per call (set in service/main.py).