playground

Score a piece of content against one or many criteria at once. Each criterion gets an independent probability in [0, 1] — read pragmatic register, not arbitrary logic. ~50-200 ms single-call latency, ~80-300 ms for a few criteria batched together.

pinging upstream…

Load a multi-criterion example:

content

criteria (5 active)

score: ≥0.85 ≥0.5 ≥0.15

—

interpretation notes

Each criterion is scored independently against the same content — the probe doesn't know about the other criteria. So "is about cooking" and "is about food" will both fire on a cooking post (they're not mutually exclusive labels).

The probe reads pragmatic register, not arbitrary logic. It's strong on topical / lexical / sentiment / temporal criteria; weak or anti-correlated on negation, arithmetic / counting, and abstract quality judgments. See SETTLED_FINDINGS.md §3 + §4 in the repo for the full capability ladder.

Scores are ranking values, not calibrated probabilities. ECE on the base probe is 0.13-0.24 depending on the distribution — threshold 0.5 is a reasonable default but a customer running 150+ labeled items per criterion can fit isotonic calibration to tighten that.

Bands: score ≥ 0.85 = high, ≥ 0.5 = medium, ≥ 0.15 = low, < 0.15 = very low. Hard cap of 64 criteria per call (set in service/main.py).