Why three concepts?
WorldJen exposes three distinct evaluation surfaces — Score, Rank, and Bench — because each answers a different question. Using the right one keeps your workflow lean and the results meaningful.
Score: "How good is this clip?"
Score is for single-clip evaluation. You have one generated video and want raw per-dimension numbers back. Use it when:
- You're prototyping a model and want quick sanity feedback after each generation.
- You're A/B testing prompts and want to compare two clips at a glance.
- You want to inspect one clip's evaluator scoring without setting up a full benchmark.
Score sessions are user-scoped — each user has exactly one. Uploads accumulate in the session until you call reset. There is no leaderboard, no comparison; just the dimension scores for the clip you uploaded.
Rank: "Which of these is the best?"
Rank is for comparative ranking across clips that share a prompt. You generated five variants of "a cat sitting on a windowsill" and want them sorted. Use it when:
- You're sweeping a hyperparameter and want to see which setting wins.
- You're comparing model variants on the same prompt without spinning up a full Bench run.
- You're collecting a leaderboard for blog content or a paper.
Rank sessions lock to the prompt of the first upload. The lock is enforced server-side so concurrent uploads (CLI + dashboard, multiple scripts) can't disagree about what's being ranked. Pass --prompt once or omit it on subsequent uploads; passing a different prompt fails fast with RankPromptMismatchError. See Rank prompt lock for the state machine in detail.
Bench: "How does the model perform overall?"
Bench is for full benchmark runs at scale. Many prompts, all your dimensions, possibly multiple models compared side-by-side, executed on a worker queue. Use it when:
- You're publishing model results and need a reproducible scorecard with CSV export.
- You're regression-testing a model across releases.
- You're comparing two models head-to-head with a reference model.
Bench has two flavours. bench.create(...) enqueues work on a GPU runner that you provisioned in advance — your script exits immediately and the worker handles generation and upload. bench.run_with_pipeline(...) is the in-process variant for scripts that already have the model loaded in Python; the SDK drives generation, upload, and (optionally) waits for evaluator scores in the same process.
Picking the right surface
Ask three questions in order:
- Are you evaluating a single clip? If yes, Score.
- Do all your clips share one prompt and you want them sorted? If yes, Rank.
- Anything else — many prompts, multiple models, CSV exports, reproducibility — use Bench.
In practice, most teams use all three: Score during prototyping, Rank for prompt sweeps, and Bench for the formal evaluations that ship in release notes.
How the surfaces relate
All three feed the same evaluator service. A dimension score on a Score upload uses the same prompts, same questions, and same model as the equivalent dimension score on a Bench run. The differences are entirely in input shape and output ergonomics — Score returns per-video scores, Rank returns a leaderboard, Bench returns a CSV and aggregate statistics.
That means you can prototype on Score, compare candidates on Rank, then commit to a Bench run with confidence that the numbers will agree.
See also
- How to Use — first steps in the dashboard and SDK
- Rank prompt lock — the locking state machine in detail
