Skip to content

Why three concepts?

WorldJen exposes three distinct evaluation surfaces — Score, Rank, and Bench — because each answers a different question. Using the right one keeps your workflow lean and the results meaningful.

Score: "How good is this clip?"

Score is for single-clip evaluation. You have one generated video and want raw per-dimension numbers back. Use it when:

  • You're prototyping a model and want quick sanity feedback after each generation.
  • You're A/B testing prompts and want to compare two clips at a glance.
  • You want to inspect one clip's evaluator scoring without setting up a full benchmark.

Score sessions are user-scoped — each user has exactly one. Uploads accumulate in the session until you call reset. There is no leaderboard, no comparison; just the dimension scores for the clip you uploaded.

Rank: "Which of these is the best?"

Rank is for comparative ranking across clips that share a prompt. You generated five variants of "a cat sitting on a windowsill" and want them sorted. Use it when:

  • You're sweeping a hyperparameter and want to see which setting wins.
  • You're comparing model variants on the same prompt without spinning up a full Bench run.
  • You're collecting a leaderboard for blog content or a paper.

Rank sessions lock to the prompt of the first upload. The lock is enforced server-side so concurrent uploads (CLI + dashboard, multiple scripts) can't disagree about what's being ranked. Pass --prompt once or omit it on subsequent uploads; passing a different prompt fails fast with RankPromptMismatchError. See Rank prompt lock for the state machine in detail.

Bench: "How does the model perform overall?"

Bench is for full benchmark runs at scale. Many prompts, all your dimensions, possibly multiple models compared side-by-side, executed on a worker queue. Use it when:

  • You're publishing model results and need a reproducible scorecard with CSV export.
  • You're regression-testing a model across releases.
  • You're comparing two models head-to-head with a reference model.

Bench has two flavours. bench.create(...) enqueues work on a GPU runner that you provisioned in advance — your script exits immediately and the worker handles generation and upload. bench.run_with_pipeline(...) is the in-process variant for scripts that already have the model loaded in Python; the SDK drives generation, upload, and (optionally) waits for evaluator scores in the same process.

Picking the right surface

Ask three questions in order:

  1. Are you evaluating a single clip? If yes, Score.
  2. Do all your clips share one prompt and you want them sorted? If yes, Rank.
  3. Anything else — many prompts, multiple models, CSV exports, reproducibility — use Bench.

In practice, most teams use all three: Score during prototyping, Rank for prompt sweeps, and Bench for the formal evaluations that ship in release notes.

How the surfaces relate

All three feed the same evaluator service. A dimension score on a Score upload uses the same prompts, same questions, and same model as the equivalent dimension score on a Bench run. The differences are entirely in input shape and output ergonomics — Score returns per-video scores, Rank returns a leaderboard, Bench returns a CSV and aggregate statistics.

That means you can prototype on Score, compare candidates on Rank, then commit to a Bench run with confidence that the numbers will agree.

See also