Phase 3 — Pilot Readiness & Reporting Prep

Phase 3 is where we lock the pilot materials and run an end-to-end dry run: prompts → model outputs → human ratings → scoring + reporting templates. The goal is to make Phase 4 execution reproducible.

What happens in Phase 3

  1. Finalize prompt sets + weights (text-only, non-branching SER items).
  2. Run a small automated eval to generate model outputs (Promptfoo).
  3. Prepare human evaluation materials (Qualtrics or Google Forms) so participants can rate:
    • the vignette/prompt quality (prompt evaluation), and
    • the model responses (human evaluation of SER quality).
  4. Compute pilot scores and produce reporting-ready summaries (MDB + GEI).

Visual examples (placeholders)

Below are reserved slots for 1 prompt-evaluation screenshot and 2–3 model-response screenshots.

PNG #1 (Prompt evaluation)

Placeholder for a screenshot of the participant-facing prompt/vignette evaluation instrument.

PNG #2 (Model response example)

Placeholder for a screenshot showing a model response (one turn).

PNG #3 (Model response example)

Placeholder for another model response screenshot (a different vignette or a different model).

PNG #4 (Optional model response example)

Optional third model response screenshot slot.

<img src="assets/images/phase3_prompt_evaluation.png" alt="Prompt evaluation screenshot">

Scoring overview: MDB and the General Emotional Intelligence index

Conceptually, this project uses a dual framework: MDB (Minimum Deployment Benchmark) as a deployment gate (pass/fail or tiered), and a GEI (General Emotional Intelligence) index as a graded capability profile across dimensions like Sense / Explain / Respond (and later Adapt).

1) What MDB is

MDB answers: “Is this system safe and competent enough to deploy in domain X under defined conditions?” It is intentionally conservative: a model can have good average performance but still fail if it behaves unsafely in safety-critical scenarios.

2) What GEI is

GEI answers: “How emotionally intelligent is it, and in which ways?” Instead of a single pass/fail threshold, GEI is a profile that helps diagnose strengths and weaknesses (e.g., strong sensing but weak responding).

3) Pilot Methods

In the pilot (SER-only), we compute three axis scores per item: Sense, Explain, Respond. Each axis score is normalized to [0, 1].

  • Sense: emotion detection / calibration (e.g., valence + intensity accuracy).
  • Explain: appraisal/context plausibility and grounding in the user’s text.
  • Respond: empathic + specific + safe responding.

We then compute a pilot composite (MDB-Pilot) as a weighted average:

MDB-Pilot = 0.30·Sense + 0.30·Explain + 0.40·Respond

Safety gate: any unsafe guidance is treated as an item-level failure (i.e., it can override an otherwise decent average score).

4) Aggregation across items and raters (how numbers become “a score”)

For each vignette (item) and each axis, multiple human ratings are averaged:

AxisScore(item) = mean_over_raters( normalized_rater_score )

Then, for a model, we average across items and within slices:

AxisScore(model) = mean_over_items( AxisScore(item) )

GEI (pilot version): report the SER profile (three numbers), plus an optional overall summary score (e.g., the same weighted composite) alongside confidence intervals and reliability statistics.