Phase 3 — Pilot Readiness & Reporting Prep

Pilot readiness

Phase 3 is a readiness sprint: we finalize the text-only, non-branching micro-dialogues (2–5 turns) established in Phase 2, confirm rubric wording, and resolve any open safety flags. This is where the full set of 200 text items is completed and frozen. The goal is to lock prompts, scoring weights, and documentation so Phase 4 can execute the pilot with confidence.

Reporting scaffolding

  • Finalize the SER axis weighting plan for the MDB-Pilot composite: 0.30·Sense + 0.30·Explain + 0.40·Respond.
  • Draft reliability templates (human and optional LLM-judge) so Phase 4 evaluations can capture confidence intervals and inter-rater metrics consistently.
  • Track safety review notes and flag items that may require revision or removal before Phase 4.

Future extensions

  • Branching: optional adaptive dialogues layered onto the pilot after baseline reporting.
  • Multimodal: add speech/video affect sensing once the text-only baseline is stable.
  • Cultural & linguistic: expand item pools and rubrics beyond English with local expert review.
  • Phase 5 trajectory: design the multimodal, cross-cultural, and longitudinal benchmark framework that will follow the pilot execution work of Phase 4.

Prompt configuration

The configuration below captures the Promptfoo setup we are finalizing in Phase 3 before running full model evaluations in Phase 4.

The ten examples shown here are a sample drawn from the larger configuration captured in phase3_promptfoo.yaml.

As we tune this configuration we also continue to refine Phase 1 literature mappings and Phase 2 rubric materials so the evaluation pipeline is cohesive once Phase 4 begins.

Loading prompt configuration…

Contribute prompts or rubric suggestions

Help us finalize the text-only, 2–5 turn prompt sets and SER grading hints for the pilot.

Submit Prompt/Rubric Feedback

We’re collecting text-only, non-branching 2–5 turn prompt sets and SER hints (Sense, Explain, Respond) for the pilot.