Our Research
Current phase: Phase 3 – finalizing the multi-turn Promptfoo benchmarking harness, prompt weights, and scoring plans so Phase 4 can run the pilot and publish results, while we continue to back-propagate feedback from earlier axis reviews.
Research Questions
The primary question guiding this project is what is the best way to evaluate emotional intelligence in AI systems?
Several secondary questions help explore this focus:
- What frameworks have been used so far?
- Where do they fall short?
- What patterns can be seen in existing axes, and where are there gaps?
Bigger Picture Goal
How can we understand emotional intelligence in LLMs in a way that can help us develop prosocial AI systems for public benefit?
See the table below for the theoretical foundations that inform our methodology, or visit the README for additional context.
Phase 1 — Motivation & Literature
Goal
Establish a theoretical foundation for EQ benchmarking through literature synthesis, culminating in a peer-reviewed manuscript now under consideration.
Methodology
- Systematic reviews across Philosophy, Psychology, Neuroscience, and Computer Science
- Comparative analysis of existing EQ benchmarking frameworks and methodologies
- Database creation detailing axes, definitions, measurement methods, and Sense, Explain, Respond (SER) alignment
The Phase 1 outputs ground the axes and terminology that inform every subsequent phase.
Phase 2 translates this outline into a reproducible, text-only, non-branching pilot over the SER axes.
Phase 2 — Pilot Methodology (SER axes)
Goal
Turn the literature outline into a runnable pilot focused on the Sense, Explain, Respond (SER) axes.
Methodology
- Design text-only, non-branching micro-dialogues with fixed user turns (2–5 turns per item)
- Author rubrics and safety gates for SER scoring, emphasizing empathy, grounding, and guardrails
- Build evaluation harnesses, datasheets, and logging for reproducible pilot runs
Phase 2 locks the pilot scope and deliverables ahead of Phase 3 execution.
Phase 3 — Pilot Readiness & Reporting Prep
Goal
Finalize the text-only, non-branching SER pilot configuration—including prompts, weights, and rubric documentation—in preparation for Phase 4 model evaluations.
Methodology
- Iterate on fixed-turn micro-dialogues and scoring rubrics using targeted spot-checks for clarity and safety
- Lock SER subscores and the MDB-Pilot composite (0.30·Sense + 0.30·Explain + 0.40·Respond) so Phase 4 scoring is reproducible
- Update safety gates and extension hooks (branching, multimodal, cultural coverage) based on findings from earlier phases
Phase 3 delivers the polished materials and reporting templates that Phase 4 will use for comprehensive pilot evaluations and public reporting.
Phase 4: Empirical Pilot Execution and Iterative Refinement
Goal
Empirically validate and refine benchmarking methodology through practical evaluations.
Methodology
- Empirical testing on diverse AI platforms
- Quantitative data (accuracy, fairness metrics) and qualitative data (user experiences)
- Mixed-methods iterative refinement
Phase 5 — Benchmark Framework Development (planned)
Pending completion of the pilot in Phase 4, the team will expand into a comprehensive benchmark framework that incorporates multimodal sensing, cross-cultural validation, longitudinal tracking, and richer adaptation pathways. Detailed documentation will be published once the Phase 4 evaluations conclude.
Ethical Considerations and Transparency
- Regular ethical review and IRB approval
- Detailed methodological transparency and reproducibility
Theoretical Foundations
The table below is rendered from theoretical_foundations.json so that every page cites the same source metadata.
| Framework | Description | Disciplinary origin | Reference | Role in SERA-X |
|---|---|---|---|---|
| Loading foundations… | ||||