Our Research

Current phase: Phase 3 – finalizing the multi-turn Promptfoo benchmarking harness, prompt weights, and scoring plans so Phase 4 can run the pilot and publish results, while we continue to back-propagate feedback from earlier axis reviews.

Research Questions

The primary question guiding this project is what is the best way to evaluate emotional intelligence in AI systems?

Several secondary questions help explore this focus:

Bigger Picture Goal

How can we understand emotional intelligence in LLMs in a way that can help us develop prosocial AI systems for public benefit?

See the table below for the theoretical foundations that inform our methodology, or visit the README for additional context.



Phase 1 — Motivation & Literature

Goal

Establish a theoretical foundation for EQ benchmarking through literature synthesis, culminating in a peer-reviewed manuscript now under consideration.

Methodology

  • Systematic reviews across Philosophy, Psychology, Neuroscience, and Computer Science
  • Comparative analysis of existing EQ benchmarking frameworks and methodologies
  • Database creation detailing axes, definitions, measurement methods, and Sense, Explain, Respond (SER) alignment

The Phase 1 outputs ground the axes and terminology that inform every subsequent phase.

Phase 2 translates this outline into a reproducible, text-only, non-branching pilot over the SER axes.

Learn more about Phase 1


Phase 2 — Pilot Methodology (SER axes)

Goal

Turn the literature outline into a runnable pilot focused on the Sense, Explain, Respond (SER) axes.

Methodology

  • Design text-only, non-branching micro-dialogues with fixed user turns (2–5 turns per item)
  • Author rubrics and safety gates for SER scoring, emphasizing empathy, grounding, and guardrails
  • Build evaluation harnesses, datasheets, and logging for reproducible pilot runs

Phase 2 locks the pilot scope and deliverables ahead of Phase 3 execution.

Learn more about Phase 2


Phase 3 — Pilot Readiness & Reporting Prep

Goal

Finalize the text-only, non-branching SER pilot configuration—including prompts, weights, and rubric documentation—in preparation for Phase 4 model evaluations.

Methodology

  • Iterate on fixed-turn micro-dialogues and scoring rubrics using targeted spot-checks for clarity and safety
  • Lock SER subscores and the MDB-Pilot composite (0.30·Sense + 0.30·Explain + 0.40·Respond) so Phase 4 scoring is reproducible
  • Update safety gates and extension hooks (branching, multimodal, cultural coverage) based on findings from earlier phases

Phase 3 delivers the polished materials and reporting templates that Phase 4 will use for comprehensive pilot evaluations and public reporting.

Learn more about Phase 3


Phase 4: Empirical Pilot Execution and Iterative Refinement

Goal

Empirically validate and refine benchmarking methodology through practical evaluations.

Methodology

  • Empirical testing on diverse AI platforms
  • Quantitative data (accuracy, fairness metrics) and qualitative data (user experiences)
  • Mixed-methods iterative refinement

Learn more about Phase 4


Phase 5 — Benchmark Framework Development (planned)

Pending completion of the pilot in Phase 4, the team will expand into a comprehensive benchmark framework that incorporates multimodal sensing, cross-cultural validation, longitudinal tracking, and richer adaptation pathways. Detailed documentation will be published once the Phase 4 evaluations conclude.


Ethical Considerations and Transparency


Theoretical Foundations

The table below is rendered from theoretical_foundations.json so that every page cites the same source metadata.

Foundational frameworks that guide SERA-X benchmark design
Framework Description Disciplinary origin Reference Role in SERA-X
Loading foundations…