
tentative axes:
Sense
Explain
Respond
Adapt
Extended
Sense
This axis evaluates an AI system’s ability to detect and recognize emotional content from input.
What We Evaluate
Can the model identify expressed or implied emotions from user dialogue, even when indirect or masked?
Example Tasks
- Classify a text utterance as expressing one of 10 emotion categories.
- Estimate continuous valence/arousal from user messages.
- Distinguish between literal vs. sarcastic emotional tone.
Metrics Used
Macro-F1 for emotion label classification, CCC for valence/arousal regression.
Explain
This axis tests whether the system can infer the causes or appraisals that explain an emotion.
What We Evaluate
Can the system identify what situational factors or beliefs caused an emotion?
Example Tasks
- Given a story and emotion, identify the core appraisal.
- Predict what the character will feel next if new information is revealed.
- Classify the formal object of an emotion (e.g., danger, loss, betrayal).
Metrics Used
Appraisal vector accuracy, cause-label accuracy, partial credit for next-emotion inference.
Respond
This axis evaluates whether the AI can generate helpful, emotionally appropriate responses.
What We Evaluate
Do responses show empathy, helpfulness, and alignment with the user’s affective state?
Example Tasks
- Complete a response to a user expressing fear, sadness, or frustration.
- Choose or rephrase the most supportive reply from multiple candidates.
- Match tone and emotion style.
Metrics Used
Human Empathy Score (HES) rated on 1–5 scale.
Adapt
This axis measures how well the model adjusts to new cultural, linguistic, or situational emotion norms.
What We Evaluate
Does the system retain old skills while adapting to new affective expressions and norms?
Example Tasks
- Classify emotions in a dataset using culturally specific idioms.
- Respond empathetically to unfamiliar dialogue styles or memes.
- Compare performance before and after fine-tuning on a new domain.
Metrics Used
Forward transfer gain and catastrophic forgetting in “Experience Pack” tasks.
Extended
The Extended Axis evaluates AI emotional intelligence across interactions, social contexts, and system-level effects.
What We Evaluate
This axis evaluates emotional intelligence at multiple levels of human-AI interaction, including:
- Interactional Flexibility: Immediate adaptive emotional responses.
- Contextual Awareness: Broader social and cultural adaptability.
- Ethical and Environmental Responsibility: Ensuring fairness, transparency, inclusivity, user autonomy, and sustainability.
- Systemic Impact: Evaluating broader societal and ecological implications.
Example Tasks
- Measure how much an operator improves model responses in a dyad.
- Score empathic quality before and after UI edits.
- Track interaction cost for human fixes.
Metrics Used
Human-AI Dyad empathy gain, interaction-cost, scenario goal-completion.