A Computerized Multimodal Battery for Emotion Recognition and Visual Attention Assessment in School-Aged Children
- Published in
- ICEIS 2026 — 28th International Conference on Enterprise Information Systems, Vol. 3 (pp. 1959–1966)
- License
- CC BY-NC-ND 4.0
Reading emotions is a skill — and some children need help measuring it
Recognizing emotions is a cornerstone of social cognition. For children, it shapes how they make friends, behave in class, and feel about themselves. When a child struggles to read emotional cues, the effects ripple outward — into prosocial behaviour, self-regulation, and everyday engagement at school. These difficulties are also frequently associated with neurodevelopmental and behavioural conditions, which makes measuring them well genuinely important.
The problem is how we measure it. Many existing instruments lean on static facial photographs or a single type of stimulus. That limits ecological validity: real social life is multimodal — moving faces, tone of voice, context — and a still photo captures almost none of that. What was missing, the authors argue, is a single computerized battery that brings together multimodal emotional stimuli, avatar-based representations, and eye tracking, designed specifically for school-aged children.
That battery is BACRE-I.
- assessment phases (plus a training phase)
- 6+1
- children aged 8–12 assessed
- 82
- emotion categories, incl. neutral
- 7
What BACRE-I is
BACRE-I is a computerized, multimodal assessment tool built around six phases, preceded by a training phase so each child gets comfortable with the interface and the response procedure before any real data is collected (training data is discarded). Children respond with a simplified, adapted input device, choosing the emotion they think best matches each stimulus. Crucially, no feedback is given during the assessment, to avoid learning effects skewing later phases.
The six phases probe emotion recognition through progressively different channels:
| Phase | Modality | What it tests |
|---|---|---|
| 1 | Static images | Recognition from still facial photographs |
| 2 | Dynamic videos | Recognition from short moving-face clips |
| 3 | Vocal stimuli | Recognition from voice alone (no face) |
| 4 | Audiovisual clips | Recognition from short clips with context |
| 5 | Avatar-based | A 3D facial avatar built from the child's own features |
| 6 | Eye tracking | Static and dynamic faces + gaze acquisition |
The seven emotion categories follow the classic set: happiness, sadness, anger, fear, surprise, disgust, and neutral. The implementation is open source (github.com/motaoliveiraufpr/emote) to support transparency and reproducibility.
Top tip
Eye tracking isn't just Phase 6 — gaze is recorded throughout all phases. That lets the system tell apart two very different kinds of error: a child who looked at the face but misread it, versus a child who simply wasn't looking.
The engineering behind the phases
What makes this a systems contribution, rather than just a set of tasks, is the pipeline underneath each phase.
- Stimuli selection and validation. Stimuli were drawn from well-established, publicly available databases (standardized facial image sets, audiovisual emotion databases, child-appropriate video excerpts). Each candidate was then reviewed by a panel of expert judges from psychology, speech therapy, and health fields; only stimuli that reached a minimum inter-judge agreement threshold made the final battery.
- Avatar generation. An end-to-end pipeline reconstructs a 3D facial model from a single input image of the child, rendered in real time in the Unity engine. Hair was deliberately removed from the model to avoid visual distraction and keep attention on the expression itself.
- Facial expression analysis. During specific phases, the child's own facial expressions are recorded and analysed with the Affectiva SDK, which detects facial action units and maps them to emotion categories — adding a continuous, objective signal that reduces observer bias.
- Eye tracking. A non-invasive, remote infrared eye tracker captures gaze with high temporal and spatial accuracy while allowing natural head movement — important when your participants are children. Areas of interest are predefined over facial regions (forehead, eyes, nose, mouth, lateral, and non-focused), enabling heat maps and quantitative attention measures.
How the study ran
The battery was applied to 82 children aged 8–12, recruited from Brazilian public elementary schools. Each child was assessed individually in a controlled environment with standardized lighting and minimal auditory distraction, completing the training phase and then all six assessment phases in a single session. A trained professional was present only to clarify instructions and keep the child engaged — never to give feedback or influence responses. The system automatically logged phase-level accuracy, execution time, and eye-tracking data.
The work followed strict ethics: it was approved by the Research Ethics Committee of the Federal University of Paraná (CAAE 09255519.8.0000.0102), with informed consent from guardians (TCLE) and age-appropriate assent from the children themselves (TALE). All data was anonymized, and images captured for avatars and expression analysis were used exclusively for research.
What the numbers showed
Recognition accuracy varied substantially across modalities — which is exactly the point of a multimodal design.
| Phase | Mean accuracy | Mean time |
|---|---|---|
| 1 — Static images | 75.20% | 115.87 s |
| 2 — Dynamic videos | 82.18% | 145.70 s |
| 3 — Vocal stimuli | 91.46% | 87.93 s |
| 4 — Audiovisual clips | 76.36% | 494.39 s |
| 5 — Avatar-based | 45.93% | 65.68 s |
| 6 — Eye tracking | 72.40% | 149.80 s |
Across the full battery (mean total time ≈ 1059 s), children averaged 49.95 correct responses and 16.03 errors. The standout patterns:
- Voice won. Isolated vocal stimuli produced the highest accuracy (91.46%) — under controlled conditions, tone of voice alone carried strongly salient affective information.
- Motion beat stills. Dynamic video (82.18%) outperformed static images (75.20%), echoing prior work: the subtle transitions in a moving face improve interpretability.
- Avatars were hardest. The self-referential 3D avatars yielded the lowest accuracy (45.93%) — and the fastest responses.
For gaze, fixations concentrated on the central face: forehead (mean 5.41), eyes (2.81), and nose (1.06), with the mouth and lateral regions receiving far fewer — confirming that children focused on the regions that typically carry affective cues.
Why "fast but wrong" is the interesting result
A naïve reading would call the avatar phase a weakness. The authors argue the opposite, and the execution-time data is what makes the case:
Rapid responses accompanied by lower accuracy may reflect increased uncertainty or reduced familiarity with the representational format — a more immediate but less accurate judgment strategy. Shorter response time should not be interpreted as greater ease.
In other words, the avatar phase doesn't measure the same thing worse — it probes a distinct, more demanding layer of emotion recognition. Synthetic, self-referential faces appear to require extra interpretive effort. By contrast, the audiovisual clips took by far the longest (494 s), suggesting that richer contextual information increases processing demand rather than easing it.
This is where the eye-tracking module earns its place. A low score paired with on-target fixations tells a different story than a low score paired with wandering gaze. The system can therefore distinguish recognition difficulty from attentional disengagement — interpretive depth that a single accuracy number can never provide.
The real contribution: a framework, not just a test
From a systems perspective, BACRE-I's value is the unified workflow: multimodal stimulus presentation, automated metric recording, and gaze-based analysis in one place. That integration enables structured, phase-level comparison and yields a far richer dataset than isolated, static tasks. The authors are explicit that BACRE-I should be read not merely as a collection of tasks, but as a computational framework for organizing, recording, and analysing multimodal emotion-recognition data in children.
It also builds on the group's earlier work — including their smartwatch-based system for monitoring stress in driving students — continuing a line of research that turns affective signals into structured, analysable data.
Conclusions and what's next
By combining static and dynamic faces, voice, self-referential avatars, and eye tracking in a single session, BACRE-I delivers a structured, phase-level picture of how a child recognizes emotion across modalities. The experiment showed it captures meaningful variation in accuracy, time, and visual exploration — and that pairing behaviour with gaze makes children's responses genuinely interpretable.
The honest limitations: the avatar representation and the gaze analysis can both be refined, and the aggregate analysis doesn't exhaust the dataset. Future work points toward larger and more diverse samples, longitudinal designs, and deeper area-of-interest analysis — plus adapting the framework to other developmental, educational, and clinical scenarios.
Acknowledgments
This research was supported by the Coordination for the Improvement of Higher Education Personnel (CAPES), and carried out in partnership with the Neuropsychology Laboratory (Labneuro) of the Department of Psychology at the Federal University of Paraná (DEPSI–UFPR), whose interdisciplinary collaboration was fundamental to developing and applying the battery.
How to cite
de Oliveira, T. M., Telaska, T. dos S., Casa, C., Lourenço, F. C., de Sá Riechi, T. I. J. and Silva, L. (2026). A Computerized Multimodal Battery for Emotion Recognition and Visual Attention Assessment in School-Aged Children. In Proceedings of the 28th International Conference on Enterprise Information Systems (ICEIS 2026), Vol. 3, pp. 1959–1966. DOI: 10.5220/0014927500004018.
This is a plain-language adaptation of the peer-reviewed paper, prepared for a general audience. For the full methodology, tables, and complete reference list, please consult the original publication.