Stefan Scherer

Learn more »

Computational Nonverbal Behavior Analytics

My research aims to automatically identify, characterize, model, and synthesize individuals' multimodal nonverbal behavior within both human-machine as well as machine-mediated human-human interaction. The emerging technology of this field of research is relevant for a wide range of interaction applications, including the areas of healthcare and education. Exemplarily, the characterization and association of nonverbal behavior with underlying clinical conditions, such as depression or post-traumatic stress, holds transformative potential and could change treatment and the healthcare system’s efficiency significantly. This is recognized by several leading research funding agencies such as DARPA with the Detection and Computational Analysis of Psychological Signals (DCAPS) project and the joint NSF/NIH program on Smart and Connected Health. Within the educational context the assessment of proficiency and expertise of individuals’ social skills, in particular for those with learning disabilities or social anxiety, can help to create individualized and tailor targeted education scenarios. The potential of cyber-learning and machine assisted training for individuals with autism spectrum disorders is reflected in numerous open research programs. For example, the Department of Defense pursues a clear agenda with the CDMRP Autism Research Program. Overall, this vibrant and highly multidisciplinary area of research that integrates the fields of psychology, machine learning, multimodal sensor fusion, and pattern recognition, emerges as an essential field of investigation for computer science.

In the recent past, my research has evolved in the direction of understanding multimodal nonverbal behavior in the context of clinical disorders and for educational purposes. Even though, we have found exciting results and identified nonverbal indicators of suicidality or proficient public speaking, there is an extensive potential for future research endeavors in the fields of machine learning, data mining, as well as data visualization and interpretation. One of my present research interests is the identification of optimal tools and algorithms for clinicians to make use of the manifold data that is available due to our automatic nonverbal behavior tracking algorithms and machine learning approaches. Further, I am interested in the development of smart data mining and pattern recognition algorithms that will enable clinicians to browse data and identify salient moments of interest in their patients’ history, interviews, and clinical records. Within the educational domain, I envision a platform that allows socially anxious individuals or individuals with learning disabilities to learn social skills in a forgiving and widely available environment that incorporates the use of virtual humans and possibly robots. Such a platform has potential in both training and standardized evaluation aspects of education.

Public speaking performances are not only characterized by the presentation of the content, but also by the presenters' nonverbal behavior, such as gestures, tone of voice, vocal variety, and facial expressions. Within this work, we seek to identify automatic nonverbal behavior descriptors that correlate with expert-assessments of behaviors characteristic of good and bad public speaking performances. We recorded multimodal corpus with our virtual audience public speaking training platform Cicero - named after the great Roman orator Marcus Tullius Cicero. We, further, utilize the behavior descriptors to automatically approximate the overall assessment of the performance using support vector regression in a speaker-independent experiment and yield promising results approaching human performance.

We could identify the following main findings in this work:

  1. Expert estimates of nonverbal behaviors, such as flow of speech, vocal variety, or avoided eye contact with the audience, are significantly correlated with an overall assessment of a presenter's performance.
  2. Using multimodal information from three synchronized sensors, we could identify automatic behavior descriptors that correlate strongly with expert estimates of nonverbal behaviors, comprising estimates for a clear intonation, vocal variety, pacing around, and eye contact with the audience.
  3. We then automatically approximate the experts' overall performance assessment with a mean error of .660 on a seven point scale. Further, the automatic approximation using support vector regression correlates significantly with the experts' opinion with Spearman's rho = .617 (p = .025), which approaches the correlation between the experts' opinions (i.e. rho = .648).
Motivated by these promising results, we plan to expand this research platform Cicero, to be presented at IVA 2013 in Edinburgh [1], in the future to incorporate a reactive virtual audience. Cicero will enable us to conduct a wide variety of experiments reaching from performance assessments to psychological experiments, which would not be possible with a real human audience.

Recently the project titled "CHS: Small: Investigating an Interactive Computational Framework for Nonverbal Interpersonal Skills Training" got funded with $500k by the National Science Foundation (Award: IIS-1421330).

Team Cohesion Project - Multimodal Computational Framework for the Assessment and Enhancement of Team Cohesion and Performance in Human-System Integration

High military unit cohesion is a critical factor that enhances unit performance and promotes individual resilience to combat-related trauma. Traditional approaches to quantifying unit cohesion largely rely on questionnaires and cumbersome coding approaches, while the precise behavioral patterns underlying high unit cohesion remain unknown. Advancements in automatic behavior analysis, tracking, and machine learning enable novel approaches to assess complex and dynamic behavior of individuals and help further our understanding of the underlying mechanisms of high team cohesion. Understanding these mechanisms will allow us to predict and control dynamic interaction among humans and enable technology to become an equal member in human-system interaction enhancing and sustaining cohesion as a human team member would.

Within this ARL-HRED funded project we investigate the mechanisms of team cohesion and mutually-adaptive behavior through a comprehensive cybernetic research framework using a dataset of ROTC cadets performing a laboratory-based, standardized, drone-flying task that requires cooperation amongst team members. Our research framework aims to: 1) Sense – Employ multimodal data acquisition in standardized virtual team training/evaluation scenarios to monitor individual and team psychological, behavioral, and physiological state information; 2) Assess – Investigate how individual and team behavior as well as cohesion relate to performance and predict positive and negative outcomes using multimodal machine learning approaches; and 3) Enhance (Phase 2) – Utilize the developed multimodal computational framework to automatically predict team cohesion in real-time. Phase 2 aims to enhance training outcomes and individuals’ capabilities to act as a highly cohesive unit through online interventions and automatically generated after action reviews. The thorough understanding of the underlying mechanisms of team cohesion will inform the development of technology that tightly cooperates with human team members and has the potential to optimize human performance.

For this project we leverage a Department of Defense-funded US Army Medical Research Acquisition Activity project “Psychobiological Assessment and Enhancement of Team Cohesion and Psychological Resilience using a Virtual Team Cohesion Test” (JW140070; PI: Woolley, Joshua). This parent project aims to investigate the precise psychobiological mechanisms that serve unit cohesion.

co-Principal Investigators:

  • Prof. Jonathan Gratch (USC ICT)
  • Prof. Joshua Woolley MD (UCSF)

SmartVoice - Automatic Speech Analysis

Within the SmartVoice project we focus our research on developing data resources and SmartVoice algorithms that enable voice analysis and can enhance speech-processing technologies, such as expressive speech synthesis. Control and analysis of voice characterization is critical for many areas of speech technology. With SmartVoice, we seek to improve state-of-the-art voice source analysis techniques and voice quality characterization approaches that could have a large impact on a wide variety of technologies and applications.

The work depicted here is sponsored by the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005 and was partially sponsored by DARPA under contract number W911NF-04-D-0005. Statements and opinions expressed and content included do not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

The above figure denotes an example vowel space assessment for two male subjects. The male reference sample vowel space (i.e. /i/, /a/, /u/) depicted in red is compared to the subjects' vowel spaces depicted in green, for a subject that scored positively for depression using the self-assessment questionnaires (A) and a subject that scored negatively (B). The vowel spaces are visualized on a two-dimensional plot with Formant 1 on the x-axis and Formant 2 on the y-axis (both in Hz). Additional two-dimensional vowel centers are displayed for both the male reference sample (red x-symbols) and the investigated subjects' vowel space cluster centroids (green circles). The corners of the triangular vowel space for both subjects are determined through minimal distance of cluster centroids to the reference locations of /i/, /a/, /u/. The grey dots depict all observations of the first two formants across an entire interview. The subject's vowel space scoring positively (A) is visibly smaller than the non-depressed subject's vowel space (B) resulting in a smaller vowel space ratio value. (C) Basic overview of the approach to automatically assess vowel space ratio. The process is separated into two major steps including speech processing (i.e. voicing detection and vowel tracking) and the vowel space assessment (i.e. vector quantization using k-means clustering and vowel space ratio calculation). The output of the algorithm is the ratio between the reference sample vowel space (depicted as a red triangle) and the individual's vowel space (depicted as a green triangle). The larger the ratio the larger the individual's vowel space with respect to the reference. (D) Observed mean vowel space ratios across conditions depression (D) vs. no-depression (ND) for interview data as well as read speech, PTSD (P) vs. no-PTSD (NP) and suicidal (S) vs. non-suicidal (NS) for interview data. The displayed whiskers signify standard errors and the brackets show significant results with * ... p < .05 and ** ... p < .01.

COVAREP - A Cooperative Voice Analysis Repository for Speech Technologies

COVAREP is an open-source repository of advanced speech processing algorithms stored in a GitHub project where researchers in speech processing can store and share original implementations of published algorithms. Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original ones. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.

By developing COVAREP we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to. COVAREP already includes several advance speech processing algorithms ready for download and use in either Matlab or Octave (Contributions).


  • University of Crete, Gilles Degottex and Georgos Kafentzis
  • Trinity College Dublin, John Kane
  • University of Mons, Thomas Drugman
  • Aalto University, Tuomo Raitio and Jouni Pohjalainen
  • University of Southern California, Stefan Scherer

Detection and Computational Analysis of Psychological Signals (DCAPS)

DARPA funded research project to investigate use of telemedicine and virtual humans to address barriers to care and to provide better care for service members who seek treatment for psychological issues, including post-traumatic stress, depression and suicide risk. Within the DCAPS project I research on the identification and automatic detection of relevant nonverbal psychological signals (e.g. lack of mutual gaze, fidgeting, voice quality, and prosodic trends) in realtime using audiovisual sensors.

Preliminary results on the investigation and identification of such indicators of psychological disorders are presented at the IEEE Face and Gesture Conference (FG2013) in Shanghai [8] and at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2013) in Vancouver [7].

Recently this work was covered in both print and online media New Scientist and Gizmodo. Unfortunately, in both articles the work is referred to as a digital shrink or virtual therapist, which is not the main purpose of the project! The research predominantly focuses on the support of healthcare providers, not on their replacement!

Principal Investigators:

  • Prof. Albert "Skip" Rizzo
  • Prof. Louis-Philippe Morency

Suicide is a very serious problem. In the United states it ranks as one of the most frequent causes of death among teenagers between the ages of 12 and 17. We investigate speech characteristics of prosody as well as voice quality in dyadic interviews with suicidal and non-suicidal adolescents. In these interviews the adolescents answer a set of specifically designed questions.

Based on this limited dataset, we reveal statistically significant differences in the speech patterns of suicidal adolescents within the investigated interview corpus. Further, we achieve respectable classification capabilities with basic machine learning approaches both on an utterance as well as an interview level. The work shows promising results in a speaker-independent classification experiment based on only a dozen speech features. We believe that once the algorithms are refined and integrated with other methods, they may be of value to the clinician.

Major findings of this study:

  1. Based on the few extracted features, we could identify the speech of suicidal and non-suicidal adolescents with a high degree of accuracy in a speaker-independent analysis scenario.
  2. Suicidal adolescents exhibit significantly more breathy voice qualities than non-suicidal subjects.

We are confident that with some additional refinement we can support professional healthcare providers with objective speech measures of suicidal patients to improve clinical assessments. This work is published at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2013) in Vancouver [7].

The below figure depicts, a boxplot comparison of two significantly different acoustic measures for the two groups of suicidal and non-suicidal adolescents. The measures indicate more breathy voice quality for suicidal subjects. The right figure, summarizes accuracies in % for speaker-independent classification experiments on the interview and the utterance level. The performance of hidden Markov models (blue) is compared to that of support vector machines (red).


Cerebella is a project directed by Prof. Stacy Marsella at USC Institute for Creative Technologies. It aims at the automatic generation of physical behaviors of virtual agents. The name Cerebella is derived from the latin plural of the word cerebellum and depicts small brains. Much like a multitude of small brains Cerebella enables virtual humans to not only generate a variety of interaction relevant behaviors, such as head nods or beat gestures, but also to underline the spoken words with gestural metaphors.

In the embedded Youtube video below some of the core features of Cerebella are highlighted. The acoustic analysis that was my main personal contribution to the development integrates a word emphasis detection and an overall agitation level assessment based on our voice quality assessment algorithms using fuzzy-input fuzzy-output support vector machines [9].

Perception Markup Language (PML)

Modern virtual agents require knowledge about their environment, the interaction itself, and their interlocutors' behavior in order to be able to show appropriate nonverbal behavior as well as to adapt dialog policies accordingly. Recent achievements in the area of automatic behavior recognition and understanding can provide information about the interactants' multimodal nonverbal behavior and subsequently their affective states. Hence, we introduce a perception markup language (PML) which is a first step towards a standardized representation of perceived nonverbal behaviors. PML follows several design concepts, namely compatibility and synergy, modeling uncertainty, multiple interpretative layers, and extensibility, in order to maximize its usefulness for the research community. We show have successfully integrate PML in a fully automated virtual agent system for healthcare applications as seen in DCAPS.

This work was first published at IVA 2012 in Santa Cruz, CA [19]. The figure below exemplifies a typical interaction with our virtual agent and highlights several key moments when PML messages are used. In these key moments, Ellie, the virtual human, reacts towards the subject's nonverbal behavior in a way that would not have been possible without the information provided by PML. She for example, exhibits a head nod when the subject is pausing a lot in the conversation to encourage the subject to continue speaking (see (a)). In (b), the subject exhibits low attention by looking away. A PML message with this information is sent to the nonverbal behavior generation manager (e.g. Cerebella) and the virtual agent is signaled to lean forward, as an effort to engage the subject. Figure (c) shows an instance where PML signals the dialog manager that the subject's attention level is low. This message triggers a branching in the dialog policy.