Moscow, Russia - Miami, US (PRWEB) March 22, 2018
In this article, we’d like to illustrate how to use crowdsourcing approach to create large-scale data corpus for affective computing.
At the present stage of the information technologies development, automatic emotion detection, recognition and classification is an important and urgent target. It seems that the idea itself was absolutely fantastic just a few decades ago, but nowadays it is a reality. Emotion detection and recognition market (EDRS) is booming now. According to recent Markets&Markets report, the overall volume of the EDRS market will grow from $6.72 bn in 2016 to $36.07 bn by 2021, and the annual growth rate will be 39.9%.
We are only on the tip of the iceberg when it comes to human-computer interaction (HCI), but without cognitive computing technologies development there will be no such thing as true, precise and correct machine emotional intelligence.
Machine emotional intelligence is evolving rapidly. The current boom in Artificial Emotional Intelligence or Emotion AI was catalyzed, in part, by success in teaching computers to recognize the contents of videos by training deep neural networks on large labeled databases. Video databases are necessary for building machine learning models. A large training corpus of annotated data is needed for the operation of machine learning algorithms.
While a photo is just one static image, a video shows narrative in motion. Video is time-consuming to annotate manually, and it is computationally expensive to store and process. In this context, one of the main limitation in the development of Emotion AI is the lack of suitable video databases. Creating video databases of annotated human expressive behavior for deep learning is difficult, and resource-consuming process. That’s why high-quality video databases for neural network training are in short supply.
All the existing video databases have following disadvantages:
- They are acted, so all the emotions performed and expressed by actors (professional and amateurs) are not natural at all;
- They are limited by the size. For automatic recognition of human behavior much more data is needed to cover variations of states’ expressions;
- The classifications of emotions are narrow and insufficient for certain practical purposes.
Haven’t find database that would suit all our needs, we decided to create it by ourselves using efficient crowdsourcing approach. Being a scientific research company working in areas of affective&cognitive science, psychology, computer vision and machine learning we in Neurodata Lab are, roughly speaking, trying to teach computers to recognize human emotions. In order to create algorithms that recognize human emotions, first you need to provide them with massive amounts of already-tagged and labeled information to test and learn from. We believe that manual annotation process where annotators watch videos and match emotions reflected in it with the preselected classifications is the most accurate and reliable way to collect data.
For the purpose of possession large-scale and reliable database Emotion Miner was developed. Emotion Miner is a global online video-annotation platform for multimodal emotion&behavior data collection, labeling and processing.
A corpus collected by us with the help of Emotion Miner platform named “Emotion Miner Corpus” (EMC).
First of all we need to fill the platform with the appropriate video content. All in all we have chosen more than 4000 video clips (140 hours duration totally) for our platform. English-language video fragments were extracted from available and/or proprietary public content (interviews, debates, talk shows etc.) and exposed to multiple annotation procedure based on elaborated emotion scales’ system. Our goal was to select videos reflecting wide variety of emotions shown by different people in various shooting conditions.
When selecting video content for the platform, we were guided by the following principles:
- All the videos are in English;
- All the scenes contain people;
- Formats of the videos are monologue, dialogue, group conversation, public presentation.
For more correct operation of the audio and video processing algorithms we excluded definitely black and white videos, videos with the extraneous noise and with the music in the background, videos with unusual (creative) angles and moving camera.
As mentioned here above, the videos on the platform were manually marked by annotators. Annotation implies watching separate fragments and either marking one or more (depending on an emotional scale) emotional states recognized in the video from a given set of options or choosing “None of these” option if the current emotion is not on the list.
The total amount of registered annotators reached approximately 20 000 users from 30 countries worldwide. They were paying money for watching set of video clips on the platform. Each clip was divided into a maximum of thirty short fragments that need to be annotated. All in all there were more than one hundred thousand video fragments presented on the platform. The duration of each fragment was approximately 5 seconds. Each fragment was marked at least by 10 annotators.
Emotion scales’ system
We’ve elaborated our own unique emotional scale system. Our goal was not to suggest a new universal model describing human behavior, but to create a quite comprehensive list of affective states and social behaviors in public interaction. We structured emotions into 4 categories:
- Basic emotions
- Mental states and behavior
- Person and situation
Based on these categories we identified 22 scales. They are: engagement, high self-disclosure, high self-confidence, pleasure, happiness, friendliness, mental effort, self-presentation, pride, low pleasure, sadness, admiration, anger, anxiety, contempt, hostility, surprise, low self-confidence, low self-disclosure, disengagement, disgust and shame.
These scales can be applied only to behavior patterns which can be observed by outside annotator and do not comprise personal traits or hidden states.
To summarize, Emotion Miner Corpus has the following numerical characteristics:
- 140 hours of videos
- More than 110 000 annotated video fragments
- 1.3 million marked frames
- 22 emotional states
A significant increase in scale and diversity to existing video datasets make Emotion Miner Corpus unique. The detailed information on emotions, divided into 22 scales, and the size of the Emotion Miner Corpus make it a valuable high-quality addition to the existing databases in the global community for the study and modeling of multimodal and expressive human communication. Our experiments clearly demonstrate that the corpus will be also useful for studying the correlations between audio and facial features in the context of affective and social behaviors in public interaction.
Оn the basis of Emotion Miner Corpus, we began to develop a multimodal system for recognizing emotions and social behavior, including analysis of the face, eyes, body movements, gestures, voice and speech.
Emotion Miner Corpus, which took less than 6 months to collect (from the initial development of the web platform to the mature processing stages), is hoped to play a really significant role in further development of the human-machine interaction and contribute to the creation of multi-modal Emotion AI.