Automatic body communication extraction through markerless motion capture

  1. Marcos Ramiro, Álvaro
Dirixida por:
  1. Daniel Pizarro Pérez Director
  2. Daniel Gatica Pérez Co-director
  3. Marta Marrón Romera Co-director

Universidade de defensa: Universidad de Alcalá

Fecha de defensa: 17 de xuño de 2014

Tribunal:
  1. Luis M. Bergasa Pascual Presidente/a
  2. Javier Macías Guarasa Secretario/a
  3. José Luis Alba Castro Vogal
  4. Luis Baumela Molina Vogal
  5. Jean Marc Odobez Vogal

Tipo: Tese

Teseo: 120314 DIALNET lock_openTESEO editor

Resumo

This thesis addresses the problem of automatic nonverbal communication extraction by means of different computer vision techniques. Nonverbal communication plays a significant role in how we perceive each other in a social context. It has therefore been intensively analyzed in social psychology and cognitive science. However, there has always been the need for an interpreter: a person that emits a judgment on the perceived traits of the analyzed subject, or that codes specific behaviors. This judgment always carries a degree of subjectivity, which can lead to inconsistencies across different evaluations. Also, depending on the amount of data available, it can be a cumbersome, time consuming task. In order to address this problem, the use of an automatic system that abstracts itself from human interpretation is a key element, providing consistency for studying the present behaviors. We address this task by means of human markerless motion capture. Markerless motion capture extracts the position of the human body parts in images and videos. While there exist wearable sensors for the same purpose, the discomfort associated with them reduces the naturality of the movements. There are three main sensor set-ups in markerless motion capture: multi-camera, single camera and depth camera. In this thesis we make contributions in all of them. We first designed a multi-camera approach based on 3D scene reconstruction through Visual Hulls. We took advantage of non-linear regression methods in order to simplify the search in the high-dimensionality human pose space. By doing this, we were able to track multiple subjects simultaneously with a single tracker. Helped by a refinement process, we were able to provide better generalization capabilities. Then we developed a single camera method, based on the idea of hand saliency: we hypothesized that the hands are the parts of the image that move quicker along a whole video. To this end, we designed a new hand tracker based on a Decision Trees algorithm, and performed simultaneously action recognition. We later extended this approach by fusing the information provided by a depth camera in the hand saliency map equations. Finally, we developed a highly appearance-invariant method for motion capture while using again a single color camera. Thanks to dense optical flow and a torso detector, we were able first to classify the body parts in the image and then obtain the body configuration. This latter contribution is a step in order to remove the appearance-related problems of markerless motion capture. We evaluated all the approaches with public and private datasets, showing or improving state-of-the-art performance. Additionally, we applied some of the ideas behind of our methods in order to infer a series of social constructs from real job interviews. We extracted and aggregated a series of manually-annotated and automatic features from videos, and showed the correlation between them and personality traits or job performance. Finally, we were able to predict some of those traits with a regression scheme