Title : Temporally Consistent Semantic Segmentation in Videos
S. Hussain Raza, Ph. D. Candidate in ECE (https://sites.google.com/site/shussainraza5/)
- Prof. Irfan Essa (advisor), School of Interactive Computing
- Prof. David Anderson (co-advisor), School of Electrical and Computer Engineering
- Prof. Frank Dellaert, School of Interactive Computing
- Prof. Anthony Yezzi, School of Electrical and Computer Engineering
- Prof. Chris Barnes, School of Electrical and Computer Engineering
- Prof. Rahul Sukthanker, Department of Computer Science and Robotics, Carnegie Mellon University.
The objective of this Thesis research is to develop algorithms for temporally consistent semantic segmentation in videos. Though many different forms of semantic segmentations exist, this research is focused on the problem of temporally-consistent holistic scene understanding in outdoor videos. Holistic scene understanding requires an understanding of many individual aspects of the scene including 3D layout, objects present, occlusion boundaries, and depth. Such a description of a dynamic scene would be useful for many robotic applications including object reasoning, 3D perception, video analysis, video coding, segmentation, navigation and activity recognition.
Scene understanding has been studied with great success for still images. However, scene understanding in videos requires additional approaches to account for the temporal variation, dynamic information, and exploiting causality. As a first step, image-based scene understanding methods can be directly applied to individual video frames to generate a description of the scene. However, these methods do not exploit temporal information across neighboring frames. Further, lacking temporal consistency, image-based methods can result in temporally-inconsistent labels across frames. This inconsistency can impact performance, as scene labels suddenly change between frames.
The objective of our this study is to develop temporally consistent scene descriptive algorithms by processing videos efficiently, exploiting causality and data-redundancy, and cater for scene dynamics. Specifically, we achieve our research objects by (1) extracting geometric context from videos to give broad 3D structure of the scene with all objects present, (2) detecting occlusion boundaries in videos due to depth discontinuity, and (3) estimating depth in videos by combining monocular and motion features with semantic features and occlusion boundaries.