Paper in ACM UIST 2022 on “Synthesis-Assisted Video Prototyping From a Document”
Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document — such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain — programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.
Paper in ACM CHI 2021 on “Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos”
We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.
Paper in Ubicomp 2015: "A Practical Approach for Recognizing Eating Moments with Wrist-Mounted Inertial Sensing"
Paper Abstract Recognizing when eating activities take place is one of the key challenges in automated food intake monitoring. Despite progress over the years, most proposed approaches have been largely impractical for everyday usage, requiring multiple onbody sensors or specialized devices such as neck collars for swallow detection. In this paper, we describe the implementation […]
Paper in ACM Ubicomp 2013 "Technological approaches for addressing privacy concerns when recognizing eating behaviors with wearable cameras"
[bibtex file=IrfanEssaWS.bib key=2013-Thomaz-TAAPCWREBWWC] Abstract First-person point-of-view (FPPOV) images taken by wearable cameras can be used to better understand people’s eating habits. Human computation is a way to provide effective analysis of FPPOV images in cases where algorithmic approaches currently fail. However, privacy is a serious concern. We provide a framework, the privacy-saliency matrix, for understanding […]
Paper in IEEE Personal Communications (2000) on “Ubiquitous sensing for smart and aware environments”
Abstract As computing technology continues to become increasingly pervasive and ubiquitous, we envision the development of environments that can sense what we are doing and support our daily activities. In this article, we outline our efforts toward building such environments and discuss the importance of a sensing and signal-understanding infrastructure that leads to awareness of […]
Paper: PUI (1997) "Prosody Analysis for Speaker Affect Determination"
Andrew Gardner and Irfan Essa (1997) “Prosody Analysis for Speaker Affect Determination” In Proceedings of Perceptual User Interfaces Workshop (PUI 1997), Banff, Alberta, CANADA, Oct 1997 [PDF][Project Site] Abstract Speech is a complex waveform containing verbal (e.g. phoneme, syllable, and word) and nonverbal (e.g. speaker identity, emotional state, and tone) information. Both the verbal and […]