April 23, 2023 / Last updated : August 9, 2023 irfan UIST

Paper in UIST 2023 on “Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access”

Presentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides.

March 10, 2023 / Last updated : March 14, 2023 irfan Publications

Some recent publications for 2023

Here is a list of some recent works accepted for publication that I am honored to be part of. These will be appearing in CHI, ICLR, and CVPR. Excited to share these new efforts.

October 15, 2022 / Last updated : March 20, 2023 irfan UIST

Paper in ACM UIST 2022 on “Synthesis-Assisted Video Prototyping From a Document”

Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document — such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain — programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.

February 25, 2021 / Last updated : March 15, 2023 irfan CHI

Paper in ACM CHI 2021 on “Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos”

We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.

September 8, 2015 / Last updated : March 25, 2023 irfan Ubicomp

Paper in Ubicomp 2015: "A Practical Approach for Recognizing Eating Moments with Wrist-Mounted Inertial Sensing"

Paper Abstract Recognizing when eating activities take place is one of the key challenges in automated food intake monitoring. Despite progress over the years, most proposed approaches have been largely impractical for everyday usage, requiring multiple onbody sensors or specialized devices such as neck collars for swallow detection. In this paper, we describe the implementation […]

September 14, 2013 / Last updated : February 21, 2021 irfan Ubicomp

Paper in ACM Ubicomp 2013 "Technological approaches for addressing privacy concerns when recognizing eating behaviors with wearable cameras"

[bibtex file=IrfanEssaWS.bib key=2013-Thomaz-TAAPCWREBWWC] Abstract First-person point-of-view (FPPOV) images taken by wearable cameras can be used to better understand people’s eating habits. Human computation is a way to provide effective analysis of FPPOV images in cases where algorithmic approaches currently fail. However, privacy is a serious concern. We provide a framework, the privacy-saliency matrix, for understanding […]

October 14, 2000 / Last updated : February 21, 2021 irfan Publications

Paper in IEEE Personal Communications (2000) on “Ubiquitous sensing for smart and aware environments”

Abstract As computing technology continues to become increasingly pervasive and ubiquitous, we envision the development of environments that can sense what we are doing and support our daily activities. In this article, we outline our efforts toward building such environments and discuss the importance of a sensing and signal-understanding infrastructure that leads to awareness of […]

October 12, 1997 / Last updated : October 12, 1997 irfan Affective Computing

Paper: PUI (1997) "Prosody Analysis for Speaker Affect Determination"

Andrew Gardner and Irfan Essa (1997) “Prosody Analysis for Speaker Affect Determination” In Proceedings of Perceptual User Interfaces Workshop (PUI 1997), Banff, Alberta, CANADA, Oct 1997 [PDF][Project Site] Abstract Speech is a complex waveform containing verbal (e.g. phoneme, syllable, and word) and nonverbal (e.g. speaker identity, emotional state, and tone) information. Both the verbal and […]