Paper / Citation
We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Especially, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as a “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classiers for a set of independent spatiotemporal segments. The object seeds obtained using segment-level classiers are further rened using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate the automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we conOur proposed methods can learn good object masks just by watching YouTube.
- Presented at: ECCV 2012 Workshop on Web-scale Vision and Social Media, 2012, October 7-12, 2012, in Florence, ITALY.
- Awarded the BEST PAPER AWARD!