Profile

Dr. Paul Voigtlaender
Email: voigtlaender@vision.rwth-aachen.de

Teaching

Einführung in die Informatik (Introduction to Computer Science), WS 19/20

Machine Learning, WS 18/19

Machine Learning, WS 17/18

Students

(hiwi = student assistant)

Current

Jens Piekenbrinck (master thesis)

Past

Ali Mohamed Fatouh Ahmed (master thesis)
Bruno Vollmer (master thesis)
Kinan Halloum (master thesis)
Bin Huang (hiwi)
Berin Gnana (hiwi)
Blin Beqa (hiwi)
Bojana Stefanovska (hiwi)
Rohit Ravi (hiwi)
Umair Sabir (hiwi)
Jonathan Schieren (hiwi)
Valentin Steiner (hiwi)
Sourabh Swain (hiwi)
Michael Krause (master thesis)
Jonathon Luiten (master thesis)
Sabarinath Mahadevan (master thesis)
Hendrik Gruß (master thesis)
Valentin Steiner (bachelor thesis)

Publications

Point-VOS: Pointing Up Video Object Segmentation

Idil Esen Zulfikar*, Sabarinath Mahadevan*, Paul Voigtlaender*, Bastian Leibe

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-scale video datasets with text descriptions and annotate over 19M points across 133K objects in 32K videos. Based on our annotations, we propose a new Point-VOS benchmark, and a corresponding point-based training mechanism, which we use to establish strong baseline results. We show that existing VOS methods can easily be adapted to leverage our point annotations during training, and can achieve results close to the fully-supervised performance when trained on pseudo-masks generated from these points. In addition, we show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task. We will make our code and annotations available at https://pointvos.github.io.

Downloads: Project Page arXiv

BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video

Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, Deva Ramanan

Winter Conference on Computer Vision (WACV) 2023

Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference.

» Show Videos
» Show BibTeX

@inproceedings{athar2023burst,
title={BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video},
author={Athar, Ali and Luiten, Jonathon and Voigtlaender, Paul and Khurana, Tarasha and Dave, Achal and Leibe, Bastian and Ramanan, Deva},
booktitle={WACV},
year={2023}
}

Downloads: Paper GitHub Page

Reducing the Annotation Effort for Video Object Segmentation Datasets

Paul Voigtlaender, Lishu Luo, Chun Yuan, Yong Jiang, Bastian Leibe

2021 Winter Conference on Applications of Computer Vision (WACV ’21)

For further progress in video object segmentation (VOS), larger, more diverse, and more challenging datasets will be necessary. However, densely labeling every frame with pixel masks does not scale to large datasets. We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations and investigate how far such pseudo-labels can carry us for training state-of-the-art VOS approaches. A very encouraging result of our study is that adding a manually annotated mask in only a single video frame for each object is sufficient to generate pseudo-labels which can be used to train a VOS method to reach almost the same performance level as when training with fully segmented videos. We use this workflow to create pixel pseudo-labels for the training set of the challenging tracking dataset TAO, and we manually annotate a subset of the validation set. Together, we obtain the new TAO-VOS benchmark, which we make publicly available at http://www.vision.rwth-aachen.de/page/taovos. While the performance of state-of-the-art methods on existing datasets starts to saturate, TAO-VOS remains very challenging for current algorithms and reveals their shortcomings.