We annotated 126 validation sequences of the Tracking Any Object (TAO) dataset with segmentation masks for video object segmentation. Additionally, we annotated all 500 training sequences semi-automatically while ensuring a high quality (for details see paper below).

Compared to existing VOS datasets, sequences in TAO-VOS are significantly longer, cover more objects per sequence, and cover more different classes:

Performance on DAVIS and YouTube-VOS saturates, but not on TAO-VOS, which remains challenging:

Note that both TAO and TAO-VOS are only annotated at 1FPS. The annotations and qualitative results are visualized in the following video:


Note that in order to further increase the quality of the training annotations, we added additional manually annotated masks for the training set (these additional masks were not used in the paper but are included in the benchmark release).


Reducing the Annotation Effort for Video Object Segmentation Datasets Paul Voigtlaender, Lishu Luo, Chun Yuan, Yong Jiang, Bastian Leibe Accepted at WACV 2021



If you use this benchmark, please cite

title={Reducing the Annotation Effort for Video Object Segmentation Datasets},
author={Paul Voigtlaender and Lishu Luo and Chun Yuan and Yong Jiang and Bastian Leibe},
and also the original TAO paper
  title={TAO: A Large-Scale Benchmark for Tracking Any Object},
  author={Achal Dave and Tarasha Khurana and Pavel Tokmakov and Cordelia Schmid and Deva Ramanan},


We would like to thank the creators of the original datasets on which TAO-VOS is based: TAO, Charades, LaSOT, ArgoVerse, AVA, YFCC100M, BDD-100K, and HACS.


If you have questions, please contact Paul Voigtlaender via voigtlaender@vision.rwth-aachen.de

Disclaimer Home Visual Computing institute RWTH Aachen University