Tracking Objects as Points
Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
UT Austin & Intel Labs
Early trackers
[Link]
Early trackers
[Link]
Current frameworks: Tracking-after-detection
Frame t-1
Frame t
Current frameworks: Tracking-after-detection
Frame t-1
Frame t
Current frameworks: Tracking-after-detection
Current frameworks: Tracking-after-detection
Current frameworks: Tracking-after-detection
Tang et al. 2017: Re-identification features, pose features
Xu et al. 2019: Spatial-temporal trajectories
Simultaneous detection and tracking
Frame t-1
Frame t
Bergmann et al. 2019 Tracking without bells and whistles
Simultaneous detection and tracking
Frame t-1
Frame t
Bergmann et al. 2019 Tracking without bells and whistles
Frame t
Frame t-1
Tracks t-1 Deep Network
Frame t
Frame t-1
Tracks t-1
Frame t Detections t
Frame t-1
Deep
Network
Tracks t-1 Offsets t → t-1
Detections t
Offsets t → t-1
Offsets t → t-1
Advantages
Advantages
• Simplified tracking conditioned detection.
Conditioned detection
• Ours:
• Tractor [Bergmann et al. 2019]:
• Implicit prior heatmap • Explicit region proposal
Advantages
• Simplified tracking conditioned detection.
• Simplified matching.
Point-based matching
• Ours:
• Prior works:
• Greedy matching by point distance. • Hungarian algorithm.
• Separate motion model.
• Additional association features.
Advantages
• Simplified tracking conditioned detection.
• Simplified matching.
• Simplified training on videos.
Frame t-1
Frame t
Results
Results - KITTI
Extend to monocular 3D tracking
Results - monocular 3D tracking on nuScenes
Ablation studies
MOT17 (30 FPS) KITTI (10 FPS) nuScenes (2FPS)
67 89 30
detection only
w/o offset
w/o heatmap
66 87.75 Ours 22.5
65 86.5 15
64 85.25 7.5
63 84 0
Ablation studies
MOT17 (30 FPS) KITTI (10 FPS) nuScenes (2FPS)
67 89 30
detection only
without vs. with heatmap w/o offset
w/o heatmap
66 87.75 Ours 22.5
65 86.5 15
64 85.25 7.5
63 84 0
Ablation studies
MOT17 (30 FPS) KITTI (10 FPS) nuScenes (2FPS)
89 30
67 without vs. with offset
detection only
w/o offset
w/o heatmap
66 87.75 Ours 22.5
65 86.5 15
64 85.25 7.5
63 84 0
Ablation studies
MOT17 (30 FPS) KITTI (10 FPS) nuScenes (2FPS)
67 89 30
detection only
w/o offset
w/o heatmap
66 87.75 Ours 22.5
65 86.5 15
64 85.25 7.5
63 84 0
Ablation studies - motion models
Trained on image data only
Trained on image data only
Code is available!
[Link]