ACM MM Grand Challenge on

Large-scale Human-centric Video Analysis in Complex Events


Overview

This grand challenge aims to advance large-scale human-centric video analysis in complex events using multimedia techniques. We propose the largest existing dataset (named as Human-in-Events or HiEve) for understanding human motion, pose, and action in a variety of realistic events, especially crowd & complex events. Four challenging tasks are established on our dataset, which encourages researches to address the very challenging and realistic problems in human-centric analysis. Our challenge will benefit researches in a wide range of multimedia and computer vision areas including multimedia analysis and multimedia content analysis.

At the end of the Challenge, all teams will be ranked based on objective evaluation. The top-3 performing teams in each track will receive award certificates and awards. At the same time, teams of high performance results are invited to submit challenge papers (4-pages) and present their solutions during the conference.


Dataset

In this grand challenge, we focus on very challenging and realistic tasks of human-centric analysis in various crowd & complex events, including subway getting on/off, collision, fighting, and earthquake escape (cf. Figure. 1). To the best of our knowledge, few existing human analysis approaches report their performance under such complex events. With this consideration, we further propose a dataset (named as Human-in-Events or HiEve) with large-scale and densely-annotated labels covering a wide range of tasks in human-centric analysis (1M+ poses and 56k+ action labels, cf. Table 1).

Our HiEve dataset covers a wide range of human-centric understanding tasks including motion, pose, and action, while the previous workshops only focus on a subset of our tasks. Compared with the related workshops & challenges, our challenge has the following unique characteristics:


Our challenge covers a wide range of human-centric understanding tasks including motion, pose, and action, while the previous workshops only focus on a subset of our tasks (cf. Table 1).

Our challenge has substantially larger data scales, which includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).

Our challenge focuses on the challenging scenes under various crowd & complex events (such as dining, earthquake escape, subway getting-off, and collision, cf. Figure 1), while the related workshops are mostly related to normal or relatively simple scenes.

Figure 1: Samples from our dataset of various complex events such as (a) dining in canteen, (b) earthquake escape, (c) getting-off train, and (d) bike collision. Our dataset also contains denselyannotated labels for a wide range of vision tasks including person identification and tracking (a) (b), pose estimation and tracking (c), and person-level action recognition (d).
           
(a) (b) (c) (d)
Table 1: Comparison among different datasets. "NA" indicates not available. "∼" denotes approximated value. "traj." means trajectory and "avg" indicates average trajectory length.
Dataset # pose # box # traj.(avg) # action pose track surveillance complex events
MSCOCO 105,698 105,698 NA NA × × ×        
MPII 14,993 14,993 NA 410 × × ×        
CrowdPose ~80,000 ~80,000 NA NA × × ×        
PoseTrack ~267,000 ~26,000 5,245(49) NA × ×        
MOT16 NA 292,733 1,276(229) NA × ×        
MOT17 NA 901,119 3,993(226) NA × ×        
MOT20 NA 1,652,040 3,457(478) NA × ×        
Avenue NA NA NA 15 × ×        
UCF-Crime NA NA NA 1,900 × √        
HiEve (Ours) 1,099,357 1,302,481 2687(485) 56,643 √        
• Format Description

Please refer to this page for the detailed format description of our dataset.

• How To

Register an account for evaluation and refer to the Tracks page to learn about the datasets and tracks of the challenge.

• Download datasets at Data page to train your models, code for evaluating your model performance in training set could be obtained here.

• Submit results of the test set to our evluation server.

• After a while, check your evaluation results on LeaderBoard page.

More information and rules can be found here.

Track-1: Multi-person Motion Tracking in Complex Events

This track is proposed to estimate the location and corresponding trajectory of each identity throughout a video. For the human detection and tracking annotation, we assign a bounding-box and an ID that is unique throughout the video range to each person in each frame. Compared with the existing challenges on multiple object tracking, this track contains longer average trajectory length, bringing more difficulty for human tracking tasks. Besides, our dataset captures videos on 30+ different scenarios including subway station, street, and dining hall, making the tracking problem a more challenging task.
Specially, we provide two sub-tracks:

(1) Public Detection: In this sub-track, all competitors can only use the public object detection results provided by us, which is obtained via Faster-RCNN.

(2) Private Detection: Participants in this sub-track is able to generate their own detection bounding-boxes through any detectors.


• Dataset Description

In this track, the bounding box and ID of pedestrians are annotated, like MOT. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- MOTA and MOTP. MOTA measures the ratio of false positive, missing target and identity switch. MOTP measures the trajectory similarity between predicted results and ground-truth. Please see [1] for details. Our final performance ranking in this track is based on MOTA.

- w-MOTA. This metric is computed in a similar manner as MOTA except that we assign a higher weight γ to the ID switch cases happening in disconnected tracks. This metric is mainly proposed to evaluate the performance of submitted algorithms on hard disconnected cases. Please refer to our paper for more detail.

- ID F1 Score.The ratio of correctly identified detections over the average number of ground-truth and computed detections. Please see [2] for details.

- ID Sw.The total number of identity switches. Please see [3] for details.

- ID Sw-DT.The total number of identity switches happening in disconnected tracks. This value will be used when computing w-MOTA. Please refer to our paper for more detail.

Challengers can see here for reference evaluation code.

Track-2: Crowd Pose Estimation in Complex Events

This task aims at estimating human poses on each frame. Compared with the existing pose estimation challenges, our dataset is much larger in scale and includes more frequent occlusions. On the other hand, our dataset involves more real-scene pose patterns in various complex events.


• Dataset Description

In this track, the skeleton keypoints of pedestrians are annotated. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- AP@α is a widely used metric for keypoint estimation. If a predicted pose proposal has the highest PCKp@α (Percentage of Correct Keypoints, where α is a distance threshold to determine whether a detected keypoint is matched to an annotated keypoint) with a certain ground-truth, then it is taken as true positive, otherwise it is regarded as false positive, the AP value is the area under the precisionrecall curve.

- w-AP@α To avoid the methods only focusing on simple cases and uncrowded scenarios in the dataset, we will assign larger weights to a test image during evaluation if it owns: (1) higher Crowd Index (2) anomalous behavior (e.g. fighting, fall-over, crouchingbowing). The values of AP calculated after adding weights are called weighted AP (w-AP). Please refer to our paper for more details about weighted AP.

- AP@avg We take the average value of AP@0.5, AP@0.75 and AP@0.9 as an overall measurement of keypoint estimation results, where 0.5, 0.75 and 0.9 are specific distance threshold for computing PCKp.

- w-AP@avg We take the average value of w-AP@0.5, w-AP@0.75 and w-AP@0.9 as an overall measurement of keypoint estimation results on weighted video frames, where 0.5, 0.75 and 0.9 are specific distance threshold for computing PCKp. Our final performance ranking in this track is based on w-AP@avg.

Challengers can see here for reference evaluation code.

Track-3: Crowd Pose Tracking in Complex Events

This task aims at estimating human poses on each frame and associating keypoints of the same identity across frames. Compared with existing pose track challenges, our dataset is much larger in scale and includes more frequent occlusions. On the other hand, our dataset involves more real-scene pose patterns in various complex events.


• Dataset Description

In this track, the skeleton keypoints and skeleton's ID of pedestrians are annotated. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- MOTA and MOTP. MOTA measures the ratio of false positive, missing target and identity switch. MOTP measures the trajectory similarity between predicted results and ground-truth. Please see [1] for details. Our final performance ranking in this track is based on MOTA.

- AP is a widely used metric for keypoint estimation. If a predicted pose proposal has the highest PCKp (Percentage of Correct Keypoints) with a certain ground-truth, then it is taken as true positive, otherwise it is regarded as false positive, the AP value is the area under the precisionrecall curve.

Challengers can see here for reference evaluation code.

Track-4: Person-level Action Recognition in Complex Events

The action recognition task requires participants to predict an action label for each individual in labeled frames. In our dataset, we annotate actions of all individuals in every 20 frames in a video. For group actions, we assign the action label to each group member involved in this event. In total, we defined 15 action categories.


• Dataset Description

In this track, annotation includes bounding box and class of action of pedestrians. Annotation is performed every 20 frames.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Submission in AVA format. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- f-mAP@α The frame mAP (f-mAP) is a common metric to evaluate spatial action detection accuracy on single frame. To be specific, each prediction consists of a bounding box and a predicted action label, if it has overlap larger than a certain threshold α with an unmatched ground-truth box of the same label, then it is taken as true positive, otherwise it is false positive. This process is conducted on each frame annotated with action boxes. The AP value is computed for each label as the area under the precision-recall curve and the mean AP value is computed by averaging the AP value of each label.

- wf-mAP@α Considering the unbalanced distribution of the action categories in the data set, we appropriately assigned smaller weights to the test samples belonging to dominated categories. In addition, we also assign larger weight to frames under crowd and occluded scenarios to encourage models that perform better in complex scene. Similar to Weighted mAP, the frame mAP value calculated with these assigned weights is called weighted frame-mAP (wf-mAP for short). Please refer to our paper for more details about weighted frame-mAP.

- f-mAP@avg We report f-AP@0.5, f-AP@0.6 and f-AP@0.75, where 0.5, 0.6 and 0.75 are specific IOU threshold to determine true/false positive, and then take their mean value as an overall mearsurement value of f-mAP, we denote this measurement as f-mAP@avg.

- wf-mAP@avg Similarly, we report wf-AP@0.5, wf-AP@0.6 and wf-AP@0.75, then take their mean value as an overall mearsurement value of wf-mAP, we denote this measurement as wf-mAP@avg and our final performance ranking in this track is based on wf-mAP@avg.

Challengers can see here for reference evaluation code.

Reference Code for Evaluation Metrics

For challengers who want to evaluate their models performance on training set on their own machines, the following code repositories can be helpful, which are the prototype code of our evaluation scripts and can be used to evaluate some common metrics like MOTA or f-mAP.

- Reference evaluation code for multiple object tracking.

- Reference evaluation code for pose estimation and pose tracking.

- Reference evaluation code for action detection.

Paper bib

The paper for our dataset is released. Please click here to download.

References

[1] Bernardin, K. & Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Image and Video Processing, 2008(1):1-10, 2008.

[2] Ristani, E., Solera, F., Zou, R., Cucchiara, R. & Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.

[3] Li, Y., Huang, C. & Nevatia, R. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.

[4] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

[5] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. pp. 3686–3693 (2014)

[6] Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10863–10872(2019)

[7] Iqbal, U., Milan, A., Gall, J.: Posetrack: Joint multi-person pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2011–2020 (2017)

[8] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)

[9] Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)

[10] Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6479–6488 (2018)