Challenge and Dataset on Large-scale Human-centric Video Analysis in Complex Events (HiEve)


Overview

This grand challenge aims to advance large-scale human-centric video analysis in complex events using multimedia techniques. We propose the largest existing dataset (named as Human-in-Events or HiEve) for understanding human motion, pose, and action in a variety of realistic events, especially crowd & complex events. Four challenging tasks are established on our dataset, which encourages researches to address the very challenging and realistic problems in human-centric analysis. Our challenge will benefit researches in a wide range of multimedia and computer vision areas including multimedia analysis and multimedia content analysis.


Dataset

In this grand challenge, we focus on very challenging and realistic tasks of human-centric analysis in various crowd & complex events, including subway getting on/off, collision, fighting, and earthquake escape (cf. Figure. 1). To the best of our knowledge, few existing human analysis approaches report their performance under such complex events. With this consideration, we further propose a dataset (named as Human-in-Events or HiEve) with large-scale and densely-annotated labels covering a wide range of tasks in human-centric analysis (1M+ poses and 56k+ action labels, cf. Table 1).

Our HiEve dataset covers a wide range of human-centric understanding tasks including motion, pose, and action, while the previous workshops only focus on a subset of our tasks. Compared with the related workshops & challenges, our challenge has the following unique characteristics:


Our challenge covers a wide range of human-centric understanding tasks including motion, pose, and action, while the previous workshops only focus on a subset of our tasks (cf. Table 1).

Our challenge has substantially larger data scales, which includes the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).

Our challenge focuses on the challenging scenes under various crowd & complex events (such as dining, earthquake escape, subway getting-off, and collision, cf. Figure 1), while the related workshops are mostly related to normal or relatively simple scenes.

Figure 1: Samples from our dataset of various complex events such as (a) dining in canteen, (b) earthquake escape, (c) getting-off train, and (d) bike collision. Our dataset also contains denselyannotated labels for a wide range of vision tasks including person identification and tracking (a) (b), pose estimation and tracking (c), and person-level action recognition (d).
           
(a) (b) (c) (d)
Table 1: Comparison among different datasets. "NA" indicates not available. "∼" denotes approximated value. "traj." means trajectory and "avg" indicates average trajectory length.
Dataset # pose # box # traj.(avg) # action pose track surveillance complex events
MSCOCO 105,698 105,698 NA NA × × ×        
MPII 14,993 14,993 NA 410 × × ×        
CrowdPose ~80,000 ~80,000 NA NA × × ×        
PoseTrack ~267,000 ~26,000 5,245(49) NA × ×        
MOT16 NA 292,733 1,276(229) NA × ×        
MOT17 NA 901,119 3,993(226) NA × ×        
MOT20 NA 1,652,040 3,457(478) NA × ×        
Avenue NA NA NA 15 × ×        
UCF-Crime NA NA NA 1,900 × √        
HiEve (Ours) 1,099,357 1,302,481 2687(485) 56,643 √        
• Format Description

Please refer to this page for the detailed format description of our dataset.

• How To

Register an account for evaluation and refer to the Tracks page to learn about the datasets and tracks of the challenge.

• Download datasets at Data page to train your models, code for evaluating your model performance in training set could be obtained here.

• Submit results of the test set to our evluation server.

• After a while, check your evaluation results on LeaderBoard page.

More information and rules can be found here.

Track-1: Multi-person Motion Tracking in Complex Events

This track is proposed to estimate the location and corresponding trajectory of each identity throughout a video. For the human detection and tracking annotation, we assign a bounding-box and an ID that is unique throughout the video range to each person in each frame. Compared with the existing challenges on multiple object tracking, this track contains longer average trajectory length, bringing more difficulty for human tracking tasks. Besides, our dataset captures videos on 30+ different scenarios including subway station, street, and dining hall, making the tracking problem a more challenging task.
Specially, we provide two sub-tracks:

(1) Public Detection: In this sub-track, all competitors can only use the public object detection results provided by us, which is obtained via Faster-RCNN.

(2) Private Detection: Participants in this sub-track is able to generate their own detection bounding-boxes through any detectors.


• Dataset Description

In this track, the bounding box and ID of pedestrians are annotated, like MOT. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- MOTA and MOTP. MOTA measures the ratio of false positive, missing target and identity switch. MOTP measures the trajectory similarity between predicted results and ground-truth. Please see [1] for details. Our final performance ranking in this track is based on MOTA.

- w-MOTA. This metric is computed in a similar manner as MOTA except that we assign a higher weight γ to the ID switch cases happening in disconnected tracks. This metric is mainly proposed to evaluate the performance of submitted algorithms on hard disconnected cases. Please refer to our paper for more detail.

- ID F1 Score.The ratio of correctly identified detections over the average number of ground-truth and computed detections. Please see [2] for details.

- ID Sw.The total number of identity switches. Please see [3] for details.

- ID Sw-DT.The total number of identity switches happening in disconnected tracks. This value will be used when computing w-MOTA. Please refer to our paper for more detail.

Challengers can see here for reference evaluation code.

Track-2: Crowd Pose Estimation in Complex Events

This task aims at estimating human poses on each frame. Compared with the existing pose estimation challenges, our dataset is much larger in scale and includes more frequent occlusions. On the other hand, our dataset involves more real-scene pose patterns in various complex events.


• Dataset Description

In this track, the skeleton keypoints of pedestrians are annotated. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- AP@α is a widely used metric for keypoint estimation. If a predicted pose proposal has the highest PCKp@α (Percentage of Correct Keypoints, where α is a distance threshold to determine whether a detected keypoint is matched to an annotated keypoint) with a certain ground-truth, then it is taken as true positive, otherwise it is regarded as false positive, the AP value is the area under the precisionrecall curve.

- w-AP@α To avoid the methods only focusing on simple cases and uncrowded scenarios in the dataset, we will assign larger weights to a test image during evaluation if it owns: (1) higher Crowd Index (2) anomalous behavior (e.g. fighting, fall-over, crouchingbowing). The values of AP calculated after adding weights are called weighted AP (w-AP). Please refer to our paper for more details about weighted AP.

- AP@avg We take the average value of AP@0.5, AP@0.75 and AP@0.9 as an overall measurement of keypoint estimation results, where 0.5, 0.75 and 0.9 are specific distance threshold for computing PCKp.

- w-AP@avg We take the average value of w-AP@0.5, w-AP@0.75 and w-AP@0.9 as an overall measurement of keypoint estimation results on weighted video frames, where 0.5, 0.75 and 0.9 are specific distance threshold for computing PCKp. Our final performance ranking in this track is based on w-AP@avg.

Challengers can see here for reference evaluation code.

Track-3: Crowd Pose Tracking in Complex Events

This task aims at estimating human poses on each frame and associating keypoints of the same identity across frames. Compared with existing pose track challenges, our dataset is much larger in scale and includes more frequent occlusions. On the other hand, our dataset involves more real-scene pose patterns in various complex events.


• Dataset Description

In this track, the skeleton keypoints and skeleton's ID of pedestrians are annotated. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- MOTA and MOTP. MOTA measures the ratio of false positive, missing target and identity switch. MOTP measures the trajectory similarity between predicted results and ground-truth. Please see [1] for details. Our final performance ranking in this track is based on MOTA.

- AP is a widely used metric for keypoint estimation. If a predicted pose proposal has the highest PCKp (Percentage of Correct Keypoints) with a certain ground-truth, then it is taken as true positive, otherwise it is regarded as false positive, the AP value is the area under the precisionrecall curve.

Challengers can see here for reference evaluation code.

Track-4: Person-level Action Recognition in Complex Events

The action recognition task requires participants to predict an action label for each individual in labeled frames. In our dataset, we annotate actions of all individuals in every 20 frames in a video. For group actions, we assign the action label to each group member involved in this event. In total, we defined 15 action categories.


• Dataset Description

In this track, annotation includes bounding box and class of action of pedestrians. Annotation is performed every 20 frames.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Submission in AVA format. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- f-mAP@α The frame mAP (f-mAP) is a common metric to evaluate spatial action detection accuracy on single frame. To be specific, each prediction consists of a bounding box and a predicted action label, if it has overlap larger than a certain threshold α with an unmatched ground-truth box of the same label, then it is taken as true positive, otherwise it is false positive. This process is conducted on each frame annotated with action boxes. The AP value is computed for each label as the area under the precision-recall curve and the mean AP value is computed by averaging the AP value of each label.

- wf-mAP@α Considering the unbalanced distribution of the action categories in the data set, we appropriately assigned smaller weights to the test samples belonging to dominated categories. In addition, we also assign larger weight to frames under crowd and occluded scenarios to encourage models that perform better in complex scene. Similar to Weighted mAP, the frame mAP value calculated with these assigned weights is called weighted frame-mAP (wf-mAP for short). Please refer to our paper for more details about weighted frame-mAP.

- f-mAP@avg We report f-AP@0.5, f-AP@0.6 and f-AP@0.75, where 0.5, 0.6 and 0.75 are specific IOU threshold to determine true/false positive, and then take their mean value as an overall mearsurement value of f-mAP, we denote this measurement as f-mAP@avg.

- wf-mAP@avg Similarly, we report wf-AP@0.5, wf-AP@0.6 and wf-AP@0.75, then take their mean value as an overall mearsurement value of wf-mAP, we denote this measurement as wf-mAP@avg and our final performance ranking in this track is based on wf-mAP@avg.

Challengers can see here for reference evaluation code.

Track-5: Pedestrian Detection in Complex and Crowded Events

The pedestrian detection task requires participants to predict the bounding box for each individual on each frame. Compared with existing pedestrian detection challenges, our dataset is much larger in scale and includes more frequent occlusions. On the other hand, our dataset involves more real-scene pose patterns in various complex events.


• Dataset Description

In this track, the bounding box of pedestrians are annotated. Annotation is performed per frame.


• Submission Format

Please refer to the Format Description page for the detailed format description of our dataset. Training with the given training set and testing set will be provided in the test stage.


• Evaluation Metrics

- MR Miss Rate, which is the log-average Miss Rate over False Positive Per Image (FPPI) in [10^-2; 100], is most commonly used in pedestrian detection. mMR is a good indicator for the algorithms applied in the real world applications. MR is very sensitive to false positives (FPs), especially FPs with high confidences will significantly harm the MR. Smaller MR indicates better performance.

- AP Averaged Precision is popular in measuring the accuracy of object detectors like Faster R-CNN, SSD, etc. AP computes the average precision value for recall value over 0 to 1 and can reflect both the precision and recall ratios of the detection results. It is sensitive to the recall scores, especially in crowded scenarios. Larger AP indicates better performance.

- JI Jaccard Index is mainly used to evaluate the counting ability of a detector. Different from AP and MR which are defined on the prediction sequence with decreasing confidences, JI evaluates how much the prediction set overlaps the ground truths. Usually, the prediction set can be generated by introducing a confidence score threshold. Larger JI indicates better performance.

Challengers can see here for reference evaluation code.

Track-6: Object Re-identification in Complex and Crowded Events

We divide Track-6 into two sub-tracks: Track-6.1: Group Re-Identification in Complex and Crowded Events and Track-6.2: In-scene Video Re-Identification in Complex and Crowded Events. The group re-id task requires participants to re-identify groups of people under different camera views. Our dataset is challenging due to its large variation of group layout and human pose. The video re-id task requires participants to re-identify people from different video clips. Our dataset is challenging because HiEve contains many discontinuous trajectories and complex events.


• Dataset Description

For Track-6.1, the group id, single pedestrian correspondence and single pedestrian bounding box are annotated. For Track-6.2, the person id of each video clip is annotated.


• Evaluation Metrics

- CMC Cumulated Matching Characteristic (CMC) is commonly used in object re-identification [12]. It is able to measure rank-k correct match rates.

Reference Code for Evaluation Metrics

For challengers who want to evaluate their models performance on training set on their own machines, the following code repositories can be helpful, which are the prototype code of our evaluation scripts and can be used to evaluate some common metrics like MOTA or f-mAP.

- Reference evaluation code for multiple object tracking.

- Reference evaluation code for pose estimation and pose tracking.

- Reference evaluation code for action detection.

- Reference evaluation code for pedestrian detection.

Paper bib

The paper for our dataset is released. Please click here to download.

References

[1] Bernardin, K. & Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Image and Video Processing, 2008(1):1-10, 2008.

[2] Ristani, E., Solera, F., Zou, R., Cucchiara, R. & Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.

[3] Li, Y., Huang, C. & Nevatia, R. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.

[4] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

[5] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. pp. 3686–3693 (2014)

[6] Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10863–10872(2019)

[7] Iqbal, U., Milan, A., Gall, J.: Posetrack: Joint multi-person pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2011–2020 (2017)

[8] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)

[9] Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)

[10] Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6479–6488 (2018)

[11] Chu, X., Zheng, A., Zhang, X., & Sun, J.: Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12214-12223 (2020)

[12] Zhu, X., Jing, X., Wu, F., & Feng, H.: Video-Based Person Re-Identification by Simultaneously Learning Intra-Video and Inter-Video Distance Metrics. In IEEE Transactions on Image Processing, 2020(27): 12214-12223, 2020)