ACM MM Grand Challenge on

Large-scale Human-centric Video Analysis in Complex Events


Format Description

We provide both track-by-track and all-track annotations. Our definition of human pose and action are shown in Table 1 and Table 2, respectively.


• Pose&Action definition
    Table 1: Our definition of human pose.
part name part index
Nose 0
Chest 1
Right-shoulder 2
Right-elbow 3
Right-wrist 4
Left-shoulder 5
Left-elbow 6
Left-wrist 7
Right-hip 8
Right-knee 9
Right-ankle 10
Left-hip 11
Left-knee 12
Left-ankle 13
      Table 2: Our definition of action.
action name action index
walking-alone 1
walking-together 2
running-alone 3
running-together 4
riding 5
sitting-talking 6
sitting-alone 7
queuing 8
standing-alone 9
gathering 10
fighting 11
fall-over 12
walking-up-down-stairs 13
crouching-bowing 14

• All-track ground truth format

For each video, we provide a frame-by-frame annotation. For example, for frame 0, the annotation file is 000000.xml. The content of the annotation is shown below.

{
  "annotation":{
    "folder":"hm_in_fighting1",
    "filename":"0.jpg",
    "path":"D:\hm_in_fighting1\0.jpg",
    "source":{
      "database":"Unknown"},
    "size":{
      "width":"1920","height":"1080","depth":"3"},
    "segmented":"0",
    "object":[
    {
      "name":"P-1", "action":"standing-alone", "pose":"unspecified", "truncated":"0", "difficult":"0",
      "bndbox":{
         "xmin":"747", "ymin":"103", "xmax":"1044", "ymax":"1079", "Nose": "928,185,0","Chest":"943,278,0", ...}
    },{...}, ...]
  }
}

The meaning of the 3 numbers after each body part is respectively: x, y, flag. When flag=1, the body part is invisiable; when flag=0, the body part is visiable; and when flag=-1, the body part is not annotated.


• Submission file structure

Please submit your results as a single .zip file. The results for each sequence must be stored in a separate file in the archive's root folder, and the file name must be exactly like the sequence name.
The description of the file structure is as follows: (take track-2 for example)

- submission.zip
— 20.json
— 21.json
......
......
— 32.json


• Track-by-track ground truth & submission format

Track-1: For each sequence, there is a separate .txt ground truth file. The ground truth .txt file contains one object instance per line. Each line must contain 10 values:
frame, id, bb_left, bb_top, bb_width, bb_height, flag, -1, -1, -1
The value "flag" determines whether the entry is to be considered. A value of 0 means that this particular instance is ignored in the evaluation, while any other value can be used to mark it as active.
For submission, the results for each sequence must be stored in a separate .txt file in the archive's root folder, and the file name must be exactly like the sequence name. The result file format should be:
frame, id, bb_left, bb_top, bb_width, bb_height, conf, -1, -1, -1
The "conf" value means the detection confidence. One example can be downloaded here.

Track-2: For each sequence, there is a separate .json ground truth file.The format of the file is shown below.


{
  "annolist":[
    {
      "image":[
        {
          "name":"000000.jpg"
        }
      ],
      "ignore_regions":[],
      "annorect":[
        {
          "x1":[625],
          "y1":[94],
          "x2":[681],
          "y2":[178],
          "score":[0.9],
          "track_id":[0],
          "annopoints":[
            {
              "point":[
                {
                  "id":[0],
                  "x":[394],
                  "y":[173],
                  "score":[1],
                },
                { ... }
              ]
            }
          ]
        },
        { ... }
      ],
    },
    { ... }
  ]
}

Here the role of "ignore_regions" is to filter out areas of disinterest from evaluation.
For submission, the results for each sequence must be stored in a separate .json file in the archive's root folder, and the file name must be exactly like the sequence name. The result format is basically the same as ground truth, but no "track id" and "ignore_regions" are required. Note that the prediction of every frame is required, even though there may be no object in some frames. The prediction of frames should be in order. One example can be downloaded here.

Track-3: The ground truth format of Track-3 is the same as Track-2.
The submission format for Track-3 is also similar with Track-2, the difference is that in Track-3, "track id" is required. One example can be downloaded here.

Track-4: For each sequence, there is a separate .csv ground truth file. Each row in the .csv file contains an annotation for one person performing an action in an interval of 20 frames, where that annotation is associated with the first frame. Different persons and multiple action labels are described in separate rows. The format of a row is the following: video_id, frame_id, person_box, action_id, person_id
>> video_id: video identifier
>> frame_id: frame index, should be an integer multiple of 20 and start from 0.
>> person_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds    to bottom right.
>> action_id: identifier of an action class, see Humaninevents_action_list.txt
>> person_id: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video.
For submission, the results for each sequence must be stored in a separate .csv file in the archive's root folder, and the file name must be exactly like the sequence name. The result format is basically the same as ground truth, but no "person_id" is required. Instead, a "detection score" is required in the place of "person_id". One example can be downloaded here.


• Format Checking Script:
Challengers can download the format checking script here to ensure the correct format before submission.