Human detection and tracking

5 min readNov 5, 2023

Human detection and tracking is a fundamental computer vision task that involves identifying and following the movements of individuals within a given scene. This technology plays a crucial role in various real-world applications, ranging from surveillance and security to autonomous vehicles and human-computer interaction. The primary objective of human detection is to locate and classify humans within an image or video frame, while tracking focuses on maintaining continuity as these individuals move across different frames, allowing for their monitoring and analysis.

This tutorial is divided into three parts, aiming to guide you in creating your own human detection and tracking system. It leverages the YOLOv8 detection algorithm and multiple tracking methods. The detection model was meticulously trained on the HumanCrowd (HC) dataset and the Multi Object Tracking (MOT) dataset. In this blog, we’ll walk you through the process of data preparation for this endeavor.

To develop and evaluate human detection and tracking algorithms, researchers and practitioners often rely on large datasets that provide a diverse range of real-world scenarios and challenges. Two prominent datasets in this domain are CrowdHuman and Multiple Object Tracking (MOT) datasets.

CrowdHuman Dataset: CrowdHuman is a widely used dataset for human detection. It contains over 15,000 images with annotations for human instances. What sets CrowdHuman apart is its diversity, featuring crowded scenes, occlusions, and a variety of poses and scales. This dataset comprises a total of 15,000 images, distributed as 4,370 images for validation and 5,000 images for testing, alongside the training set. CrowdHuman’s annotations are exhaustive and cover a wide array of scenes.

CrowdHuman benchmark with full body, visible body, and head bounding box annotations for each person.

In this dataset, the combined count of individual persons in the training and validation subsets reaches a staggering 470,000, with an average of 22.6 pedestrians per image. Moreover, detailed annotations are provided, including bounding boxes for the visible region, bounding boxes for the head region, and full-body annotations, ensuring a comprehensive and invaluable resource for human detection research.

The data is first converted to COCO format using this code:

def crowdhuman_to_coco(DATA_PATH, OUT_PATH, SPLITS= ['val', 'train'], DEBUG=False):
    if OUT_PATH is None:
        OUT_PATH = DATA_PATH + 'annotations/'

    if not os.path.exists(OUT_PATH):
        os.makedirs(OUT_PATH, exist_ok=True)
    for split in SPLITS:
        data_path = DATA_PATH + split
        out_path = OUT_PATH + '{}.json'.format(split)
        out = {'images': [], 'annotations': [], 'categories': [{'id': 1, 'name': 'person'}]}
        ann_path = DATA_PATH + 'annotation_{}.odgt'.format(split)
        anns_data = load_func(ann_path)
        image_cnt = 0
        ann_cnt = 0
        video_cnt = 0
        for ann_data in tqdm(anns_data):
            image_cnt += 1
            file_path = DATA_PATH + split + '/Images/{}.jpg'.format(ann_data['ID'])
            assert os.path.isfile(file_path)
            im = Image.open(file_path)
            image_info = {'file_name': 'Images/{}.jpg'.format(ann_data['ID']),
                          'id': image_cnt,
                          'height': im.size[1],
                          'width': im.size[0]}
            out['images'].append(image_info)
            if split != 'test':
                anns = ann_data['gtboxes']
                for i in range(len(anns)):
                    if anns[i]['tag'] == 'mask':
                        continue  # ignore non-human
                    assert anns[i]['tag'] == 'person'
                    ann_cnt += 1
                    fbox = anns[i]['fbox']  # fbox means full body box
                    ann = {'id': ann_cnt,
                         'category_id': 1,
                         'image_id': image_cnt,
                         'track_id': -1,
                         'bbox_vis': anns[i]['vbox'],
                         'bbox': fbox,
                         'area': fbox[2] * fbox[3],
                         'iscrowd': 1 if 'extra' in anns[i] and \
                                         'ignore' in anns[i]['extra'] and \
                                         anns[i]['extra']['ignore'] == 1 else 0}
                    out['annotations'].append(ann)
        print('loaded {} for {} images and {} samples'.format(split, len(out['images']), len(out['annotations'])))
        json.dump(out, open(out_path, 'w'))

Here we only take into account the human class and ignore the boxes of head position.

Multiple Object Tracking (MOT) Datasets: MOT datasets, on the other hand, are designed specifically for the task of human tracking. These datasets consist of video sequences with annotations for tracking individual humans as they move throughout the frames. MOT datasets are critical for evaluating the robustness and accuracy of tracking algorithms, as they provide ground truth data to measure how well an algorithm can consistently follow and identify individuals over time. Well-known MOT datasets include MOT17 and MOT20, each with various challenging scenarios, making them essential for tracking research.

The COCO (Common Objects in Context) format dataset and the YOLO (You Only Look Once) format dataset are two widely used data formats for object detection and image recognition tasks, each with its specific structure and purposes. Here’s an overview of each format:

COCO Format Dataset:

Annotation Structure: The COCO dataset format is known for its structured and comprehensive annotations. Each image in the dataset is associated with a JSON file that contains information about the objects present in the image. Annotations typically include details such as the object category, bounding box coordinates, segmentation masks, and additional attributes like keypoints for pose estimation. COCO also supports annotations for keypoint detection, object relationships, and more.

Annotation format:

{
 "images": [{
  "file_name": "000001.jpg",
  "height": 500,
  "width": 353,
  "id": 1
 }, {
  ...
 }, {
  "file_name": "009962.jpg",
  "height": 375,
  "width": 500,
  "id": 9962
 }, {
  "file_name": "009963.jpg",
  "height": 500,
  "width": 374,
  "id": 9963
 }],
 "type": "instances",
 "annotations": [{
  "segmentation": [
   [47, 239, 47, 371, 195, 371, 195, 239]
  ],
  "area": 19536,
  "iscrowd": 0,
  "image_id": 1,
  "bbox": [47, 239, 148, 132],
  "category_id": 12,
  "id": 1,
  "ignore": 0
 }, {
  "segmentation": [
   [138, 199, 138, 301, 207, 301, 207, 199]
  ],
  "area": 7038,
  "iscrowd": 0,
  "image_id": 2,
  "bbox": [138, 199, 69, 102],
  "category_id": 19,
  "id": 3,
  "ignore": 0
 }, {
 ...
 }, {
  "segmentation": [
   [1, 2, 1, 500, 374, 500, 374, 2]
  ],
  "area": 185754,
  "iscrowd": 0,
  "image_id": 9963,
  "bbox": [1, 2, 373, 498],
  "category_id": 7,
  "id": 14976,
  "ignore": 0
 }],
 "categories": [{
  "supercategory": "none",
  "id": 1,
  "name": "aeroplane"
 }, {
  "supercategory": "none",
  "id": 2,
  "name": "bicycle"
 }, {
  "supercategory": "none",
  "id": 3,
  "name": "bird"
 }, {
  "supercategory": "none",
  "id": 4,
  "name": "boat"
 }, {
 ...
 }, {
  "supercategory": "none",
  "id": 19,
  "name": "train"
 }, {
  "supercategory": "none",
  "id": 20,
  "name": "tvmonitor"
 }]
}

YOLO Format Dataset:

Annotation Structure: The YOLO (You Only Look Once) format dataset is specifically designed for object detection tasks. YOLO annotations are usually stored in a text file associated with each image, and the format is simpler compared to COCO. The folder structure is defined as follows:

In the Label folder, there are several text files with similar names to images in the Image folder. In each text file, the YOLO format provides details such as the object class, and the coordinates of the bounding box (class_id, center x, center y, width, and height) normalized relative to the image dimensions for each object in an image.

Since each dataset have it’s own format, in this serie of tutorial, I first merge all data to COCO and then to YOLO format to train detection models. An example of COCO to YOLO conversion is in this script.

Human detection and tracking

Written by Khoa Le, Ph.D.

Responses (1)