Extracts a subset of COCO¶

The training dataset of the COCO2017 dataset includes 118K images, and the validation set includes 5K images, which is a relatively large dataset. Loading JSON in debugging or quick verification scenarios will consume more resources and bring slower startup speed.

The extract_subcoco.py script provides the ability to extract a specified number/classes/area-size of images. The user can use the --num-img, --classes, --area-size parameter to get a COCO subset of the specified condition of images.

For example, extract images use scripts as follows:

python tools/misc/extract_subcoco.py \
    ${ROOT} \
    ${OUT_DIR} \
    --num-img 20 \
    --classes cat dog person \
    --area-size small

It gone be extract 20 images, and only includes annotations which belongs to cat(or dog/person) and bbox area size is small, after filter by class and area size, the empty annotation images won’t be chosen, guarantee the images be extracted definitely has annotation info.

Currently, only support COCO2017. In the future will support user-defined datasets of standard coco JSON format.

The root path folder format is as follows:

├── root
│   ├── annotations
│   ├── train2017
│   ├── val2017
│   ├── test2017

Extract 10 training images and 10 validation images using only 5K validation sets.

python tools/misc/extract_subcoco.py ${ROOT} ${OUT_DIR} --num-img 10

Extract 20 training images using the training set and 20 validation images using the validation set.

python tools/misc/extract_subcoco.py ${ROOT} ${OUT_DIR} --num-img 20 --use-training-set

Set the global seed to 1. The default is no setting.

python tools/misc/extract_subcoco.py ${ROOT} ${OUT_DIR} --num-img 20 --use-training-set --seed 1

Extract images by specify classes

python tools/misc/extract_subcoco.py ${ROOT} ${OUT_DIR} --classes cat dog person

Extract images by specify anchor size

python tools/misc/extract_subcoco.py ${ROOT} ${OUT_DIR} --area-size small