Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Dataset for training

Existing object detection models are most trained to recognise common everyday objects such as dogs, people, cars, etc. We’ll use these pre-trained models later on when we do scene/character detection. For logo detection, we need to train out own model for the logos we need to detection. The training requires a sizeable dataset labelled with ground truth (where logos appear in the images). Because we are to detect logos in movies and TV shows where products are often not in perfect focus, lighting conditions and orientation, sometimes obstructed by other objects. So our model needs to be training using “in-the-wild” images in non-perfect conditions. We use the following two datasets.

Dataset 1: Logos-32plus dataset is a “publicly-available collection of photos showing 32 different logo brands (1.9GB). It is meant for the evaluation of logo retrieval and multi-class logo detection/recognition systems on real-world images”. The dataset has separate sub-folders with one subfolder per class (“adidas”, “aldi”, …). Each subfolder contains a list of JPG images. groundtruth.mat contains a MATLAB struct-array “groundtruth”, each element having the following fields: relative path to the image (e.g. ‘adidas\000001.jpg’), bboxes (bounding box format is X,Y,W,H), and logo name. The dataset contains around 340 Coca-cola logo images.

Logo32plus

Dataset 2: The Logos in the Wild Dataset is a large-scale set of web collected images with provided logo annotations in Pascal VOC style. The current version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It is an in-the-wild logo dataset where images include the logos as natural part instead of the raw original logo graphics. Images are collected by Google image search based on a list of well-known brands and companies. The bounding boxes are given in the format of (x_min, y_min, x_max, y_max) in absolute pixels.

The dataset does not provide actual images but urls to fetch images from various online sources such as: http://bilder.t-online.de/b/47/70/93/86/id_47709386/920/tid_da/platz-1-bei-den-plakaten-coca-cola-foto-imas-.jpg So one must write a script to download images from the urls. Not all urls are valid and we extracted around 530 Coca-cola logo images. The image (600×428) below contains three logos and the ground-truth is (435, 22, 569, 70), (308, 274, 351, 292), and (209, 225, 245,243).

As different object detection implementations use methods to define bounding boxes, it is necessary to write conversion scripts to map different bounding box definitions. This is not a horrendous task but requires basic programming skills.

Part of the Coca-cola image dataset

Useful links:

Implementations

We have a iMac Pro and a HP workstation with Windows OS to host the project development. Both platforms have a decent Xeon processor and plenty of memory. However our ConvNet-heavy application requires GPU acceleration to reduce training and detection time from days and hours to hours and minutes. ML GPU acceleration is led by Nvidia thanks to its range of graphics cards and software support which includes its CUDA toolkit for parallel computing / image processing and Deep Neural Network library (cuDNN) for deep learning support. GPU acceleration for ML is currently not possible on Mac officially (Nvidia and Apple haven’t find a way to work together). We also want to steer away from hacks.

The authors of YOLOv3 provide an official implementation detailed at https://pjreddie.com/darknet/yolo/. You’ll need to install Darknet, an open source neural network framework written in C and CUDA. It is easy to compile and run the baseline configuration on Mac and Linux. There is little support for Windows platform.

We then tested two alternative solutions.

The first is YunYang’s Tensorflow 2.0 Python implementation. It has support for training, inference and evaluation. The source code is not too difficult to follow and the author also wrote a tech blog (in Chinese) with some useful details. This “minimal” YOLO implementation is ideal for people who want to learn the code but some useful features are missing or not implemented in full. Data augmentation is an example. There are also challenges with NaN loss (no loss outputs then no learning) that requires a lot of tweaking. Also, although we have a Nvidia RTX 4000 graphics card with 8GB memory, we kept running out of GPU memory when we increase input image size or batch size. This is however not an issue necessarily links to YunYang’s implementation.

We then switched to AlexeyAB’s darknet C implementation that uses Tensor cores. There are instructions on how to compile the C source code for Windows. The process is quite convoluted but the instructions are clear to follow. Some training configurations require basic understanding of YOLO architecture to tune. It is possible to use a Python wrapper over the binary or compile YOLO as a DLL both of which will be very handy to link the detection core functions to user applications. There is also an extended version of YOLOv3 with 5 size outputs rather than 3 to potentially improve small object detection.

To address the GPU out-of-memory issue, we quickly acquired NVIDIA TITAN RTX to replace the RTX 4000. NVIDIA claims it “is the fastest PC graphics card ever built. It’s powered by the award-winning Turing architecture, bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory to your PC”. It is possible to scale up and have a two cards configuration with a TITAN RTX NVLINK BRIDGE. [we are lucky to have acquired this before the Coronavirus shutdown in the UK…]

An image for our eyes…

Product detection in movies and TV shows using machine learning – part 1: Background

3 thoughts on “Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Leave a comment