Public image datasets are very handy when it comes to ML training but at some point you’ll face a product/logo that are not covered by any existing dataset. In our case, we are experimenting with detecting Cadbury Roses and Cadbury Heroes products. We need to construct an image dataset to cover these two products.
Instagram – use image acquisition tool such as Instaloader (https://instaloader.github.io) to fetch images based on hashtags (#cadburyheroes and #cadburyroses).
The three sources provide around 5,000 raw images with a significant amount of duplicates and unrelated items. A manual process is needed to filter the dataset. Going through thousands of files is tedious, so to make things slightly easier, I made a small GUI application. When you first start the application, it prompts for your image directory. Then it loads the first image. You then use Left and Right arrow key to decide whether to keep the image for ML training or discard it (LEFT to skip and RIGHT to keep). No files are deleted and instead they are moved to corresponding sub-directories “skip” and “keep”. Once one of the two arrow keys is pressed, the application loads the next image. It’s pretty much a one-hand operation so you have the other hand free to feed yourself coffee/tea… The tool is available on Github. It’s based on wxPython and I’ve only tested it on Mac (pythonw).
Labelling the dataset requires manual input of bounding box coordinates and label. A few tools are available including: LabelImg and Yolo_mark. I also set up “Video labelling tool” as one of the assignment topics for my CS module Media Technology. So hopefully we’ll see some interesting designs for video labelling. In this case we use Yolo_mark as it directly exports in the labelling format required by our framework.
Depending on the actual product and packaging, the logo layout and typeface varies. I am separating out as four classes Cadbury logo, “Heroes” (including the one with the star), “Roses” in precursive (new), and “Roses” in cursive (old) and code them as cadbury_logo, cadbury_heroes, cadbury_roses_precursive, and cadbury_roses_cursive.
Training has been an ongoing process to test what configurations work for our us the best. Normally you set the training config, dataset and validation strategy then sit back and wait for the model performance to peak. Figures below show plots of loss and mAP@0.5 for 4000 iterations of training with input size of 608 and batch size of 32. Loss generally drops as expected and we can get mAP around 90% with careful configurations. The training process saves model weights every 1000 iterations plus the best, last and final version of weights. The training itself takes a few hours so I usually run it overnight.
The performance measures are based on our image dataset. To evaluate how the model actually performs on test videos, it is essential to do manual verification. This means feeding the videos frame by frame to the pre-loaded model then assemble the results as videos. Because YOLO detect objects at 3 scales, the input test image size has a great influence on recall. Our experiments suggest that the input size of 1600 (for full HD videos) leads to the best results. So input HD content are slightly downsampled and padded. The images below show the detections of multiple logos in the test video.
It is clear that training a model on image dataset for video content CAN work, but there are many challenges. Many factors such as brightness, contrast, motion blur, and video compression all impact the outcomes of the detection. Some of the negative impacts can be mitigated by tuning the augmentation (to mimic how things look in motion pictures) and I suspect a lot can be done once we start to exploit the temporal relationship between video frames (instead of considering them as independent images).
Existing object detection models are most trained to recognise common everyday objects such as dogs, people, cars, etc. We’ll use these pre-trained models later on when we do scene/character detection. For logo detection, we need to train out own model for the logos we need to detection. The training requires a sizeable dataset labelled with ground truth (where logos appear in the images). Because we are to detect logos in movies and TV shows where products are often not in perfect focus, lighting conditions and orientation, sometimes obstructed by other objects. So our model needs to be training using “in-the-wild” images in non-perfect conditions. We use the following two datasets.
Dataset 1: Logos-32plus dataset is a “publicly-available collection of photos showing 32 different logo brands (1.9GB). It is meant for the evaluation of logo retrieval and multi-class logo detection/recognition systems on real-world images”. The dataset has separate sub-folders with one subfolder per class (“adidas”, “aldi”, …). Each subfolder contains a list of JPG images. groundtruth.mat contains a MATLAB struct-array “groundtruth”, each element having the following fields: relative path to the image (e.g. ‘adidas\000001.jpg’), bboxes (bounding box format is X,Y,W,H), and logo name. The dataset contains around 340 Coca-cola logo images.
Dataset 2: The Logos in the Wild Dataset is a large-scale set of web collected images with provided logo annotations in Pascal VOC style. The current version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It is an in-the-wild logo dataset where images include the logos as natural part instead of the raw original logo graphics. Images are collected by Google image search based on a list of well-known brands and companies. The bounding boxes are given in the format of (x_min, y_min, x_max, y_max) in absolute pixels.
As different object detection implementations use methods to define bounding boxes, it is necessary to write conversion scripts to map different bounding box definitions. This is not a horrendous task but requires basic programming skills.
We have a iMac Pro and a HP workstation with Windows OS to host the project development. Both platforms have a decent Xeon processor and plenty of memory. However our ConvNet-heavy application requires GPU acceleration to reduce training and detection time from days and hours to hours and minutes. ML GPU acceleration is led by Nvidia thanks to its range of graphics cards and software support which includes its CUDA toolkit for parallel computing / image processing and Deep Neural Network library (cuDNN) for deep learning support. GPU acceleration for ML is currently not possible on Mac officially (Nvidia and Apple haven’t find a way to work together). We also want to steer away from hacks.
The authors of YOLOv3 provide an official implementation detailed at https://pjreddie.com/darknet/yolo/. You’ll need to install Darknet, an open source neural network framework written in C and CUDA. It is easy to compile and run the baseline configuration on Mac and Linux. There is little support for Windows platform.
We then tested two alternative solutions.
The first is YunYang’s Tensorflow 2.0 Python implementation. It has support for training, inference and evaluation. The source code is not too difficult to follow and the author also wrote a tech blog (in Chinese) with some useful details. This “minimal” YOLO implementation is ideal for people who want to learn the code but some useful features are missing or not implemented in full. Data augmentation is an example. There are also challenges with NaN loss (no loss outputs then no learning) that requires a lot of tweaking. Also, although we have a Nvidia RTX 4000 graphics card with 8GB memory, we kept running out of GPU memory when we increase input image size or batch size. This is however not an issue necessarily links to YunYang’s implementation.
We then switched to AlexeyAB’s darknetC implementation that uses Tensor cores. There are instructions on how to compile the C source code for Windows. The process is quite convoluted but the instructions are clear to follow. Some training configurations require basic understanding of YOLO architecture to tune. It is possible to use a Python wrapper over the binary or compile YOLO as a DLL both of which will be very handy to link the detection core functions to user applications. There is also an extended version of YOLOv3 with 5 size outputs rather than 3 to potentially improve small object detection.
To address the GPU out-of-memory issue, we quickly acquired NVIDIA TITAN RTX to replace the RTX 4000. NVIDIA claims it “is the fastest PC graphics card ever built. It’s powered by the award-winning Turing architecture, bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory to your PC”. It is possible to scale up and have a two cards configuration with a TITAN RTX NVLINK BRIDGE. [we are lucky to have acquired this before the Coronavirus shutdown in the UK…]
This blog series discusses some R&D work within an ongoing “Big Idea” project. The project is in collaboration with the Big Film Group Ltd , a leading Product Placement Agency working with Blue Chip clients across UK and International entertainment properties.
Currently as part of the service the company offers to clients an evaluation on the impact of product placement on TV programmes and films. The service is essential to the customer experience and the growth of business. The evaluation is carried out via human inspection over programmes to mark all corresponding appearances and mentions within the content of broadcast media, mainly TV and film. This is a manual operation which is time-consuming, requires intense concentration and costly. We believe that this whole process can eventually be automated using media processing techniques, AI and machine learning. [project design document]
The long-term goal would be to offer a monitoring service across all broadcast media which would allow agencies and their clients to know where, when and how their brands and companies are being talked about on air. For PR, Advertising and Social Media agencies this information would be particularly valuable. There is no existing solution readily available and we believe that there would be high demand for information and services of this nature. [project design document]
The first phase of the project is to prototype a core function: product detection. We want something that can detect Coca-cola products in sample videos provided by the company. The H.264/AAC encoded sample videos are roughly 40 seconds long and in the resolution of 1920×1080 and frame-rate of 25 fps. Coca-cola products appear in various points of the sample videos for the duration between half a second and several seconds.
It is quite clear that we can map the core function to a ML object detection problem. Object detection has seen some major development with success in the past 5 years. So our focus is to analyse the requirements and pick the best from existing framework to develop a working solution.
Requirement 1: Logo detection. To simplify the solution, we start with logo detection. This means that we do not differentiate different products/packages of the same brand nor their colours. So Coca-cola cans and glass bottles are considered the same.
Requirement 2: Accuracy. The goal is to reach near human level accuracy overall but there are some major differences between the two. With human inspections, we expect few false positive (FP) detection but a degree of false negative (FN) when very brief appearances of product are not picked up by human eyes. For ML based solution, detection can be carried out frame-by-frame but there is a good chance of both FP and FN.
Requirement 3: Speed. As we are prototyping for video, the speed of the framework is important. There is no hard requirement on processing framerate but few people would like to wait an hour or two of processing time on each TV show and movie.
Requirement 4: End system. We are setting no constraint on end systems (both for training and runtime). We’ll develop the application in a physical workstation (not virtualised) while assuming a similar system will be available at runtime. It is possible to move the system at runtime to the cloud.
Two things constitute an object detection task: localisation (where things are) and classification (what it is). 1) Localisation predicts the coordinates of a bounding box that contains an object (and the likelihood of an object existing in that box). Different framework may use different coordinates system such as (x_min, y_min, x_max, y_max) or (x_centre, y_centre, width, height). 2) Classification tells us the probability of the object in the bounding box belonging to a set of pre-defined classes OR a distribution of probabilities of the object belonging to a set of pre-defined classes when the classes are exclusive (i.e., when it cannot be associated with multiple classes).
There are two main ML framework families for object detection: Region-Based Convolutional Neural Networks (R-CNNs) and the You Only Look Once (YOLO). Both frameworks have seen some major updates in the pass few years. Without going into the technical details too much, I’ll compare the two and discuss the reasons of our choice.
R-CNN is one of the first end-to-end working solutions for object detection. It selects regions that likely contain objects using selective search, a greedy search process that finds all possible regions and selects 2,000 best ones (in the format of coordinates) . The selected regions then go through ConvNet feature extraction before a separate classifier makes predictions for each region. R-CNN splits key functions in independent modules which is a reasonable choice for prototyping and it has shown a relatively good performance. The main issue of R-CNN is its speed. As thousands of regions go through ConvNet for each image, the process can be extremely slow. Processing a single image can take tens of seconds.
Fast R-CNN and Faster R-CNN introduced some significant architectural changes in order to improve the efficiency of the process (hence the names). The changes include shifting ConvNet to an earlier stage of the process so there is less (no) overlap on ConvNet operations over each image. The functionalities of ConvNet is also extended beyond the initial feature extraction to support region proposal (Region Proposal Network (RPN)) as well as classification (replacing SVM with ConvNet+activation function such as softmax). As a result, the architectural components also become more integrated. Faster R-CNN can process an image in less than a second. In summary, the R-CNN family started from a good performance baseline then gradually improved its speed to achieve “real-time” detection. Detectron (Mask R-CNN) is a good starting point to test out recent development on R-CNN.
Compared with R-CNN, YOLO is designed for speedy detection when accuracy is not mission critical. Instead of searching for appearance of objects in every possible location, YOLO uses a grid-based search. The grid fixes the anchor points in each image and a number of (such as 3) bounding boxes are created at each anchor point. The grid size is determined by the stride when convolution operations are applied to the image. So for a 416 x 416 image, a stride of 16 will result in a grid of 26 x 26. A large stride means a greater reduction to the feature image dimension, hence it allows the bounding boxes to cover large objects. This design is inspired by the Inception model behind GoogLeNet. Instead of constructing a very deep sequential CNN and relying on small features to build up larger features, filters of different sizes operate in parallel and the results are concatenated. This is similar to having telephoto, prime, and wide-angle camera lens on your smartphone shooting at the same time, so you are picking up small, medium and large objects in one shot.
The standard configuration of YOLO has three stride sizes 32, 16, and 8 (which map to 13 x 13, 26 x 26 and 52 x 52 for a 416 x 416 image), each responsible for object of small, medium and large sizes. So the three grids will generate 13 x 13 x 3+26 x 26 x 3+52 x 52 x 3 bounding boxes as a fixed and manageable starting point. Because we are doing a sampled search and not full search, some objects might be missed. But thats the cost of a speed-first approach. In fact, the stride-based dimension reduction (and not ConvNet+maxpooling) is also a choice for speed and not for accuracy.
YOLO has three major releases: YOLO: YOLO Unified, Real-Time Object Detection, YOLO9000 – Better, Faster, Stronger, and YOLOv3: An Incremental Improvement. Each version is an attempt to improve model performance while maintaining the speed for real-time object detection. YOLOv3 uses a deep 53 layer ConvNet darknet-53 for feature extraction followed by another 53 layer ConvNet for detection at three size levels. So the 106 layer architecture is fully convolutional (FCN) and does not contain any conventional Dense layers. A connected Dense layer requires input data to be flattened, so it limits the size of input images. So a FCN design gives us the freedom to use any input image size (not without its own problems), a key feature for dealing with high res content such as HDTV.
The figure above compares the performance (Mean Average Precision mAP) and speed of some modern object detection models. mAP is a measurement that factors in both localisation accuracy (IoU) and classification accuracy (Precision-Recall Curve). COCO is a tough dataset to get good mAP so anything above 50@mAP0.5 is considered amazing. YOLOv3 clearly shows its advantages in speed (authors put then “off the chart” to make a point…) while its performance is on a par with others. It is also noticed that larger input images (such as 608 x 608) can help with YOLOv3’s performance with some penalty on speed.
It is important to point out that the performance comparison from the related work may not apply to our problem space. These models are likely to behave differently over high res image data extracted from video content. Based on the project requirements, phase 1 will have YOLOv3 as our reference framework.
The new year started with a couple of interesting projects on AI.
In a HEIF-funded “Big Ideas” project, I am working with a brand placement company to prototype a solution that uses computer vision and deep learning to automate the evaluation of how brands and products appear in movies and TV shows. This is to hopefully assist if not replace the daunting manual work of a human evaluator. Measuring the impact of brand placement is a complex topic and it is underpinned by the capabilities of detecting the presence of products. Object detection (classification + localisation) is a well researched topic with many established deep learning frameworks available. Our early prototypes ,which use YOLOv3-based CNN (convolutional neural network) structures and trained on FlickrLogos-32 dataset have shown promising outcomes. There is a long list of TODOs linked to gamma correction, motion complexity, etc.
Our analysis of eye gaze and body motion data from a previous VR experiment continues. The main focus is on feature extraction, clustering and data visualisation. There are quite a few interesting observations made by a PhD researcher on how men and women behave differently in VR and how this could contribute to an improved measurement of user attention.
The research on human attention in VR is not limited by passive measurement and we already have some plans to experiment with creative art. We spent hours observing how young men and women interact with VR paintings which has inspired us to develop generative artworks that capture user experience of art encounters. Our first VR generative art demo will be hosted in Milton Keynes Gallery project space in Feb 2020 as part of Alison Goodyear’s Paint Park exhibition. My SDCN project has been supporting the research as part of its Connected VR use case.
I am thrilled to join the editorial board of Springer Multimedia System journal. Since 1993, Multimedia Systems has been a leading journal in the field covering enduring topics related to multimedia computing, AI, human factors, communication, and applications. The world of multimedia and computing is constantly evolving. I am really looking forward to working with other editors, reviewers, and authors to get the best research and engineering papers to our readers as quickly as we could while maintaining a high publishing standard.
This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.
Coverage in Multimedia Systems includes:
Integration of digital video and audio capabilities in computer systems
Multimedia information encoding and data interchange formats
Operating system mechanisms for digital multimedia
Digital video and audio networking and communication
Storage models and structures
Methodologies, paradigms, tools, and software architectures for supporting multimedia applications
Multimedia applications and application program interfaces, and multimedia end system architectures.
In Part 1, I introduced the architecture and shown some sample charts of my Smart campus project. The non-intrusive use of WIFI data for campus services and student experience is really cool.
As we are approaching the start of university term, I have reduced time to work on this project. So my focus was to prototype a “student-facing” application that visualise live building information. The idea is students can tell which computing labs are free, where to find quiet study areas or check if student helpdesk is too busy to visit. Security team can also use that to see if there is any abnormal activities at certain time of the day.
The chart below shows a screenshot of a live floor heatmap with breakdowns of lecture rooms (labelled white), study areas (also labelled white), staff areas (labelled black), and service areas (labelled grey).
Technically the application is split into three parts: user facing front-end (floor chart), data feed (JSON feed) and backend (data processing). The data feed layer provides the necessary segregation so that user requests don’t trigger backend operations directly.
The front-end chart is still based on Highcharts framework though I needed to manually draw the custom map using Inkscape based on actual floor map, export the map as SVG, convert it to map JSON using Highcharts’ online tool. At the same time, the mapping between areas (e.g., lecture rooms) and their corresponding APs must also be recorded in the database. This is a very time consuming process that requires a bit of graphic editing skills and a lot of patience.
The backend functions adopt a “10 minute moving average window” and periodically calculate the AP/area device population to generate data for each area defined in the custom floor map. I also filtered out devices that are simply passing by APs to reduce noise in data (e.g., a person walking along the corridor will not leave a trace). The data is then merged with the floormap JSON to generate the data feed every few minutes in static JSON file format.
A finishing touch is the chart annotation for most floor areas. I use different labelled colours so areas of different functionalities can be clearly identified.