Product detection in movies and TV shows using machine learning – part 5: Finding sweeties quick

In Part 4, I made a start with establishing a new training dataset by harvesting publicly accessible photos on social media. The main benefit of using user generated content is that they were taken in a real-world setting, hence close to what the targeting logos would look like in a film. For content selection and labelling, my own filtering tool and Yolo_Mark worked pretty well. It wasn’t easy to label 600+ images but the workflow is decent. The three classes are: 0 – Cadbury, 1 – ROSES, and 2 – HEROES. There are some typeface variations of ROSES. You need to be patient and consistent of the labelling strategy. As humans, we are able to acquire information from different sources very quickly while making a decision. So if I were actively looking for a particular logo while knowing the logo is definitely present, I could still point at an unidentifiable blob of pixels and be 100% certain that its a Cadbury logo on a discarded purple wrapper. It may not be realistic to expect a “low-level” machine learning model with a small training set to capture what human could do in this case. Therefore I limit the labelling to only the logos that I could visually identify directly.

Labelling a photo using Yolo_mark (Image copyrights belong to its owner)

The training process wasn’t much different from the previous modelling for Coca-cola logo except some further tidying of the dataset (minor issues with missing files, etc.). With a baseline configuration, it took about 6 hours to complete 6000 epochs with a pretty good result base on the detection of three logos.

The images below illustrate what the model picks up from some standard photos (using the slider to see “before” and “after”).

logo detection

Another example:

logo detection

I’ve also tested the model on some videos provided by our partner. I won’t be able to show it here due to copyrights but its safe to say that it works very well with room for improvements. Some adjustment can be done at the modelling side, such as increasing the size of training images (currently downsampled to 608×608), increasing the number of detection layers to accommodate a larger range of logo sizes, or perhaps giving the new YOLOv4 a go!

This update concludes the “Product detection in movies and TV shows using machine learning” series. The dataset used for Cadbury, Roses, and Heroes training will be made public for anyone interested in giving it a go or expanding her own logo detector. I am still pushing this topic forward and will start a new series soon!

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Product detection in movies and TV shows using machine learning – part 4: Start a new dataset

Smart Campus project – part 3 (COVID-19) – in progress – [10 June 2020]

I have been playing with the project data to study the impact of COVID-19 social distancing / lockdown to the university, especially the use of campus facilities. Meanwhile there are some time series analysis and behavioural modelling that I’d like to complete sooner than later. Everything has taken me so much longer than what I planned. Here are some previews followed by moaning, i.e., the painful processes to generate these.

Number of connected devices on campus

The above shows some regular patterns of crowd density and how the numbers dropped due to COVID-19 lockdown. Students started to reduce their time on campus in the week prior to the official campus closure.


The autocorrelation grape shows a possible daily pattern (data resampled in 5 minute interval so 288 samples is a day, hence the location of the first peak).

Seasonal decomposition

Seasonal decomposition based on the hypothesis of a weekly pattern. There is also a strong hourly pattern, which I’ll explain in the paper (not written yet!).

A comparison of area crowd density dynamics of one floor of an academic building. from left to right: Pre-lockdown baseline, Near-lockdown reduced social contact, and working/studying from home during lockdown).

These ones above show the area crowd density dynamics of one floor of an academic building. The one on the left shows how an academic workspace, a few classrooms and study areas were used during a normal week when few people in the UK felt the COVID-19 is relevant to them. The middle one shows the week when there were increasing reports of COVID-19 cases in the UK and the government was changing its tones and advising social distancing. Staff and students reduced their hours on campus. The one on the right shows a week during university closure (building still accessible for exceptional purposes).

Using the system to monitor real-time crowd data provides a lot of insights but its somehow passive. It’s the modelling, simulation and predictions that make the system truly useful. I have done some work on this and I’ll gradually update this post with some analysis results:

The first thing I tried is standard time-series analysis. A lot of people don’t think it’s a bid deal but it’s tricky to get things right. There are many models to try and they are all based on the assumption that we can predict future data based on previous observations. ARIMA (Auto Regressive Integrated Moving Average) is a common time-series analysis method characterised by 3 terms: p, d, q. Tother they work on which part(s) of the observed data to use and how to adjusted the data (based on how thing change over time) to form a prediction. The seasonal variation of ARIMA (SARIMA) introduces additional seasonal terms to capture seasonal differences. Our campus WIFI data is not only non-stationary but also has multiple seasonality embedded: from a high level, the university has terms, each term has a start and end with special activities, each week of the term has weekdays and weekends, each weekday has lecturing hours and non-lecturing hours. The standard SARIMA can only capture one seasonality but it will be our starting point to experiment with crowd predictions.

SARIMA prediction

Figure above shows the predictions of campus occupant level on Friday 6th March. The blue curve plots the observed data on that week (Monday to Friday). The green curve depicts the “intra-week” predictions based on data observed during the same week, i.e., using Monday-Thursday’s data to guess Friday’s data. This method can respond to extraordinary situations in a particular week. If we chain all weekday data then in theory it’s possible to make prediction for any weekday of the week, practically ignoring the differences across weekdays. However, we know that people’s activities across weekdays are not entirely identical. Students have special activities on Wednesdays and everyone tries to finish early on Fridays. This explains why the intra-week predictions overestimate occupancy level for Friday afternoon. The orange curve gives the “inter-week” prediction based on previous four Fridays. This method captures normal activities on Fridays but is agnostic to week-specific changes (e.g., the week prior to exams). Balancing intra- and inter-week predictions using a simple element-wise Mean, the red curve shows the “combined” prediction. For this particular prediction exercise, the combined method does not show better MSE measurement compared with the inter-week version, partially due to the overestimates.

SARIMA prediction

Figure above shows the week prior to the university’s closure in response to COVID-19. This week is considered an “abnormal” week as students and staff started to spend more time study or work from home. In this case, the intra-week model successfully captures the changes on that week. There must be a better way to balance the two model to take the best from both worlds but I will try other options first.

All modelling above were done using pmdarima, a Python implementation of the R’s auto.arima feature. To speed up the process, the data was subsampled to a 30-minute interval. The number of observations per seasonal cycle m was set as 48 (24 hours x 2 samples per hour) to define a daily cycle.

[more to come soon]

Some technical details…

  • The main tables on the live DB have 500+ million records (which takes about 300 GB space). It took a week to replicate it on a second DB so I can decouple the messy analysis queries from the main.
  • A few python scripts to correlate loose data in the DB which got me a 150+ GB CSV file for time series analysis. From there, the lovely Pandas happily chews the CSV like its bamboo shoots.
  • The crowd density floor map was done for live data (10 minute coverage). To reprogramme it for historical data and generate the animation above, a few things have to be done:
    • A python script ploughed through the main DB table (yes the one with 500 million records) and derive area density in a 10-minute interval. The script also did a few other things at the same time so the whole thing took a few hours.
    • A new PHP page loaded the data in, then some Javascripts looped through the data and display the charts dynamically. It’s mainly setIntervals() to call Highcharts’ setData, removeAnnotation and addAnnotation.
    • To save the results as videos / GIFs, I tested screen capturing, which turned out to be OK but recording the screen just didn’t feel right to me. So I went down the route of image exporting and merging. Highcharts’ offline exporting exportChartLocal() works pretty well within the setIntervals() loop until a memory issue appeared. Throwing in a few more lines of Javascript to stagger the exporting “fixed” the issue. FFMPEG was brought in to covert image sequence to video.
    • Future work will get all these tidied up and automated.

[to be continued]

Product detection in movies and TV shows using machine learning – part 4: Start a new dataset

Public image datasets are very handy when it comes to ML training but at some point you’ll face a product/logo that are not covered by any existing dataset. In our case, we are experimenting with detecting Cadbury Roses and Cadbury Heroes products. We need to construct an image dataset to cover these two products.

Two steps to put the elephant in the fridge:

Open the fridge

There are a few sources for in-the-wild images:

  1. Google Images – search “Cadbury Heroes” and “Cadbury Roses”, then use downloading tool such as Download All Images ( to fetch all image results.
  2. Flickr – same process as above.
  3. Instagram – use image acquisition tool such as Instaloader ( to fetch images based on hashtags (#cadburyheroes and #cadburyroses).

The three sources provide around 5,000 raw images with a significant amount of duplicates and unrelated items. A manual process is needed to filter the dataset. Going through thousands of files is tedious, so to make things slightly easier, I made a small GUI application. When you first start the application, it prompts for your image directory. Then it loads the first image. You then use Left and Right arrow key to decide whether to keep the image for ML training or discard it (LEFT to skip and RIGHT to keep). No files are deleted and instead they are moved to corresponding sub-directories “skip” and “keep”. Once one of the two arrow keys is pressed, the application loads the next image. It’s pretty much a one-hand operation so you have the other hand free to feed yourself coffee/tea… The tool is available on Github. It’s based on wxPython and I’ve only tested it on Mac (pythonw).

Insert elephant

Labelling the dataset requires manual input of bounding box coordinates and label. A few tools are available including: LabelImg and Yolo_mark. I also set up “Video labelling tool” as one of the assignment topics for my CS module Media Technology. So hopefully we’ll see some interesting designs for video labelling. In this case we use Yolo_mark as it directly exports in the labelling format required by our framework.

Depending on the actual product and packaging, the logo layout and typeface varies. I am separating out as four classes Cadbury logo, “Heroes” (including the one with the star), “Roses” in precursive (new), and “Roses” in cursive (old) and code them as cadbury_logo, cadbury_heroes, cadbury_roses_precursive, and cadbury_roses_cursive.

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Training has been an ongoing process to test what configurations work for our us the best. Normally you set the training config, dataset and validation strategy then sit back and wait for the model performance to peak. Figures below show plots of loss and mAP@0.5 for 4000 iterations of training with input size of 608 and batch size of 32. Loss generally drops as expected and we can get mAP around 90% with careful configurations. The training process saves model weights every 1000 iterations plus the best, last and final version of weights. The training itself takes a few hours so I usually run it overnight.

The performance measures are based on our image dataset. To evaluate how the model actually performs on test videos, it is essential to do manual verification. This means feeding the videos frame by frame to the pre-loaded model then assemble the results as videos. Because YOLO detect objects at 3 scales, the input test image size has a great influence on recall. Our experiments suggest that the input size of 1600 (for full HD videos) leads to the best results. So input HD content are slightly downsampled and padded. The images below show the detections of multiple logos in the test video.

Detection results (copyrights belong to owner)
Detection results (copyrights belong to owner)

It is clear that training a model on image dataset for video content CAN work, but there are many challenges. Many factors such as brightness, contrast, motion blur, and video compression all impact the outcomes of the detection. Some of the negative impacts can be mitigated by tuning the augmentation (to mimic how things look in motion pictures) and I suspect a lot can be done once we start to exploit the temporal relationship between video frames (instead of considering them as independent images).

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Dataset for training

Existing object detection models are most trained to recognise common everyday objects such as dogs, people, cars, etc. We’ll use these pre-trained models later on when we do scene/character detection. For logo detection, we need to train out own model for the logos we need to detection. The training requires a sizeable dataset labelled with ground truth (where logos appear in the images). Because we are to detect logos in movies and TV shows where products are often not in perfect focus, lighting conditions and orientation, sometimes obstructed by other objects. So our model needs to be training using “in-the-wild” images in non-perfect conditions. We use the following two datasets.

Dataset 1: Logos-32plus dataset is a “publicly-available collection of photos showing 32 different logo brands (1.9GB). It is meant for the evaluation of logo retrieval and multi-class logo detection/recognition systems on real-world images”. The dataset has separate sub-folders with one subfolder per class (“adidas”, “aldi”, …). Each subfolder contains a list of JPG images. groundtruth.mat contains a MATLAB struct-array “groundtruth”, each element having the following fields: relative path to the image (e.g. ‘adidas\000001.jpg’), bboxes (bounding box format is X,Y,W,H), and logo name. The dataset contains around 340 Coca-cola logo images.


Dataset 2: The Logos in the Wild Dataset is a large-scale set of web collected images with provided logo annotations in Pascal VOC style. The current version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It is an in-the-wild logo dataset where images include the logos as natural part instead of the raw original logo graphics. Images are collected by Google image search based on a list of well-known brands and companies. The bounding boxes are given in the format of (x_min, y_min, x_max, y_max) in absolute pixels.

The dataset does not provide actual images but urls to fetch images from various online sources such as: So one must write a script to download images from the urls. Not all urls are valid and we extracted around 530 Coca-cola logo images. The image (600×428) below contains three logos and the ground-truth is (435, 22, 569, 70), (308, 274, 351, 292), and (209, 225, 245,243).

As different object detection implementations use methods to define bounding boxes, it is necessary to write conversion scripts to map different bounding box definitions. This is not a horrendous task but requires basic programming skills.

Part of the Coca-cola image dataset

Useful links:


We have a iMac Pro and a HP workstation with Windows OS to host the project development. Both platforms have a decent Xeon processor and plenty of memory. However our ConvNet-heavy application requires GPU acceleration to reduce training and detection time from days and hours to hours and minutes. ML GPU acceleration is led by Nvidia thanks to its range of graphics cards and software support which includes its CUDA toolkit for parallel computing / image processing and Deep Neural Network library (cuDNN) for deep learning support. GPU acceleration for ML is currently not possible on Mac officially (Nvidia and Apple haven’t find a way to work together). We also want to steer away from hacks.

The authors of YOLOv3 provide an official implementation detailed at You’ll need to install Darknet, an open source neural network framework written in C and CUDA. It is easy to compile and run the baseline configuration on Mac and Linux. There is little support for Windows platform.

We then tested two alternative solutions.

The first is YunYang’s Tensorflow 2.0 Python implementation. It has support for training, inference and evaluation. The source code is not too difficult to follow and the author also wrote a tech blog (in Chinese) with some useful details. This “minimal” YOLO implementation is ideal for people who want to learn the code but some useful features are missing or not implemented in full. Data augmentation is an example. There are also challenges with NaN loss (no loss outputs then no learning) that requires a lot of tweaking. Also, although we have a Nvidia RTX 4000 graphics card with 8GB memory, we kept running out of GPU memory when we increase input image size or batch size. This is however not an issue necessarily links to YunYang’s implementation.

We then switched to AlexeyAB’s darknet C implementation that uses Tensor cores. There are instructions on how to compile the C source code for Windows. The process is quite convoluted but the instructions are clear to follow. Some training configurations require basic understanding of YOLO architecture to tune. It is possible to use a Python wrapper over the binary or compile YOLO as a DLL both of which will be very handy to link the detection core functions to user applications. There is also an extended version of YOLOv3 with 5 size outputs rather than 3 to potentially improve small object detection.

To address the GPU out-of-memory issue, we quickly acquired NVIDIA TITAN RTX to replace the RTX 4000. NVIDIA claims it “is the fastest PC graphics card ever built. It’s powered by the award-winning Turing architecture, bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory to your PC”. It is possible to scale up and have a two cards configuration with a TITAN RTX NVLINK BRIDGE. [we are lucky to have acquired this before the Coronavirus shutdown in the UK…]

An image for our eyes…

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 1: background

This blog series discusses some R&D work within an ongoing “Big Idea” project. The project is in collaboration with the Big Film Group Ltd , a leading Product Placement Agency working with Blue Chip clients across UK and International entertainment properties.


Currently as part of the service the company offers to clients an evaluation on the impact of product placement on TV programmes and films. The service is essential to the customer experience and the growth of business. The evaluation is carried out via human inspection over programmes to mark all corresponding appearances and mentions within the content of broadcast media, mainly TV and film. This is a manual operation which is time-consuming, requires intense concentration and costly. We believe that this whole process can eventually be automated using media processing techniques, AI and machine learning. [project design document]

The long-term goal would be to offer a monitoring service across all broadcast media which would allow agencies and their clients to know where, when and how their brands and companies are being talked about on air. For PR, Advertising and Social Media agencies this information would be particularly valuable. There is no existing solution readily available and we believe that there would be high demand for information and services of this nature. [project design document]

The first phase of the project is to prototype a core function: product detection. We want something that can detect Coca-cola products in sample videos provided by the company. The H.264/AAC encoded sample videos are roughly 40 seconds long and in the resolution of 1920×1080 and frame-rate of 25 fps. Coca-cola products appear in various points of the sample videos for the duration between half a second and several seconds.

A scene from TV show Geordie Shore with multiple Coca-cola products (copyrights belong to their respective owners)


It is quite clear that we can map the core function to a ML object detection problem. Object detection has seen some major development with success in the past 5 years. So our focus is to analyse the requirements and pick the best from existing framework to develop a working solution.

  • Requirement 1: Logo detection. To simplify the solution, we start with logo detection. This means that we do not differentiate different products/packages of the same brand nor their colours. So Coca-cola cans and glass bottles are considered the same.
  • Requirement 2: Accuracy. The goal is to reach near human level accuracy overall but there are some major differences between the two. With human inspections, we expect few false positive (FP) detection but a degree of false negative (FN) when very brief appearances of product are not picked up by human eyes. For ML based solution, detection can be carried out frame-by-frame but there is a good chance of both FP and FN.
  • Requirement 3: Speed. As we are prototyping for video, the speed of the framework is important. There is no hard requirement on processing framerate but few people would like to wait an hour or two of processing time on each TV show and movie.
  • Requirement 4: End system. We are setting no constraint on end systems (both for training and runtime). We’ll develop the application in a physical workstation (not virtualised) while assuming a similar system will be available at runtime. It is possible to move the system at runtime to the cloud.


Two things constitute an object detection task: localisation (where things are) and classification (what it is). 1) Localisation predicts the coordinates of a bounding box that contains an object (and the likelihood of an object existing in that box). Different framework may use different coordinates system such as (x_min, y_min, x_max, y_max) or (x_centre, y_centre, width, height). 2) Classification tells us the probability of the object in the bounding box belonging to a set of pre-defined classes OR a distribution of probabilities of the object belonging to a set of pre-defined classes when the classes are exclusive (i.e., when it cannot be associated with multiple classes).

There are two main ML framework families for object detection: Region-Based Convolutional Neural Networks (R-CNNs) and the You Only Look Once (YOLO). Both frameworks have seen some major updates in the pass few years. Without going into the technical details too much, I’ll compare the two and discuss the reasons of our choice.


R-CNN is one of the first end-to-end working solutions for object detection. It selects regions that likely contain objects using selective search, a greedy search process that finds all possible regions and selects 2,000 best ones (in the format of coordinates) . The selected regions then go through ConvNet feature extraction before a separate classifier makes predictions for each region. R-CNN splits key functions in independent modules which is a reasonable choice for prototyping and it has shown a relatively good performance. The main issue of R-CNN is its speed. As thousands of regions go through ConvNet for each image, the process can be extremely slow. Processing a single image can take tens of seconds.

Fast R-CNN and Faster R-CNN introduced some significant architectural changes in order to improve the efficiency of the process (hence the names). The changes include shifting ConvNet to an earlier stage of the process so there is less (no) overlap on ConvNet operations over each image. The functionalities of ConvNet is also extended beyond the initial feature extraction to support region proposal (Region Proposal Network (RPN)) as well as classification (replacing SVM with ConvNet+activation function such as softmax). As a result, the architectural components also become more integrated. Faster R-CNN can process an image in less than a second. In summary, the R-CNN family started from a good performance baseline then gradually improved its speed to achieve “real-time” detection. Detectron (Mask R-CNN) is a good starting point to test out recent development on R-CNN.

Compared with R-CNN, YOLO is designed for speedy detection when accuracy is not mission critical. Instead of searching for appearance of objects in every possible location, YOLO uses a grid-based search. The grid fixes the anchor points in each image and a number of (such as 3) bounding boxes are created at each anchor point. The grid size is determined by the stride when convolution operations are applied to the image. So for a 416 x 416 image, a stride of 16 will result in a grid of 26 x 26. A large stride means a greater reduction to the feature image dimension, hence it allows the bounding boxes to cover large objects. This design is inspired by the Inception model behind GoogLeNet. Instead of constructing a very deep sequential CNN and relying on small features to build up larger features, filters of different sizes operate in parallel and the results are concatenated. This is similar to having telephoto, prime, and wide-angle camera lens on your smartphone shooting at the same time, so you are picking up small, medium and large objects in one shot.

Inception model

The standard configuration of YOLO has three stride sizes 32, 16, and 8 (which map to 13 x 13, 26 x 26 and 52 x 52 for a 416 x 416 image), each responsible for object of small, medium and large sizes. So the three grids will generate 13 x 13 x 3+26 x 26 x 3+52 x 52 x 3 bounding boxes as a fixed and manageable starting point. Because we are doing a sampled search and not full search, some objects might be missed. But thats the cost of a speed-first approach. In fact, the stride-based dimension reduction (and not ConvNet+maxpooling) is also a choice for speed and not for accuracy.

YOLOv3 (errata: the first detection is at layer 82 and not 84)

YOLO has three major releases: YOLO: YOLO Unified, Real-Time Object Detection, YOLO9000 – Better, Faster, Stronger, and YOLOv3: An Incremental Improvement. Each version is an attempt to improve model performance while maintaining the speed for real-time object detection. YOLOv3 uses a deep 53 layer ConvNet darknet-53 for feature extraction followed by another 53 layer ConvNet for detection at three size levels. So the 106 layer architecture is fully convolutional (FCN) and does not contain any conventional Dense layers. A connected Dense layer requires input data to be flattened, so it limits the size of input images. So a FCN design gives us the freedom to use any input image size (not without its own problems), a key feature for dealing with high res content such as HDTV.

Performance and speed comparison (source)

The figure above compares the performance (Mean Average Precision mAP) and speed of some modern object detection models. mAP is a measurement that factors in both localisation accuracy (IoU) and classification accuracy (Precision-Recall Curve). COCO is a tough dataset to get good mAP so anything above 50@mAP0.5 is considered amazing. YOLOv3 clearly shows its advantages in speed (authors put then “off the chart” to make a point…) while its performance is on a par with others. It is also noticed that larger input images (such as 608 x 608) can help with YOLOv3’s performance with some penalty on speed.

It is important to point out that the performance comparison from the related work may not apply to our problem space. These models are likely to behave differently over high res image data extracted from video content. Based on the project requirements, phase 1 will have YOLOv3 as our reference framework.

AI and generative art

The new year started with a couple of interesting projects on AI.

In a HEIF-funded “Big Ideas” project, I am working with a brand placement company to prototype a solution that uses computer vision and deep learning to automate the evaluation of how brands and products appear in movies and TV shows. This is to hopefully assist if not replace the daunting manual work of a human evaluator. Measuring the impact of brand placement is a complex topic and it is underpinned by the capabilities of detecting the presence of products. Object detection (classification + localisation) is a well researched topic with many established deep learning frameworks available. Our early prototypes ,which use YOLOv3-based CNN (convolutional neural network) structures and trained on FlickrLogos-32 dataset have shown promising outcomes. There is a long list of TODOs linked to gamma correction, motion complexity, etc.

Detection of Coca-Cola logo

Our analysis of eye gaze and body motion data from a previous VR experiment continues. The main focus is on feature extraction, clustering and data visualisation. There are quite a few interesting observations made by a PhD researcher on how men and women behave differently in VR and how this could contribute to an improved measurement of user attention.

Gaze direction in user experiments

The research on human attention in VR is not limited by passive measurement and we already have some plans to experiment with creative art. We spent hours observing how young men and women interact with VR paintings which has inspired us to develop generative artworks that capture user experience of art encounters. Our first VR generative art demo will be hosted in Milton Keynes Gallery project space in Feb 2020 as part of Alison Goodyear’s Paint Park exhibition. My SDCN project has been supporting the research as part of its Connected VR use case.

Associate Editor of Springer Multimedia Systems

I am thrilled to join the editorial board of Springer Multimedia System journal. Since 1993, Multimedia Systems has been a leading journal in the field covering enduring topics related to multimedia computing, AI, human factors, communication, and applications. The world of multimedia and computing is constantly evolving. I am really looking forward to working with other editors, reviewers, and authors to get the best research and engineering papers to our readers as quickly as we could while maintaining a high publishing standard.

Multimedia Systems

ISSN: 0942-4962 (Print) 1432-1882 (Online)


This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.

Coverage in Multimedia Systems includes:

  • Integration of digital video and audio capabilities in computer systems
  • Multimedia information encoding and data interchange formats
  • Operating system mechanisms for digital multimedia
  • Digital video and audio networking and communication
  • Storage models and structures
  • Methodologies, paradigms, tools, and software architectures for supporting multimedia applications
  • Multimedia applications and application program interfaces, and multimedia end system architectures.

Smart Campus project – part 2

In Part 1, I introduced the architecture and shown some sample charts of my Smart campus project. The non-intrusive use of WIFI data for campus services and student experience is really cool.

As we are approaching the start of university term, I have reduced time to work on this project. So my focus was to prototype a “student-facing” application that visualise live building information. The idea is students can tell which computing labs are free, where to find quiet study areas or check if student helpdesk is too busy to visit. Security team can also use that to see if there is any abnormal activities at certain time of the day.

The chart below shows a screenshot of a live floor heatmap with breakdowns of lecture rooms (labelled white), study areas (also labelled white), staff areas (labelled black), and service areas (labelled grey).

floor heatmap (not for redistribution)

Technically the application is split into three parts: user facing front-end (floor chart), data feed (JSON feed) and backend (data processing). The data feed layer provides the necessary segregation so that user requests don’t trigger backend operations directly.

The front-end chart is still based on Highcharts framework though I needed to manually draw the custom map using Inkscape based on actual floor map, export the map as SVG, convert it to map JSON using Highcharts’ online tool. At the same time, the mapping between areas (e.g., lecture rooms) and their corresponding APs must also be recorded in the database. This is a very time consuming process that requires a bit of graphic editing skills and a lot of patience.

The backend functions adopt a “10 minute moving average window” and periodically calculate the AP/area device population to generate data for each area defined in the custom floor map. I also filtered out devices that are simply passing by APs to reduce noise in data (e.g., a person walking along the corridor will not leave a trace). The data is then merged with the floormap JSON to generate the data feed every few minutes in static JSON file format.

A finishing touch is the chart annotation for most floor areas. I use different labelled colours so areas of different functionalities can be clearly identified.

TB to Unity – A small software tool for creative VR artists

[I am still learning Unity/abstract art. Do let me know if you spot me doing anything silly.]


Google Tilt Brush (TB) is a virtual art studio that enables artists to create paintings in VR. It’s packed with features for editing and sharing. As physical artworks require a gallery for exhibition, TB VR paintings is in need of a specialised environment for their audiences. Game engines such as Unity is a natural choice since they offer a wide spectrum of tools to help installing artwork, controlling the environment, and choreographing interactions with the audience. You can also “bake” the outcomes for different platforms.

The standard workflow to port an artwork to Unity is: Export TB artwork as a FBX file -> Import FBX into Unity and add it to the scene -> Apply Brush material to mesh using the content provided by the tiltbrush-toolkit. This work well until you want to do anything specific with each brush stroke such as hand-tracking to see where people touch the artwork (yes, its ok to touch! I even put my head into one to see whats inside). In Unity, artworks are stored in meshes and there is no one-to-one mapping between brush stroke and mesh. In fact all strokes of the same brush type are merged as one big mesh (even when they are not connected) when they are exported from TB. This is (according to a TB engineer) to make the export/import process more efficient.

The paint below was done using only one Brush type “WetPaint” in spite of different colour, patterns and physical locations of the strokes. So In the eye of Unity, all five thousands brush strokes is one mesh and there is nothing you can do about it as it’s already fixed in FBX when the artwork was exported from TB. This simply won’t work if an artist wants to continue her creative process in Unity or collaborate with game developers to create interactive content.

Abstract VR Painting Sketch Copyright@Alison Goodyear

To fix it, we have to bypass TB’s FBX export function. Luckily, TB also exports artworks in JSON format. Using the python-based export tools in tiltbrush-toolkit, its possible to convert JSON to FBX with your own configurations. Judging from the developer comments in the source code, these export tools came before TB supported direct FBX export. Specifically, the “” script allows us to perform the conversion with a few useful options including whether to merge strokes (“–no-merge-brush”). However, not merging strokes by brush type led to loose meshes in Unity with no obvious clue of their brush type. With some simply modifications to the source code, the script exports meshes with brush type as prefix in mesh names as shown below. The setup makes it easy to select all strokes with the same brush type, lock, and apply brush materials in one go. I also added a sequence number at the end of the mesh name (starting from 1000). Occasionally, we put multiple artworks in the same Unity scene, like a virtual gallery. It is then important to be able to differentiate meshes from different artworks in the asset list. This is done by appending the original JSON filename in the mesh name (“alig” in the picture below). At the moment, we are working on understanding how audience interact with paint of different colours, so the colour of stroke (in “abgr little-endian, rgba big-endian”) is also coded for quick access in Unity. As a whole, the mesh naming scheme is: BRUSHTYPE_STARTINGCOLOUR_JSONNAME_ID. All these are based on some simple hacking of the “write_fbx_meshes()” and “add_mesh_to_scene()” function.

Coding metadata of brush strokes in their names is sufficient in most cases, though there are experiments where we need more detailed / find-grained brush information. As far as colour is concerned, it is imperative to log the “colour array” since the colour may change along the stroke. In our mesh names, we only the starting colour. To support better data driven research, we also export the full stroke metadata as a JSON file along the FBX. The schema is:

‘v’:V, #list of positions (3-tuples)
‘n’:N, #list of normals (3-tuples, or None if missing)
‘uv0’:UV0, #list of uv0 (2-, 3-, 4-tuples, or None if missing)
‘uv1’:UV1, #see uv0
‘c’:C, #list of colors, as a uint32. abgr little-endian, rgba big-endian
‘t’:T, #list of tangents (4-tuples, or None if missing)
‘tri’:TRI #list of triangles (3-tuples of ints)

The modified script is available here:

Another example of Alison’s “Peacock” painting imported in Unity:

Copyright@Alison Goodyear