Using VR and machine learning for art and mental health

[ update: the research idea described in this post has supported a successful outline proposal to EPSRC High-risk speculative engineering and ICT research: New Horizons ]

In the past few years, we have had a series of projects on capturing and modelling human attention in VR applications. Our research shows that eye gaze and body movements share a pivotal role in capturing human perception, intent, and experience. We truly believe that VR is not just another computerised environment with fancy graphics. With the help of biometric sensors and machine learning, VR can become the best persuasive technology known to HCI designers. In a recent project, we demonstrated how machine learning can be automated to study visitor behaviours in a VR art exhibition without any prior knowledge of the artwork. The resultant model then drives autonomous avatars (see below) to guide other visitors based on their eye gaze and mobility patterns. With the “AI avatars”, we observed a significant increase in visitors’ interactions with the VR artwork and very positive feedback on the overall user experience.

Image generated by Murtada Dohan, a PhD student at UoN.

The COVID-19 pandemic and its prolonged impact on health services made us rethink our research priorities. While we are still enthusiastic about digital arts, we wanted to make good use of our VR and data science know-how for healthcare innovations. Using VR and AI in healthcare is not a new idea. There are already tons of existing research on VR-based therapies, especially for the treatment of phobia and dementia. AI has been used to develop chatbots, to detect COVID-19 symptoms, etc. The research we’ve seen so far are very promising from an academic perspective but most of them aim at augmenting traditional practices for improved outcomes. This means that any developed application will still need to be operated by a technician in a controlled setting. Recognising the healthcare innovations in the research communities, we are interested in a new form of design that can deliver automated or even autonomous assessment and treatment of diseases in a remote location, e.g., patients’ own home or an easily accessible community centre. This will ultimately help reduce the amount of health care appointments and patients’ trips to hospitals.

The pandemic has added long-lasting impacts on public mental health due to social isolation, loss of coping mechanisms, reduced access to health services, etc. We believe VR and AI research should see a major shift from exploratory proof-of-concept to product-focused development with wider public engagement. Just like how every Tesla car and every Google search improves their underlying ML models, mental health innovation must aim at large scale user trials to achieve any major transformation. To this end, we now pair with the R&D department of a leading mental health institution to engineer new VR applications for new adventures. We hope that customised VR stimuli and NLP dialogue engines will lead to more effective treatment that was not possible in the past due to constraints in the physical world. We are also quite excited about the opportunities to automate the assessment of mental disorders through biometric sensors and machine learning.

The development of BSc AI and Data Science programme

This is a belated post on developing a new BSc AI and Data Science (Hons) programme. This programme has successfully passed validation in early 2022 and we are now accepting applications for the 22/23 academic year.

The development of the new programme is an answer to the growing demand for machine learning engineers and scientists in the UK job market. Using AI and machine learning to increase productivity, save cost, and assist new designs is no longer a privilege for large tech companies and government organisations. In the past few years, we have worked with many small and micro-businesses that are enthusiastic about adopting AI techniques and recruiting AI talents. Although we have been teaching AI-related topics such as computer vision, deep learning and graph databases within our existing programmes for many years, it is now imperative to design a dedicated BSc programme to capture recent advancements in AI as well as the legal, ethical, and environmental challenges that may follow. I am pleased to have the chance to be part of this development as the programme lead.

We had two parallel procedures taking place: Computing market research and CAIeRO Planner. The market research was carried out by key academics who are currently teaching AI-related modules. We did a few case studies of similar programmes offered by our main competitors and current job vacancies for ML engineers, researchers, and data analysts. We noticed that a lot of AI programmes are offered as a collection of discrete data science and machine learning modules that don’t synergise with each other. While this setting may give prospective students the impression of a rich and sophisticated course, students do not get the best value while hopping between those modules. We wanted to follow the theme of responsible and human-centred AI while providing a clear path to success and a sense of accomplishment along the way. The research on the job market was especially important because we wanted to continuously champion hands-on learning and practical skills. This practice gave us a general idea of the toolset, frameworks, workflow, and R&D environment that our students will be expected to master in their future workplace.

Planning on the technical content is only half of the story. The University has a large and dedicated Learning Technology team to support any activities on the module and programme development and improvement. We had two learning technologists assigned to our programme to support detailed designs at both programme and module levels. We used an in-house planner Creating Aligned Interactive educational Resource Opportunities (CAIeRO) to guide the exercises.

We started with the “look and feel”, learning outcomes, mission statement and assessment strategy for the programme as a whole using interactive tools and sharable environments such as padlet. All members of the programme team had equal inputs to the design. The whole process was carried out through multiple online sessions over a few weeks. Because everyone came to the meeting fully prepared, the sessions were really effective and super engaging. The programme level design then became the blueprint for module-level designs to ensure coherence and consistency across all modules.

We then identified four new modules for the programme: Mathematics for Computer Science, Introduction to AI, Natural Language Processing, and Cloud Computing and Big Data. We also reworked some existing modules such as Advanced AI and Applications, and Media Technology to better accommodate the programme learning outcomes.

Developing module-level learning outcomes can be challenging, especially when we need to maintain the coherence between modules at the same level. As student-facing documents, the module specifications also need to be clear and concise. We used a toolkit called COGS which stands for Changemaker Outcomes for Graduate Success. It includes a series of guidelines that help staff write clear and robust learning outcomes that are appropriate to the academic level of study in order to clarify for students what is expected of them across the different stages of their study. I found this tool extremely useful when I developed the new modules, knowing that my colleagues would be using similar languages for the related modules.

We also took a few extra steps to make sure that the learning outcomes will be assessed using a range of tools including assignment, project, time-constrained assessment and dissertation. Most modules also offer a mix of face-to-face and a small number of online contact hours for active and blended learning. This will allow students to work on subject tasks online before they join the classes, a practice that could greatly improve student engagement.

If you are interested in more details about our programme, please don’t hesitate to contact me.

Product detection in movies and TV shows using machine learning – part 5: Finding sweeties quick

In Part 4, I made a start with establishing a new training dataset by harvesting publicly accessible photos on social media. The main benefit of using user generated content is that they were taken in a real-world setting, hence close to what the targeting logos would look like in a film. For content selection and labelling, my own filtering tool and Yolo_Mark worked pretty well. It wasn’t easy to label 600+ images but the workflow is decent. The three classes are: 0 – Cadbury, 1 – ROSES, and 2 – HEROES. There are some typeface variations of ROSES. You need to be patient and consistent of the labelling strategy. As humans, we are able to acquire information from different sources very quickly while making a decision. So if I were actively looking for a particular logo while knowing the logo is definitely present, I could still point at an unidentifiable blob of pixels and be 100% certain that its a Cadbury logo on a discarded purple wrapper. It may not be realistic to expect a “low-level” machine learning model with a small training set to capture what human could do in this case. Therefore I limit the labelling to only the logos that I could visually identify directly.

Labelling a photo using Yolo_mark (Image copyrights belong to its owner)

The training process wasn’t much different from the previous modelling for Coca-cola logo except some further tidying of the dataset (minor issues with missing files, etc.). With a baseline configuration, it took about 6 hours to complete 6000 epochs with a pretty good result base on the detection of three logos.

The images below illustrate what the model picks up from some standard photos (using the slider to see “before” and “after”).

logo detection

Another example:

logo detection

I’ve also tested the model on some videos provided by our partner. I won’t be able to show it here due to copyrights but its safe to say that it works very well with room for improvements. Some adjustment can be done at the modelling side, such as increasing the size of training images (currently downsampled to 608×608), increasing the number of detection layers to accommodate a larger range of logo sizes, or perhaps giving the new YOLOv4 a go!

This update concludes the “Product detection in movies and TV shows using machine learning” series. The dataset used for Cadbury, Roses, and Heroes training will be made public for anyone interested in giving it a go or expanding her own logo detector. I am still pushing this topic forward and will start a new series soon!

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Product detection in movies and TV shows using machine learning – part 4: Start a new dataset

Smart Campus project – part 3 (COVID-19) – in progress – [10 June 2020]

I have been playing with the project data to study the impact of COVID-19 social distancing / lockdown to the university, especially the use of campus facilities. Meanwhile there are some time series analysis and behavioural modelling that I’d like to complete sooner than later. Everything has taken me so much longer than what I planned. Here are some previews followed by moaning, i.e., the painful processes to generate these.

Number of connected devices on campus

The above shows some regular patterns of crowd density and how the numbers dropped due to COVID-19 lockdown. Students started to reduce their time on campus in the week prior to the official campus closure.

Autocorrelation

The autocorrelation grape shows a possible daily pattern (data resampled in 5 minute interval so 288 samples is a day, hence the location of the first peak).

Seasonal decomposition

Seasonal decomposition based on the hypothesis of a weekly pattern. There is also a strong hourly pattern, which I’ll explain in the paper (not written yet!).

A comparison of area crowd density dynamics of one floor of an academic building. from left to right: Pre-lockdown baseline, Near-lockdown reduced social contact, and working/studying from home during lockdown).

These ones above show the area crowd density dynamics of one floor of an academic building. The one on the left shows how an academic workspace, a few classrooms and study areas were used during a normal week when few people in the UK felt the COVID-19 is relevant to them. The middle one shows the week when there were increasing reports of COVID-19 cases in the UK and the government was changing its tones and advising social distancing. Staff and students reduced their hours on campus. The one on the right shows a week during university closure (building still accessible for exceptional purposes).

Using the system to monitor real-time crowd data provides a lot of insights but its somehow passive. It’s the modelling, simulation and predictions that make the system truly useful. I have done some work on this and I’ll gradually update this post with some analysis results:

The first thing I tried is standard time-series analysis. A lot of people don’t think it’s a bid deal but it’s tricky to get things right. There are many models to try and they are all based on the assumption that we can predict future data based on previous observations. ARIMA (Auto Regressive Integrated Moving Average) is a common time-series analysis method characterised by 3 terms: p, d, q. Tother they work on which part(s) of the observed data to use and how to adjusted the data (based on how thing change over time) to form a prediction. The seasonal variation of ARIMA (SARIMA) introduces additional seasonal terms to capture seasonal differences. Our campus WIFI data is not only non-stationary but also has multiple seasonality embedded: from a high level, the university has terms, each term has a start and end with special activities, each week of the term has weekdays and weekends, each weekday has lecturing hours and non-lecturing hours. The standard SARIMA can only capture one seasonality but it will be our starting point to experiment with crowd predictions.

SARIMA prediction

Figure above shows the predictions of campus occupant level on Friday 6th March. The blue curve plots the observed data on that week (Monday to Friday). The green curve depicts the “intra-week” predictions based on data observed during the same week, i.e., using Monday-Thursday’s data to guess Friday’s data. This method can respond to extraordinary situations in a particular week. If we chain all weekday data then in theory it’s possible to make prediction for any weekday of the week, practically ignoring the differences across weekdays. However, we know that people’s activities across weekdays are not entirely identical. Students have special activities on Wednesdays and everyone tries to finish early on Fridays. This explains why the intra-week predictions overestimate occupancy level for Friday afternoon. The orange curve gives the “inter-week” prediction based on previous four Fridays. This method captures normal activities on Fridays but is agnostic to week-specific changes (e.g., the week prior to exams). Balancing intra- and inter-week predictions using a simple element-wise Mean, the red curve shows the “combined” prediction. For this particular prediction exercise, the combined method does not show better MSE measurement compared with the inter-week version, partially due to the overestimates.

SARIMA prediction

Figure above shows the week prior to the university’s closure in response to COVID-19. This week is considered an “abnormal” week as students and staff started to spend more time study or work from home. In this case, the intra-week model successfully captures the changes on that week. There must be a better way to balance the two model to take the best from both worlds but I will try other options first.

All modelling above were done using pmdarima, a Python implementation of the R’s auto.arima feature. To speed up the process, the data was subsampled to a 30-minute interval. The number of observations per seasonal cycle m was set as 48 (24 hours x 2 samples per hour) to define a daily cycle.

[more to come soon]


Some technical details…

  • The main tables on the live DB have 500+ million records (which takes about 300 GB space). It took a week to replicate it on a second DB so I can decouple the messy analysis queries from the main.
  • A few python scripts to correlate loose data in the DB which got me a 150+ GB CSV file for time series analysis. From there, the lovely Pandas happily chews the CSV like its bamboo shoots.
  • The crowd density floor map was done for live data (10 minute coverage). To reprogramme it for historical data and generate the animation above, a few things have to be done:
    • A python script ploughed through the main DB table (yes the one with 500 million records) and derive area density in a 10-minute interval. The script also did a few other things at the same time so the whole thing took a few hours.
    • A new PHP page loaded the data in, then some Javascripts looped through the data and display the charts dynamically. It’s mainly setIntervals() to call Highcharts’ setData, removeAnnotation and addAnnotation.
    • To save the results as videos / GIFs, I tested screen capturing, which turned out to be OK but recording the screen just didn’t feel right to me. So I went down the route of image exporting and merging. Highcharts’ offline exporting exportChartLocal() works pretty well within the setIntervals() loop until a memory issue appeared. Throwing in a few more lines of Javascript to stagger the exporting “fixed” the issue. FFMPEG was brought in to covert image sequence to video.
    • Future work will get all these tidied up and automated.

[to be continued]

Product detection in movies and TV shows using machine learning – part 4: Start a new dataset

Public image datasets are very handy when it comes to ML training but at some point you’ll face a product/logo that are not covered by any existing dataset. In our case, we are experimenting with detecting Cadbury Roses and Cadbury Heroes products. We need to construct an image dataset to cover these two products.

Two steps to put the elephant in the fridge:

Open the fridge

There are a few sources for in-the-wild images:

  1. Google Images – search “Cadbury Heroes” and “Cadbury Roses”, then use downloading tool such as Download All Images (https://download-all-images.mobilefirst.me/) to fetch all image results.
  2. Flickr – same process as above.
  3. Instagram – use image acquisition tool such as Instaloader (https://instaloader.github.io) to fetch images based on hashtags (#cadburyheroes and #cadburyroses).

The three sources provide around 5,000 raw images with a significant amount of duplicates and unrelated items. A manual process is needed to filter the dataset. Going through thousands of files is tedious, so to make things slightly easier, I made a small GUI application. When you first start the application, it prompts for your image directory. Then it loads the first image. You then use Left and Right arrow key to decide whether to keep the image for ML training or discard it (LEFT to skip and RIGHT to keep). No files are deleted and instead they are moved to corresponding sub-directories “skip” and “keep”. Once one of the two arrow keys is pressed, the application loads the next image. It’s pretty much a one-hand operation so you have the other hand free to feed yourself coffee/tea… The tool is available on Github. It’s based on wxPython and I’ve only tested it on Mac (pythonw).

Insert elephant

Labelling the dataset requires manual input of bounding box coordinates and label. A few tools are available including: LabelImg and Yolo_mark. I also set up “Video labelling tool” as one of the assignment topics for my CS module Media Technology. So hopefully we’ll see some interesting designs for video labelling. In this case we use Yolo_mark as it directly exports in the labelling format required by our framework.

Depending on the actual product and packaging, the logo layout and typeface varies. I am separating out as four classes Cadbury logo, “Heroes” (including the one with the star), “Roses” in precursive (new), and “Roses” in cursive (old) and code them as cadbury_logo, cadbury_heroes, cadbury_roses_precursive, and cadbury_roses_cursive.

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Product detection in movies and TV shows using machine learning – part 3: Training and Results

Training has been an ongoing process to test what configurations work for our us the best. Normally you set the training config, dataset and validation strategy then sit back and wait for the model performance to peak. Figures below show plots of loss and mAP@0.5 for 4000 iterations of training with input size of 608 and batch size of 32. Loss generally drops as expected and we can get mAP around 90% with careful configurations. The training process saves model weights every 1000 iterations plus the best, last and final version of weights. The training itself takes a few hours so I usually run it overnight.

The performance measures are based on our image dataset. To evaluate how the model actually performs on test videos, it is essential to do manual verification. This means feeding the videos frame by frame to the pre-loaded model then assemble the results as videos. Because YOLO detect objects at 3 scales, the input test image size has a great influence on recall. Our experiments suggest that the input size of 1600 (for full HD videos) leads to the best results. So input HD content are slightly downsampled and padded. The images below show the detections of multiple logos in the test video.

Detection results (copyrights belong to owner)
Detection results (copyrights belong to owner)

It is clear that training a model on image dataset for video content CAN work, but there are many challenges. Many factors such as brightness, contrast, motion blur, and video compression all impact the outcomes of the detection. Some of the negative impacts can be mitigated by tuning the augmentation (to mimic how things look in motion pictures) and I suspect a lot can be done once we start to exploit the temporal relationship between video frames (instead of considering them as independent images).

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Dataset for training

Existing object detection models are most trained to recognise common everyday objects such as dogs, people, cars, etc. We’ll use these pre-trained models later on when we do scene/character detection. For logo detection, we need to train out own model for the logos we need to detection. The training requires a sizeable dataset labelled with ground truth (where logos appear in the images). Because we are to detect logos in movies and TV shows where products are often not in perfect focus, lighting conditions and orientation, sometimes obstructed by other objects. So our model needs to be training using “in-the-wild” images in non-perfect conditions. We use the following two datasets.

Dataset 1: Logos-32plus dataset is a “publicly-available collection of photos showing 32 different logo brands (1.9GB). It is meant for the evaluation of logo retrieval and multi-class logo detection/recognition systems on real-world images”. The dataset has separate sub-folders with one subfolder per class (“adidas”, “aldi”, …). Each subfolder contains a list of JPG images. groundtruth.mat contains a MATLAB struct-array “groundtruth”, each element having the following fields: relative path to the image (e.g. ‘adidas\000001.jpg’), bboxes (bounding box format is X,Y,W,H), and logo name. The dataset contains around 340 Coca-cola logo images.

Logo32plus

Dataset 2: The Logos in the Wild Dataset is a large-scale set of web collected images with provided logo annotations in Pascal VOC style. The current version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It is an in-the-wild logo dataset where images include the logos as natural part instead of the raw original logo graphics. Images are collected by Google image search based on a list of well-known brands and companies. The bounding boxes are given in the format of (x_min, y_min, x_max, y_max) in absolute pixels.

The dataset does not provide actual images but urls to fetch images from various online sources such as: http://bilder.t-online.de/b/47/70/93/86/id_47709386/920/tid_da/platz-1-bei-den-plakaten-coca-cola-foto-imas-.jpg So one must write a script to download images from the urls. Not all urls are valid and we extracted around 530 Coca-cola logo images. The image (600×428) below contains three logos and the ground-truth is (435, 22, 569, 70), (308, 274, 351, 292), and (209, 225, 245,243).

As different object detection implementations use methods to define bounding boxes, it is necessary to write conversion scripts to map different bounding box definitions. This is not a horrendous task but requires basic programming skills.

Part of the Coca-cola image dataset

Useful links:

Implementations

We have a iMac Pro and a HP workstation with Windows OS to host the project development. Both platforms have a decent Xeon processor and plenty of memory. However our ConvNet-heavy application requires GPU acceleration to reduce training and detection time from days and hours to hours and minutes. ML GPU acceleration is led by Nvidia thanks to its range of graphics cards and software support which includes its CUDA toolkit for parallel computing / image processing and Deep Neural Network library (cuDNN) for deep learning support. GPU acceleration for ML is currently not possible on Mac officially (Nvidia and Apple haven’t find a way to work together). We also want to steer away from hacks.

The authors of YOLOv3 provide an official implementation detailed at https://pjreddie.com/darknet/yolo/. You’ll need to install Darknet, an open source neural network framework written in C and CUDA. It is easy to compile and run the baseline configuration on Mac and Linux. There is little support for Windows platform.

We then tested two alternative solutions.

The first is YunYang’s Tensorflow 2.0 Python implementation. It has support for training, inference and evaluation. The source code is not too difficult to follow and the author also wrote a tech blog (in Chinese) with some useful details. This “minimal” YOLO implementation is ideal for people who want to learn the code but some useful features are missing or not implemented in full. Data augmentation is an example. There are also challenges with NaN loss (no loss outputs then no learning) that requires a lot of tweaking. Also, although we have a Nvidia RTX 4000 graphics card with 8GB memory, we kept running out of GPU memory when we increase input image size or batch size. This is however not an issue necessarily links to YunYang’s implementation.

We then switched to AlexeyAB’s darknet C implementation that uses Tensor cores. There are instructions on how to compile the C source code for Windows. The process is quite convoluted but the instructions are clear to follow. Some training configurations require basic understanding of YOLO architecture to tune. It is possible to use a Python wrapper over the binary or compile YOLO as a DLL both of which will be very handy to link the detection core functions to user applications. There is also an extended version of YOLOv3 with 5 size outputs rather than 3 to potentially improve small object detection.

To address the GPU out-of-memory issue, we quickly acquired NVIDIA TITAN RTX to replace the RTX 4000. NVIDIA claims it “is the fastest PC graphics card ever built. It’s powered by the award-winning Turing architecture, bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory to your PC”. It is possible to scale up and have a two cards configuration with a TITAN RTX NVLINK BRIDGE. [we are lucky to have acquired this before the Coronavirus shutdown in the UK…]

An image for our eyes…

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 1: background

This blog series discusses some R&D work within an ongoing “Big Idea” project. The project is in collaboration with the Big Film Group Ltd , a leading Product Placement Agency working with Blue Chip clients across UK and International entertainment properties.

Background

Currently as part of the service the company offers to clients an evaluation on the impact of product placement on TV programmes and films. The service is essential to the customer experience and the growth of business. The evaluation is carried out via human inspection over programmes to mark all corresponding appearances and mentions within the content of broadcast media, mainly TV and film. This is a manual operation which is time-consuming, requires intense concentration and costly. We believe that this whole process can eventually be automated using media processing techniques, AI and machine learning. [project design document]

The long-term goal would be to offer a monitoring service across all broadcast media which would allow agencies and their clients to know where, when and how their brands and companies are being talked about on air. For PR, Advertising and Social Media agencies this information would be particularly valuable. There is no existing solution readily available and we believe that there would be high demand for information and services of this nature. [project design document]

The first phase of the project is to prototype a core function: product detection. We want something that can detect Coca-cola products in sample videos provided by the company. The H.264/AAC encoded sample videos are roughly 40 seconds long and in the resolution of 1920×1080 and frame-rate of 25 fps. Coca-cola products appear in various points of the sample videos for the duration between half a second and several seconds.

A scene from TV show Geordie Shore with multiple Coca-cola products (copyrights belong to their respective owners)

Requirements

It is quite clear that we can map the core function to a ML object detection problem. Object detection has seen some major development with success in the past 5 years. So our focus is to analyse the requirements and pick the best from existing framework to develop a working solution.

  • Requirement 1: Logo detection. To simplify the solution, we start with logo detection. This means that we do not differentiate different products/packages of the same brand nor their colours. So Coca-cola cans and glass bottles are considered the same.
  • Requirement 2: Accuracy. The goal is to reach near human level accuracy overall but there are some major differences between the two. With human inspections, we expect few false positive (FP) detection but a degree of false negative (FN) when very brief appearances of product are not picked up by human eyes. For ML based solution, detection can be carried out frame-by-frame but there is a good chance of both FP and FN.
  • Requirement 3: Speed. As we are prototyping for video, the speed of the framework is important. There is no hard requirement on processing framerate but few people would like to wait an hour or two of processing time on each TV show and movie.
  • Requirement 4: End system. We are setting no constraint on end systems (both for training and runtime). We’ll develop the application in a physical workstation (not virtualised) while assuming a similar system will be available at runtime. It is possible to move the system at runtime to the cloud.

Framework

Two things constitute an object detection task: localisation (where things are) and classification (what it is). 1) Localisation predicts the coordinates of a bounding box that contains an object (and the likelihood of an object existing in that box). Different framework may use different coordinates system such as (x_min, y_min, x_max, y_max) or (x_centre, y_centre, width, height). 2) Classification tells us the probability of the object in the bounding box belonging to a set of pre-defined classes OR a distribution of probabilities of the object belonging to a set of pre-defined classes when the classes are exclusive (i.e., when it cannot be associated with multiple classes).

There are two main ML framework families for object detection: Region-Based Convolutional Neural Networks (R-CNNs) and the You Only Look Once (YOLO). Both frameworks have seen some major updates in the pass few years. Without going into the technical details too much, I’ll compare the two and discuss the reasons of our choice.

R-CNN

R-CNN is one of the first end-to-end working solutions for object detection. It selects regions that likely contain objects using selective search, a greedy search process that finds all possible regions and selects 2,000 best ones (in the format of coordinates) . The selected regions then go through ConvNet feature extraction before a separate classifier makes predictions for each region. R-CNN splits key functions in independent modules which is a reasonable choice for prototyping and it has shown a relatively good performance. The main issue of R-CNN is its speed. As thousands of regions go through ConvNet for each image, the process can be extremely slow. Processing a single image can take tens of seconds.

Fast R-CNN and Faster R-CNN introduced some significant architectural changes in order to improve the efficiency of the process (hence the names). The changes include shifting ConvNet to an earlier stage of the process so there is less (no) overlap on ConvNet operations over each image. The functionalities of ConvNet is also extended beyond the initial feature extraction to support region proposal (Region Proposal Network (RPN)) as well as classification (replacing SVM with ConvNet+activation function such as softmax). As a result, the architectural components also become more integrated. Faster R-CNN can process an image in less than a second. In summary, the R-CNN family started from a good performance baseline then gradually improved its speed to achieve “real-time” detection. Detectron (Mask R-CNN) is a good starting point to test out recent development on R-CNN.

Compared with R-CNN, YOLO is designed for speedy detection when accuracy is not mission critical. Instead of searching for appearance of objects in every possible location, YOLO uses a grid-based search. The grid fixes the anchor points in each image and a number of (such as 3) bounding boxes are created at each anchor point. The grid size is determined by the stride when convolution operations are applied to the image. So for a 416 x 416 image, a stride of 16 will result in a grid of 26 x 26. A large stride means a greater reduction to the feature image dimension, hence it allows the bounding boxes to cover large objects. This design is inspired by the Inception model behind GoogLeNet. Instead of constructing a very deep sequential CNN and relying on small features to build up larger features, filters of different sizes operate in parallel and the results are concatenated. This is similar to having telephoto, prime, and wide-angle camera lens on your smartphone shooting at the same time, so you are picking up small, medium and large objects in one shot.

Inception model

The standard configuration of YOLO has three stride sizes 32, 16, and 8 (which map to 13 x 13, 26 x 26 and 52 x 52 for a 416 x 416 image), each responsible for object of small, medium and large sizes. So the three grids will generate 13 x 13 x 3+26 x 26 x 3+52 x 52 x 3 bounding boxes as a fixed and manageable starting point. Because we are doing a sampled search and not full search, some objects might be missed. But thats the cost of a speed-first approach. In fact, the stride-based dimension reduction (and not ConvNet+maxpooling) is also a choice for speed and not for accuracy.

YOLOv3 (errata: the first detection is at layer 82 and not 84)

YOLO has three major releases: YOLO: YOLO Unified, Real-Time Object Detection, YOLO9000 – Better, Faster, Stronger, and YOLOv3: An Incremental Improvement. Each version is an attempt to improve model performance while maintaining the speed for real-time object detection. YOLOv3 uses a deep 53 layer ConvNet darknet-53 for feature extraction followed by another 53 layer ConvNet for detection at three size levels. So the 106 layer architecture is fully convolutional (FCN) and does not contain any conventional Dense layers. A connected Dense layer requires input data to be flattened, so it limits the size of input images. So a FCN design gives us the freedom to use any input image size (not without its own problems), a key feature for dealing with high res content such as HDTV.

Performance and speed comparison (source)

The figure above compares the performance (Mean Average Precision mAP) and speed of some modern object detection models. mAP is a measurement that factors in both localisation accuracy (IoU) and classification accuracy (Precision-Recall Curve). COCO is a tough dataset to get good mAP so anything above 50@mAP0.5 is considered amazing. YOLOv3 clearly shows its advantages in speed (authors put then “off the chart” to make a point…) while its performance is on a par with others. It is also noticed that larger input images (such as 608 x 608) can help with YOLOv3’s performance with some penalty on speed.

It is important to point out that the performance comparison from the related work may not apply to our problem space. These models are likely to behave differently over high res image data extracted from video content. Based on the project requirements, phase 1 will have YOLOv3 as our reference framework.

AI and generative art

The new year started with a couple of interesting projects on AI.

In a HEIF-funded “Big Ideas” project, I am working with a brand placement company to prototype a solution that uses computer vision and deep learning to automate the evaluation of how brands and products appear in movies and TV shows. This is to hopefully assist if not replace the daunting manual work of a human evaluator. Measuring the impact of brand placement is a complex topic and it is underpinned by the capabilities of detecting the presence of products. Object detection (classification + localisation) is a well researched topic with many established deep learning frameworks available. Our early prototypes ,which use YOLOv3-based CNN (convolutional neural network) structures and trained on FlickrLogos-32 dataset have shown promising outcomes. There is a long list of TODOs linked to gamma correction, motion complexity, etc.


Detection of Coca-Cola logo

Our analysis of eye gaze and body motion data from a previous VR experiment continues. The main focus is on feature extraction, clustering and data visualisation. There are quite a few interesting observations made by a PhD researcher on how men and women behave differently in VR and how this could contribute to an improved measurement of user attention.

Gaze direction in user experiments

The research on human attention in VR is not limited by passive measurement and we already have some plans to experiment with creative art. We spent hours observing how young men and women interact with VR paintings which has inspired us to develop generative artworks that capture user experience of art encounters. Our first VR generative art demo will be hosted in Milton Keynes Gallery project space in Feb 2020 as part of Alison Goodyear’s Paint Park exhibition. My SDCN project has been supporting the research as part of its Connected VR use case.

Associate Editor of Springer Multimedia Systems

I am thrilled to join the editorial board of Springer Multimedia System journal. Since 1993, Multimedia Systems has been a leading journal in the field covering enduring topics related to multimedia computing, AI, human factors, communication, and applications. The world of multimedia and computing is constantly evolving. I am really looking forward to working with other editors, reviewers, and authors to get the best research and engineering papers to our readers as quickly as we could while maintaining a high publishing standard.

Multimedia Systems

ISSN: 0942-4962 (Print) 1432-1882 (Online)

Description

This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.

Coverage in Multimedia Systems includes:

  • Integration of digital video and audio capabilities in computer systems
  • Multimedia information encoding and data interchange formats
  • Operating system mechanisms for digital multimedia
  • Digital video and audio networking and communication
  • Storage models and structures
  • Methodologies, paradigms, tools, and software architectures for supporting multimedia applications
  • Multimedia applications and application program interfaces, and multimedia end system architectures.