Product detection in movies and TV shows using machine learning – part 2: Dataset and implementations

Dataset for training

Existing object detection models are most trained to recognise common everyday objects such as dogs, people, cars, etc. We’ll use these pre-trained models later on when we do scene/character detection. For logo detection, we need to train out own model for the logos we need to detection. The training requires a sizeable dataset labelled with ground truth (where logos appear in the images). Because we are to detect logos in movies and TV shows where products are often not in perfect focus, lighting conditions and orientation, sometimes obstructed by other objects. So our model needs to be training using “in-the-wild” images in non-perfect conditions. We use the following two datasets.

Dataset 1: Logos-32plus dataset is a “publicly-available collection of photos showing 32 different logo brands (1.9GB). It is meant for the evaluation of logo retrieval and multi-class logo detection/recognition systems on real-world images”. The dataset has separate sub-folders with one subfolder per class (“adidas”, “aldi”, …). Each subfolder contains a list of JPG images. groundtruth.mat contains a MATLAB struct-array “groundtruth”, each element having the following fields: relative path to the image (e.g. ‘adidas\000001.jpg’), bboxes (bounding box format is X,Y,W,H), and logo name. The dataset contains around 340 Coca-cola logo images.

Logo32plus

Dataset 2: The Logos in the Wild Dataset is a large-scale set of web collected images with provided logo annotations in Pascal VOC style. The current version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It is an in-the-wild logo dataset where images include the logos as natural part instead of the raw original logo graphics. Images are collected by Google image search based on a list of well-known brands and companies. The bounding boxes are given in the format of (x_min, y_min, x_max, y_max) in absolute pixels.

The dataset does not provide actual images but urls to fetch images from various online sources such as: http://bilder.t-online.de/b/47/70/93/86/id_47709386/920/tid_da/platz-1-bei-den-plakaten-coca-cola-foto-imas-.jpg So one must write a script to download images from the urls. Not all urls are valid and we extracted around 530 Coca-cola logo images. The image (600×428) below contains three logos and the ground-truth is (435, 22, 569, 70), (308, 274, 351, 292), and (209, 225, 245,243).

As different object detection implementations use methods to define bounding boxes, it is necessary to write conversion scripts to map different bounding box definitions. This is not a horrendous task but requires basic programming skills.

Part of the Coca-cola image dataset

Useful links:

Implementations

We have a iMac Pro and a HP workstation with Windows OS to host the project development. Both platforms have a decent Xeon processor and plenty of memory. However our ConvNet-heavy application requires GPU acceleration to reduce training and detection time from days and hours to hours and minutes. ML GPU acceleration is led by Nvidia thanks to its range of graphics cards and software support which includes its CUDA toolkit for parallel computing / image processing and Deep Neural Network library (cuDNN) for deep learning support. GPU acceleration for ML is currently not possible on Mac officially (Nvidia and Apple haven’t find a way to work together). We also want to steer away from hacks.

The authors of YOLOv3 provide an official implementation detailed at https://pjreddie.com/darknet/yolo/. You’ll need to install Darknet, an open source neural network framework written in C and CUDA. It is easy to compile and run the baseline configuration on Mac and Linux. There is little support for Windows platform.

We then tested two alternative solutions.

The first is YunYang’s Tensorflow 2.0 Python implementation. It has support for training, inference and evaluation. The source code is not too difficult to follow and the author also wrote a tech blog (in Chinese) with some useful details. This “minimal” YOLO implementation is ideal for people who want to learn the code but some useful features are missing or not implemented in full. Data augmentation is an example. There are also challenges with NaN loss (no loss outputs then no learning) that requires a lot of tweaking. Also, although we have a Nvidia RTX 4000 graphics card with 8GB memory, we kept running out of GPU memory when we increase input image size or batch size. This is however not an issue necessarily links to YunYang’s implementation.

We then switched to AlexeyAB’s darknet C implementation that uses Tensor cores. There are instructions on how to compile the C source code for Windows. The process is quite convoluted but the instructions are clear to follow. Some training configurations require basic understanding of YOLO architecture to tune. It is possible to use a Python wrapper over the binary or compile YOLO as a DLL both of which will be very handy to link the detection core functions to user applications. There is also an extended version of YOLOv3 with 5 size outputs rather than 3 to potentially improve small object detection.

To address the GPU out-of-memory issue, we quickly acquired NVIDIA TITAN RTX to replace the RTX 4000. NVIDIA claims it “is the fastest PC graphics card ever built. It’s powered by the award-winning Turing architecture, bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory to your PC”. It is possible to scale up and have a two cards configuration with a TITAN RTX NVLINK BRIDGE. [we are lucky to have acquired this before the Coronavirus shutdown in the UK…]

An image for our eyes…

Product detection in movies and TV shows using machine learning – part 1: Background

Product detection in movies and TV shows using machine learning – part 1: background

This blog series discusses some R&D work within an ongoing “Big Idea” project. The project is in collaboration with the Big Film Group Ltd , a leading Product Placement Agency working with Blue Chip clients across UK and International entertainment properties.

Background

Currently as part of the service the company offers to clients an evaluation on the impact of product placement on TV programmes and films. The service is essential to the customer experience and the growth of business. The evaluation is carried out via human inspection over programmes to mark all corresponding appearances and mentions within the content of broadcast media, mainly TV and film. This is a manual operation which is time-consuming, requires intense concentration and costly. We believe that this whole process can eventually be automated using media processing techniques, AI and machine learning. [project design document]

The long-term goal would be to offer a monitoring service across all broadcast media which would allow agencies and their clients to know where, when and how their brands and companies are being talked about on air. For PR, Advertising and Social Media agencies this information would be particularly valuable. There is no existing solution readily available and we believe that there would be high demand for information and services of this nature. [project design document]

The first phase of the project is to prototype a core function: product detection. We want something that can detect Coca-cola products in sample videos provided by the company. The H.264/AAC encoded sample videos are roughly 40 seconds long and in the resolution of 1920×1080 and frame-rate of 25 fps. Coca-cola products appear in various points of the sample videos for the duration between half a second and several seconds.

A scene from TV show Geordie Shore with multiple Coca-cola products (copyrights belong to their respective owners)

Requirements

It is quite clear that we can map the core function to a ML object detection problem. Object detection has seen some major development with success in the past 5 years. So our focus is to analyse the requirements and pick the best from existing framework to develop a working solution.

  • Requirement 1: Logo detection. To simplify the solution, we start with logo detection. This means that we do not differentiate different products/packages of the same brand nor their colours. So Coca-cola cans and glass bottles are considered the same.
  • Requirement 2: Accuracy. The goal is to reach near human level accuracy overall but there are some major differences between the two. With human inspections, we expect few false positive (FP) detection but a degree of false negative (FN) when very brief appearances of product are not picked up by human eyes. For ML based solution, detection can be carried out frame-by-frame but there is a good chance of both FP and FN.
  • Requirement 3: Speed. As we are prototyping for video, the speed of the framework is important. There is no hard requirement on processing framerate but few people would like to wait an hour or two of processing time on each TV show and movie.
  • Requirement 4: End system. We are setting no constraint on end systems (both for training and runtime). We’ll develop the application in a physical workstation (not virtualised) while assuming a similar system will be available at runtime. It is possible to move the system at runtime to the cloud.

Framework

Two things constitute an object detection task: localisation (where things are) and classification (what it is). 1) Localisation predicts the coordinates of a bounding box that contains an object (and the likelihood of an object existing in that box). Different framework may use different coordinates system such as (x_min, y_min, x_max, y_max) or (x_centre, y_centre, width, height). 2) Classification tells us the probability of the object in the bounding box belonging to a set of pre-defined classes OR a distribution of probabilities of the object belonging to a set of pre-defined classes when the classes are exclusive (i.e., when it cannot be associated with multiple classes).

There are two main ML framework families for object detection: Region-Based Convolutional Neural Networks (R-CNNs) and the You Only Look Once (YOLO). Both frameworks have seen some major updates in the pass few years. Without going into the technical details too much, I’ll compare the two and discuss the reasons of our choice.

R-CNN

R-CNN is one of the first end-to-end working solutions for object detection. It selects regions that likely contain objects using selective search, a greedy search process that finds all possible regions and selects 2,000 best ones (in the format of coordinates) . The selected regions then go through ConvNet feature extraction before a separate classifier makes predictions for each region. R-CNN splits key functions in independent modules which is a reasonable choice for prototyping and it has shown a relatively good performance. The main issue of R-CNN is its speed. As thousands of regions go through ConvNet for each image, the process can be extremely slow. Processing a single image can take tens of seconds.

Fast R-CNN and Faster R-CNN introduced some significant architectural changes in order to improve the efficiency of the process (hence the names). The changes include shifting ConvNet to an earlier stage of the process so there is less (no) overlap on ConvNet operations over each image. The functionalities of ConvNet is also extended beyond the initial feature extraction to support region proposal (Region Proposal Network (RPN)) as well as classification (replacing SVM with ConvNet+activation function such as softmax). As a result, the architectural components also become more integrated. Faster R-CNN can process an image in less than a second. In summary, the R-CNN family started from a good performance baseline then gradually improved its speed to achieve “real-time” detection. Detectron (Mask R-CNN) is a good starting point to test out recent development on R-CNN.

Compared with R-CNN, YOLO is designed for speedy detection when accuracy is not mission critical. Instead of searching for appearance of objects in every possible location, YOLO uses a grid-based search. The grid fixes the anchor points in each image and a number of (such as 3) bounding boxes are created at each anchor point. The grid size is determined by the stride when convolution operations are applied to the image. So for a 416 x 416 image, a stride of 16 will result in a grid of 26 x 26. A large stride means a greater reduction to the feature image dimension, hence it allows the bounding boxes to cover large objects. This design is inspired by the Inception model behind GoogLeNet. Instead of constructing a very deep sequential CNN and relying on small features to build up larger features, filters of different sizes operate in parallel and the results are concatenated. This is similar to having telephoto, prime, and wide-angle camera lens on your smartphone shooting at the same time, so you are picking up small, medium and large objects in one shot.

Inception model

The standard configuration of YOLO has three stride sizes 32, 16, and 8 (which map to 13 x 13, 26 x 26 and 52 x 52 for a 416 x 416 image), each responsible for object of small, medium and large sizes. So the three grids will generate 13 x 13 x 3+26 x 26 x 3+52 x 52 x 3 bounding boxes as a fixed and manageable starting point. Because we are doing a sampled search and not full search, some objects might be missed. But thats the cost of a speed-first approach. In fact, the stride-based dimension reduction (and not ConvNet+maxpooling) is also a choice for speed and not for accuracy.

YOLOv3 (errata: the first detection is at layer 82 and not 84)

YOLO has three major releases: YOLO: YOLO Unified, Real-Time Object Detection, YOLO9000 – Better, Faster, Stronger, and YOLOv3: An Incremental Improvement. Each version is an attempt to improve model performance while maintaining the speed for real-time object detection. YOLOv3 uses a deep 53 layer ConvNet darknet-53 for feature extraction followed by another 53 layer ConvNet for detection at three size levels. So the 106 layer architecture is fully convolutional (FCN) and does not contain any conventional Dense layers. A connected Dense layer requires input data to be flattened, so it limits the size of input images. So a FCN design gives us the freedom to use any input image size (not without its own problems), a key feature for dealing with high res content such as HDTV.

Performance and speed comparison (source)

The figure above compares the performance (Mean Average Precision mAP) and speed of some modern object detection models. mAP is a measurement that factors in both localisation accuracy (IoU) and classification accuracy (Precision-Recall Curve). COCO is a tough dataset to get good mAP so anything above 50@mAP0.5 is considered amazing. YOLOv3 clearly shows its advantages in speed (authors put then “off the chart” to make a point…) while its performance is on a par with others. It is also noticed that larger input images (such as 608 x 608) can help with YOLOv3’s performance with some penalty on speed.

It is important to point out that the performance comparison from the related work may not apply to our problem space. These models are likely to behave differently over high res image data extracted from video content. Based on the project requirements, phase 1 will have YOLOv3 as our reference framework.

AI and generative art

The new year started with a couple of interesting projects on AI.

In a HEIF-funded “Big Ideas” project, I am working with a brand placement company to prototype a solution that uses computer vision and deep learning to automate the evaluation of how brands and products appear in movies and TV shows. This is to hopefully assist if not replace the daunting manual work of a human evaluator. Measuring the impact of brand placement is a complex topic and it is underpinned by the capabilities of detecting the presence of products. Object detection (classification + localisation) is a well researched topic with many established deep learning frameworks available. Our early prototypes ,which use YOLOv3-based CNN (convolutional neural network) structures and trained on FlickrLogos-32 dataset have shown promising outcomes. There is a long list of TODOs linked to gamma correction, motion complexity, etc.


Detection of Coca-Cola logo

Our analysis of eye gaze and body motion data from a previous VR experiment continues. The main focus is on feature extraction, clustering and data visualisation. There are quite a few interesting observations made by a PhD researcher on how men and women behave differently in VR and how this could contribute to an improved measurement of user attention.

Gaze direction in user experiments

The research on human attention in VR is not limited by passive measurement and we already have some plans to experiment with creative art. We spent hours observing how young men and women interact with VR paintings which has inspired us to develop generative artworks that capture user experience of art encounters. Our first VR generative art demo will be hosted in Milton Keynes Gallery project space in Feb 2020 as part of Alison Goodyear’s Paint Park exhibition. My SDCN project has been supporting the research as part of its Connected VR use case.

Associate Editor of Springer Multimedia Systems

I am thrilled to join the editorial board of Springer Multimedia System journal. Since 1993, Multimedia Systems has been a leading journal in the field covering enduring topics related to multimedia computing, AI, human factors, communication, and applications. The world of multimedia and computing is constantly evolving. I am really looking forward to working with other editors, reviewers, and authors to get the best research and engineering papers to our readers as quickly as we could while maintaining a high publishing standard.

Multimedia Systems

ISSN: 0942-4962 (Print) 1432-1882 (Online)

Description

This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.

Coverage in Multimedia Systems includes:

  • Integration of digital video and audio capabilities in computer systems
  • Multimedia information encoding and data interchange formats
  • Operating system mechanisms for digital multimedia
  • Digital video and audio networking and communication
  • Storage models and structures
  • Methodologies, paradigms, tools, and software architectures for supporting multimedia applications
  • Multimedia applications and application program interfaces, and multimedia end system architectures.

Smart Campus project – part 2

In Part 1, I introduced the architecture and shown some sample charts of my Smart campus project. The non-intrusive use of WIFI data for campus services and student experience is really cool.

As we are approaching the start of university term, I have reduced time to work on this project. So my focus was to prototype a “student-facing” application that visualise live building information. The idea is students can tell which computing labs are free, where to find quiet study areas or check if student helpdesk is too busy to visit. Security team can also use that to see if there is any abnormal activities at certain time of the day.

The chart below shows a screenshot of a live floor heatmap with breakdowns of lecture rooms (labelled white), study areas (also labelled white), staff areas (labelled black), and service areas (labelled grey).

floor heatmap (not for redistribution)

Technically the application is split into three parts: user facing front-end (floor chart), data feed (JSON feed) and backend (data processing). The data feed layer provides the necessary segregation so that user requests don’t trigger backend operations directly.

The front-end chart is still based on Highcharts framework though I needed to manually draw the custom map using Inkscape based on actual floor map, export the map as SVG, convert it to map JSON using Highcharts’ online tool. At the same time, the mapping between areas (e.g., lecture rooms) and their corresponding APs must also be recorded in the database. This is a very time consuming process that requires a bit of graphic editing skills and a lot of patience.

The backend functions adopt a “10 minute moving average window” and periodically calculate the AP/area device population to generate data for each area defined in the custom floor map. I also filtered out devices that are simply passing by APs to reduce noise in data (e.g., a person walking along the corridor will not leave a trace). The data is then merged with the floormap JSON to generate the data feed every few minutes in static JSON file format.

A finishing touch is the chart annotation for most floor areas. I use different labelled colours so areas of different functionalities can be clearly identified.

TB to Unity – A small software tool for creative VR artists

[I am still learning Unity/abstract art. Do let me know if you spot me doing anything silly.]

References:
https://github.com/googlevr/tilt-brush-toolkit
https://docs.google.com/document/d/1YID89te9oDjinCkJ9R65bLZ3PpJk1W4S1SM2Ccc6-9w/edit
https://blog.google/products/tilt-brush/showcase-your-art-new-ways-tilt-brush-toolkit/

Google Tilt Brush (TB) is a virtual art studio that enables artists to create paintings in VR. It’s packed with features for editing and sharing. As physical artworks require a gallery for exhibition, TB VR paintings is in need of a specialised environment for their audiences. Game engines such as Unity is a natural choice since they offer a wide spectrum of tools to help installing artwork, controlling the environment, and choreographing interactions with the audience. You can also “bake” the outcomes for different platforms.

The standard workflow to port an artwork to Unity is: Export TB artwork as a FBX file -> Import FBX into Unity and add it to the scene -> Apply Brush material to mesh using the content provided by the tiltbrush-toolkit. This work well until you want to do anything specific with each brush stroke such as hand-tracking to see where people touch the artwork (yes, its ok to touch! I even put my head into one to see whats inside). In Unity, artworks are stored in meshes and there is no one-to-one mapping between brush stroke and mesh. In fact all strokes of the same brush type are merged as one big mesh (even when they are not connected) when they are exported from TB. This is (according to a TB engineer) to make the export/import process more efficient.

The paint below was done using only one Brush type “WetPaint” in spite of different colour, patterns and physical locations of the strokes. So In the eye of Unity, all five thousands brush strokes is one mesh and there is nothing you can do about it as it’s already fixed in FBX when the artwork was exported from TB. This simply won’t work if an artist wants to continue her creative process in Unity or collaborate with game developers to create interactive content.

Abstract VR Painting Sketch Copyright@Alison Goodyear

To fix it, we have to bypass TB’s FBX export function. Luckily, TB also exports artworks in JSON format. Using the python-based export tools in tiltbrush-toolkit, its possible to convert JSON to FBX with your own configurations. Judging from the developer comments in the source code, these export tools came before TB supported direct FBX export. Specifically, the “geometry_json_to_fbx.py” script allows us to perform the conversion with a few useful options including whether to merge strokes (“–no-merge-brush”). However, not merging strokes by brush type led to loose meshes in Unity with no obvious clue of their brush type. With some simply modifications to the source code, the script exports meshes with brush type as prefix in mesh names as shown below. The setup makes it easy to select all strokes with the same brush type, lock, and apply brush materials in one go. I also added a sequence number at the end of the mesh name (starting from 1000). Occasionally, we put multiple artworks in the same Unity scene, like a virtual gallery. It is then important to be able to differentiate meshes from different artworks in the asset list. This is done by appending the original JSON filename in the mesh name (“alig” in the picture below). At the moment, we are working on understanding how audience interact with paint of different colours, so the colour of stroke (in “abgr little-endian, rgba big-endian”) is also coded for quick access in Unity. As a whole, the mesh naming scheme is: BRUSHTYPE_STARTINGCOLOUR_JSONNAME_ID. All these are based on some simple hacking of the “write_fbx_meshes()” and “add_mesh_to_scene()” function.

Coding metadata of brush strokes in their names is sufficient in most cases, though there are experiments where we need more detailed / find-grained brush information. As far as colour is concerned, it is imperative to log the “colour array” since the colour may change along the stroke. In our mesh names, we only the starting colour. To support better data driven research, we also export the full stroke metadata as a JSON file along the FBX. The schema is:

{‘fbxname’:FBXNAME,
‘fbxmeta’:
[{‘meshname’:MESHNAME,
‘meshmeta’:
{‘brush_name’:BRUSHNAME,
‘brush_guid’:BRUSH_GUID,
‘v’:V, #list of positions (3-tuples)
‘n’:N, #list of normals (3-tuples, or None if missing)
‘uv0’:UV0, #list of uv0 (2-, 3-, 4-tuples, or None if missing)
‘uv1’:UV1, #see uv0
‘c’:C, #list of colors, as a uint32. abgr little-endian, rgba big-endian
‘t’:T, #list of tangents (4-tuples, or None if missing)
‘tri’:TRI #list of triangles (3-tuples of ints)
}
},{},…]
}

The modified script is available here: https://github.com/MrMMu/tiltbrushfbxexport

Another example of Alison’s “Peacock” painting imported in Unity:

Copyright@Alison Goodyear

Research on the fairness of networked multimedia to appear in FAT/MM WS at ACM Multimedia 2019

A job well done for a first-year PhD student.

SDCN: Software Defined Cognitive Networking

AwesomeScreenshot-www-acmmm-org-2019--2019-08-12_11_44.png

Basil, A. et al., A Software Defined Network Based Research on Fairness in Multimedia, FAT/MM WS, 27th ACM International Conference on Multimedia (ACM MM 2019), France. 10/2019

The demand for online distribution of high quality and high throughput content has led to a non-cooperative competition of network resources between a growing number of media applications. This causes a significant impact on network efficiency, the quality of user experience (QoE) as well as a discrepancy of QoE across user devices. Within a multi-user multi-device environment, measuring and maintaining perceivable fairness becomes as critical as achieving the QoE on individual user applications. This paper discusses application- and human-level fairness over networked multimedia applications and how such fairness can be managed through novel network designs using programmable networks such as software-defined networks (SDN).

Screenshot 2019-08-12 at 11.45.55.png

View original post

“Disruptive” VR art? A quick update

Lovely sunset view from Lowry

Our visited to the TVX 2019 has been a tremendous success. Murtada and Alison’s lightning talks were well received and we managed to have two demos in the BBC Quay House on the last day.

Alison’s VR painting demo had a great start then took an interesting turn and became a community art creation exercise. Audience with different background built on each other’s creations and the artwork just kept growing in multiple dimensions (no canvas to limit you and no one is afraid of making “digital mess”). This has really inspired us to look into collaborative VR art more closely.

Alison’s VR Painting demo (trust me, i tried tidying the desk)

Murtada’s gaze-controlled game has seen a lot of visitors who “always wanted to do something with eye-tracking in VR”. We are already working on the third version of the game. We have changed the strategy from “building a research tool that contains games elements” to “building a professional VR game with research tool integrated”. The game will also be part of a use case for our Intelligent Networks experiments.

Murtada’s gaze-control game demo

Immediately after TVX, we also organised a workshop at Merged Futures event on our campus. Our audience are mainly SMEs and educators from Northants and nearby counties.

VR arts and education workshop at Merged Futures 2019, UON

Slides from the workshop:

Smart Campus project – part 1

Most research in communication networks are quite fundamental such as sending data frame from point A to point B as quickly as possible with little loss on the way. Some networking research can also benefit communities indirectly. I recently started a new collaboration with our University IT department on a smart campus project where we use anonymised data sampled from a range of on-campus services for service improvement and automation with the help of information visualisation and data analytics. The first stage of the project is very much focused on the “intent-based” networking infrastructure by Cisco on Waterside campus. The SoTA system provides us with a central console and APIs to manage all network switches and 1000+ wireless APs. Systematically studying how user devices are connected to our APs can help us, in a non-intrusive fashion, better understand the way(s) our campus are used, and use that intelligence to improve our campus services. Although it’s possible to correlate data from various university information systems to infer ownership of devices connected to our wireless networks, my research does not make use of any data related user identity at this stage. Not only because it is unnecessary (we are only interested in how people use the campus as a whole), but also because how privacy and data protection rules are implemented. This is not to say that we’ll avoid any research on individual user behaviours. There are many use cases around timetabling, bus services, personal wellbeing and safety that will require volunteers to sign up to participate.

This part 1 blog shares the R&D architecture and some early prototypes of data visualisation before they evolve into something humongous.

A few samples of the charts we have:

Wireless connected devices in an academic building with breakdowns on each floor. There is a clear weekly and daily pattern. We are able to tell which floors are over or under-used and improve our energy efficiency / help students or staff finding free space to work. [image not for redistribution]
“Anomaly” due to fire alarm test (hundreds of devices leaving the building in minutes). We can examine how people leave the building from different areas of the building and identify any bottleneck. [image not for redistribution]
Connected devices on campus throughout a typical off-term day with breakdowns in different areas (buildings, zones, etc.). [image not for redistribution]
Heatmap of device connected in an academic building in off-term weeks. The heat strips are grouped in weekdays except an Open Day Saturday [image not for redistribution]
Device movements between buildings/areas. It helps us to understand the complex dependencies between parts of our infrastructure and how we can improve the user experience. [image not for redistribution]
How connected devices are distributed across campus in the past 7 days and the top 5 areas on each floor of academic buildings. [image not for redistribution]

So how were the charts made:

The source of our networking data is the Cisco controllers. The DNA centre offers secure APIs while the WLC has a well structured interface for data scraping. Either option worked for us so we have Python-based data sampling functions programmed for both interfaces. What we collect is a “snapshot” of all devices in our wireless networks and the details of the APs they are connected to. All device information such as MAC addresses can be hashed as long as we can differentiate one device from another (count unique devices) and associate a device across different samples. We think devices’ movements on campus as a continuous signal. The sampling process is essentially an ADC (analog to digital conversion) exercise similar to audio sampling. The Nyquist Theorem instructs us to take a minimum sampling frequency as least twice the highest frequency of the analog signal to faithfully capture the characteristics of the input. In practice, the signal frequency is determined by the density of wireless APs in an area and how fast people travel. In a seating area on our Learning Hub ground floor, I could easily pass a handful of APs during a minute long walk. Following the math and sampling from control centre every few seconds risks killing the data source (and unlikely but possibly entire campus network). As the first prototype, I compromised on a 1/min sampling rate. This may not affect our understanding of the movement between buildings that much (unless you run really fast between buildings) but we might need some sensible data interpolation for indoor movements (e.g., a device didn’t teleport from the third floor library to a fourth floor class room, it traveled via stairwell/lift).

Architecture (greyed out elements will be discussed in future blogs)

The sampling outcome are stored as data snippets in the format of Python Pickle files (one file per sample). The files are then picked up asynchronously by a Python-based data filtering and DB insertion process to insert the data in a database for data analysis. Processed Pickle files are archived and hopefully never needed again. Separating the sampling and DB insertion makes things easier when you are prototyping (e.g., changing DB table structure or data type while sampling continues).

Data growth [image not for redistribution]

With the records in our DB growing at a rate of millions per day, some resource intensive pre-processing / aggregation (such as the number of unique devices per hour on each floor of a building) need to be done periodically to accelerate any following server-side functions for data visualisation, reducing the volume of data going to a web server by several orders of magnitude. This is at the cost of inserting additional entries in the database and risking creating “seams” between iterations of pre-processing but the benefit clearly outweighs the cost.

The visualisation process is split into two parts: the plot (chart) and the data feed. There are many choices for professional-looking static information plotting such as Matplotlib and ggplot2 (see how the BBC Visual and Data Journalism team works with graphics in R). Knowing that we’ll present the figures in interactive workshops, I made a start with web-based dynamic charts that “bring data to life” and allow us to illustrate layers of information while encouraging exploring. Frameworks that support such tasks include D3.js and Highcharts (a list of 14 can be found here). Between the two, D3 gives you more freedom to customise your chart but you’ll need to be a SVG guru (and a degree of artistic excellence) to master it. Meanwhile, Highcharts provides many sample charts for you to begin with and the data feed is easy to programme. It’s an ideal tool for prototyping and only some basic knowledge of Javascript is needed. To feed structured data to Highcharts, we pair each chart page with a PHP worker for data aggregation and formatting. The workflow is as follows:
1) The client-side webpage loads all elements including the Highcharts framework and the HTML elements that accommodates the chart.
2) A JQuery function waits for the page load to complete and initiates a Highcharts instance with the data feed left open (empty).
3) The same function then calls a separate Javascript function that performs an AJAX call to the corresponding PHP worker.
4) The PHP worker runs server-side code, fetches data from MySQL, and performs any data aggregation and formatting necessary before returning the JSON-encoded results back to the front-end Javascript function.
5) Upon receiving the results, the Javascript function conducts lightweight data demultiplexing for more complex chart types and set the data attribute of the Highcharts instance with the new data feed.
For certain charts, we also provided some extra user input fields to help dealing with user queries (e.g., plot data from a particular day).

Data science and network management at IEEE IM 2019, Washington, D.C.

IEEE IM 2019 – Washington DC, USA (link to papers)
Following IM 2017 in the picturesque Lisbon, one of the most beautiful cities in Europe, this year’s event was held in the US capital city during its peak cherry blossom season.

The conference adopted the theme of “Intelligent Management for the Next Wave of Cyber and Social Networks”. Besides the regular tracks, the five-day conference features some great tutorials, keynotes and panels. I have pages of notes and many contacts to follow up.

A few highlights are: Zero-touch network and service management (and how it’s actually “touch less” rather than touchless!), Huawei’s Big Packet Protocol (network management via packet header programming), DARPA’s Off-planet network management (fractionated architectures for satellites), Blockchain’s social, political, regulatory challenges (does not work with GDPR?) by UZH, Data science/ML for network management from Google and Orange Labs (with some python notebooks and a comprehensive survey paper of 500+ references.) and many more. I am hoping to write more about some of them in the future when I have a chance to study them further. There are certainly some good topics for student projects.

Since I am linked to both the multimedia/HCI and communication network communities, I have the opportunity to observe different approaches and challenges faced by these communities towards AI and ML. In multimedia communities, its relatively easy to acquire large and clean datasets, and there is a high level of tolerance when it comes to “trial and error”: 1) No one will get upset if a few from a hundred image search results are not accurate and 2) you can piggy-back some training module/reinforced learning on your services to improve the model. Furthermore, applications are often part of a closed proprietary environment (end to end control) and users are not that bothered with giving up their data. In networking, things are not far from “mission impossible”. 95% accuracy in packet forwarding will not get you very far, and there is not much infrastructure available to track any data, let alone making any data open for research. Even when there are tools to do so, you are likely to encounter encryption or information that is too deep to extract in practice. Also, tracking network data seems to attracts more controversy. We have a long and interesting way to go.

Washington, D.C. is surrounded by some amazing places to visit. George Washington’s riverside Mount Vernon is surely worth a trip. Not far from the Dulles airport is the Great Falls Park with spectacular waterfalls on Potomac river that separate Maryland and Virginia. Further west is the 100-mile scenic Skyline Drive and Appalachian Trail in Shenandoah National Park.