Most research in communication networks are quite fundamental such as sending data frame from point A to point B as quickly as possible with little loss on the way. Some networking research can also benefit communities indirectly. I recently started a new collaboration with our University IT department on a smart campus project where we use anonymised data sampled from a range of on-campus services for service improvement and automation with the help of information visualisation and data analytics. The first stage of the project is very much focused on the “intent-based” networking infrastructure by Cisco on Waterside campus. The SoTA system provides us with a central console and APIs to manage all network switches and 1000+ wireless APs. Systematically studying how user devices are connected to our APs can help us, in a non-intrusive fashion, better understand the way(s) our campus are used, and use that intelligence to improve our campus services. Although it’s possible to correlate data from various university information systems to infer ownership of devices connected to our wireless networks, my research does not make use of any data related user identity at this stage. Not only because it is unnecessary (we are only interested in how people use the campus as a whole), but also because how privacy and data protection rules are implemented. This is not to say that we’ll avoid any research on individual user behaviours. There are many use cases around timetabling, bus services, personal wellbeing and safety that will require volunteers to sign up to participate.
This part 1 blog shares the R&D architecture and some early prototypes of data visualisation before they evolve into something humongous.
A few samples of the charts we have:
So how were the charts made:
The source of our networking data is the Cisco controllers. The DNA centre offers secure APIs while the WLC has a well structured interface for data scraping. Either option worked for us so we have Python-based data sampling functions programmed for both interfaces. What we collect is a “snapshot” of all devices in our wireless networks and the details of the APs they are connected to. All device information such as MAC addresses can be hashed as long as we can differentiate one device from another (count unique devices) and associate a device across different samples. We think devices’ movements on campus as a continuous signal. The sampling process is essentially an ADC (analog to digital conversion) exercise similar to audio sampling. The Nyquist Theorem instructs us to take a minimum sampling frequency as least twice the highest frequency of the analog signal to faithfully capture the characteristics of the input. In practice, the signal frequency is determined by the density of wireless APs in an area and how fast people travel. In a seating area on our Learning Hub ground floor, I could easily pass a handful of APs during a minute long walk. Following the math and sampling from control centre every few seconds risks killing the data source (and unlikely but possibly entire campus network). As the first prototype, I compromised on a 1/min sampling rate. This may not affect our understanding of the movement between buildings that much (unless you run really fast between buildings) but we might need some sensible data interpolation for indoor movements (e.g., a device didn’t teleport from the third floor library to a fourth floor class room, it traveled via stairwell/lift).
The sampling outcome are stored as data snippets in the format of Python Pickle files (one file per sample). The files are then picked up asynchronously by a Python-based data filtering and DB insertion process to insert the data in a database for data analysis. Processed Pickle files are archived and hopefully never needed again. Separating the sampling and DB insertion makes things easier when you are prototyping (e.g., changing DB table structure or data type while sampling continues).
With the records in our DB growing at a rate of millions per day, some resource intensive pre-processing / aggregation (such as the number of unique devices per hour on each floor of a building) need to be done periodically to accelerate any following server-side functions for data visualisation, reducing the volume of data going to a web server by several orders of magnitude. This is at the cost of inserting additional entries in the database and risking creating “seams” between iterations of pre-processing but the benefit clearly outweighs the cost.
1) The client-side webpage loads all elements including the Highcharts framework and the HTML elements that accommodates the chart.
2) A JQuery function waits for the page load to complete and initiates a Highcharts instance with the data feed left open (empty).
For certain charts, we also provided some extra user input fields to help dealing with user queries (e.g., plot data from a particular day).