Skip to main content

Breaking Down the Analytics Starter Dataset - Part 2

DevRel | 01 Apr 2025
9 minutes reading time

cropped view of women talking in front of computer screens

Introduction

Welcome to part 2 of the "Getting Started" guide for Vodafone Analytics on the Developer Marketplace. In this guide, we'll continue from where we left of in part 1 and look at using interactive maps for visualising the data produced by our advanced analytics tools.

Pre-requisites

This guide assumes you are continuing from where part 1 ended. If you have ended that session, then you can follow the "Setup" section below to resume the analysis.

Setup

You can run these commands to resume your session:

# Run the activation script to configure the shell environment
source vfa/bin/activate
# Run jupyter notebooks
jupyter notebook

A new page should open in your browser. In it, click on the file that was created in the previous tutorial. in this case, it is the "Untitled.ipynb" displayed on the image below:


Figure 1: reopening a Python notebook file

It should already contain all the variables and helper functions created during the last run through. In case it doesn't, here they are:

import pandas as pd
import geopandas as gpd
import mercantile
from shapely.geometry import Polygon
import matplotlib.pyplot as plt
import contextily as ctx
import folium
from folium.plugins import HeatMap
import plotly.express as px

# Specify the path to your CSV file
CSV_FOOTFALL_FILE = './footfall-measures_20240701_20240731_starter.csv'
CSV_FOOTFALL_AGG_FILE = './footfall-measures_20240701_20240731_starter-agg.csv'

EXPORT_HEATMAP_FILE = './vfa_heatmap.html'
EXPORT_CHOROPLETH_FILE = './vfa_choropleth.html'

# Helper functions
# Function to calculate the centroid of a quadkey
def quadkey_to_lat_lon(quadkey):
    # Convert quadkey to tile
    tile = mercantile.quadkey_to_tile(quadkey)
    # Get the bounding box of the tile
    bounds = mercantile.bounds(tile)
    # Calculate the center of the tile
    lat = (bounds.north + bounds.south) / 2
    lon = (bounds.east + bounds.west) / 2
    return lat, lon

# Function to convert a quadkey to a polygon
def quadkey_to_polygon(quadkey):
    # Convert quadkey to tile coordinates
    tile = mercantile.quadkey_to_tile(quadkey)
    # Get bounding box for the tile
    bounds = mercantile.bounds(tile)
    # Create a polygon from the bounding box
    return Polygon([
        (bounds.west, bounds.south),  # Bottom left
        (bounds.west, bounds.north),  # Top left
        (bounds.east, bounds.north),  # Top right
        (bounds.east, bounds.south),  # Bottom right
        (bounds.west, bounds.south)   # Close polygon
    ])

# Function to calculate the best map zoom level given a dataframe of latitude and longitude points
def map_zoom_level(df):
    # Calculate the range between map locations
    lat_range = df['latitude'].max() - df['latitude'].min()
    lon_range = df['longitude'].max() - df['longitude'].min()

    #Determine a suitable zoom level, adjusting the constant factor as needed
    if max(lat_range, lon_range) < 1:
        zoom_level = 10  # Higher zoom level for small area
    elif max(lat_range, lon_range) < 5:
        zoom_level = 6  # Medium zoom level for medium area
    else:
        zoom_level = 4  # Lower zoom level for large area    

    return zoom_level+3

df = pd.read_csv(CSV_FOOTFALL_FILE, dtype=str)
df['measure_value'] = pd.to_numeric(df['measure_value'])

# Remove the daily entries and keep just the hourly ones
df['measure_granularity'] = df['time_aggregation'].apply(
	lambda x: 'daily' if len(x) == 10 else 'hourly'
)
df = df[df['measure_granularity'] != 'daily']

df_agg = pd.read_csv(CSV_FOOTFALL_AGG_FILE, dtype=str)
df_agg['measure_avg'] = pd.to_numeric(df_agg['measure_avg'])
df_agg['measure_max'] = pd.to_numeric(df_agg['measure_max'])

# Create the extra supporting columns used for mapping
df_agg[['latitude', 'longitude']] = df_agg['dst_id'].apply(quadkey_to_lat_lon).apply(pd.Series)

# Creating a new column with geometry data
df_agg['geometry'] = df_agg['dst_id'].apply(quadkey_to_polygon)

# Top 10 busiest tiles by their average footfall
df_top_10_avg = df_agg.sort_values(by='measure_avg', ascending=False).head(10)

# Top 10 busiest tiles by their maximum footfall
df_top_10_max = df_agg.sort_values(by='measure_max', ascending=False).head(10)

Now our session is ready to continue.

Analysing overlapping results

Observant readers may have noticed that in the resulting map from the first guide, out of a possible 20 tiles (top 10 average and top 10 maximum), only 14 were present. As a reminder, here is that map:


Figure 2: resulting map from first tutorial

This section will show how to drill down into the dataset to flesh out overlapping results.

The next code snippets will detail how to merge the dataframes on the "dst_id" field, allowing the intersections to be visible.

# Merge DataFrames on 'id' to find common values
common_ids = pd.merge(df_top_10_avg, df_top_10_max, on='dst_id')
print('there are {} common IDs that appear in both datasets'.format(len(common_ids)))
there are 6 common IDs that appear in both datasets

Here are the details of the quiadkeys appearing in both dataframes:

common_ids
dst_id	measure_min_x	measure_max_x	measure_avg_x	latitude_x	longitude_x	geometry_x	measure_min_y	measure_max_y	measure_avg_y	latitude_y	longitude_y	geometry_y
0	03131313111232022	4806	83226	35277.0	51.516434	-0.130463	POLYGON ((-0.1318359375 51.51557978375592, -0....	4806	83226	35277.0	51.516434	-0.130463	POLYGON ((-0.1318359375 51.51557978375592, -0....
1	03131313113001233	5116	68560	33165.0	51.495920	-0.144196	POLYGON ((-0.14556884765625 51.49506473014369,...	5116	68560	33165.0	51.495920	-0.144196	POLYGON ((-0.14556884765625 51.49506473014369,...
2	03131313111232220	6423	81998	30680.0	51.511307	-0.130463	POLYGON ((-0.1318359375 51.510451886248575, -0...	6423	81998	30680.0	51.511307	-0.130463	POLYGON ((-0.1318359375 51.510451886248575, -0...
3	03131313111232221	4666	85729	30605.0	51.511307	-0.127716	POLYGON ((-0.12908935546875 51.510451886248575...	4666	85729	30605.0	51.511307	-0.127716	POLYGON ((-0.12908935546875 51.510451886248575...
4	03131313111223300	2411	75081	30510.0	51.514725	-0.141449	POLYGON ((-0.142822265625 51.513870548723986, ...	2411	75081	30510.0	51.514725	-0.141449	POLYGON ((-0.142822265625 51.513870548723986, ...
5	03131313111223331	5578	74285	28956.0	51.511307	-0.133209	POLYGON ((-0.13458251953125 51.510451886248575...	5578	74285	28956.0	51.511307	-0.133209	POLYGON ((-0.13458251953125 51.510451886248575...

With this information, you can now plot the individual hourly values of the common ids over the entire month on a chart.

# Use the common ids from above to look at the hourly counts over the month
ids = common_ids['dst_id']

# Create a new dataframe based only on the common ids
df_hourly_tile = df[df['dst_id'].isin(ids)]

# Drop the columns we don't need and sort out the remaining ones by the 'time_aggregation' column
df_hourly_tile = df_hourly_tile.drop(columns=['dst_unit', 'measure_name']).sort_values(by='time_aggregation', ascending=True)

# Create an interactive line chart using Plotly
fig = px.line(df_hourly_tile, x='time_aggregation', y='measure_value', color='dst_id', markers=True, labels={'time_aggregation': 'Time Aggregation (Hourly)', 'measure_value': 'Measure Value'}, title='Interactive Line Chart of Measure Value by dst_id')

# Update layout for better visuals
fig.update_layout(xaxis_title='Time Aggregation (Hourly)', yaxis_title='Measure Value', legend_title='dst_id', xaxis=dict(tickangle=45))

# Show the interactive plot
fig.show()


Figure 3: line chart of the hourly values of the common ids

Plotting interactive maps

While the map shown in figure 2 provides a good visualisation of the subset of data extracted from the starter file, it doesn't offer much detail as users are not able to zoom in or out or move around it.

In this section, the tools that can provide a more interactive visual experience for the users will be introduced. To do this, the Folium map package is used, allowing all the files from the dataset to be displayed as a heatmap. Other mapping libraries can also achieve similar results, depending on personal preference. The output can also be exported as an HTML file to be loaded in a browser.

Since all the tiles are going to be analysed, let's refer to the original 'df_agg' dataframe loaded in earlier.

# User CartoDB Positron grayscale custom layer

# Example URL: Using 'CartoDB Positron' as a light grayscale alternative
custom_tile_url = 'https://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}.png'

# Calculate the best map zoom level
zoom_level = map_zoom_level(df_agg)

# Prepare data for heatmap
heatmap_data = df_agg[['latitude', 'longitude', 'measure_max']].dropna().values.tolist()

m = folium.Map(location=[df_agg['latitude'].mean(), df_agg['longitude'].mean()], zoom_start=zoom_level, tiles=custom_tile_url, attr='CartoDB Positron')

# Add a heatmap layer
HeatMap(heatmap_data, min_opacity=0.5, radius=15, blur=10).add_to(m)

# Save to HTML and display
m.save(EXPORT_HEATMAP_FILE)
m


Figure 4: screen capture showing heatmap

The Folium map allows readers to interact with the data and zoom in to see the busiest areas. Two key clusters standout: Soho and Covent Garden (popular with both national and international tourists), and Bishopsgate, where Liverpool Street station is located (a popular hub for commuters and visitors).

Folium also provides a Choropleth map, which is excellent for presenting measures against different-sized areas.

For this next step, let's separate the geographic data from the measurement data:

# Prepare the GeoJson data - We will split the spatial and the measurements into 2 separate dataframes;
# If you are using 3rd party geojson data in the future you will be able to use those files or URIs to provide the geometry objects.

# Calculate the desired zoom level, create the map and add the data layer
map_latitude_center, map_longitude_center = df_agg['latitude'].mean(), df_agg['longitude'].mean()
zoom_start = map_zoom_level(df_agg)

# Split the 'data' and 'geo_data' into 2 separate data frames
spatial_cols = ['dst_id', 'geometry', 'latitude', 'longitude'] # columns for the geopsatial dataframe
measurement_cols = ['dst_id', 'measure_avg']

# Create the geo data frame and use the projection 4326 (WGS84)
geo_data = gpd.GeoDataFrame(df_agg[spatial_cols].copy(), geometry='geometry', crs='EPSG:4326')
data = df_agg[measurement_cols]

# Create a Folium map
m = folium.Map(location=[map_latitude_center, map_longitude_center], zoom_start=zoom_start, tiles='CartoDB Positron')

# Add a Choropleth map
folium.Choropleth(
	geo_data=geo_data.to_json(),
	name='Choropleth',
	data=data,
	columns=['dst_id', 'measure_avg'],
	key_on='feature.properties.dst_id',
	fill_color='YlOrRd',
	fill_opacity=0.2,
	line_opacity=0.1,
	legend_name='Measure Value',
).add_to(m)

# Add layer control
folium.LayerControl().add_to(m)

# Save to HTML and display
m.save(EXPORT_CHOROPLETH_FILE)
m


Picture 5: screen capture showing the Choropleth map

Changing the size of the tiles

Now the quadkeys will be aggregated to demonstrate a powerful feature of using tiles over some other coordinate systems. To move up a level (reduce the zoom level), simply remove the last digit from the "dst_id". For example, a quadkey-17 contain 17 digits while a quadkey-15 contains 15 digits, and so on.

A new column called "dst_id_agg" can be created and moved up two zoom levels to showcase this:

# Here is how to create the new column: using a default zoom level of 15, zoom levels can be experimented with where the muber is smaller than 15
zoom_level = 15
df_agg['dst_id_agg'] = df_agg['dst_id'].str[:zoom_level]
df_agg.head(5)
dst_id	measure_min	measure_max	measure_avg	latitude	longitude	geometry	dst_id_agg
0	03131313110133133	18	50	32.0	51.571095	-0.177155	POLYGON ((-0.17852783203125 51.570241445811234...	031313131101331
1	03131313110133230	10	141	44.0	51.565973	-0.190887	POLYGON ((-0.1922607421875 51.56511970412417, ...	031313131101332
2	03131313110133231	35	701	279.0	51.565973	-0.188141	POLYGON ((-0.18951416015625 51.56511970412417,...	031313131101332
3	03131313110133232	51	178	96.0	51.564266	-0.190887	POLYGON ((-0.1922607421875 51.563412328675874,...	031313131101332
4	03131313110133301	89	583	413.0	51.569388	-0.182648	POLYGON ((-0.18402099609375 51.56853426269097,...	031313131101333

The response for the above shows the difference between the "dst_id" and the "dst_id_agg" columns: one has a length of 17 while the other has 15.

Next, create the new qk15 level dataset.

Note: this is not the proper way to achieve this in a production environment (doing an average of an average), but it is being used here to demonstrate functionality. Alternatively, you can use the original dataset and create a new column.

# Creates aggregates for minimum, maximum, mean (average)
df_agg_qk15 = df_agg.groupby('dst_id_agg', as_index=False).agg(
	measure_min=('measure_min', 'min'),
	measure_max=('measure_max', 'max'),
	measure_avg=('measure_avg', 'mean')
)

# Convert the mean average to a whole number
df_agg_qk15['measure_avg'] = df_agg_qk15['measure_avg'].round(0)

# Check the first 5 rows
df_agg_qk15.head(5)
dst_id_agg	measure_min	measure_max	measure_avg
0	031313131101331	18	50	32.0
1	031313131101332	10	701	140.0
2	031313131101333	10	3394	384.0
3	031313131103100	727	2186	1341.0
4	031313131103101	134	3944	1542.0

Now, enrich the dataframe with geometry values, then split it, ready for mapping:

# Creating a new column with geometry data
df_agg_qk15['geometry'] = df_agg_qk15['dst_id_agg'].apply(quadkey_to_polygon)

# Create the lat and lng columns - used for working out the view point for the map
df_agg_qk15[['latitude', 'longitude']] = df_agg_qk15['dst_id_agg'].apply(quadkey_to_lat_lon).apply(pd.Series)

# Split the 'data' and 'geo_data' into 2 separate data frames
# Columns for the geospatial dataframe
spatial_cols = ['dst_id_agg','geometry', 'latitude','longitude'] 
measurement_cols = ['dst_id_agg','measure_avg', 'measure_max']


# Create the geo data frame and use the projection 4326 (WGS84)
geo_data = gpd.GeoDataFrame(df_agg_qk15[spatial_cols].copy(), geometry='geometry', crs='EPSG:4326')
data = df_agg_qk15[measurement_cols]

Finally, plot the aggregated tiles on a map:

# Create a Folium map
m = folium.Map(location=[df_agg_qk15['latitude'].mean(), df_agg_qk15['longitude'].mean()],
	zoom_start=zoom_start,
	tiles='CartoDB Positron')
	
# Add a Choropleth map
folium.Choropleth(
	geo_data=geo_data.to_json(),
	name='Choropleth',
	data=data,
	columns=['dst_id_agg', 'measure_avg'],
	key_on='feature.properties.dst_id_agg',
	fill_color='YlOrRd',
	fill_opacity=0.8,
	line_opacity=0.5,
	legend_name='Avg',
).add_to(m)

# Add layer control
folium.LayerControl().add_to(m)
m


Figure 6: screen capture showing the Choropleth map, now with 2 zoom levels up

The aggregated results at quadkey-15 show the same trends as quadkey-17, but the tiles in figure 6 are noticeable larger than those in figure 5, which follows the established premise.

Summary

Congratulations! You've now completed this guide for exploring insights of the starter data set using interactive maps. Be sure to check out future guides and blog posts to see how other companies and academia are using these valuable data sources.

Don't forget to register your interest by filling out and submitting the form located here. That ensures you can stay up to date with the latest on Analytics and other APIs offerings. It is also a great way to get in touch with us.

See you around!

Ready to start building?

Got Questions?

Vodafone Developer Portal

Discover, try, and purchase our APIs to start building your own apps