Building a Data Studio Fitness Leaderboard

It may not be the leaderboard at Augusta, but the #PanderaFitnessClub is all about having a leaderboard. In the past few posts I’ve gone through various topics leading up to this, the capstone on my series of building a Fitness Leaderboard. In my last three posts I covered how I went about:

Designing the Architecture I would need
Implementing a Data Vault in BigQuery
Building of the Pipelines

In this article I will focus on building out a Data Studio dashboard to actually be able to serve out to the members of the #PanderaFitnessClub community. To circle back on the why, we were utilizing Strava to track friendly competitions in our community. However the way the Strava clubs are built it only allowed for ranking in the leaderboards based on swim, bike, or run activities. This left many of our members out, whether they lifted weights, walked, hiked, surfed, or even played badminton, we wanted everyone to be able to participate and this dashboard would help in that.

One of the big things about #PanderaFitnessClub is it is not about who is the fastest, or who can ride the farthest. It’s a community of encouragement and support to get as many people active as possible. So when it came time to build our leaderboard, we didn’t want the focus to be on participating, and participating often. We went with metrics that reflected that while still having some things to tangibly compete on:

Most Overall TIme (Month) 2 points: Who had the most time logged in a month long period.
Most Activities (Month) 2 points: Count of activities that were longer than 25 minutes.
Threshold: 20 activities (Month) — 2 points: Points awarded if you hit a threshold of 20 logged activities in a month.
Monthly 5k — 2 points: Each month points are available for those that run a 5k.
Fastest 5k time — 1 point: For those that do run in the 5k, whoever posts the best time gets points.
Monthly Event — 3 points: This one is still in the works, but in addition to the 5k we are planning on holding some kind of monthly event, where members can participate for points.
Most Overall Time (Week) — 1 point: Whoever has the most time logged in a calendar week within a month.
Most Activities (Week) — 1 point: Count of activities that were longer than 25 minutes

These metrics would be the foundation of the leaderboard, and to support them I would need to surface data that was landed in the previous articles. As mentioned previously I implemented a data vault model in its own dataset, and am surfacing all the aggregates via views in their own dataset. Below are all of the datasets used for the dashboard:

fact_activity — as mentioned before this is the base view that handles deduplication and collecting of the latest version of a record.

WITH act_sat_latest as (
SELECT activity_hub_seq,
MAX(sat_load_date) as latest_load_date
FROM strava-int.strava_datavault.activity_sat
GROUP BY activity_hub_seq)

SELECT DISTINCT ath_s.athlete_hub_seq
, ath_s.firstname
, ath_s.lastname
, act_s.*
FROM act_sat_latest asl
JOIN strava-int.strava_datavault.activity_sat act_s
on (asl.activity_hub_seq = act_s.activity_hub_seq
and asl.latest_load_date = act_s.sat_load_date)
JOIN `strava-int.strava_datavault.athlete_activity_link` aal
on act_s.activity_hub_seq = aal.activity_hub_seq
JOIN `strava-int.strava_datavault.athlete_sat` ath_s
on aal.athlete_hub_seq = ath_s.athlete_hub_seq
WHERE act_s.delete_ind = False

agg_overall_activity — calculates the monthly activity based metrics

WITH qualified_act as (
SELECT athlete_hub_seq
,CONCAT(firstname,” “, lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, count(activity_hub_seq) act_count
FROM `strava-int.strava_information_vault.fact_activity`
where moving_time_s > 1500
GROUP BY athlete_hub_seq
,CONCAT(firstname,” “, lastname)
, FORMAT_TIMESTAMP(“%b %Y”, start_date)
)

select athlete_hub_seq
, athlete
, event_month
— , event_week
, act_count
, RANK()
OVER (PARTITION BY event_month order by act_count desc) as month_rank
, CASE WHEN (RANK()
OVER (PARTITION BY event_month order by act_count desc)) = 1 THEN 2 ELSE 0 END as most_activities_points
, CASE WHEN act_count >= 20 THEN 2 ELSE 0 END as twenty_activities_points
from qualified_act

agg_overall_time — calculates the monthly time based metrics

WITH qualified_act as (
SELECT athlete_hub_seq
,CONCAT(firstname,” “, lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, sum(moving_time_s) as total_time_s
FROM `strava-int.strava_information_vault.fact_activity`
GROUP BY athlete_hub_seq
,CONCAT(firstname,” “, lastname)
, FORMAT_TIMESTAMP(“%b %Y”, start_date)
)

select athlete_hub_seq
, athlete
, event_month
, total_time_s
, RANK()
OVER (PARTITION BY event_month order by total_time_s desc) as month_rank
, CASE WHEN (RANK()
OVER (PARTITION BY event_month order by total_time_s desc)) = 1 THEN 2 ELSE 0 END as month_points
from qualified_act
where total_time_s > 0

agg_week_activity — calculates the weekly activity based metrics

WITH qualified_act as (
SELECT athlete_hub_seq
,CONCAT(firstname,” “, lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, FORMAT_TIMESTAMP(“%W”, start_date) as event_week
, count(activity_hub_seq) act_count
FROM `strava-int.strava_information_vault.fact_activity`
where moving_time_s > 1500
GROUP BY athlete_hub_seq
,CONCAT(firstname,” “, lastname)
, FORMAT_TIMESTAMP(“%b %Y”, start_date)
, FORMAT_TIMESTAMP(“%W”, start_date)
)

select athlete_hub_seq
, athlete
, event_month
, event_week
, act_count
, RANK()
OVER (PARTITION BY event_week order by act_count desc) as week_rank
, CASE WHEN (RANK()
OVER (PARTITION BY event_week order by act_count desc)) = 1 THEN 1 ELSE 0 END as week_points
from qualified_act

agg_week_time — calculates the weekly time based metrics

WITH qualified_act as (
SELECT athlete_hub_seq
,CONCAT(firstname,” “, lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, FORMAT_TIMESTAMP(“%W”, start_date) as event_week
, sum(moving_time_s) as total_time_s
FROM `strava-int.strava_information_vault.fact_activity`
GROUP BY athlete_hub_seq
,CONCAT(firstname,” “, lastname)
, FORMAT_TIMESTAMP(“%b %Y”, start_date)
, FORMAT_TIMESTAMP(“%W”, start_date)
)

select athlete_hub_seq
, athlete
, event_month
, event_week
, total_time_s
, RANK()
OVER (PARTITION BY event_week order by total_time_s desc) as week_rank
, CASE WHEN (RANK()
OVER (PARTITION BY event_week order by total_time_s desc)) = 1 THEN 1 ELSE 0 END as week_points
from qualified_act
where total_time_s > 0

agg_monthly_5k — calculates the monthly 5k based metrics

WITH _5k_act as (
SELECT athlete_hub_seq
,CONCAT(firstname,” “, lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, activity_hub_seq
, distance_mi
, activity_type
, average_speed_mi
, 2 as monthly_5k
FROM `strava-int.strava_information_vault.fact_activity`
where distance_mi >= 3.08
and lower(activity_name) like ‘%pandera monthly 5k%’
)
select distinct fa.athlete_hub_seq
, CONCAT(fa.firstname,” “, fa.lastname) as athlete
, FORMAT_TIMESTAMP(“%b %Y”, fa.start_date) as event_month
, fka.activity_hub_seq
, fka.distance_mi
, fka.activity_type
, fka.average_speed_mi
, COALESCE(fka.monthly_5k, 0) as monthly_5k
, RANK()
OVER (PARTITION BY FORMAT_TIMESTAMP(“%b %Y”, fa.start_date) order by fka.average_speed_mi desc) as month_rank
, CASE WHEN (RANK()
OVER (PARTITION BY FORMAT_TIMESTAMP(“%b %Y”, fa.start_date) order by fka.average_speed_mi desc)) = 1 then 1 else 0 end as fastest_monthly_5k
from `strava-int.strava_information_vault.fact_activity` fa
left join _5k_act fka
on fa.athlete_hub_seq = fka.athlete_hub_seq
and FORMAT_TIMESTAMP(“%b %Y”, fa.start_date) = fka.event_month
order by average_speed_mi asc

Additionally I wanted to bring in some stats around how much we were doing, so things like how far people have travelled, or time spent, calories burned.

agg_overall_stats

SELECT athlete_hub_seq
, CONCAT( firstname, ‘ ‘, lastname) as athlete
, activity_type
, FORMAT_TIMESTAMP(“%b %Y”, start_date) as event_month
, sum(moving_time_s/60) activity_minutes
, sum(distance_mi) distance_mi
, sum( total_elevation_gain_m) elev_gain_m
, sum(kilojoules) kilojoules
, sum(calories) calories
FROM `strava-int.strava_information_vault.fact_activity`
WHERE delete_ind = FALSE
GROUP by athlete_hub_seq
, CONCAT( firstname, ‘ ‘, lastname)
, activity_type
, FORMAT_TIMESTAMP(“%b %Y”, start_date)

For the purpose of this build I wanted to quickly build out a dashboard, so I went with Data Studio. It was pretty ugly but it was a solid MVP to kick us off and get going.

To achieve the above, I brought in each of the datasets for the scoring metrics and created a blended dataset.

As well as bringing in the agg_overall_stats view to create the KPIs at the bottom of the screen.

Ultimately though, it was pretty ugly, and some of the amazing UI/UX folks we have politely let me know it could be better. I co-opted some internal color schemes and changed the layout a bit for a much needed face lift.

One thing that was nagging me about the dashboard was the performance. It was fairly slow on an initial load.

Original Query
Load: 18.3 s
Month Selection: 5.44 s

I really wanted to change this because it was very distracting from the user experience. I had happened to have a conversation about Data Studio blended datasets earlier in the week with someone and they mentioned pushing it down into BigQuery instead. So I did just that.

Instead of having four blended datasets I created a new view in BigQuery that did the same combination of datasets. The SQL was not super complex so I went ahead with validating the performance.

Joined Query
Load: 6.99 s
Month Selection: 5.5 s

The pattern I used here is pretty common, in terms of having serverless ingestion points into Google Cloud, using BigQuery as a data warehouse, and data studio as an exploration and visualization tool. It really lends itself to being almost a self service implementation for marketing teams looking to get quick insights into their data. From the time of implementing the pipeline to ugly MVP, I spent six to eight working hours. Most of which was determining various aspects of the data that was coming through to the dashboard. So if you are utilize BigQuery and want to visualize the data I would highly recommend Data Studio.

So how has this worked out for us? This has honestly been a great tool for us in the #PanderaFitnessClub. We rolled this out mid mid quarantine, and it is a great way to keep us all connected even though we have not had the chance to see each other in person for months. I have watched as someone will go out and hit a long run and post it to the channel only for someone to immediately head out and to do the same to retake the lead. It has brought us closer together and continued our drive to stay fit through all of this.

If you are think you would want to implement this at your company here are some links to get you going:

GitHub repos for this project:
Authentication
Webhook
Write
BQ DDLs and View SQL

Strava API

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store