With the Corona virus shutting down everyone’s typically norm, we at Pandera have been trying to hold strong to keeping active both individually and active in our culture. Our culture has always been that of being there to support one another, striving for the best, and of course data. This is how I used some of my free time to support that culture.
About a year ago I started playing around with the idea of doing a multi-sport event, specifically a duathlon. I wanted a challenge and an excuse to get back on my neglected road bike. However one of the things I was not prepared for was how out of shape I was from a cardio perspective, I had not run any real distance in a few years, and it had probably been longer than that for biking. After a couple of weeks of stumbling through runs especially, I was ready to move on to something else. I looked for support in some of my local running groups, but unfortunately really did not have time to commit to their times as my days are pretty hectic with the normal day to day. So I looked for support where I spend most of my day, at work.
Luckily for me as I asked around there was a lot of interest in forming an activity group. So I created the Pandera Fitness Club (PFC) channel in our Google Chat. It gave myself and others a place to talk, push ourselves, help push and encourage others. We have had great participation, about a ⅓ of the company is subscribed to the channel with a ⅓ of that group being active. Over the course of the year the PFC members have competed in multi-sport events, 5ks, half-marathons, a group Tough Mudder and laid down some serious time running, biking, and in general staying fit.
As our first anniversary approached I wanted to get feedback on what could make PFC even better, and overwhelmingly it was more group activities and to have some kind of leaderboard. Given that teams are spread across the country doing either of those things in person was going to be difficult, so I looked to Strava to help cover that gap. We could create a club on their application, host digital events, and have an embedded leaderboard. We went through with pushing that and got people to start logging activities in Strava. What we came to realize is that the leaderboards only encompass running, biking or swimming so it left a lot of our members out. We needed a better way to aggregate and serve up the data that was being collected in Strava.
The solution was fairly simple: pull the data out and write it to a database. From there I would be able to put a visualization tool on top and display what made sense for PFC.
In reviewing the Strava API documentation I would need the following logical components:
- Authentication method with noSQL storage
- Webhook — Strava handles this differently than I have typically seen they do not actually send the record through the webhook. They include a way to identify the object and if it is an update, the record.
- Data Pipeline — Since they do not send the record as part of the webhook you need to make the request to actually get the record, then move the data over to BigQuery
- Data Warehouse — Just plain storage is all I really need here, but I did want to try something different. Internally we have been discussing Data Vault 2.0 as a modeling strategy and I wanted to get an actual implementation of it under my belt.
- Raw Storage — To store the original raw data just in case anything needs to be reprocessed.
I decided to use GCP to host the infrastructure I would need. I ended up settling on the architecture below. This is a pretty common architecture that can be used in a lot of different use cases, pulling social, mar-tech, or any data that is derived from an API. We have actually used it a few times in implementing a data feed from our CRM and PMS systems.
As a quick note with the diagram above, if you’re needing to make architecture diagrams I highly recommend draw.io. It has prebuilt icons for GCP, AWS, and Azure, the ability to add custom icons, and a lot of generic icons and shapes.
Now back to the application. App engine will serve up a static page where users can authenticate and allow me to make the token request from Strava. I will then store those in Datastore for easy retrieval and updates when I request the activity data. A cloud function to be a call back for the webhook that then pushes the message to PubSub where a separate Cloud Function makes the actual request from Strava and stores the output in both GCS and BigQuery. Since the authorization and webhook are vanillaI won’t get into those, but I do want to go through the function that makes the API call and the Data Vault model as this is the more involved portion.
So check back in with me in a few weeks I’ll go through the solution in more depth, how it was implemented, and more on retaining our culture through the new norm!