Data Quality: A Lesson in Eating Your Own Dog Food

Typically when we engage with customers, I like to reflect on what they are asking us to do. It helps set the trend for topics I engage in with new customers, and keeps me up-to-date. Twice this week alone, the topics of Data Quality and Data Governance have come up.

Reflection on those topics came while digging through some data for a new forecasting and capacity planning tool we are working on in-house. We are a consulting company, so the majority of our revenue stream is through statements of work where we are staffing resources on projects — and anything that can streamline that process, help optimize resource utilization, and keep the organization as a whole informed is always beneficial.

We have bounced around between various CRM tools over the years. We’ve used Salesforce, NetSuite, and Excel workbooks, but today we are using noCRM. It is relatively new for us so it is still in its infancy when it comes to defined rules and processes.

As I started looking through the data, key things like Company, Description, Start Date, and Status columns, I immediately thought to myself, We have Data Quality issues.

Eating our own dog food isn’t something that comes easy, especially when we talk about Data Quality and Governance. These issues cannot be fixed by a tool, but can be fixed by changing the way people work, enforcing process, and understanding the data and how it is used.

But why should a company invest in a Data Quality effort like this, and what is that actual benefit? Most Data Quality and Data Governance programs fail because they cannot quantify the cost implications. For us it is pretty straight forward, having bad data leads us to have delayed starts in projects that can leave resources sitting unallocated. Not optimizing our resource pool that leaves a single resource out for a week can be amount to $8,000. This is a pretty nominal amount, however on average one of our top talent resources changes projects three times a year, which is when these gaps normally arise. With that ratio if we apply it to our billable resource pool can come to $240k annually.

So, over the next few weeks, Pandera is going to be my client. I’ll keep posting about the struggles of change, the impacts of Data Quality, and how we at Pandera become a more data-centric company in our operations.

Step 1: Understanding Our Data

I won’t be talking about the benefits of CRM A over B, but instead focusing on how our people use noCRM and what each field means to them — and hopefully come out the other side with a data dictionary of our own, along with some basic rules around which fields are critical and how the data should conform.

To do this I’ll be using one of our accelerator templates with our Florida Managing Partner, Kevin Curley, and our Miami Solutions Partner Ed Pestana. We sat and reviewed the process a salesperson would go through when they enter an opportunity into the system from initial entry to closure. One of the key things I noticed was noCRM does not enforce rules on most of the fields, ensuring data quality at the source is really the ideal situation, unfortunately it does not appear we can guarantee that due to tool limitations.

When we talk about data quality issues they normally fall into one or more of the below categories: People, Process, Technology, and Data. Since the technology in this case doesn’t allow us to enforce rules we are going to take an alternative approach, modifying the process and potentially people behavior behind the data entry.

Armed with the information from the meeting, I’ll be heading off to define that process as well as the data dictionary so everyone is in lockstep about what the data in each field truly represents to us.

In the meantime, I also want to start playing with the data from noCRM to see if there are any issues. When it comes to understanding data it is always easier for me to just look at it all. Luckily noCRM has a data export tool for leads that lets me do just that. I will be using that feature for the first go around. Eventually I am going to land the data in BigQuery but for now I just want to profile the data.

Dataprep is a good starting point for that because setup is pretty straightforward, I can manually upload a file, point to a file in Cloud Storage or a BigQuery table and start checking things out like cardinality, null values, etc.

Once the file is uploaded a flow is created, a flow is the general name of a pipeline in Dataprep. I won’t be using it to build the pipeline, just doing the data exploration. To get there, I’ll add a recipe and edit it.

As you can see below, Dataprep does some level of profiling of the data.

Some quick observations:

  1. Empty records make up most of the Company data in the sample set
  2. There are multiple fields being used to describe a Client (Client, Company, Company Name, Company_Name) and there is a lot of inconsistency with their use.
  3. Data is not being input into many of the fields, you can tell by the horizontal bar graph coloring, Teal is where values are present and the dark gray is missing a value.
  4. There are also some data type discrepancies between some of the system dates

So I know that when I start looking at Data Quality there are some data input processes that will need to be formalized, understandings of which fields to use, and some potential transformations I’ll need when processing some date fields.

Well that’s it for now, next time we will look at the Data Dictionary and any Data Quality rules and processes that came up based on the business meetings and data exploration activities that I covered in this post, as well as setting up a more permanent data pipeline to noCRM into BigQuery using Python.