We wanted to drive business insights. "Data-driven decision-making" - that was going to be us. To do that, we realized we needed to be the source of truth for our company's internal data, so we set about to aggregate as many data sources as possible into an internal data warehouse. Like so many data teams, we wanted people to come to us with their data questions, and we could answer them.
Soon into our journey, the reality of a resource-constrained team came to light. We could only answer some of our questions or make a BI report with insights for some situations. We had high-quality data that people wanted, but we started being the bottleneck.
We realized our valuable assets were the cleaned, high-quality datasets we'd created. We would generate BI visualizations for our stakeholders as quickly and effectively as possible, but most people just wanted the underlying data to do their own analyses. Even with all the visualization power BI tools have, our most requested report was just a table of numbers. We had other teams asking for our data, but we needed a better way to get it to them. If we get the underlying data to specific stakeholders, we could serve far more use cases and stop being as big of a bottleneck.
To do this responsibly, we wrestled with a few challenges:
A standard security posture, and the one that our company followed, is the concept of least privilege. Our data governance dictated that people should only have access to the data they need and not have unnecessary access to data they don't.
One potential solution to exposing the data to our stakeholders could be direct database access, but we quickly dismissed that idea. Our cleaned data sets were too often co-mingled with the draft ones, and we didn't have the infrastructure set up to selectively permission specific tables easily. Our leadership was also concerned with 'external' parties accessing our database. So we had to get our data out of the data warehouse in a controlled way.
When we were using the data for our BI reports and analysis, we could perform our quality checks to ensure things weren't looking wrong in our production reports, and if they did, we could quickly remedy it. If we were to give our data to other parties, we would have to be more robust with our quality checking and communicate any caveats about using the data to the other teams.
Some people that wanted to use the data were data experts managing their own data warehouse, while others were going to use the data in simple Excel. Some teams wanted to automatically feed our data into an internal CRM or another SaaS platform, and some were content with the BI visualizations we pre-built. We had to be able to support all of these use cases with a single source-of-truth data set.
Given these challenges, how could we start to distribute our data responsibly?
We reluctantly turned to every data engineer's silver bullet - the CSV file. CSVs solved a few of our problems, but not entirely:
We could filter our SQL query to pull the relevant data for the person requesting it. They could always forward it to people who weren't supposed to have access, but we had to trust them.
We could do manual data quality checks in SQL before we created the CSV files. Each CSV was accompanied by an email explaining the data and applicable use cases. We added timestamps to the filename to let people know when we last refreshed the data, but the static nature meant we manually needed to send files on a monthly or weekly cadence.
CSVs are the 'easiest' data transfer method, so we were able to cater to the Excel users and more advanced users. The advanced users had to take an extra step to load that CSV into the format they wanted, but at least they were getting the data now, so this was a good compromise.
As demand grew, sending CSVs turned untenable for our most popular data sets, especially with all the manual work involved for each data refresh. To scale, we had to shift to more self-serve options. But to enable the self-serve data access by external parties, we had to be deliberate about how we set it up. Each data set had to be accompanied by the appropriate caveats - what to use it for and what not to use it for.
To replace the quality checks that we were manually performing before each CSV send, we now had to write automated tests for the key datasets to be performed on every refresh. We needed an owner for each dataset to maintain and enhance it as we got user feedback. We started tracking the usage of our key datasets to prioritize our efforts against the most-used datasets and assess the value our team was driving. We cobbled together the functionality to do these things with our existing tools, which eventually got the job done, but was long and arduous to set up.
Up until this time, our team generally operated by completing projects. We would get requests, iterate with the stakeholder, and deliver our completed analysis or dashboard. Some of our stakeholders were in the product org, and we worked with them to maintain a product catalog and track KPIs around product usage, quality, and support. We devised criteria for adding datasets to a product catalog, including requirements for having an owner and passing specific governance gates. As we were working with them, we realized: this is what we were doing with our key datasets! The processes we were putting in place to maintain our self-serve datasets were effectively creating "Data Products".
We then fully adopted the product mindset - we had a catalog of data products that our team managed as data product managers. This mindset shift was a big unlock, but even after realizing we needed to treat our data assets as products, we needed help executing various parts of the product lifecycle. We hacked our processes with different tools, but it took a lot of effort and was far from perfect. We didn't have any dedicated data distribution or data product management tools, so we had to try rigging different things together to get this to work.
Our team still did projects for specific requests, but building out the team's data products will be how the team's impact will scale. Data visualizations still have their place for known reporting, but I suspect many business insights and analytics teams will evolve similarly to ours - from driving projects to creating and maintaining data as a product.
While we learned this through trial and error, we are excited to share strategic advice with your team on how to better utilize your data. So feel free to book a call with us and start this journey!