In the age of unprecedented tech innovation, it can be easy to forget our roots.
Data is the lifeblood of an organization. Companies such as Google, Uber, and Facebook have built business models around the data they collect. With all the promise data offers, it also presents challenges. For example, securing consensus across the organization on the definition of data elements can be a daunting task. The time required to clean and prep data for use can be excessive and expensive. Long turnaround times for new reports cause frustration, so users end up building their own reports in Excel. But one of the biggest challenges is slow system response times. I’ve always thought there had to be a better answer to all this frustration.
Thinking through today’s biggest data puzzles has made me reflect, though, on just how far we’ve come.
One of my most memorable projects was building an enterprise data warehouse (EDW) for a large federal agency. The EDW is where organizations collect and store key data from across the organization so they can measure performance against key indicators. Although I had spent my entire career developing software applications, I felt like I was fresh out of college when I had to design and build my first data warehouse. It required a fundamental shift in thinking. The word data took on an entirely new meaning.
Despite their quirks, these new tools represented a qualitative victory: they helped us envision data uses that would have seemed impossible just a few years prior. They were not a silver bullet, but they were a step in the right direction.
A few months into that experience, I quickly learned that most of the difficult design decisions had to do with designing data models that would perform well when users submitted complex queries. This was no easy feat. We had to anticipate how users were going to query the data before we could design the data models. It also required significant amounts of code to pre-aggregate data so we could present it just as the user would expect it. Any small change request caused us to tremble because we knew it had the potential to cause ripple effects through our code and data models.
Then, about 10 years ago, a new set of data visualization tools emerged that promised to ease the pain. We bought into that promise and were early adopters. Now we could just take raw data and create our own visualizations without having to wait on perfect data models. Despite the potential, we quickly realized this didn’t scale well. As soon as we reached 100 million records, we started to have major performance problems. Despite their quirks, these new tools represented a qualitative victory: they helped us envision data uses that would have seemed impossible just a few years prior.
They were not a silver bullet, but they were a step in the right direction.
MemSQL, a small startup founded by two former Facebook employees, realized the pain points and took the next step. Realizing some of the challenges faced by our data team and many others, they developed an in-memory database specifically to address performance issues. I won’t go into all the details, but MemSQL’s in-memory database performs so well that it eliminates the need to pre-aggregate data. This means users can get access to data as quickly as they can collect it. It also means development teams don’t need to spend their time building perfect data models that anticipate a user’s every interaction with the warehouse. Instead, they can shift their energy to delivering better insights.
I’m sure this cocoon of smart kids in Silicon Valley will continue to push the envelope with innovative database solutions that enable us to unlock the value hidden in our data. I just wish those solutions had been around when I worked on my first data warehouse project.