From Legacy to New World, How We are Revolutionizing our Data Pipeline
You are the head of logistics and delivery at a corporation called Groceries to You. Just as the title implies, the company delivers groceries from various stores to a wide variety of customers. Your most prized innovation is the Grand Grocery Voyager, a super cool van that delivers groceries from grocery stores to customers across harsh weather conditions, distances and terrain. Your Grand Grocery Voyager (GGV) makes thousands of people, and their stomachs, very happy. Maybe you’re not in charge of revolutionizing our data pipeline with the GGV, but you know…baby steps, right?
However, this was not always the case. The Grand Grocery Voyager wasn’t always the solution to the grocery delivering problem. Groceries to You had much more humble beginnings.
You started as an eager smiley faced employee at Groceries to You, back when it was a new company. One day your boss told you that you needed to design a system that would deliver groceries from Mega Grocery Mart to a customer. However, this needed to be done in a week. Using your knowledge of transportation and the problem at hand, you concluded that the best solution would be a horse and buggy. They’re cheap and get the job done, plus you love horses. The business grew very fast, and soon everyone wanted to have their groceries delivered to them. Struggling to meet demands, you decided to create new horse and buggy systems each with a slight variation. Some customers lived across the desert, while others lived in a marshy area. As a result, you developed special tires for the buggies to handle the different terrain. Another grocery store called Groceries R Us also wanted to utilize your services, but they packed their groceries differently than Mega Grocery Mart. As a result, you designed a new fleet of buggies that could carry these strangely wrapped groceries.
As the years went on, you gathered a handful of clients who all packed their groceries in different ways and all needed to be transported a little differently based on where they were going. Your clients wanted to start delivering groceries to the next town over. Delivering to the new destinations would take a long time to accomplish with your poor tired horses. You decided to tie two horses to each buggy to make everyone happy. You knew that in the future, your clients would want to deliver food to farther and farther towns. Food goes bad after a while though. It’s at that moment that a new brilliant idea was born. Your mind started to form what will become the Grand Grocery Voyager: a van that can store any oddly shaped package product, with tires that can overcome any obstacle, and a turbo engine that can move faster than a buggy pulled by 600 horses.
Building the Grand Grocery Voyager is the current mission being faced by the Data and Analytics team here at 360insights, except instead of groceries it’s data. In the past, the Data and Analytics team was known as the “Data Feeds” team; their sole mission being to build and maintain the company’s ETL system.
ETL stands for Extract, Transform, and Load, which is the process of taking data from one place in one form or another, working some magic on the data to transform it into another form and loading it into some other location.
360insights has many different clients across different industries, and unless there’s a secret “data formatting” meeting that takes place amongst all our clients, every client’s data is going to look different. If you take a moment and put on your handy dandy thinking cap, you can probably conclude that every ETL process is going to be different.
Once upon a time, the data feeds team was a one-man show named Joe (it rhymes). Once upon a time, 360insights was also a lot smaller than what they are today.
Once upon a time, the solution to the ETL problem was a glorious composition of C# code and SQL stored procedures.
Each new transform had to be coded from scratch. To give you an idea of how much work this was, it usually took at least two weeks to code a new transform and to test it. This is all fine and dandy if you only have a few to do, but 360insights isn’t what it once was; it got bigger, and those few clients grew into over 100 clients across multiple industries. Soon, the little data feeds program became much bigger and more complex. You can think of our data systems like a horse and buggy: up until now it’s done what we’ve needed it to do and has done it well. Whenever we needed to go a bit farther or faster, we added a new horse to the existing buggy. However, no matter how many horses we add, we won’t be able to go as far or as fast as the car.
The old system simply took client data and loaded it directly into Core, the main platform, which happens to be a transactional database. The problem with this is that the system had a tendency to create duplicates. This often lead to a lot of difficulties with regards to specific tasks like building reports because everyday things like dealers appear multiple times, appear multiple times, where there should be only one.
This is where the “Data and Analytics” team comes into play. With more syllables comes more responsibility. In order for the team to be able to do more, to go farther, and to go faster, there needs to be a car. Getting that car in place will take more time, energy, and planning; but once it is built it will be a lot easier to go further and faster. If you haven’t picked up on the analogy yet, the “car” represents the new data system that is being developed.
The big challenge is that operations don’t stop, so the team has to maintain and develop the old ETL system while trying to engineer a sophisticated car. So, if it’s so much work then why bother? A more efficient and sophisticated data pipeline will allow us to onboard clients more quickly, and vastly improve the experience for current clients, thus upholding one of our founding principles: providing an unbelievable experience for our clients.
The new world data system is split into two components: the first dealing with how and where the data is stored, and the second addressing how we get data from one point to another (or more affectionately known as ETL).
Complicating the first component is the fact that data is split into two types, transactional data and master data. Transactional data is precisely what it sounds like; it keeps track of all events.
For example, if Billy buys a coffee from Bob at Mega Coffee Cafe. The transaction would be the interaction between Billy and Bob that involves buying the coffee. The master data would be stuff like Billy’s information, Bob’s information, the coffee type purchased, and the address of Mega Coffee Cafe.
Now, say that every time a transaction took place at any of Mega Coffee Cafe’s locations, a new entry gets stored about which location it was at and the type of coffee purchased. Some of these entries might be referring to the same coffee type or location but might be slightly different from the others (spaces, short forms, capitalizations, an accidental typo etc.) If you were tasked with finding out about what locations sold the most coffee, you’d likely run into some issues. Each location is going to be present multiple times in slightly different ways which can make finding out what you want quite difficult. This is when something called MDM (Master Data Management) comes into play. MDM is about maintaining a single “source of truth” Instead of storing the same location or coffee type multiple times, only one entry for each unique type is stored and then later referred to. This avoids duplicate locations and coffee types, which will make finding and using information a lot easier.
The Data and Analytics team is building an MDM solution for the new data world.
The second problem tackled by our new world data system stems from 360insights having a wide variety of clients from various industries. All the data tends to be formatted a little differently. However, all the data must be transformed into something that is more usable for the company. This is where something called a “customer canonical” data store comes into play. The idea behind this is it allows data of different formats to communicate and be used with each other. Before, data was just loaded directly into Core; now it is loaded first into this customer canonical data store where it is formatted consistently. The customer canonical system serves as the “translator” between the client’s data and our own. The customer canonical system feeds client data into the MDM system and will update information such as new dealers. This way, if two clients share the same dealer, that dealer can be represented as a single entity as opposed to multiple.
All this data needs to have a way to travel from one location to another.
The ETL system is the bridge between all the data stores. Previously, the ETL system was a complex assortment of custom code and stored procedures which took a long time to build, maintain, and test.
In the new world, an ETL tool called Pentaho is being adapted to perform these tasks. In Pentaho, ETL processes are built using a “drag and drop” Graphical User Interface (GUI) and are stored in an XML format. This makes processes a lot faster to build and easier to maintain. With a process that originally took two weeks to build in the old world, now takes only a few hours. One of the nice things about Pentaho is that all logging features are built in, and you don’t have to develop them separately, saving a lot of effort and time. Additionally, since Pentaho is built for creating ETL processes all the features you need are built in, such as reading and writing to various inputs and outputs. Pentaho offers a lot more flexibility and functionality than what is offered in stored procedures. Stored procedures are code that allows you to interact directly with a single database technology, such as Microsoft SQL Server. What makes Pentaho great is that it allows you to interact with a wide variety of data stores without depending on the database-embedded code, allowing for more visibility at a higher level. The Data and Analytics team has also been working on creating an automated testing system for transforms created in Pentaho. This allows for easier and quicker testing, cutting down hours of manual testing time to a few minutes.
Even though a customer canonical data model, an MDM system, and an easier, faster ETL tool will definitely make the Data and Analytics Team’s lives a lot easier, there is a lot of planning and work that has been put into this “new world” state and there will be a lot more work and planning to bring this data masterpiece to life. Nobody ever said it was easy to build a car, but it will be a lot faster than the horse and buggy.
Holly Oegema is a student in Software Engineering studying at the University of Waterloo. Currently, she is working on the 360Insights development team as an intern.