So this post is a bit different as it doesn’t relate to mobile or web development particularly, but is more of a conceptualization and analysis. Recently, I’ve been tasked to move and summarize large sets of data; we’re talking tens of millions of records, resulting in hundreds of gigabytes of space needed. At first I took a rudimentary approach which involved in writing a console app in C# using ADO.NET. I won’t even going into using an ORM (Object Relational Mapper) like Entity Framework, lets just say: they’re not efficient in this scenario.
Now, I began my approach by creating SQL database views to select and aggregate the data, so that my calls from the console app would be simple, clean, and maintainable from a code aspect. But there was an inherit flaw that there was so much data, SQL server was taking over 5 minutes to return data of a simple group by for example:
SELECT Name, Id, SUM(Numbers) FROM example GROUP BY Name, Id
Now it can be argued with the correct indices and a powerful enough server this shouldn’t be an issue. Well you are correct; however, throwing money at a server is not always an option and it was going to become comber some to write database views and an console application and keep them in-sync. Step in Microsoft’s Sql Server Integration Services (also known as DTSX), which provides a great alternative and solution for an ETL process. Now Microsoft’s documentation is actually on point when it comes to creating and implementing an SSIS solution, surprisingly!
Creating an SSIS solution is as simple as installing the Tools available here also known as SSDT. Just as heads up you’ll want a local instance of SQL Server installed on the machine as-well, version doesn’t matter. If that isn’t an option, don’t worry about it to much because SSDT will install everything you need to develop and test; but not necessarily to deploy correctly. Anyway, once you have the SSDT package installed you can create a new project/solution in Visual Studio and start dragging controls into your package.
A “package” is the terminology used in this project/solution and a package contains many steps or “flows”. “Flows” is used to distinguish and show how data is extracted, transformed, and then loaded (ETL). You do have a multiple options on how you complete these processes via ‘controls’. The caveat to remember is that some controls process data row by row, which could be a costly performance hit if you are not careful. The ultimate goal is to process data in sets and minimize row by row processing. So for example the previously mentioned query can be improved on in an SSIS package.
Basically what we’ll be off loading the aggregation into SSIS because the data in all held in memory and the aggregation is far superior to what even SQL can provide by itself. So we create a new control to extract the data from our SQL server via an OLE DB connection and pipe it into an aggregate control. Now what is beautiful about this step is that the aggregate control is smart enough (most of the time) to know what columns need to be grouped by and what aggregate function should be performed (mainly SUM()). Granted you’ll always want to double check and configure appropriately, but this was a nice touch. Well done Microsoft! Now, after the aggregation control finishes its thing, we have to pipe it to another control, until we hit a “destination” control type.
The next step in a an ETL process is to check your target database and see if there are identical records. When there are identical records we have to make a choice: update or ignore. If there is not identical record then we want to simply insert the new ones. Once we’ve identified the records that need a change we’ll want to setup a staging table and drop the records into the staging table and perform an update by joining the staging table and the main table on the correct keys, which results in a one transnational update.
How do we complete the “T” and the “L” and what controls do we use?? Well we actually need to extract the targets data before we can continue with the process because we need to know what to transform and what to load, rather than perform unnecessary calculations or aggregations. We will do this with a “Lookup” control which connects to your target database and can pipe data to two different flows: one that matches and one that does not match. See were we’re going.. 😀 Let’s explore the update flow: you’ll want to connect the lookup control to a “Conditional Split” control, which will allow you to create an expression to check the extracted data against what already exists and pipe the data accordingly. If the data needs updating we’ll want to pipe it to a staging table rather than an “OLE DB” command because an “OLE DB” command processes row by row, which is not what we want remember. Finally, we’re going to to connect both this “Conditional Split” and the “Lookup” missing data to their own “OLE DB Destination” controls. This control is pretty straight forward and simply asks for your destination table and gives you a few options on how to load the table. Always use “FAST LOAD”, this will use batch inserts while the others do not.
Once all the previous data flows have been completed, you’ll want to execute a sql statement or stored proc to update the target table joined on the staging table; then clean up the table after the fact.
This process resulting in an initial time of 7 seconds for SQL to return the data into less than 1 second to pull the data and aggregate and insert/update the data accordingly, what an improvement! I’d say that was a great day’s worth of work and a lot learned!
Till next time, happy coding!