At DotControl & Rockboost we apply a data-first approach. We don’t make assumptions on what our client’s clients would like. Instead, we build, go live early with an MVP, we measure and we learn. We literally live and breathe data. Data in itself is not valuable. It’s the information and insights that you can extract from data that are valuable.
Before you can even start thinking about extracting learnings from data, you must first make sure all your data is stored, cleaned and accessible in a uniform way. You have to think about how to make sure new or updated data is automatically loaded into the system, without compromising speed or the availability of the system where data experiments are being performed. We’re not talking about gigabytes, but potentially petabytes of information, coming from various data sources, from .cvs text files to IoT generated event data.
Traditionally the solution for analyzing big data sets was setting up on-premise data warehouses. Due to the huge costs involved, both in terms of hardware costs as well as maintenance costs, this was only attainable for very large companies. With the advent of cloud computing the costs of both data storage as well as the costs of computing power needed to analyze this data decreased considerably. This made the analysis of big data sets reachable for medium to small-sized companies as well.
Unfortunately, the license and/or maintenance cost of these traditional data warehouse solutions still are unreachable for most companies. On the other hand, we see some new kids on the block jump into this hole left by the traditional data warehouse solutions in the cloud data warehousing arena, making cloud-based big data analysis available for the mid-size companies.
These cloud-optimised data warehouses are called Data Lakes. It’s basically a central repository where you can store all your structured (.csv, SQL), semi-structured (JSON) and even unstructured data (binary files such as PDF's and images). Besides allowing other forms of data storage than structured data, Data Lakes also try to overcome other limitations of traditional data warehouses, such as flexible storage and computing costs on a pay-per-use base.
Traditionally data warehouses took an ETL approach for getting and storing data. ETL stands for Extract, Transform and Load. Data Lakes usually employ a slightly modified process called ELT. So first data is Extracted from various data sources, then the data is Loaded into the Data Lake storage. Decreased costs of storage and computing in the cloud made it possible to store raw data in the Data Lake. After data is loaded in the Data Lake the data is Transformed to make it suitable for data analysis.
Data Lakes solutions are offered by different vendors, such as Amazon Web Services and Azure. Because we’re a Microsoft oriented company we’ll have a brief look into the Azure Data Lake offering. If we look at the ELT process, the Extract phase is handled by Azure Data Factory. A Data Factory is a dynamically oriented service that allows different data sources to be queried for data. This data can then be Loaded in the Azure Data Lake, which is basically an Azure Blob based data storage. For Data Transformation Microsoft introduced a new language called USQL. It’s a combination of SQL and c# like language components that allows easy data manipulation that can be executed in parallel. This ensures performance when handling big data sets. At the other end of the Data Lake offering Microsoft offers Azure Data Lake Analytics where you can transform Data Lake data for further data analysis.
When running a Data Lake project we’re using an adapted version of the Scrum process, called the Microsoft Team Data Science Process. The TDSP process gives you a clear format, from a clear definition of the project and team roles and responsibilities to data dictionary documentation and agile way of executing data science user stories. The TDSP life cycle consists of the steps below:
- Business understanding
- Data acquisition and understanding
- Customer acceptance
Another big advantage of using the TDSP process is its focus on setting up a CICD like pipeline. We’re in the old days the complete project was dependent of the data scientist’s machine, TSDP applies the lessons learned in software development to the data science process. This means all sample data, data dictionary documentation, data transformations scripts etc is committed to the data science project git repository. This ensures all team members can share scripts and can have their environment up and running in minutes.
Our approach is always based on The Lean Startup. We also apply this to our data science projects. Focus and getting a cost-effective MVP live as soon as possible and iterate from there. A Data Science project never has a guaranteed outcome. A lot of time is spent on data wrangling and exploration, machine learning model parameter experimentation, and analysis of results. By combining TDSP and MVP we can make sure we successfully deliver and run a data science project.
As a company, you need to start building your Data Lake because it’s the Intelligent bases of the service offered by your Intelligent Digital Mesh. Your Data Lake will not only store all your data, from on-premise data to analytics and IoT streaming data, but in the future, it will also offer you services like Augmented Analytics.
So if data is the new oil, you can see a Data Lake as an oilfield, ready to be harvested for insights.