Data engineering, the immersed part of the digitalization iceberg
Updated: Jul 25, 2019
Two weeks ago the DI Summit organized by Digityser took place at the Free University of Brussels (ULB). Ai experts shared very interesting insights and use cases. During one of the masterclasses, a big consulting company presented an impressive AI use case that they have built within a month, but after 6 months of collecting business requirements and getting access to the right data, in the right format. In order words, 6 months of data engineering for building their use case. This proportion of efforts invested in the real value creation (the AI model) contrasts with what it took to be able to kick it off… And trust me, these guys are not amateurs! This is actually a very common observation within companies that testifies of a too-frequently accepted status quo which must be challenged.
Data engineering consists mostly of getting data ready-to-use without harming existing systems. When it comes to data-driven projects such as AI, data science or any regular data integration project (e.g. building a customer 360 view), 80% of the effort - let’s be optimistic - is spent on data engineering to get the necessary data ready-to-use, while 20% is actually invested in building value out of this data. This may sound weird to you and you may probably wonder: “If this data exists already in your systems, why on earth should it take 6 months to get it ready-to-use?”. This article explains what this 80% entails, reflects upon the ‘why’ of this situation and gives direction on how you can get out of this never-ending loop.
> TREE53 provides data engineering & operations expertise and trainings. Our Kafka®-based solution automates data engineering, governance and security. Discover how.
So what does a data engineer do?
There is no clear definition of when data engineering starts and stops. In the context of this article, I will just take the example of what data engineering most probably looked like for an analytics use case (eg. BI reporting, AI, data science etc.).
When a new project comes up where data sources need to be unlocked for a specific use case, the very first thing a data engineer will need to do is understand which data is needed and where it is located. This involves a lot of meetings in order to obtain the information and get access to the data. It is important to note that different data sources (e.g. a CRM vs an ERP) do not talk naturally to one another. This is why data engineers build data flows between different systems, storage facilities and other environments (e.g. a BI tool, an AI model environment etc.) involved in this use case, taking into account the specific needs of that use case (e.g. batch delivery vs real-time analytics). The idea is to retrieve and deliver data without impacting other existing systems.
Once unlocked, data requires a lot of preparation and transformation at individual (record) level and at a more encompassing level (datasets) in order to build a curated and consistent dataset. Data from different data sources is heterogenous. It’s a bit like if it was written in different languages and needed to be translated into a common one, a common format. Of course not every record is worth using: some are “corrupt” or “inaccurate” and need to be replaced, modified or deleted for the quality of the final solution to be ensured (see Dr Shaomin Wu in “A review on coarse warranty data and analysis”).
Besides ensuring the quality of each record, data engineers compare them with one another to make sure that duplicates are out of the way and that some data is combined/aggregated to give a better understanding of the datasets. This is of course an iterative process that only stops when the data seems ready for the use case (BI reporting, AI, data science, data integration etc.). Changing the data that is needed implies another round of data engineering work, whether it is in the context of the project or long after its delivery. Once a project is delivered, maintaining and fixing bugs on this dataset stays important to ensure the quality of the solution.
The robustness of the final solution will depend on not only how many of these tasks are automated (data delivery, data cleansing, change management etc.) but also on the level of security (eg. encryption) and data governance (eg. access rules, data catalog, data workflow documentation, etc.) built around the solution. This is also typically part of data engineering concerns, and it’s unfortunately often overlooked due to tight deadlines.
The above-described status quo of data engineering accounts for the considerable time dedicated to getting data ready-to-use for building value on top of it. Moreover it mostly involves tasks that are actually often done manually and repeated over and over again instead of building upon existing flows. Most of these redundancies started with constraints imposed by old technologies that got so embedded in IT culture that they became the norm. Let’s take as an example a data integration effort. Data integration is often thought to serve a specific use case and is executed by using a tailored point-to-point ETL connector, giving thereby less flexibility for reuse in other use cases. It is enough to have the data engineer that created the connector leaving the company to forget about the very existence or functioning of this connector. All in one, it often leads to a spaghetti-bowl data landscape that no one understands or dares to touch anymore. In this context, the impact of change and evolution on existing data pipelines is often unpredictable and hard to control. Bug fixing, re-engineering, etc... Change takes time. In other words: it sometimes becomes safer to build the exact same connector all over again than risking a system breakdown by touching an existing ETL connector.
How can digitalization truly happen when it takes you months or years just to have data ready-to-use or just ready-to-experiment-with?
If any new business requirement or any little change in the data requires significant data engineering efforts, your current data engineering practices will be a hurdle to business agility. Business agility is an important component of a business digital transformation journey. The reason behind this need is that no one knows exactly what businesses will need to look like in the near future. Therefore companies must be able to try out different things just like a start-up would do to discover its market. As a consequence business requirements are constantly evolving and need to be dealt with as soon as possible. This need now requires new technologies, concepts and movements in order to accelerate the delivery and ensure the long-term quality of data-driven projects.
Some technological and conceptual highlights …. Automate, automate, automate.
A key element that will substantially improve any data engineering effort resides in the automation and data centralization opportunities opening up with the rise of data platforms as some sort of a central nervous system of a company. Such a platform can be developed in-house (with the right open-source software) or supplied by a third-party. We are currently preparing a series of articles on the essential features of such a platform (e.g. security and access control, schema management, governance, lineage tracking, etc.). So, if you want to know more about it, stay tuned. In the context of this article, I would like to cite a few benefits that will tremendously change the way data-driven solutions are developed.
First of all, a data-platform can integrate any type of tools or systems, once. This means two things:
You can keep and work with any ‘best-of-breed’ tool you want; there’s no need of a large enterprise suite that imposes its own tools.
Data is always readily available for new use cases since once data sources are integrated to this central platform you can index (use) this data at will without any extra data integration efforts. This why we call it “connect once, index at will”.
Another important benefit of a data platform is the backward-compatibility capabilities that should come with it. It should guarantee that any change in one of the integrated systems will be coordinated across other systems so that they can adapt automatically to that change.
Thirdly, the self-service data capability facilitates any type of analytics use case since e.g. data scientists can just pick up the necessary data and play with it until their model is built. This illustrates that data is readily available: data engineering has been done once and now the main focus of any project can be about using the data to build business value.
Finally, data governance aspects of such data platform can tremendously accelerate data engineering by automating information updates about where data is, what type of data (e.g. personal data) is available, who has access to it and how it is transformed. This eliminates the major parts of meetings in the first phase of an analytics project as mentioned above in this article.
But technology is not enough of course as the status quo can not be removed in the mind of humans from one day to another. This is why movement such as DataOps looks at different angles to optimize data engineering efforts. Dataops can sound somewhat of a buzzword but the idea behind it is worth attention: improve integration, communication and automation of data flows across the organization by leveraging people collaboration, processes and technology.
The DataOps term just entered Gartner’s Hype Cycle for Data Management. It is defined by Gartner as “a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization. The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.” In other words, it aims at implementing a process to manage, store and use data in an agile and efficient way while reducing the cycle time of data analytics… An interesting concept to explore!
It sometimes takes company a lot of bad and costly experiences before they realize how important data engineering skills are.
To conclude this article, I would like to highlight two “DataOps smells” or frustrations that might make you want to reflect upon your own data engineering practices.
We often see prospects and customers underestimating data engineering. Many of them ask data scientists do this part of the work as well. However this is a bit like hiring a pianist to play the guitar. Don’t take me wrong, some data scientists will do a great job on this but be aware that these persons will only spend 20% of their time on what they were actually hired for: build a model, train it and deliver results. This can be highly frustrating not only for the data scientist but also for future data engineers that might need to fix issues caused by a lack of data engineering expertise. So if your data scientists are spending so much time on data engineering, this might be a symptom of something bigger.
The second frustration I would like to highlight is of course how a business department reacts when a little change in data might take months before it can be realized on the IT side. In terms of agility, companies should of course question their organization, the methodologies used to develop solutions but also the role of their data infrastructure and thereby of their data engineering in all of this.
Thanks for reading and please let us know in the comments below whether there is a specific question or task you would like us to zoom on in the next article.