The State of Data Engineering in 2021 - The integration problem
The current state of data engineering is quite frustrating.
Coming from the field of data science where new tools emerge every day, it doesn’t take long to understand that the data engineering situation is not the same.
The Current State
If you take a look at the pyramid Data Science Hierarchy Of Needs, you can see that the second basic layer is about a reliable data flow and structured and unstructured data storage. That is exactly the job of a Data Engineer.
Furthermore, in a quite recent article in the Andreessen Horowitz a16Z blog and 2 other articles here and here reflect the deep ongoing transformation that Data Engineering is having right now. This transformation has an impact in the capital being raised and the job market.
Data engineers are not data scientists, and have no domain expertise. To learn to be a data engineer, a person needs to be able to tackle programming challenges that do not come naturally to a developer because of the data quality aspect that are behind them. However, on the most extreme level, data engineers need to be able to creatively analyze data, and craft systems that can extract business insights from them.
And yet, the current state of Data Engineering is very immature probably because it’s much less hyped than the one for Machine Learning or Data Scientists.
This creates a vicious cycle :
On the technical side, the change of paradigm has been going from the traditional ETL (Extract - Load - Transform) one, to the EL(T).
This evolution has seen a very important actor in the T step that will be discussed in later articles : DBT . DBT has created a new standard in data transformation with Open Source tools but also a business model based on team collaboration and advanced features.
However, it has been pointed out, the ET steps are previous to the T one.
And here’s where the problems begin.
The data integration problem
There used to be an open source solution for creating connectors, created and maintained by Stitch, the Singer protocol.
After the acquisition of Stitch by Talend, this protocol has not been updated as significantly as it used to be.
This creates a situation where on one hand you have expensive commercial solutions (Talend, Matillion, Fivetran, …) and on the other hand you have disorganized and heterogeneous ad-hoc solutions created by data engineers either their own connectors for special purposes.
Even Fivetran’s CEO pointed out this situation.
The good news is that when there’s a problem , there’s a market. And where’s a market, entrepreneurs will find a way to address it. Among those entrepreneurs, there’s the founders of Airbyte , a very young company that just raised $5.2M to tackle this exact problem.
Their view of the dilemma between Open-Source and Commercial solutions is well resumed in their article on the subject.
The new kids on the block
However they are not the only ones that decided to provide a solution to this problem. There are 2 other companies that have been working on this. The most notable ones are
- Gitlab, with their solution Meltano
- Wise (ex-Transferwise) with Pipelinewise
They all provide different solutions and different perspectives.
Conclusion
The purpose of this article is to explain one of the many problems companies face when it comes to data engineering. In the recent months new players have emerged trying to solve the integration problems and at some point it’s important to choose one. In the next article we will share the criteria that we used at Meister in order to do so.