Why using an open source framework for data engineering

As an article written by the guys of ManoMano pointed out in an article , many companies think about solutions without defining the right problem first.

Our approach was different :

  1. Define the problems that are trying to solve. It’s it important ? Complex ? What is the potential impact on the short term ? And in the long term ?
  2. What are the characteristics (requirements) of the tool that we need
  3. Index the tools that exist in the market and see which one(s) fit(s) the best to the requirements
  4. Analyse the Return of Investment of each solution when multiple solutions have similar features

In our case the problem is :

  • We do not require multiple connectors
  • The data that we generate internally is considerable as we already hit the 1.5 billion data points only counting one of our products
  • We have a freemium model, hence a significant amount of data that we manipulate doesn’t generate an immediate financial return that would therefore modify the short to middle term ROI that we are looking for
  • We didn’t want to be vendor-locked to a cloud platform solution

These elements inclined us towards the Open Source solution. However we didn’t want to fall in the trap of the “easiest solution”.

One of the founders of Airbyte accurately describes the situation in a blog post : “The truth is that in a lot of enterprises, it can be easier to justify hiring a new employee than getting even a modest budget approved for an external vendor. Depending on the open-source project, most or all of the job could be already done. That’s one of the reasons why most companies use open-source technologies rather than external vendors.”

However, in our case we felt pretty confident, considering the low amount of connectors that we would need to create, that we would be able to tackle the problem without a significant engineering overhead.

Among the different solutions that we found in the market, the three that were the most appealing were :

Since we are the data team, we decided to have a measurable approach. So, we created a spreadsheet where we carefully selected the features that we would require from the tool and we compared those 3.

The use case

In order to test the different solutions in front of us, we took a very pragmatic approach : We defined what we wanted to do and tried to do it with all of them. By doing so, we measured the amount of effort that would be required to complet the task.

Moving forward, we decided to solve a medium-difficulty task, since it would be more sensitive to differentiate between the alternatives than an easier task, without requiring a lot of time. Indeed, a hard task might be more differentiating but it would also require more time per se, and this strategy would have been time consuming independently of the test that we were doing.

The data that we wanted to move was from Hubspot to BigQuery.

In the next articles I’ll describe in detail how this pipeline actually ended up being created.

The comparison

In order to have a better understanding of the difference between each solution we did the same procedure :

  • Subscribe and participate to each solution Slack channel
  • Install the frameworks in a GCP Compute Engine VM Instance
  • Try to create and launch the pipeline

PipelineWise#

👍 The good :

  • The compatibility with Singer protocol
  • The quality of the taps and targets, considered as a very good reference among Singer protocol users
  • The simplicity of integration

🙁 The less good:

  • The documentation is much less explicit than the other alternatives
  • No integration with DBT for the T process
  • There was not a BigQuery target
  • Orchestration doesn’t come out of the box

Meltano#

👍 The good :

  • The compatibility with Singer protocol
  • Relative big community in Slack
  • Developer oriented by having a good cli experience
  • DBT integration out of the box
  • Orchestration easy to put in place
  • The Singer SDK (work in process) with objective is standardize the tap creation in order to create a more homogeneous ecosystem while accelerating the creation of new taps

🙁 The less good:

  • As a very new project, there are many features that are still buggy and the team is small so the speed of development is relatively low
  • You still depend on the heterogeneous landscape of tap and targets, which makes the onboarding process to find the right combination for you still painful. In our case we use a target that was not adapted for the hubspot-tap, ending up with tables with an unreasonable amount of columns [screenshot ] because of the JSON structure of Hubspot API responses

Airbyte#

👍 The good :

  • Relative big community in Slack
  • A more user friendly interface for non technical people
  • DBT integration but not as easy as with Meltano
  • Raised money so the team with grow and accelerate
  • The will to create a new standard for the creation of taps and targets, making them become the DBT for connectors

🙁 The less good:

  • Because of the money raising, a pivot or change in business model, creating a vendor-lock since their protocol is not out of the box compatible with Singer’s one
  • Debugging is complicated
  • The speed of some pipelines was very slow
  • We had to disable normalization, causing the dump of JSON objects in BQ requiring a more complicated transformation [screenshot ]
  • The containerization of the system is complex because of the use of Docker-compose spinning up many docker images

Here ’s a screenshot of a recent (25/02/2021) user feedback about his experience with Airbyte.

Conclusion

As per our analysis, we decided to go with Meltano. Why ?

  • Because it allow us to use the community taps without rewriting
  • We can always start using the singer protocol and move to Airbyte if it grows, but the other way around doesn’t work. So if Airbyte end-up not evolving as expected, we would need to rewrite the code twice
  • The developer experience was better overall

 
 

comments powered by Disqus