
There are so many tools out there that help with your data projects. Some are free and open-source and some you have to buy. Should you use open-source or vendor data tools? The answer depends. There is no right or wrong answer to this question.
In this post, I’ll spill my thoughts on what it means to use open-source solutions vs vendor solutions.
Open-Source Solutions
Open-source solutions are those that are available free. You can use an open-source however you want. e.g. the pandas library in Python. It’s an open-source library so you can use it for data processing and analysis for free. There might be some limitations in usage depending on what license the open-source tool has, but generally speaking, open-source means you can use it for free. If you think about it, many programming languages are open-source.
When I was first getting started in the data space or in tech in general, I was surprised to learn that people are developing tools or software for others to use at no cost. I thought, what a great world we live in. And later realized this is only half true.
Some open-source tools are actually developed with a plan in mind that their extended features will be available for paid-users only. Nowadays, many companies do that. They develop an open-source tool, and while it’s getting usage from users, they actively develop paid features. And they make money that way. I personally don’t have any objection to this approach. They’re giving a free tool anyways, it’s their freedom to create an extra feature to make money.
Going back to open-source (OSS) solutions in general, if you’re trying to decide to employ an open-source tool at your company, I’d suggest you look at these elements:
- How many users and how many contributors are there?
- How actively is the project being developed?
- Are there any companies supporting the tool financially?
- How complete is their documentation?
- How easy is it to get support from the community?
At work you’re there to help make money for your company. So, whatever you use to support it, it needs to be reliable. Some open-source are developed by a single-person with no financial support. Some have companies backing up supporting them financially. Make sure you look at these elements before committing to a tool.
Vendor Solutions
Vendor solutions, on the other hand, make it easy for you to get up and running right away with the tool, removing the majority of the heavy burden of implementing and maintaining the solution. Their product or service tend to be more hands-off, requiring less effort on your end to get the job done. They even offer you some extended support that you can use to troubleshoot any issues you encounter. For an exchange, you pay the vendor some licensing cost or you pay them the resources (e.g. CPU, RAM) you use.
One caveat to this is that, even though they tell you that you’d work less, you may have to implement custom solutions depending on your needs. There was one time, my client I was working with at the time was using Fivetran for data ingestion. But then my client had a source system that Fivetran didn’t have a connector for. Why did I do? I had to develop a custom Python script to ingest that data. Fivetran offers a way to integrate custom scripts so that Fivetran can trigger them on schedule. So, I had to build my script with that integration in mind. Can
Since you pay for a solution, you tend to get better support system. If you’re using a popular tool like Databricks, Snowflake, dbt, there are communities for each of these tools online where you can ask questions and get answers from people using the tool. You may also get to ask questions the representatives from the vendor, depending on your licensing tier.
One thing I’ll have to mention regarding using a vendor solution or tool is that you’d want to look at how easy it is to get out of it. For example, vendors usually make it easy for you to get data into their platform, but oftentimes, it’s not as easy to get the data out. Although open-source data formats like Delta lake and Iceberg are making this portion more open-source like, if you’re not careful, you’ll experience what’s called a vendor-lock in. Meaning you can’t get out of it even though you want to.
Conclusion
I suggest you look at the data tooling landscape holistically before picking a tool. There are many tools out there and there may be a tool that fits your exact need, at an economical price. Starting with open-source tools is also great. You can test out features and understand what the tool has to offer before spending a lot of money. With open-source tools, your data and potentially code logic are likely to be portable as well, in case you’ll decide to use another tool after a PoC. If you go with a vendor solution, make sure you have some open-source components if possible, to prepare yourself for a future migration (which will happen).