By Vincent Huysmans – Data Engineer
Hi, I’m Vincent Huysmans – data engineer at InfoFarm. You might remember me from the previous blogs ‘streaming-first as data architecture‘ and ‘streaming-first with apach flink‘. In this blog, I will enrich the data platform blogseries with some valuable information on why ‘serverless’ data platforms are powerful, scalable and cost-efficient.
What is “Serverless”?
First things first, let me give you a quick explanation on what we mean by the term “serverless”. When we talk about serverless, we don’t mean that you don’t need servers to run your applications. It is impossible to magically use an application without hardware. With the serverless model we mean that we eliminate the need to manage servers and focus only on building and using the application.
This model of course requires a cloud-native approach. On-premises servers need to be managed; hence this is not serverless. Instead of managing our own servers, we give that responsibility back to the cloud provider. They will handle the routine work of provisioning, maintaining, and scaling the server infrastructure. Developers don’t need to be server administrator experts to deploy and run their code. They can simply package their code and give it to the corresponding service and they are good to go.
Once the developer has deployed their application, the cloud vendor will automatically provision and scale the needed server infrastructure based on the demand. When there is no demand, so when the application is sitting idle, you don’t need to pay for anything thanks to the serverless ‘pay-as-you-go’ model.
As you can see, this serverless paradigm gives developers many advantages such as eliminating operational overhead and reducing cost. But can this model also be used when building data platforms? Can data engineers benefit from the same advantages as application developers?
What is a Serverless Data Platform?
The answer to the questions above, is of course “yes”. Yes, data platforms can benefit from the same serverless advantages as described above. As mentioned in a previous blog on data platforms, we have learned that a data platform consists of different components. Each component has its own responsibilities such as: data ingestion, data storage, data catalog, data transformation and data analysis. Each of these components can be covered nowadays by numerous serverless services provided by your favorite cloud vendor.
So, a serverless data platform is nothing more than a collection of serverless services to cover all the layers of your data platform. This platform has the advantage that you only pay for what you use, and that it doesn’t require any operational setup and maintenance. We just tell the service what data, how we want to process it and the platform takes care of the rest.
“Serverless”: the silver bullet for data platforms?
“Is a serverless data platform always the right fit for your organization?” – That’s a question we should be asking ourselves before building a serverless data platform.
When constructing a serverless data platform, we need to take a few things into consideration:
The (in)famous 3 V’s: Volume, velocity and variety. When choosing the right tools for your data platform, you need to keep these three concepts in mind, even if you’ve heard them so many times already. What will the data look like? When will the data be available, and at what rate will it come in? Is my data platform required to support stream processing, batch processing, or both.
Operational complexity: One of the biggest advantages of choosing a serverless data platform is that it eliminates a lot of operational overhead. This will not only save budget on maintenance, but also on the initial setup of the platform. In the past, setting up the first version of a big data platform would take months, if not years, now it can be completed within a few days. But can we really give away all this control? Isn’t it also necessary to adapt our infrastructure ourselves? And if so, do we have the resources to set up and maintain our own servers?
Cost: Does going serverless always result in a lower cost for my use-cases? Can we afford to trade in operational overhead for a higher usage cost? How can we limit the cost when data spikes occur?
Vendor lock-in: Going entirely serverless always means that you will need to lock-in a certain cloud vendor as they will be responsible for hosting, provisioning and managing your resources. For many organizations this is difficult, as they must be able to switch to other vendors whenever possible. And because of this lock-in, it can be hard to port to another vendor’s platform without a large amount of effort and cost. This requires them to go for solutions where they have more control over their applications and the underlying infrastructure that would make a possible shift of vendor easier.
But before sounding too pessimistic, I still think that in most cases a serverless architecture is a good fit for building a data platform. There is a wide range of serverless services that cloud providers offer nowadays that can be helpful to data engineers. Rather than using traditional 24/7 running applications for their data platform, they can now create a data platform consisting of fully managed serverless services. Taking advantage of all the characteristics of serverless, the data engineer can now concentrate on what is really important: your data.
In the upcoming posts, we will dive deeper into building a Serverless Data Platform on both AWS and Microsoft Azure – stay tuned!
Want to start building your own platform straight away? Take a look at the InfoFarm One Day Data Platform. A reference architecture in both AWS and Azure. We will get you going with a fully operational data platform in only one day!
If you’re excited about our content, make sure to drop us a follow on LinkedIn and stay informed about the next blog in this series. Interested in how a data platform would look like for your organization? Book a meeting with one of our data architects and we’ll tell you all about it!