By Ben Vermeersch – Managing Partner
Making the wrong choice of platform. The horror scenario for every IT Manager or Architect. We are afraid of buying into the wrong ecosystem and this results in lengthy studies, doubt, standstill and not much being built at all.
In this blogpost, we’ll shed a light on the differences between cloud native Data Platforms on Amazon Web Services (AWS) and Microsoft Azure so you can make an informed decision and start exploring your data as quick as possible.
Want to jump to conclusions? In the end I’ll give you decision tree to simplify your platform choice. But first, let’s explore the differences between Data Platforms on AWS or Azure.
Tomato – Tomato
This blog could be really, really short:
There are no real differences between Data Platforms in AWS or Azure.
Differences are limited. End of blog :)
But let me elaborate. Almost every service in either AWS or Azure has a counterpart in the competitor’s offering. Microsoft even has a nice overview in their documentation.
All core services and functionalities you’ll need will be there, either in AWS or Azure. Storage? You’ll use AWS S3 or Azure Data Lake Storage. Ingest data from an API? Amazon Lambda or Azure Function will do the job. That nifty Spark ETL job? You can execute that in AWS Glue or Azure Synapse. Even the pricing point is fairly similar.
Even if there’s a compelling new feature that one cloud provider releases you can rest assured that shortly afterwards their competitor will release very similar functionality.
So, for the core functions of your Data Platform both competitors got you covered.
That being said – there are some smaller features where either platform – at the time of writing – has the upper hand over its competitor. In the next section I’ll highlight some of the current differences in function between AWS or Azure.
Cell-level and row-level Security
A common paradigm used by many enterprises is to restrict data access to limit scope based on the user profile or the organizations to which they belong. Previously, you had to enforce this by duplicating the original data or creating materialized and non-materialized views of the data based on filtered datasets. However, these solutions often break the concept of a single source of truth and result in write amplification, which doubles or triples storage. The large number of copies required also increases the management effort required due to their complexity.
AWS Lake Formation supports simple row-level security and cell-level security:
- Basic row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user.
- Cell-level security builds on row-level security by allowing you to hide or show specific columns along with providing access to specific rows.
Row-level and column-level security is supported in Azure Synapse and dedicated SQL pool (formerly SQL DW), but it’s not supported for Apache Spark pool and serverless SQL pool.
Database Migration Service
The AWS Database Migration Service allows you not only to migrate an entire database towards either an AWS Database or S3. It also allows you to do change-data-capture (CDC) on the source database and only write those changes to a Kinesis data stream or to S3.
We use this as an easy way to keep a source and reporting DB in sync, or to allow you to rebuild transactions instead of having daily snapshots of a database.
Azure limits CDC to SQL Server databases. Other source would require third-party software such as Oracle Golden Gate or Kafka connect as workaround.
Redshift Spectrum – Query your data lake and data warehouse simultaneously
In our standard Data Platform architecture, we aim to store as much data as possible in the lake as it is the most cost-efficient option for infrequent analytical workloads.
More frequently accessed data will live in your data warehouse. But what if you want to combine both sources? Instead of having to build an entirely new dataset in either your lake (S3 + Athena) or warehouse (Redshift) you can use Redshift Spectrum to query data in your Data Lake and Redshift Data Warehouse simultaneously
Knowing what data is in your data platform and knowing what it means is really important for data-driven organizations. Microsoft Purview is a unified data governance solution that describes all assets in your Data Platform. Not only can it automatically list all available data across different sources and systems in your Data Platform, but you can also couple this to a Glossary with a business and technical description of your data fields. You can assign Data Stewards (integrated with your Active Directory), automatically classify your data, and apply sensitivity labels.
In the AWS Glue Metadata Catalog, the AWS counterpart, functionality is limited to an overview of all your tables and columns in the Data Platform, sensitivity info and not much more. For more complete Data Governance options, you need to turn to third-party tools like Apache Atlas (on which Purview is based), or Amundsen.
Microsoft Power BI
Building reports and visualizations on top of your data is an important task in every Data Platform. , a powerful Business Intelligence tool that has been around for years and runs on Azure as well. It has many types of visualizations, is very well configurable and is very well known within the Business Intelligence community. Microsoft is also integrating it more and more within its Office 365 offering, making report sharing very easy.
Microsoft Power BI is without any doubt the most feature-complete Business Intelligence option. It doesn’t come cheap, however.
AWS does offer Amazon QuickSight as an . It uses a pay-per-use model instead of fixed licensing fees and is a lot cheaper than Microsoft Power BI. All core functionality you would expect from a BI Reporting tool is there, but it’s definitely not up there with the likes of Power BI or Tableau.
But since there’s nothing keeping you from using your BI tool of choice with any Data Platform, we don’t see this as a unique selling point for Azure-based Data Platforms.
Buyer’s guide – AWS or Azure?
Knowing all this: should you choose AWS or Azure as your vendor of choice for building your Data Platform?
From a functional point of view, it doesn’t really matter. You won’t miss out on any features when choosing either platform, and minor competitive advantages are usually caught up quickly.
Look at your current landscape
The first question you should ask yourself: what does your current IT landscape look like? Are you already using AWS or Azure for other operational services within your organization? Then it is best to stick to one platform, making everything easier to maintain.
One exception to this rule I want to add is Azure Active Directory (and Office 365). Almost every organization makes use of these services, but they integrate really well with AWS. So, if that’s the only services you do use and you have no specific intention of using other Azure services, that’s not a showstopper for using AWS.
Look at your skills
If you have the luxury of starting from scratch, you can base your decision on arbitrary things like the ease-of-use of the interface, or the color scheme of the console. But what is most important, is that you look at your people. Who will be building and maintaining the platform? What is their skillset? Have they got any prior experience in either platform?
Or if you don’t have the people within your organization: which cloud experts can you easily source externally? And what is their cost?
Because the skills of your development team determine how good your data platform will perform.
Cloud Data Platform Choice Decision Tree
If you’re excited about our content, make sure to follow the InfoFarm company page on LinkedIn and stay informed about the next blog in this series. Interested in how a data platform would look like for your organization? Book a meeting with one of our data architects and we’ll tell you all about it!
Want to start building your own data platform straight away? Take a look at the InfoFarm One Day Data Platform. A reference architecture in both AWS and Azure. We get you going with a fully operational data platform in only one day! More info on our website.