Don’t take data for granted


We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more

We joked for a long time that the world ran out of data. It’s definitely the kind of statement that gets a rise. But one could argue that, after the emergence of big data more than a decade ago, data eventually disappeared from the headlines in favor of AI, cloud and microservices. With the cloud making it almost trivial to pile up those terabytes in object storage and turn off compute cores any time soon, it’s tempting to wonder if we’re starting to take data for granted.

Data is more important than ever. It goes without saying that the so-called Three Vs of big data are no longer exceptional. Big data is so 2014 – in the 2020s we just call it ‘data’. And data comes from more sources and places. That has led to a chicken-and-egg scenario as distributed databases become more commonplace. The cloud makes it possible, and the use cases for global deployment require it. And, by the way, did we mention the edge? In many cases, that data doesn’t go anywhere and the processing has to get there.

There is no panacea for extending data processing to the edge. Getting to the edge means pushing a lot of intelligence down because there won’t be enough bandwidth to bring in the streams of data, much of it low density (e.g. instrument readouts) where the value comes only from aggregation. And at the back, or shall we say the hub (in a distributed environment, multiple hubs), it will invoke the need to converge real-time data (e.g. streaming, data in motion) with historical data (e.g. data at rest).

Eliminate data complexity

That’s been a dream since the early days of what we called big data, where the only practical solution at the time was the Lambda architecture – separating the real-time and batch layers. As a result, streaming typically required separate platforms, where the results would be ingested into the database or data lake. That was a complex architecture that required multiple tools, a lot of data movement, and then extra steps to merge results.

With the rise of cloud-native architecture, where we containerize, deploy microservices, and separate the data and compute layers, we are now bringing it all together and losing complexity. Allocate some nodes as Kafka sinks, generate change data input on other nodes and persistent data on other nodes, and it’s all under one umbrella on the same physical or virtual cluster.

And because data moves globally, we have to worry about managing it. There are more and more data retention mandates in the country of origin, and depending on jurisdiction, varying privacy rights and data retention requirements.

Indirectly, restrictions on data traffic across national borders raise the question of the hybrid cloud. There are other reasons for data gravity, especially with established back office systems managing financial and customer data, where the interdependencies between legacy applications can make it impractical to move data to a public cloud. Those well-anchored ERP systems and the like are the last frontier for cloud adoption.

Data lives on the edge

So on-premises data centers won’t disappear anytime soon, but increasingly, as HPE’s motto is, the cloud can come to you. The draw is the operational simplicity and flexibility of having a common control plane and on-demand pricing model tied to public clouds. Therefore, as we usher in the new decade, we predict that the 2020s would be the era of the Hybrid Default. As a result, the HPE spin-off has more than doubled its on-demand hybrid/private cloud business year over year.

Demand for the cloud is not a zero-sum game; growing demand for hybrid cloud or private cloud is not at the expense of public cloud. And that’s where things get crazy, as cloud providers have built an increasingly mind-boggling array of choices.

When we last counted, AWS had over 250 services, and looking at the data and analytics job, there are 16 databases and 30 machine learning (ML) services. The burden is on the customer to put the pieces together, for example when they use a service like Redshift or BigQuery and want to use data pipelines to ingest and transform data in motion, visualization to deliver ad hoc analytics and of course advanced machine to the learning.

Help is on the way. For example, you can now run ML models in Redshift or BigQuery in some cases, and you can contact other AWS or Google databases for federated searches. Azure, for its part, has strived with Synapse to be more of an end-to-end service, with the components built in or activated with a single click. But these are just opening shots – cloud providers, and hopefully with an ecosystem of partners, need to put more of the pieces together.

The magic of data networks

In all of this, we’ve skipped one of the liveliest topics of the past year so far: the discussion of data meshes. They arose in response to the shortcomings of data lakes — namely that it is all too easy to lose or bury data, and the teams using the data must take active ownership of it. In contrast, there are concerns that such practices may not yet scale or establish new data silos.[overitAgainstthatareconcernsthatsuchpracticesmaynotscaleorerectyetnewdatasilos[overitAgainstthatareconcernsthatsuchpracticesmaynotscaleorerectyetnewdatasilos

And so with all of this as a backdrop, we’re excited to set up a residency here at VentureBeat in the Data Pipeline, along with fellow partners in crime Andrew Brust and Hyoun Park. Hold on, let’s go for a ride.

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more

This post Don’t take data for granted

was original published at “https://venturebeat.com/2022/03/29/dont-take-data-for-granted/”

No Comment

Leave a reply

Your email address will not be published.