We Open Source

A community education resource

June 18, 2024

The challenges and opportunities of building a modern enterprise data ecosystem on the cloud

Watch this presentation to find ways to unlock the full potential of cloud-based data management.

By Raja Chattopadhyay

Explore building a modern enterprise data ecosystem on the cloud. Discover data domain design, seamless transfer, user experience enhancements, crucial roles, robust security, and the power of machine learning. Don’t miss unlocking the full potential of cloud-based data management.

Key takeaways

Welcome to this presentation, where we delve into modern data architecture. In an era where data has become the cornerstone of innovation and competitive advantage, organizations worldwide are facing to develop modern data strategies. We will explore the latest technologies and the frameworks that empower organizations to transfer data into actionable insights.
The data lake is like a simple storage where structured and unstructured data can be stored and consumed from. However, this simple storage will have a number of challenges. Addressing these challenges is essential for optimizing the functionality and efficiency of the data lake.
Data writers are assigned specific areas within data lake designated for their respective suborganizations. Data within each server organizations is structured into distinct data sets. Within each dataset’s designated area, data is organized into delta files and aggregates. Data is partitioned for efficient storage and retrieval.
There will be various personnel who will consume this data. This could be data analyst, data scientist, applications, or consumers that we do not know of. All this would need the data to be stored in a fit for consumption platform. The challenges are consistency of data in various things.
Multiple data publishers write data in various formats. We will need a component that will be able to move the published data into any of the data platforms that consumer want to consume from. Some of the challenges that this component will have to solve are ensure that there is consistency in the data.
The data pipeline is to be able to preprocess the data before it is written to the storage platforms. The basic steps that will be within the pipeline are implementation of the data governance platform and route data to the appropriate syncs. A data pipeline offers a plug and play approach for adding and updating data governance.
How can this platform that we have defined till now support machine learning? There are a few stages. Raw data will be available in the data lake. A transformation job within the data transformation platform will extract the features and store them in a feature platform. Finally having an UI that binds all this together, that is the data portal.
There could be a potential challenge with data security. Data integration, integrating on premise data with new data. Cloud capacity is also a concern. Although it’s good abstraction, we will need to plan for the capacity.

Read the full transcript on the Conf42 Cloud Native 2024 site.

About the Author

Engineering Manager @ Capital One

Read Raja Chattopadhyay's Full Bio

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Get Started

Contribute to We Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

This year we're hosting two world-class events!

Join us for AllThingsOpen.ai, March 17-18, and All Things Open 2025, October 12-14.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.