Databricks Delta Sharing Breakthrough
Author: Inza Khan
Databricks’ Delta Sharing marks a watershed moment in the world of data sharing, addressing a long-standing industry challenge. Traditionally, enterprises have grappled with the constraints of vendor-tied data-sharing solutions, hindering seamless collaboration between organizations utilizing different platforms. This new open protocol shatters these limitations, offering a secure and real-time data-sharing avenue independent of the platform hosting the data.
Delta Sharing not only addresses the issue of proprietary lock-in but also tackles the challenge of sharing unstructured data. In a paradigm shift, businesses are increasingly inclined to share diverse data types such as images, videos, dashboards, and machine learning models. Unlike traditional data-sharing tools geared towards structured data, Delta Sharing is architected to seamlessly support unstructured data and express itself through Python, in addition to SQL. This flexibility ensures it caters to the diverse needs of data engineers, analysts, and scientists alike.
Delta Sharing becomes more than just a solution to isolated data-sharing challenges. It extends the applicability of the Lakehouse architecture rapidly gaining traction in organizations today. This extension means that Delta Sharing not only facilitates an open, simple, and collaborative approach to data and AI within organizations but also acts as a bridge between different entities. It enables organizations to share and collaborate on data seamlessly, fostering innovation and efficiency in an interconnected world.
What is Delta Sharing?
The shortcomings of traditional data-sharing methods in data management have long hindered seamless collaboration and robust security measures. Methods like FTP, emailing copies of flat files or utilizing APIs often fall short in scalability, demand manual infrastructure upkeep, and pose challenges for real-time access, making collaboration a daunting task. Security concerns further compound these issues, especially when dealing with large datasets. This is where Databricks Delta Sharing enters the chat. It is a revolutionary open protocol that redefines the dynamics of collaborative data sharing while reinforcing all security measures.
Developed by Databricks, Delta Sharing is not just a solution; it’s a paradigm shift in how data is shared and accessed across organizations. At its core, Delta Sharing eliminates the need for duplicating data or imposing access restrictions. It acts as a bridge, fostering collaboration by providing a secure and real-time data-sharing avenue. Its significance is underscored by the inclusion of third-party tools that seamlessly manage, govern, and audit shared data, ensuring a comprehensive solution for organizations seeking efficiency in data collaboration.
Delta Sharing’s Comprehensive Approach
The Delta Sharing project doesn’t merely offer a solution; it aims to address the fundamental challenges in data movement, scalability, and security and governance complexities. The open protocol defines a structured approach to the secure, real-time exchange of large datasets among both internal and external stakeholders. The open-source REST protocol, a cornerstone of Delta Sharing, ensures a secure and consistent method to access data stored on popular platforms such as S3, ADLS, or GCS.
The protocol specification introduces critical concepts such as shares, schemas, tables, recipients, and sharing servers. Shares logically group schemas, allowing them to be shared with one or more recipients. Schemas, in turn, group tables, with each table representing a Delta Lake table or view. Recipients are the principals granted secure access to shared tables. This meticulous structuring provides an opening to implement rigorous data governance controls that are crucial in today’s data-centric landscape.
Delta Sharing Reference Server: A Hub for Secure Implementation
The Delta Sharing Reference Server is the starting point in the Delta Sharing ecosystem. It is a simple development server that implements the Delta Sharing Protocol. The reference server allows anyone to expand upon it and create custom implementations of the Delta Sharing Protocol on their own data. A major advantage of using Databricks is getting access to the powerful Unity Catalog feature for data governance when using the platform. Databricks brings the reference server to the next level by providing first-party support and a full-fledged Delta Sharing implementation out of the box. This reference implementation acts as a foundation to facilitate the adoption of the Delta Sharing protocol across the web.
Connectors for Apache Spark
To enhance the accessibility of Delta Sharing, pre-built connectors have been crafted for Apache Spark. The Apache Spark Connector provides versatility, enabling the loading of shared tables through SQL, Python (as PySpark), Java Scala, and R. The Python Connector, specifically, empowers users by allowing them to load shared tables directly as pandas DataFrames into their project, facilitating easy integration into nearly any data-science workflow. This flexibility ensures that technical experts, regardless of their preferred workflow, can seamlessly incorporate Delta Sharing into their data management practices.
Delta Sharing on Databricks in Action: Core Benefits
Open cross-platform sharing
Delta Sharing stands as a beacon of open collaboration, ensuring organizations have the freedom to share raw data assets seamlessly across different platforms. Through providing a layer of abstraction, it eliminates the constraints of vendor lock-in, enabling users to share valuable data in Delta Lake and Apache Parquet formats without limitations.
Share live data with no replication
Delta Sharing revolutionizes the data-sharing landscape by facilitating real-time collaboration without the need for data replication. This capability empowers users to share live data seamlessly across diverse data platforms, clouds, or regions, eliminating the complexities associated with copying or duplicating information and reducing turn-around time in the process.
Centralized governance
Delta Sharing on Databricks introduces a profound shift in data governance by offering a centralized platform for comprehensive management, governance, and auditing of shared data. With robust features, including identity verification, role-based access control (RBAC), and detailed audit trails, organizations can ensure a secure and transparent data-sharing experience.
Databricks Marketplace for data products
Delta Sharing platform-agnostic approach extends beyond conventional data sharing; it enables a marketplace for data products from any platform. Through the open and standardized Delta Sharing protocol, Databricks has built a marketplace for data providers and data consumers to easily connect. Providers can build, package, and distribute data sets, machine learning models, and notebooks efficiently. This centralized marketplace streamlines the process, fostering collaboration and ensuring that valuable data products reach their intended audience seamlessly.
Databricks Clean Rooms
Databricks’ implementation of Delta Sharing addresses the critical need for privacy in collaborative environments through the concept of privacy-safe data clean rooms. Using the power of the Unity Catalog, this feature ensures secure collaboration with customers and partners on any cloud, creating a protected environment that prioritizes data privacy without compromising the collaborative experience.
Delta Sharing’s Scalability and Financial Advantage
Delta Sharing presents a compelling case for cost-effectiveness in the realm of data sharing. By sidestepping the expenses and intricacies tied to data integration solutions, Delta Sharing emerges as a financially viable option for enterprises, irrespective of their scale.
One of the key cost-saving features is Delta Sharing’s ability to facilitate data sharing directly from existing cloud object stores without the need for replication. This not only streamlines the process but also significantly reduces storage costs. The protocol eliminates the necessity for setting up separate computing environments, making data sharing a more efficient and cost-effective endeavor.
As data sharing needs continue to burgeon, Delta Sharing rises to the occasion with its scalable architecture. The protocol demonstrates remarkable resilience, capable of handling larger data volumes while maintaining top-notch performance. This scalable approach ensures that enterprises can effectively allocate resources and development efforts instead of grappling with the complexities of data sharing.
Conclusion
Delta Sharing emerges as a beacon of change, ushering in a future where collaboration knows no bounds. It liberates organizations from the constraints of vendor dependencies, providing a secure, real-time, and platform-independent avenue for data sharing. As the data landscape continues to evolve, Delta Sharing stands as a testament to the power of open-source principles and collaborative innovation, paving the way for a future where data collaboration is seamless, inclusive, and limitless.