Challenges in Safeguarding Data and AI Projects

Author: Ryan Shiva

In today’s data-driven landscape, large organizations face the challenge of securing large volumes of sensitive data while extracting meaningful insights. As revealed by the recent IBM Cost of a Data Breach Report 2023, the global average cost of a data breach surged to $4.45 million in 2023, marking a 15% increase from 2020. In response to this escalating threat landscape, a resounding 51% of organizations plan to bolster their cybersecurity spending in the current year. Companies struggle to manage vast datasets of structured and unstructured data to serve real-time analytics dashboards and machine learning models. The evolution of separate tools and methodologies for data engineering, data warehousing, and AI projects has led to fragmented security practices, hindering innovation and creating significant security gaps.

The historical approach of operating in data silos, where data engineers and data scientists utilize different toolsets, has created substantial challenges. The various tools tend to be poorly integrated across data workflows, slowing down innovation and introducing vulnerabilities in data security. The need for a unified data platform becomes paramount to bridge these gaps and enhance overall data security.

Databricks as the Unified Solution

Databricks provides the best solution to address the challenges of securing data and AI projects. Databricks bridge the gap between traditional data silos. It provides a unified data platform that supports data engineering, data warehousing, and AI use cases on a single integrated cloud platform. Leveraging open standards such as the Delta Lake storage format, Databricks ensures reliability, strong governance, and performance comparable to data warehouses, while maintaining the flexibility and machine learning support associated with data lakes.

The incorporation of open-source tools like MLFlow and Hyperopt within Databricks further emphasizes its commitment to openness and collaboration. Crucially, native integration of AI tooling into Databricks simplifies workflows, providing a seamless experience for data professionals. This unity is fundamental for streamlining processes and ensuring that security considerations are seamlessly woven into every aspect of data and AI projects.

As organizations grapple with challenges related to fragmented security, poor reliability, and disjointed governance, Databricks stands as the linchpin for unifying workflows, implementing robust security measures, and meeting compliance requirements.

In the subsequent sections of this blog post, we will delve into the architecture of the Databricks Lakehouse Platform, exploring its key components such Unity Catalog. Additionally, we will provide actionable insights into security best practices for managing Databricks accounts and workspaces, empowering you to fortify your Databricks workspaces effectively.

Diving into Databricks Architecture: Control and Compute Planes

To begin our overview of Databricks architecture and how this relates to data security, it’s crucial to understand the interplay between the control plane and compute plane. The control plane encompasses backend services managed by Databricks within your account. Notebook commands and workspace configurations are stored in the control plane. The compute plane, by contrast, is where your data is processed. The compute plane resides in your own cloud computing platform account set up with AWS, Azure, or GCP. Databricks adheres to a single-tenant model in the compute plane, ensuring that Databricks remains unaware of the specific data processed by your teams on the platform. Since your data lake is stored in a customer-controlled account, you retain complete control and ownership of your data.

Databricks provides encryption for both data-at-rest and data-in-motion, offering a comprehensive safeguard for your data. Data in the control plane is always encrypted at rest. The compute plane supports local encryption, which can be enabled using encrypted storage buckets. Databricks also encrypt data passed between the control and compute planes.

If the data being processed is particularly sensitive or your organization has stringent compliance requirements, Databricks offers additional settings to utilize encryption to increase security. For example, there is the option of enabling intra-cluster encryption. Clusters are virtual machines configured to run data engineering and machine learning workloads on Databricks, and a cluster with multiple nodes communicates over a network to share data. This example demonstrates how Databricks can be configured to suit the security needs of your organization.

User activities, data access, and commands are meticulously logged, with the option for automatic delivery to a cloud storage bucket. Encryption is seamlessly integrated both at rest and in motion. Customers have the flexibility to employ custom-managed keys for additional security layers. This meticulous approach to encryption aligns with Databricks’ commitment to securing your data throughout its lifecycle.

As we dissect the intricate layers of Databricks architecture, it becomes evident how each element is purposefully designed to address the challenges highlighted earlier—fragmented security, poor reliability, and disjointed governance. Next, we will explore the Databricks features and best practices to achieve unmatched security for your data and users.

Databricks Security Best Practices

Deploy a Workspace with Secure Cluster Connectivity

All new workspaces are automatically configured with secure cluster connectivity, a default setting that significantly enhances network security. With this feature enabled, customer Virtual Private Clouds (VPCs) have no open ports, and Databricks Runtime cluster nodes within the compute plane operate without public IP addresses. This not only simplifies network administration but also eliminates the need to configure ports on security groups or establish complex network peering configurations. Secure cluster connectivity can’t be added to an existing workspace. The only way to use it is to create a new workspace.

Implement IP Access Lists for Authorized User Access Control

IP access lists in Databricks workspaces provide a crucial layer of network security by allowing administrators to specify authorized IP addresses (or CIDR ranges) for user connections. This ensures that users can access the service only through predefined, secure networks, mitigating the risk associated with unsecured connections. Administrators have the flexibility to permit specific IP addresses belonging to egress gateways or user environments while also having the option to block specific IP addresses or subnets. Additionally, Databricks workspaces support PrivateLink to restrict public internet access. The use of IP access lists can also extend to the account console, where administrators can control access based on specified IP addresses or CIDR ranges, providing a comprehensive approach to network security.

Isolate by Line of Business (LOB) for Enhanced Security

Organizing workspaces by Line of Business (LOB) in an enterprise context provides security benefits. In this model, each functional unit receives dedicated workspaces—typically covering development, staging, and production. The architecture consists of separate cloud accounts for each LOB within the same Databricks account. While this approach offers isolated assets, improved governance, and efficient automation, upfront planning, and specialized expertise are crucial. By isolating each LOB in its own workspace, workspaces will be less cluttered, and it will be easier to manage access to your environments and reduce overall risk. Smaller organizations with fewer datasets to manage may not benefit from this model. It requires additional setup and automation by skilled administrators.

Enable Unity Catalog for Unified Data Governance

Databricks recommends that all users enable and migrate to Unity Catalog, a centralized solution for governing data and AI assets within the Lakehouse. It allows users to manage access to tables, feature stores, models, and all other Databricks objects across all workspaces in the account. Unity Catalog employs a define-once, secure-everywhere approach. It offers built-in auditing and lineage to allow users to explore the history of each table. Unity Catalog’s uses a three-level namespace to reference all data assets.

Avoid Using DBFS for Cloud Storage Mounts

Databricks offers the convenience of mounting cloud object storage to the Databricks File System (DBFS), simplifying data access for users. However, mounted data does not integrate with Unity Catalog, and Databricks recommends steering away from using mounts to DBFS. Mounting to DBFS creates a security issue where everyone with access to the workspace would be able to access the data at the mounted location. Instead, the recommended approach is to manage database location and secrets with Unity Catalog, where fine-grained access control lists can be maintained by workspace admins. Admins may leverage session-scoped connections using provider secret scopes (e.g., Azure Key Vault, AWS Parameter Store) and access control lists. This method ensures secure access to storage locations, allowing precise control over who has access to the associated service principal secrets.

By following the best practices outlined in this blog post, you can achieve better security for your own Databricks deployments. However, it’s always best to seek out experienced Databricks engineers and architects to help you define a comprehensive security strategy tailored to the specific needs of your organization. The development team at Xorbix Technologies specializes in crafting custom solutions, ensuring that your Databricks environment not only meets the highest security standards but also aligns seamlessly with your unique business requirements. Partner with Xorbix Technologies for expert guidance in architecting a fortified Databricks deployment that will stand resilient against evolving security challenges.

Get In Touch With Us

Would you like to discuss how Xorbix Technologies, Inc. can help with your enterprise IT needs.


Blog

Case Study

Blog

Case Study

One Inc ClaimsPay Integration

One Inc’s ClaimsPay integration is our major Midwest headquartered Insurance provider client’s ambitious electronic payment integration project.

Blog

Case Study

Blog

Case Study

One Inc ClaimsPay Integration

One Inc’s ClaimsPay integration is our major Midwest headquartered Insurance provider client’s ambitious electronic payment integration project.