Building an Open Lakehouse: Avoiding Vendor Lock-In with Unity Catalog

Author: Tyler Faulkner

24 June, 2025

In today’s AI-driven landscape, organizations generate vast amounts of structured and unstructured data that can become trapped in proprietary platforms, limiting flexibility and ownership. To address this, enterprises must adopt open standards and interoperable architectures. As of 2025, Databricks and the broader open-source community are leading this shift with solutions like Unity Catalog.

In this blog, we will explore guiding principles, the “San Fran Paper” scenario, example interoperability architectures, Unity Catalog’s role, and the ongoing challenges of schema translation.

Introduction

As AI and data science innovations exploded over the past two decades, enterprises built sprawling data platforms, only to discover they had traded scalability for vendor lock-in. The cost? Your data lives in a proprietary silo, accessible only through that vendor’s APIs, query engines, and tooling.

In the age of unstructured data and Artificial Intelligence, organizations need an Open Data Lake paradigm: one that keeps your files in open table formats applies unified governance, and lets any compute engine, Spark, Trino, DuckDB, you name it, access the same data seamlessly across platforms. Enter Unity Catalog, the open-by-design layer that helps you orchestrate your lakehouse assets across engines while sidestepping lock-in.

How Does Unity Catalog Work

At its core, Unity Catalog is your lakehouse’s north star: a centralized metadata plane that knows about every table, file, model, and notebook in your data lake, regardless of format or compute engine. It exposes several open REST APIs so external tools can integrate seamlessly:

Unity Catalog REST API

CRUD operations on catalogs, schemas, tables, views, and user permissions.
Lineage endpoints to graph upstream/downstream dependencies.

Iceberg REST API

Direct read/write of Apache Iceberg tables registered in Unity Catalog, complete with snapshot management.

Delta Sharing REST API

Secure, open-protocol data sharing for Delta and Parquet tables across organizations—no proprietary connector needed.

Model Asset API

This API suite lets any client—whether an on-prem Spark cluster, a Trino federation node, or a Custom Python CLI—discover, query, and govern your assets.

Centralized Metadata Store
- Unity Catalog stores metadata about Delta, Iceberg, Parquet, and other open table formats in a unified catalog.
- It extends governance and lineage beyond tables to include unstructured assets, models, and jobs.
Unified Governance & Access Controls
- Say goodbye to “governance sprawl” across Glue, HMS, and custom RBAC scripts: Unity Catalog provides a single source for policies.
- Credential vending services hand out temporary credentials so Spark, Trino, or EMR can safely talk to your S3/ADLS buckets without extra config.
Catalog Federation & Delta Sharing
- Want Trino to read the same Iceberg table that Spark writes? Catalog federation lets external engines read directly from the file locations defined in Unity Catalog.
- Delta Sharing extends secure, open data sharing across organizations—no proprietary wire protocols.

In non-Databricks contexts, Unity Catalog plugs into AWS Glue or Hive Metastore to retrieve existing metadata, then surfaces those tables (and more) in its uniform API. Projects, permissions, and lineage graphs span clusters, Clouds, and engines, so your data teams can focus on insights, not on “Why can’t I see that table?”

Interoperability in Action

Heads up: This interoperability example was created by James Malone & Aniruth Narayanan at Databricks Summit 2025, overviewing an architecture on how multiple engines can coexist on the same open data.

Scenario: The San Fran Paper Company

Team Roles

Dina (Data Engineer)
- Facing siloed, duplicated data.
Paolo (Platform Architect)
- Struggling with inconsistent governance across systems.
Blake (Business Analyst)
- Wants seamless querying across platforms without lock-in.

Pain Points & Solutions

Problem	Impact	Unity Catalog-Powered Fix
Duplicate Data	Multiple “bronze” copies confuse teams	Single source of truth via open table formats (Parquet/Delta/Iceberg)
Fragmented Governance	Inconsistent permissions, discoverability barriers	Centralized policies & lineage in Unity Catalog
Query Lock-In	Analyst stuck in one engine	Catalog federation & open APIs let any engine query the same data

Example Architectures

Streaming Ingestion
- Kafka → Structured Streaming (Spark) → Unity Catalog (Delta Bronze Layer)
- Kafka Connect → Iceberg REST → Unity Catalog (Iceberg Bronze Layer)
  Both pipelines write to open table formats registered in Unity Catalog, so both Delta and Iceberg consumers can coexist in the same lakehouse.
Partner Data Sync
- Workday + HubSpot → Fivetran → Unity Catalog REST → Delta Bronze
- Fivetran writes in Parquet/Iceberg; Unity Catalog exposes these tables alongside managed Delta tables for downstream transformation.
Multi-Engine ETL & Transformation
- Databricks Spark SQL → Unity Catalog REST (Silver Layer)
  - Leverages auto-optimized clusters and liquid clusteringat the catalog level for high-performance Parquet/Delta writes.
- EMR + DBR → Unity Catalog REST → Silver Layer
  - EMR clusters use Unity Catalog credentials; transformations enjoy the same clustering optimizations.
- Dataproc → Iceberg APIs → Silver Layer
  - Dataproc jobs read/write Iceberg via Unity Catalog, inheriting liquid clustering outside of the Spark runtime.
- Snowflake → Iceberg REST → Unity Catalog
  - Snowflake writes native Iceberg tables; Unity Catalog manages access and governance across both worlds.

And a big shout-out: at Databricks Summit 2025, Michelle Leon (Staff Product Manager) gave a fantastic live demo of Starburst, DuckDB, and Daft in Python, all reading and writing to Unity Catalog—proving interoperability works end-to-end.

Current Pain Point – Schema Translations

This schema translation discussion is based on a presentation given by Eric Sun, Head of Data Platform at Coinbase, at Databricks Summit 2025. Even with open formats and a unified catalog, schema mismatches can trip you up:

Data Type Gaps:

IDL formats (Avro, Thrift) lack complex SQL types like timestamp, decimal256, or enum.
Iceberg does not natively support unsigned integers or custom precision floats (FP16).

Metadata Loss:

Partitioning, indexing, and custom lineage annotations often vanish when data lands in the lake.
OLTP systems have PKs and indexes; lakes rely on pushdown filters and clustering hints instead.

Polyglot Sources & Targets:

MySQL → Delta, Oracle → Iceberg, Kafka → Parquet… point-to-point pipelines become spaghetti.
Reverse-ETL (Lake → OLTP) demands robust translation back into richer schemas.

Toward a Standardized Schema Translation Service

Standardizing schema translation is essential for maintaining data consistency and reliability across heterogeneous systems. A common framework ensures that critical metadata, such as data types, precision, and semantic meaning, travels intact between sources and targets, reducing errors, accelerating integrations, and simplifying maintenance.

Logical Types & Semantic Metadata
- Define a “superset” logical schema (e.g., support enum, high-precision decimals) with a shared metadata contract.
- Preserve semantic tags (e.g., user_id vs. order_id) so downstream agents and scientists understand context.
Hub-and-Spoke Pattern
- Ingest all sources into a central “Schema Master” service via REST/GRPC.
- Evolve target lake schemas automatically and generate mapping snippets for pipelines.
- Push validated schemas back to source and sink systems for consistency.

Open Ecosystem Collaboration
- Contribute translators for Protobuf unions, NoSQL JSON schemas, and streaming formats.
- Build community-driven connectors so every major system speaks the same schema language.

A Call for Open Standardization

The industry must champion an open, unified schema translation standard that complements platforms like Unity Catalog. By embedding translation services directly into Unity Catalog, a capability Eric Sun, Head of Data Platform at Coinbase, advocated for at Databricks Summit 2025, organizations can ensure consistent type mappings, preserve semantic context, and streamline integrations across diverse systems. Such standardization will reduce complexity, foster innovation, and solidify data interoperability as the foundation for modern analytics and AI workloads.

Conclusion

Building an Open Lakehouse is not just about choosing Parquet over proprietary blobs; it is about adopting open table formats (Delta and Iceberg), centralized governance, and interoperable compute that travels with your data. Unity Catalog delivers the metadata backbone: one catalog, many engines, zero lock-in. As you design your pipelines, from Kafka streaming to Snowflake writes, keep an eye on schema translation tooling so that your data’s semantics stay intact from source to insight.

In a world where AI demands agility and experimentation, you need to move fast, and keeping your data free is non-negotiable. Start today by:

Embracing open formats (Delta or Iceberg) for all your layers.
Centralizing policies and lineage in a unified catalog.
Building schema translation services that preserve precision and context.

Xorbix can help you leverage Unity Catalog to ensure your business maintains full control over its data. Our team will work with you to design open, scalable architectures, implement unified governance policies, and integrate seamlessly with your existing tools, so you can innovate confidently, scale efficiently, and avoid the risks of vendor lock-in.

Contact us at Xorbix Technologies today to discover how AI, machine learning, and custom software development solutions on Databricks can transform your business performance.

Blogs

Accelerating Manufacturing R&D with Databricks AI & Analytics in 2025

The increase in the amount of data generated in manufacturing...

Blogs

Databricks in Manufacturing: Transforming Data into Global Competitive Advantage

In this competitive manufacturing world, global manufacturers...

Case Studies

Revitalizing a Legacy Portal: The Path from Angular 4 to 18

This project revolved around modernizing a critical management...

Case Studies

Modernizing Orthotics with TrueDepth Technology

The client, a leading provider of foot scanning technology for...

Let’s Start a Conversation

Request a Personalized Demo of Xorbix’s Solutions and Services

Discover how our expertise can drive innovation and efficiency in your projects. Whether you’re looking to harness the power of AI, streamline software development, or transform your data into actionable insights, our tailored demos will showcase the potential of our solutions and services to meet your unique needs.

Take the First Step

Connect with our team today by filling out your project information.

Services

Solutions

gcjhgcvjhvcjv

Build an Open Lakehouse with Unity Catalog: A Guide to Data Freedom and Flexibility

Author: Tyler Faulkner

Introduction

How Does Unity Catalog Work

Unity Catalog REST API

Iceberg REST API

Delta Sharing REST API

Model Asset API

Interoperability in Action

Team Roles

Pain Points & Solutions

Example Architectures

Current Pain Point – Schema Translations

Data Type Gaps:

Metadata Loss:

Polyglot Sources & Targets:

Toward a Standardized Schema Translation Service

A Call for Open Standardization

Conclusion

Let’s Start a Conversation

Request a Personalized Demo of Xorbix’s Solutions and Services

Take the First Step

Address

Billing Inquiries

Information and Sales

Services

Industries

Solutions

Solutions

Contact Us

Contact Us