Build an Open Lakehouse with Unity Catalog: A Guide to Data Freedom and Flexibility
Author: Tyler Faulkner
24 June, 2025
In today’s AI-driven landscape, organizations generate vast amounts of structured and unstructured data that can become trapped in proprietary platforms, limiting flexibility and ownership. To address this, enterprises must adopt open standards and interoperable architectures. As of 2025, Databricks and the broader open-source community are leading this shift with solutions like Unity Catalog.
In this blog, we will explore guiding principles, the “San Fran Paper” scenario, example interoperability architectures, Unity Catalog’s role, and the ongoing challenges of schema translation.
Introduction
As AI and data science innovations exploded over the past two decades, enterprises built sprawling data platforms, only to discover they had traded scalability for vendor lock-in. The cost? Your data lives in a proprietary silo, accessible only through that vendor’s APIs, query engines, and tooling.
In the age of unstructured data and Artificial Intelligence, organizations need an Open Data Lake paradigm: one that keeps your files in open table formats applies unified governance, and lets any compute engine, Spark, Trino, DuckDB, you name it, access the same data seamlessly across platforms. Enter Unity Catalog, the open-by-design layer that helps you orchestrate your lakehouse assets across engines while sidestepping lock-in.
How Does Unity Catalog Work
At its core, Unity Catalog is your lakehouse’s north star: a centralized metadata plane that knows about every table, file, model, and notebook in your data lake, regardless of format or compute engine. It exposes several open REST APIs so external tools can integrate seamlessly:
Unity Catalog REST API
- CRUD operations on catalogs, schemas, tables, views, and user permissions.
- Lineage endpoints to graph upstream/downstream dependencies.
Iceberg REST API
- Direct read/write of Apache Iceberg tables registered in Unity Catalog, complete with snapshot management.
Delta Sharing REST API
- Secure, open-protocol data sharing for Delta and Parquet tables across organizations—no proprietary connector needed.
Model Asset API
- Register, version, and serve ML models (TensorFlow, PyTorch, ONNX) alongside your tables and files.
This API suite lets any client—whether an on-prem Spark cluster, a Trino federation node, or a Custom Python CLI—discover, query, and govern your assets.
- Centralized Metadata Store
- Unity Catalog stores metadata about Delta, Iceberg, Parquet, and other open table formats in a unified catalog.
- It extends governance and lineage beyond tables to include unstructured assets, models, and jobs.
- Unified Governance & Access Controls
- Say goodbye to “governance sprawl” across Glue, HMS, and custom RBAC scripts: Unity Catalog provides a single source for policies.
- Credential vending services hand out temporary credentials so Spark, Trino, or EMR can safely talk to your S3/ADLS buckets without extra config.
- Catalog Federation & Delta Sharing
- Want Trino to read the same Iceberg table that Spark writes? Catalog federation lets external engines read directly from the file locations defined in Unity Catalog.
- Delta Sharing extends secure, open data sharing across organizations—no proprietary wire protocols.
In non-Databricks contexts, Unity Catalog plugs into AWS Glue or Hive Metastore to retrieve existing metadata, then surfaces those tables (and more) in its uniform API. Projects, permissions, and lineage graphs span clusters, Clouds, and engines, so your data teams can focus on insights, not on “Why can’t I see that table?”
Interoperability in Action
Heads up: This interoperability example was created by James Malone & Aniruth Narayanan at Databricks Summit 2025, overviewing an architecture on how multiple engines can coexist on the same open data.
Scenario: The San Fran Paper Company
Team Roles
- Dina (Data Engineer)
- Facing siloed, duplicated data.
- Paolo (Platform Architect)
- Struggling with inconsistent governance across systems.
- Blake (Business Analyst)
- Wants seamless querying across platforms without lock-in.
Pain Points & Solutions
Problem | Impact | Unity Catalog-Powered Fix |
Duplicate Data | Multiple “bronze” copies confuse teams | Single source of truth via open table formats (Parquet/Delta/Iceberg) |
Fragmented Governance | Inconsistent permissions, discoverability barriers | Centralized policies & lineage in Unity Catalog |
Query Lock-In | Analyst stuck in one engine | Catalog federation & open APIs let any engine query the same data |
Example Architectures
- Streaming Ingestion
- Kafka → Structured Streaming (Spark) → Unity Catalog (Delta Bronze Layer)
- Kafka Connect → Iceberg REST → Unity Catalog (Iceberg Bronze Layer)
Both pipelines write to open table formats registered in Unity Catalog, so both Delta and Iceberg consumers can coexist in the same lakehouse.
- Partner Data Sync
- Workday + HubSpot → Fivetran → Unity Catalog REST → Delta Bronze
- Fivetran writes in Parquet/Iceberg; Unity Catalog exposes these tables alongside managed Delta tables for downstream transformation.
- Multi-Engine ETL & Transformation
- Databricks Spark SQL → Unity Catalog REST (Silver Layer)
- Leverages auto-optimized clusters and liquid clusteringat the catalog level for high-performance Parquet/Delta writes.
- EMR + DBR → Unity Catalog REST → Silver Layer
- EMR clusters use Unity Catalog credentials; transformations enjoy the same clustering optimizations.
- Dataproc → Iceberg APIs → Silver Layer
- Dataproc jobs read/write Iceberg via Unity Catalog, inheriting liquid clustering outside of the Spark runtime.
- Snowflake → Iceberg REST → Unity Catalog
- Snowflake writes native Iceberg tables; Unity Catalog manages access and governance across both worlds.
- Databricks Spark SQL → Unity Catalog REST (Silver Layer)
And a big shout-out: at Databricks Summit 2025, Michelle Leon (Staff Product Manager) gave a fantastic live demo of Starburst, DuckDB, and Daft in Python, all reading and writing to Unity Catalog—proving interoperability works end-to-end.
Current Pain Point – Schema Translations
This schema translation discussion is based on a presentation given by Eric Sun, Head of Data Platform at Coinbase, at Databricks Summit 2025. Even with open formats and a unified catalog, schema mismatches can trip you up:
Data Type Gaps:
- IDL formats (Avro, Thrift) lack complex SQL types like timestamp, decimal256, or enum.
- Iceberg does not natively support unsigned integers or custom precision floats (FP16).
Metadata Loss:
- Partitioning, indexing, and custom lineage annotations often vanish when data lands in the lake.
- OLTP systems have PKs and indexes; lakes rely on pushdown filters and clustering hints instead.
Polyglot Sources & Targets:
- MySQL → Delta, Oracle → Iceberg, Kafka → Parquet… point-to-point pipelines become spaghetti.
- Reverse-ETL (Lake → OLTP) demands robust translation back into richer schemas.
Toward a Standardized Schema Translation Service
Standardizing schema translation is essential for maintaining data consistency and reliability across heterogeneous systems. A common framework ensures that critical metadata, such as data types, precision, and semantic meaning, travels intact between sources and targets, reducing errors, accelerating integrations, and simplifying maintenance.
- Logical Types & Semantic Metadata
- Define a “superset” logical schema (e.g., support enum, high-precision decimals) with a shared metadata contract.
- Preserve semantic tags (e.g., user_id vs. order_id) so downstream agents and scientists understand context.
- Hub-and-Spoke Pattern
- Ingest all sources into a central “Schema Master” service via REST/GRPC.
- Evolve target lake schemas automatically and generate mapping snippets for pipelines.
- Push validated schemas back to source and sink systems for consistency.
- Open Ecosystem Collaboration
- Contribute translators for Protobuf unions, NoSQL JSON schemas, and streaming formats.
- Build community-driven connectors so every major system speaks the same schema language.
A Call for Open Standardization
The industry must champion an open, unified schema translation standard that complements platforms like Unity Catalog. By embedding translation services directly into Unity Catalog, a capability Eric Sun, Head of Data Platform at Coinbase, advocated for at Databricks Summit 2025, organizations can ensure consistent type mappings, preserve semantic context, and streamline integrations across diverse systems. Such standardization will reduce complexity, foster innovation, and solidify data interoperability as the foundation for modern analytics and AI workloads.
Conclusion
Building an Open Lakehouse is not just about choosing Parquet over proprietary blobs; it is about adopting open table formats (Delta and Iceberg), centralized governance, and interoperable compute that travels with your data. Unity Catalog delivers the metadata backbone: one catalog, many engines, zero lock-in. As you design your pipelines, from Kafka streaming to Snowflake writes, keep an eye on schema translation tooling so that your data’s semantics stay intact from source to insight.
In a world where AI demands agility and experimentation, you need to move fast, and keeping your data free is non-negotiable. Start today by:
- Embracing open formats (Delta or Iceberg) for all your layers.
- Centralizing policies and lineage in a unified catalog.
- Building schema translation services that preserve precision and context.
Xorbix can help you leverage Unity Catalog to ensure your business maintains full control over its data. Our team will work with you to design open, scalable architectures, implement unified governance policies, and integrate seamlessly with your existing tools, so you can innovate confidently, scale efficiently, and avoid the risks of vendor lock-in.