Important Spark Features from Databricks Data+AI Summit 2025: Spark Connect and PySpark 4.0 DataSource API
Author: Tyler Faulkner
25 June, 2025
Introduction
The 2025 Databricks Data+AI Summit focused primarily on exciting new features on the Databricks platform, including the new Lakebase functionality, which brings true OLTP databases to Databricks, and the new free edition of Databricks, allowing more people to get their hands on the platform (especially in the education world). While these are great features, the core power behind Spark has also gained new capabilities that may be even more impactful to the everyday workflow of engineers and developers.
I was able to personally attend two sessions during the conference, one covering Spark Connect, which transforms the entire Spark paradigm into a client-server architecture by abstracting the compute layer and allowing any language or app to interact directly with Spark.
The other session demonstrated how to create a custom data connector for Spark directly in Python, enabling engineers to easily source custom data types without needing to interact with Java or Scala directly.
Spark Connect
To fully understand the revolution that is Spark Connect, it is important to review the compute architecture of Spark and how it works. Spark directly interacts with the driver node when executing from a JVM context (PySpark utilizes Java under the hood). While this often works perfectly fine with a single user or application interacting with a driver, many issues begin to arise once you want multiple users or applications connected to the same driver at the same time.
Four core issues come up with multi-tenancy. First, out-of-memory issues cause the entire driver to crash, which stops execution for all tenants. Conflicts can arise when tenants require conflicting dependencies. It becomes a very difficult process to upgrade drivers or clusters due to the direct link between code and compute. Finally, it is challenging to debug issues when multiple tenants are executing against the same driver.
Serverless compute is often touted as a solution to these problems; however, serverless compute is more of a band-aid than a fix for the core architectural issues. Spark Connect, on the other hand, presents a new client-server architecture that solves the issues noted above.
This new architecture directly addresses the problem of multi-tenancy on clusters, rather than simply masking it with serverless. If you have ever used the Databricks Connect extension in VS Code to execute notebooks locally, you’ve already used Spark Connect and seen its value.
How Does Spark Connect Work?
A Spark Connect “client” is standard Spark code with a special attribute to indicate that Spark Connect is being used and configured to point to the Spark Connect server. The Spark Connect “server” is an API layer that sits between clients and the Spark driver.
All code is executed on the source machine—the only part sent to the server is an unresolved query plan, which is language agnostic. When the server receives this unresolved query plan, it is forwarded to the driver, where the driver resolves and executes the query.
Effectively, this means all code now runs on the client machine rather than on the Spark driver node. All the driver needs to do is process data. Finally, the results are sent back to the client using gRPC and Arrow.
What is in it for me?
This true separation of code and compute directly addresses the four key issues in the standard Spark compute architecture. Out-of-memory issues no longer affect the driver—they happen on the client, allowing the driver to keep running even if one tenant’s process fails.
Dependency management is handled on the client machine, preventing conflicts when multiple pieces of code are executed in the same driver. Since code executes on the client’s machine, observability is much easier, as logs appear locally instead of on the driver node. Finally, upgrades become easier since nodes can be migrated independently without affecting existing jobs or applications.
Note that there are a handful of Spark APIs not supported with Spark Connect: RDDs, Spark Context, and SparkML. Authentication is also not included in the Spark Connect server, so it must be implemented with a reverse proxy.
If your needs don’t require the unsupported Spark APIs, you can begin using Spark Connect today. This allows developers to debug code locally and enables new languages to work with Spark, allowing production applications to run Spark and serve live data. Some languages that now support Spark through this new paradigm include, but are not limited to, Swift, Go, and NET. This effectively allows Spark to run nearly anywhere.
PySpark DataSource API
One major feature in the very recent Spark 4.0 release that I’m very excited about is the PySpark DataSource API. 100% of the work I do is in PySpark, and while most of my needs are met, some flexibility is lost compared to Java/Scala implementations. One of these limitations was the inability to create custom data formats to load custom data types using more Spark-native syntax, rather than relying on clunky workarounds to read API data with Spark.
A data format is the unique string you provide to the .format() function to denote what kind of file or data you’re reading. With this new API, you are now able to create custom formats to load new data types without the need for inefficient workarounds that don’t feel very “Spark-y.”
The best way to illustrate the power of this API is to walk through a small example where I’ll cover a basic implementation of an Excel data format for Spark. I’m going to provide pseudocode for configuring an Excel batch read process for brevity. Streaming reads and writes are also supported, and I suggest reviewing the PySpark documentation to further your learning.
Below is the pseudo code for a custom Python class that implements the pyspark.sql.DataSource interface:
from pyspark.sql import DataSource, DataSourceReader
class ExcelDataSource(DataSource):
@classmethod
def name(cls):
return "excel" #unique name used in format() call
def schema(self):
# Default schema to use for the data source
# Can be overridden using .schema() spark method
return "name string, dept string"
def reader(self, schema: StructType) -> DataSourceReader:
return ExcelDataSourceReader(schema, self.options)
Below is an example pseudo-code implementation for the ExcelDataSourceReader:
from pyspark.sql import InputPartition
import pandas as pd
import os
class ExcelDataSourceReader(DataSourceReader):
def init(self, schema, options):
self.schema: StructType = schema
self.options = options
self.path = options.get("path") # internal name for load() paramter
def partitions(self) -> Sequence[InputPartition]:
# Determine the number of tasks for Spark to execute
# This class can be inherited for dynamic logic
if self.path.endswith(".xlsx"):
return [InputPartition(self.path)]
partitions =[]
for (_, _, filenames) in walk(self.path):
for filename in filenames:
partitions.append[InputPartition(f"{self.path}/{filename}")]
return partitions
def read(self, parttions):
# Specific logic to read the partition
# This can be any Excel library
df = pd.read_excel(partition.value)
for _, row in df.iterrows():
yield row
Register the data source
Running the following commands will register the custom data source to the Spark session and utilize it:
spark.dataSource.register(ExcelDataSource)
Read from a custom data source
spark.read.format(“excel”).load(“c:/path/to/folder”).show()
Now we have successfully created a custom Excel data source for Spark that can easily integrate into any existing Spark flow. This example barely scratches the surface of the flexibility offered by this API.
If you wish to learn more about the Spark DataSource API and how to implement writing and streaming, please read this documentation. Please note that to utilize this functionality in Databricks solutions, the runtime version must be set to 15.2 or above.
Conclusion
The newest Spark features are pushing the ecosystem toward broader accessibility and flexibility. With tools like Spark Connect and the PySpark DataSource API, engineers can now build scalable data solutions using familiar languages, no deep JVM knowledge required. This helps lower the barrier to entry and opens up Spark development to a wider range of teams and use cases.
At Xorbix Technologies, staying current on emerging tools and best practices is just part of how we operate. It’s how we continue to deliver modern, efficient solutions, and why we’re always watching where Spark goes next.