The Truth Behind Spark Connector for Microsoft Fabric: Performance, Cost, and Security Risks Â
Author: Ryan Shiva
07 April, 2025
Fine-grained access control is a fundamental requirement for data scientists and engineers working with sensitive data. As an example, a health insurance company might have millions of patient records they want to use to train ML models and feed their analytics tools – to reduce costs and improve patient outcomes. However, the data is certain to contain sensitive data regulated by HIPAA, and the organization needs to ensure they are not exposing sensitive information to unauthorized users in their downstream analytics/models.Â
This is why fine-grained access control is essential. Row-level security (RLS) ensures team members only see patient records they’re authorized to access. Column-level security (CLS) masks sensitive fields like social security numbers or specific diagnoses based on user roles.Â
Unfortunately, Microsoft Fabric falls short by failing to provide row-level security and column-level security for OneLake data accessed via notebooks. This is where the Spark connector for Microsoft Fabric Data Warehouse and SQL endpoint steps in: a workaround that attempts to plug this gap. As you will soon see, the Spark connector does not ensure your data is secure, since the user can access sensitive data by using Spark to query the data directly. It’s like having a securely locked door standing alone in an open field – you can simply walk around it, making the security completely ineffective.Â
The Spark Connector: A Small Bandage on a Large WoundÂ
The Spark connector provides a mechanism for Spark developers and data scientists to access and manipulate data stored in a Fabric Data Warehouse or a lakehouse. This method can enforce fine-grained access control policies defined at the warehouse or the SQL endpoint level.Â
How It WorksÂ
The Spark connector comes preinstalled as part of the Fabric runtime environment. At its core, the connector relies on the synapsesql method to enable interaction with data:Â
synapsesql(tableName:String="<Part 1.Part 2.Part 3>") => org.apache.spark.sql.DataFrame
This setup allows Spark to interact with the data through a SQL endpoint or a Warehouse, which acts as an intermediary. However, there is no way to ensure that users will always use this method to read data. Users can still opt to bypass the connector entirely and directly use Spark to access the data, thereby circumventing the fine-grained access controls defined at the warehouse or SQL endpoint level. This fundamental weakness underscores the limitations of relying on this workaround for securing sensitive data.Â
The Price of a WorkaroundÂ
1. Subpar Performance
The Spark connector workaround slows performance. By relying on a basic JDBC connection to the SQL endpoint, Spark’s parallelism is severely impacted. Reading big datasets becomes significantly slower due to poor parallelism, and essential optimizations, such as metadata-only queries, are notably absent. The connector is unsuitable for large-scale operations especially when the transformations cannot be pushed down to the SQL endpoint.Â
To illustrate the performance degradation when using the SQL endpoint connector compared to Spark, here are two different analytical queries:Â
Query 1: Running a simple count query on a relatively big dataset (8 billion rows) using Spark and the Spark SQL endpoint connector shows a huge performance difference. It took only 18 seconds on Spark, but 1,315 seconds using the connector. This is 73x slower performance compared to Spark. The screenshots below show the results of the test.Â
Spark on its own (fast)Â
Spark SQL endpoint connector (slow)Â
Query 2: Here’s a standard analytical query that joins fact and dimension tables, filters data, and performs aggregation:Â
SELECT dt.d_year,
item.i_brand_id AS brand_id,
item.i_brand AS brand,
SUM(ss_sales_price) AS sum_agg
FROM date_dim dt,
store_sales,
item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id = 816
AND dt.d_moy = 11
GROUP BY dt.d_year,
item.i_brand,
item.i_brand_id
ORDER BY dt.d_year,
sum_agg DESC,
brand_id
LIMIT 100;
The query took 1 minute and 34 seconds to execute on Spark, but 1 hour and 18 minutes using the Spark SQL endpoint connector. This dramatic 49X slowdown highlights the inefficiency of the connector for complex analytical tasks. The screenshots below show the results of the test.Â
Spark on its own (fast)Â
Spark SQL endpoint connector (slow)Â
These constraints render the connector inefficient and impractical for high-performance data processing. Ironically, high-performance large-scale data processing is one of the primary reasons organizations use Spark. This workaround, however, diminishes Spark’s core capabilities, reducing it to a shadow of its intended power.Â
2. Unnecessary, Duplicative Costs
The connector not only severely impacts performance but also increases costs. It slows down queries, requires more processing time, and necessitates more expensive compute resources. This combination results in higher overall costs for data operations. Notebook queries executed against a SQL endpoint or a warehouse consume both Spark and SQL endpoint vCores, resulting in substantial resource usage. SQL endpoints and warehouses are four times more expensive than Spark, with 1 Capacity Unit (CU) equating to 0.5 SQL endpoint cores and 2 Spark vCores.Â
This dual resource consumption and the slower performance inflate capacity usage, potentially increasing the risk of throttling the Fabric capacity. When capacity is throttled to the point of being frozen, users encounter errors for any actions requiring Fabric compute resources, effectively halting their workloads until the capacity debt is resolved.Â
To further emphasize the cost implications, let’s revisit Query 2 from the performance analysis section. The Spark query consumed 1510 CUs, costing $0.0755. In contrast, the Spark SQL endpoint connector query consumed a total of 40,360 CUs for Spark (costing $2.018) and an additional 8,110 CUs for the SQL endpoint (costing $0.4055). The total cost for the connector query was $2.4235. The Spark SQL endpoint connector query is approximately 32x more expensive than the Spark query.Â
3. Reliability Conundrum
The Spark connector for SQL endpoint in Microsoft Fabric introduces significant reliability challenges that data engineers and analysts need to be aware of. One of the main issues is the inconsistent data synchronization between Delta tables in the data lake and their corresponding SQL endpoint representations. This discrepancy can lead to a frustrating scenario where fresh data is readily accessible via Spark but remains invisible when queried through the SQL endpoint.Â
While users can manually force a metadata refresh using the Fabric portal to sync the data, this workaround is far from ideal in automated data pipeline scenarios. As of now, there’s no built-in method to automate this refresh process within data pipelines, which poses a significant obstacle for implementing robust medallion architectures. Consider a pipeline designed to feed data through bronze, silver, and gold layers: newly ingested data in the bronze layer might not propagate downstream if subsequent tasks using the connector fail to see the recently added records.Â
Even if automatic syncing becomes available in the future, it could introduce new complications. Syncing operations for non-optimized Delta tables or environments with a large number of tables can cause substantial delays in pipeline execution. In extreme cases, data engineers might find themselves forced to migrate lakehouses to different workspaces to mitigate these performance issues.Â
4. Security and Governance: Hoping for Compliance, Delivering Chaos
The connector was supposed to solve the fine-grained access challenges. However, with all the challenges introduced by the Spark SQL endpoint connector (performance, cost, reliability), it still doesn’t fully address the fine-grained access control issue. The major limitation is the inability to enforce the use of the connector in notebooks. Users can bypass it by directly using Spark to read data, ignoring the fine-grained access controls defined at the SQL endpoint level. If a user has access to the data in OneLake, nothing prevents them from accessing it via Spark.Â
Admins must ensure that users do not have access to any dataset at the OneLake level and provide access only through the SQL endpoint. This might be achievable by sharing the lakehouse without granting item-level permissions and using T-SQL to define object, column, and row-level security. However, there are scenarios where this can be bypassed.Â
For example, a user who needs write access within the same workspace will require the contributor role, which inherently grants them read access to lakehouse data at the OneLake level. This access enables them to bypass the security defined at the endpoint level.Â
Ultimately, this makes fine-grained access control optional—and optional security is effectively no security. Organizations cannot afford to rely on mere hopes that users will comply. A robust, enforceable solution is critical to safeguarding sensitive data.Â
Conclusion: A Yet Another Security LoopholeÂ
The Spark connector for Microsoft Fabric Data Warehouse is not a robust solution – it’s unfortunately a flawed workaround. While it addresses fine-grained access control in certain scenarios, it remains limited in its ability to handle all use cases, and the trade-offs in cost, performance and reliability are too severe to ignore. Organizations relying on this connector must prepare for inflated expenses, sluggish performance, and potential security breaches.Â
Unlike Microsoft Fabric’s approach, other platforms have successfully implemented comprehensive solutions for this fundamental security requirement. For instance, Databricks’ Unity Catalog provides native fine-grained access control at both row and column levels without forcing users into performance-degrading workarounds. Unity Catalog seamlessly enforces these controls regardless of how users access the data – whether through SQL, Spark, or other interfaces – ensuring that security policies can’t be bypassed.Â
For a sustainable and scalable approach, Microsoft must address these shortcomings in Fabric with native support for fine-grained access control rather than relying on easily circumvented workarounds that compromise performance, reliability, and security.
Read more on our related services:Â