Databricks Logs Explained: Where to Look When Things Break From Driver to Delta

Author: Aishwarya Manoharan

20 May, 2026

Introduction

A Databricks job fails… or worse, it runs but performs poorly.

You open the workspace and face a familiar question:

Where do you start?

Driver logs? Spark UI? Executor logs? Query history?

Without a clear approach, it’s easy to jump between tabs and waste time chasing symptoms instead of root causes.

This guide provides a structured, layer-by-layer approach to Databricks logs, so you know exactly where to look, what each log tells you, and how it connects to real debugging scenarios and certification concepts.

It walks through each log type with:

  • What to look for 
  • What it tells us 
  • Exactly where to find it in the Databricks UI 

The Mental Model: Debugging Top-Down

Before diving into individual logs, anchor yourself in this hierarchy:

Cluster → Driver → Executor → Stage → Task → Table (Delta)

  1. Cluster: Infrastructure and lifecycle events 
  2. Driver: Job orchestration and failures 
  3. Executor: Parallel task execution and resource issues 
  4. Stage: Shuffle boundaries and data distribution 
  5. Task: Fine-grained execution and skew 
  6. Table (Delta): Data operations and history

Rule of Thumb

  • Debug top-down (start broad, narrow down) 
  • Optimize bottom-up (fix root causes at task level)

Core Log Types with Deep Interpretation

1. Cluster Event Logs

Scope: Cluster lifecycle

Where to find (UI path): Compute → Cluster → Event Log tab

What we see in the logs and how to interpret it

Cluster start / terminate events

  • Cluster starting → resources being provisioned 
  • Cluster terminated → job finished OR failure OR idle timeout 
  • Unexpected termination → check policies, spot/preemptible loss, or failures 

Autoscaling actions (scale up / down)

  • Scaling up (adding workers) → workload requires more parallelism 
  • Scaling down (removing workers) → cluster is underutilized 
  • Frequent scale up/down → unstable workload or poor partitioning 
  • No scale up despite load → autoscaling limits or misconfiguration 

Init script execution

  • Success → environment correctly configured 
  • Failure → dependency/setup issue (libraries, mounts, configs) 
  • Long execution time → slowing cluster startup 

Errors during cluster setup

  • Library install failure → dependency mismatch 
  • Node allocation failure → cloud capacity or quota issue 
  • Permission errors → IAM / role misconfiguration

2. Driver Logs

Scope: Job orchestration

Where to find (UI path): Compute → Cluster → Driver Logs

OR

Workflows → Job → Run → Driver Logs

What we see in the logs and how to interpret it

SparkContext initialization

  • Successful init → cluster ready for execution 
  • Failure → configuration issue or incompatible settings 

Query planning and execution coordination

  • Logical/physical plan generation → Spark deciding execution strategy 
  • Long planning time → complex query or large schema 

Exceptions and stack traces

  • NullPointer / AnalysisException → code or schema issue 
  • Job aborted → failure in execution stage 
  • Repeated failures → systemic issue, not transient 

Broadcast join behavior

  • Broadcast created → small table optimized for join 
  • Broadcast too large → failure or fallback to shuffle join 

Driver OutOfMemory (OOM)

  • Large collect() or toPandas() → data pulled to driver 
  • Large broadcast → exceeds driver memory 

3. Executor Logs

Scope: Worker nodes

Where to find (UI path): Compute → Cluster → Executors → stdout / stderr

OR

Spark UI → Executors → Logs

What we see in the logs and how to interpret it

Task execution logs

  • Normal execution → tasks distributed properly 
  • Repeated retries → instability or skew 

Memory usage and GC (Garbage Collection)

  • Frequent GC → memory pressure 
  • Long GC pauses → inefficient memory allocation 

Spill to disk (very important)

  • Spill occurs → memory insufficient for operation 
  • Heavy spill → performance degradation 
  • No spill → workload fits in memory 

Shuffle operations

  • Shuffle read/write → data redistribution across nodes 
  • Large shuffle → expensive joins/aggregations 

Executor failures

  • Executor lost → node crash or resource exhaustion 
  • Fetch failures → shuffle data unavailable 

4. Stage Logs (Spark UI)

Scope: Stage-level execution

Where to find (UI path): Workflows → Job → Run → Spark UI → Stages tab

What we see in the logs and how to interpret it

Shuffle read size

  • Large read → heavy dependency on previous stage 
  • Skewed read → uneven data distribution 

Shuffle write size

  • Large write → expensive transformation (join/groupBy) 
  • Small write → efficient stage 

Stage duration

  • Long duration → bottleneck stage 
  • Short duration → efficient processing 

Task distribution within stage

  • Even distribution → balanced workload 
  • Uneven distribution → data skew 

Stage retries

  • Retry occurred → transient failure or instability 
  • Multiple retries → deeper issue (data or infra) 

5. Task Logs

Scope: Individual tasks

Where to find (UI path): Spark UI → Stages → Select Stage → Tasks

What we see in the logs and how to interpret it

Task execution time

  • Uniform times → balanced partitions 
  • One task much slower → skew 

Input size

  • Large input → heavy partition 
  • Uneven input → skew 

Output size

  • Large output → data expansion 
  • Small output → filtering or aggregation 

Spill (memory → disk)

  • Spill present → memory insufficient 
  • Heavy spill → tuning needed (memory, partitions) 

Locality level

  • Data-local → efficient execution 
  • Remote reads → network overhead 

6. SQL Query History

Scope: SQL queries

Where to find (UI path): SQL Warehouses → Query History

OR

SQL Editor → Query History

What we see in the logs and how to interpret it

Query execution time

  • Long time → inefficient query 
  • Short time → optimized execution 

Query plan

  • Simple plan → efficient execution 
  • Complex plan → multiple joins/aggregations 

Photon usage

  • Photon enabled → optimized engine 
  • Photon not used → missed optimization opportunity 

7. Delta Table History

Scope: Table-level operations

Where to find: DESCRIBE HISTORY table_name

What we see in the logs and how to interpret it

Write operations

  • Frequent small writes → small file problem 
  • Batched writes → efficient ingestion 

MERGE operations

  • Frequent merges → upsert-heavy workload 
  • Large merges → performance cost 

OPTIMIZE operations

  • Regular optimize → good file compaction 
  • Missing optimize → degraded read performance 

VACUUM operations

  • Performed → storage cleanup 
  • Not performed → storage bloat 

8. Ganglia Metrics

Scope: Cluster resource usage

Where to find (UI path): Compute → Cluster → Metrics tab

What we see in the logs and how to interpret it

CPU usage

  • High CPU → compute-bound workload 
  • Low CPU → underutilization 

Memory usage

  • High memory → risk of spill/OOM 
  • Low memory → over-provisioned cluster 

Network I/O

  • High network → heavy shuffle 
  • Low network → minimal data movement 

9. Audit Logs

Scope: Workspace-level activity

Where to find: Admin Console → Audit Logs

OR

Cloud Storage (log delivery)

What we see in the logs and how to interpret it

User actions

  • Frequent access → active usage 
  • Unexpected access → potential security issue 

Permission changes

  • Changes detected → governance activity 
  • Unauthorized changes → security risk 

10. Streaming Query Logs

Scope: Structured Streaming

Where to find: Notebook → query.lastProgress

Spark UI → Streaming tab

What we see in the logs and how to interpret it

Input rows per second

  • High input → heavy ingestion rate 
  • Increasing input → growing load 

Processed rows per second

  • Matches input → system keeping up 
  • Lower than input → backlog forming 

Batch duration

  • Increasing duration → system under stress 
  • Stable duration → healthy pipeline 

Latency

  • High latency → delayed processing 
  • Low latency → near real-time 

Debugging Scenarios: Putting It All Together

Real-world issues are rarely obvious. The following scenarios show how to use these logs together to diagnose common but tricky problems.

Autoscaling Not Working as Expected

Symptom

Our job is slow, and we expect Databricks to add more workers, but it doesn’t.

Think of it like this:

Autoscaling should bring in more “workers” when there’s too much work. If it doesn’t, our job stays slow because not enough machines are helping.

Where to look: Compute → Cluster → Event Log tab

What to look for and what it tells us

No “scaling up” events

We don’t see messages about adding workers.

Autoscaling is not being triggered

Possible reasons:

  • Max workers limit already reached 
  • Not enough pending tasks (Spark doesn’t think it needs more workers) 

Frequent scale up and scale down

Workers are added and removed repeatedly.

  • Workload is unstable or uneven 
  • Often caused by poor partitioning or bursty jobs 

Scaling happens too late

Workers are added, but only after the job is already slow.

  • Autoscaling is reacting, but too slowly 

What we check next (practical steps)

1. Check cluster limits (very first step)

Go to: Compute → Cluster → Configuration

Look at:

  • Min workers 
  • Max workers 

If max workers is already reached, autoscaling cannot scale further.

If min workers is too low, scaling may start too late.

2. Check if there are enough tasks to trigger scaling

Go to: Spark UI → Stages → Tasks

If we see only a few tasks running:

  • Spark does not need more executors 

Fix: increase partitions (for example, repartition)

3. Check task parallelism vs cluster size

If we have:

  • 10 tasks 
  • 20 workers 

Half the cluster will sit idle.

Autoscaling will not scale up because it is not needed.

4. Check workload pattern (spiky vs steady)

If tasks appear in bursts:

  • Autoscaling may scale up and immediately scale down 

Fix:

  • Improve partitioning 
  • Avoid uneven workloads

5. Check stage behavior (hidden bottleneck)

Go to: Spark UI → Stages

If one stage is slow but not parallel:

  • Autoscaling cannot help 

What this tells us

Autoscaling depends on how much parallel work Spark can see.

If our job doesn’t expose enough parallelism, or if limits are too tight, scaling won’t behave the way we expect.

Final Takeaway

Each log answers a different question, but only if we interpret it correctly.

  • Cluster tells us if infrastructure is healthy 
  • Driver tells us why the job failed 
  • Executors tell us how work is executed 
  • Stages and Tasks tell us where performance breaks 
  • Delta tells us what happened to our data 

Debug top-down. Optimize bottom-up.

Looking for Databricks solutions or assistance? Contact us below to learn how our team can help you build flexible, future-ready solutions.

peaky-frames-Aowg76xooEY-unsplash
shubham-dhage-T9rKvI3N0NM-unsplash
Rectangle 5938
Modernizing Heavy Equipment Operations with a Multi-Platform Manuals & Documentation Tool

Let’s Start a Conversation

Request a Personalized Demo of Xorbix’s Solutions and Services

Discover how our expertise can drive innovation and efficiency in your projects. Whether you’re looking to harness the power of AI, streamline software development, or transform your data into actionable insights, our tailored demos will showcase the potential of our solutions and services to meet your unique needs.

Take the First Step

Connect with our team today by filling out your project information.

Address

802 N. Pinyon Ct,
Hartland, WI 53029

[forminator_form id="56446"]