MLOps in Databricks: Designing a Structured and Repeatable Machine Learning Lifecycle
Author: Andrew McQueen
25 February, 2026
Introduction
I recently gave a presentation on Machine Learning Operations (MLOps) in Databricks at one of Xorbix Technologies’ Databricks User Group meetings. The demo portion of the presentation was a simple walkthrough from data ingestion to modeling on the platform. There are a few topics covered here that were left out of that presentation. This blog will explore the machine learning lifecycle and how features available in Databricks support a structured MLOps workflow.
What is MLOps?
As Databricks defines it in their Big Book of MLOps, MLOps is “the set of processes and automation for managing data, code and models to improve performance stability and long-term efficiency in ML systems.” This includes the typical processes of exploratory data analysis and wrangling, feature engineering, model experimentation and validation, deployment, and monitoring. Our end goal is to create a lifecycle which is repeatable, traceable, and operational.
The Medallion Architecture
Before getting into any machine learning, the Medallion Architecture encompasses the entire MLOps process. From ingestion to downstream consumption, this architecture standardizes our data, separates it into logical stages, and supports the ML lifecycle. There are three data layers: bronze, silver, and gold. At each stage, we aim to produce consistent data that can be reused across different tasks.
The bronze layer holds our raw data. This data is kept exactly as it arrives, without transformation or cleaning. It serves as the source of truth and allows us to trace any downstream results back to their origin.
The silver layer is our cleaning and validation layer. The goal of this stage is to produce data that is reliable enough to be used in model training. We move from raw, potentially messy data to cleaned and deduplicated tables with enforced schemas. Standardizing columns, handling missing values and invalid records, and performing necessary joins should happen here. By the end of this stage, our data should be consistent, validated, and structured.
The gold layer is for business-ready data. In traditional business intelligence workflows, this layer contains the tables used in dashboards and reporting. In a machine learning workflow, this is where we take the clean data from the silver layer and transform it into model-ready features. This stage contains our feature engineering, including aggregations, derived features such as lags and seasonality signals, and any transformations needed to produce a consistent training dataset. Now that we have outlined the architecture, we can go through each step of the machine learning process.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a core component of any ML project. Our focus for MLOps is not to eliminate ad-hoc analysis, but to transition from an isolated notebook to a structured pipeline. The ad-hoc analysis is still important; we simply want to transfer the findings into something reproducible. None of this analysis goes away. A notebook with plotted distributions, correlation matrices, outlier identification, and exploratory aggregations will absolutely exist for discovery.
However, in a traditional workflow, many of these transformations remain embedded in a notebook. They may be rerun manually, adjusted over time, or forgotten entirely. This makes it difficult to reproduce results, validate assumptions, or scale the solution.
When moving from discovery to production, the findings should be implemented in whichever pipeline stage fits our architecture. To make the process operational, we move transformations into our silver or gold layers, assumptions into formal data validation rules, and exploratory aggregations into the gold layer as structured feature logic. In doing so, the notebook becomes an experimentation environment, while the pipeline becomes the source of truth.
Feature Engineering
By the time we get to feature engineering, we should have cleaned data with an enforced schema. Afterward, we will have a feature table (the first gold-layer table) which will be used in model experimentation. Feature engineering is where any insights from EDA, such as relationships and potential signals, are translated into structured inputs for a model. It is the combination of clean data, EDA, and domain knowledge that defines model-ready data.
During EDA, we begin to understand the behavior of the target variable by experimenting with derived variables, aggregations, or transformations. These are implemented formally such that they become standardized feature definitions that can be reused across model training and inference.
For example, in a regression setting, we may consider interaction terms or normalized or transformed variables. For classification, this might include encoding categorical variables. In time series forecasting, we may introduce lagged values or rolling statistics. Whichever modeling approach we take, feature logic should be structured, versioned, and reproducible.
Feature engineering represents the transition from prepared data to model experimentation. At this stage, we are no longer focused on data quality alone, but on constructing meaningful representations of the underlying problem. By centralizing feature definitions within the gold layer, we create a clear boundary between data preparation and model training, allowing experimentation to remain focused on evaluation rather than data manipulation.
Model Experimentation
With our features defined, we can now move into model experimentation. This is often where traditional machine learning workflows begin to diverge from operational ones. Training models in a notebook or script is straightforward, but managing multiple experiments, comparing results, and reproducing a specific model—while keeping everything organized—is more complex.
In a typical workflow, model parameters might be adjusted manually, metrics logged informally, and artifacts saved locally. This may be sufficient for a model that will be used once, but it does not scale when multiple experiments are being evaluated. In an MLOps framework, our goal is structured tracking that documents the modeling process and provides a historical record. A well-designed workflow should allow us to answer questions such as: Which data version was used to train this model? Which parameters were applied? Why did performance change between runs?
MLflow is an open-source platform designed to manage experiment tracking, model versioning, lineage, stage transitions, and deployment. In Databricks, MLflow is fully integrated and can be used programmatically or through the UI. Using MLflow, we can log model parameters, evaluation metrics, and artifacts for each training run. Each run is stored as a structured record, providing traceability and consistent comparison across experiments.
This structure also helps separate experimentation from deployment. Multiple models can be trained and evaluated using the same feature definitions, with metrics logged in a standardized format. Baseline models, alternative algorithms, and hyperparameter variations can be compared objectively rather than relying on informal notes or memory.
By formalizing experiment tracking in this way, model evaluation becomes repeatable and transparent. Model selection is driven by measured performance, and any promoted model can be traced back to its exact training configuration. Because data in the Lakehouse is versioned, models can also be traced back to the exact state of the data used during training. With experiments tracked and compared systematically, we are ready to move into model validation and promotion.
Model Validation
Now that experiments are tracked and organized, the next step is model validation. While experiment tracking allows us to compare runs, model validation defines the criteria by which we determine the appropriate lifecycle stage for the model and whether it should be promoted. In an MLOps framework, validation should be structured and consistent, with clearly defined evaluation datasets, baseline models, and appropriate metrics aligned to the problem type.
For regression, the metrics we choose may include the mean squared error or mean absolute error. For classification, we may use precision, recall, F1 score, or area under the ROC curve. What is important is that these metrics are used consistently across the models.
This is also where decisions based on how the model will be used come into play. For example, we may ask whether a complex model with marginally higher accuracy but unstable performance across data segments is preferable to a simpler model with more consistent behavior. We may also consider the importance of interpretability depending on the business context. Because experiments have been tracked systematically, these decisions can be made based on performance trends rather than isolated results.
There should be a clear separation of training, validation, and test datasets to ensure that evaluation reflects generalization rather than memorization. When these splits are defined consistently and applied systematically, validation becomes reproducible rather than subjective.
By formalizing validation criteria within our workflow, model selection is performed against clearly defined and measurable criteria. Once a model meets predefined validation standards, it can move forward in the lifecycle to the model registry and promotion process.
Model Registry & Promotion
Once a model meets predefined validation criteria, it moves beyond experimentation and into lifecycle management. Rather than treating a trained model as a file saved from a notebook, we register it as a versioned asset within the system. Over time, production models may be replaced, and maintaining historical versions allows us to analyze performance changes and investigate issues such as model drift.
Using the MLflow Model Registry, each model version is stored along with its associated parameters, metrics, and lineage. The results from experimentation and validation remain tied to each registered model version. Registration creates a clear boundary between experimentation and deployment. We are now promoting a specific, versioned model entry.
The registry introduces lifecycle stages such as Development, Staging, and Production. These stages clarify how a model is intended to be used and provide structure around promotion decisions. Once in Production, models can be deployed for batch scoring or real-time inference depending on the use case.
Versioning also allows for safe iteration. If a newly promoted model does not perform as expected, we can roll back our changes and use an old model. The history of changes is preserved, and each stage transition is traceable.
By incorporating a model registry into the workflow, we move from simply training models to managing them as long-lived assets. This governance layer ensures that deployment is intentional, auditable, and aligned with the broader ML lifecycle.
Monitoring
Deploying a model to production is not the end of the ML lifecycle. Once a model is generating predictions, its performance must be observed over time to ensure it continues to behave as expected.
Monitoring often begins with tracking model performance as new labeled data becomes available. For regression models, this may involve observing changes in error metrics. For classification models, we may track precision, recall, or other relevant metrics over time. Consistent tracking allows us to detect degradation early.
Monitoring should also include the incoming data itself. If the distribution of features begins to shift from what the model was trained on, performance may decline even if initial validation metrics were strong. Identifying data drift provides an early signal that retraining may be necessary.
In addition to technical metrics, business outcomes should be considered. A model may perform well statistically but fail to produce meaningful impact in practice. Monitoring helps ensure that model performance remains aligned with operational goals.
When performance thresholds are no longer met, retraining workflows can be triggered. In this way, monitoring closes the loop between production and experimentation, reinforcing the continuous nature of MLOps.
Orchestration with Databricks Workflows
While each component of the ML lifecycle can be built independently, operational maturity requires automation. Orchestration ensures that data ingestion, feature generation, model training, validation, scoring, and monitoring occur in a coordinated and repeatable manner.
Within the Databricks ecosystem, workflows can be scheduled and structured to manage dependencies between tasks. For example, a pipeline may begin with data ingestion into the bronze layer, followed by transformations into silver and gold tables. Model retraining can then be triggered on a defined schedule or in response to monitoring signals. After validation, scoring jobs can run automatically, updating downstream tables and dashboards.
Workflows allow teams to move from manually executed notebooks to automated pipelines. Each stage of the ML lifecycle becomes a task within a larger, orchestrated system. Dependencies are explicit, schedules are controlled, and execution history is recorded. This provides visibility into when models were retrained, what data was used, and how results were generated.
Orchestration ties together all previous sections of the lifecycle. Without it, even well-designed architectures remain partially manual. With it, the ML system becomes operational, repeatable, and scalable.
Conclusion
Machine Learning Operations is not defined by a single tool or feature. It is the integration of data architecture, feature engineering, experimentation, validation, governance, monitoring, and orchestration into a cohesive system.
The Medallion Architecture provides a structured foundation for data. Exploratory analysis transitions into formalized feature engineering. Experiment tracking ensures reproducibility and clarity. Validation defines measurable standards by which models are evaluated. The model registry governs lifecycle transitions. Monitoring closes the feedback loop. Orchestration automates the entire process.
When these components work together, machine learning moves from isolated experimentation to a managed, repeatable lifecycle. Models are no longer artifacts created in notebooks, but versioned assets operating within a controlled system.



