Building a Scalable LLM Testing Pipeline with Databricks

Author: Ammar Malik

28 May, 2024

Databricks is a data platform that lets users create many different data-driven applications. It allows its users to create machine learning models and visualizations, interact with LLMs, etc. When making these products, users can use products from the Databricks marketplace to help aid their development. These can be tools like frameworks to perform A/B test or a pre-trained model. One of the tools can be a Solution Accelerator, a tool used to perform a niche task for a data application.

In this case, at Xorbix, we developed a Solution Accelerator to help aid in the testing of custom LLM solutions.

The Testing Solution Accelerator

The basic premise of this tool is to help in testing LLM models using different metrics like answer relevancy, contextual recall, hallucinations, and more. This allows people to upload documents, create an RAG pipeline and measure how well their LLM will perform. They can then use these metrics to do any further visualizations and analysis of the results to see if the LLM solution is performing according to whatever threshold they deem worthy. This script is meant to be automated testing for LLMs so an engineer would set it up as a Databricks job and schedule it to run and return insights.

How It Works

Document Ingestion

Document ingestion is straightforward, the document path is put into a variable and passed to a function that will read and chunk the text. The reason we are breaking up the test is to help reduce the number of calls to the API. This will help mitigate any issues related to the rate limit most LLMs have on their API. They are able to ingest multiple documents as well.

Generating the Datasets

We then use DeepEval’s synthesizer class to pass the document text to generate the testing dataset. Since we are chunking the text, it will perform better at extracting the correct context for our test cases. We need to specify what model we are using for it to generate the testing set. For this, we can use the evaluation testing LLM or testing LLM as we are generating questions for our testing LLM, which will be our input.

Getting LLM Response

When we generated the dataset, one of the most crucial elements was the actual output generated from our testing LLM. We pass the dataset generated earlier, which contains the questions through the LLM we are testing and get the response.

Creating the Test Cases

After we have all the relevant fields, we create a list of test case objects using the DeepEval library. We need to pass in the fields with the information needed to run the test cases. This includes specifying items like the input, context, retrieval context (RAG response), etc. This in turn will create a test case object to be run through different metrics.

Running the Test Cases

The test cases are run through a modified version of DeepEval’s metrics, a later section will explain why we had to modify the library. After we have our test cases ready, we can store them in a structured storage solution called a dataframe. The main testing function then extracts the relevant information and runs through the metrics. We use metrics like answer relevancy which checks how well-crafted the response is compared to the topic, hallucination which checks how much information the LLM is creating rather than using the ground truth, etc. After calculating these scores, they are stored in another dataframe where we will use to view the results and make some visualizations.

Interpreting the Results

After we have our results stored, we have helper functions to find the test cases that passed or failed. Users can view these to better fine-tune their LLM into giving the expected response. We made visualizations like a bar chart plotting the average score of the metrics, this can showcase the performance of the LLM on the key metrics the user used. We also have a simple line chart visualization to showcase the variations in the scores, it is important to analyze which specific test cases the LLM was failing. By looking at the variation of the scores and comparing them with the test cases, it will allow users to check if a specific topic extracted from the documents is causing issues or not.

DeepEval Modifications

One of the issues faced during development was related to token usage and an issue with garbage collection. For the token usage, it was an easy fix to add a timer to certain parts of the program to wait 60 seconds before reaching the query limit. The garbage collection issue was a major one. When running many tests, about 100, the notebook would freeze up and the cluster would be stuck in an infinite loop.

To mitigate this, we tried different strategies to reduce the memory usage of our program. First, we implemented batch processing to reduce the items loaded from memory, increased the spark cluster’s heap memory, driver, and executor memory, and tried different configurations but with no success.

Another issue was the widgets generated by DeepEval were causing the page to break as well. This was when we decided to modify the DeepEval code to suit our purposes. We modified the prompts and created a custom function that reduced the calls to the LLM and removed the progress indicator. After this, we completed this project.

Conclusion

The Databricks Solution Accelerator developed for testing custom LLM solutions provides a comprehensive and automated approach to evaluating the performance of language models. By ingesting documents, generating test datasets, and running a suite of metrics to assess factors like answer relevancy, contextual recall, and hallucination, this tool empowers users to thoroughly analyze the capabilities of their LLM implementations. The accelerator’s modular design, with customizable components for document ingestion, dataset generation, and metric calculation, allows for flexibility and adaptability to suit the unique needs of different LLM use cases.

The ability to visualize test results through intuitive charts further enhances the user’s understanding of the LLM’s strengths and weaknesses, enabling informed decisions on model refinement and optimization. The challenges faced during development, such as token usage limitations and memory management issues, were successfully overcome through strategic modifications to the underlying DeepEval library. This demonstrates the team’s technical expertise and commitment to delivering a robust and reliable solution.

Overall, the Databricks Solution Accelerator for LLM testing represents a valuable contribution to the data science community, providing a powerful tool to streamline the evaluation and improvement of custom language models. Its adoption can lead to more accurate, reliable, and impactful LLM-powered applications across a wide range of industries.

Explore how the Databricks Solution Accelerator can automate LLM testing and enhance your language model’s performance. Contact Xorbix Technologies today to discuss how this solution can benefit your organization.

Custom Software Development
Multimodal AI
Angular 4 to 18
TrueDepth Technology

Let’s Start a Conversation

Request a Personalized Demo of Xorbix’s Solutions and Services

Discover how our expertise can drive innovation and efficiency in your projects. Whether you’re looking to harness the power of AI, streamline software development, or transform your data into actionable insights, our tailored demos will showcase the potential of our solutions and services to meet your unique needs.

Take the First Step

Connect with our team today by filling out your project information.

Address

802 N. Pinyon Ct,
Hartland, WI 53029