Overcoming Challenges in the Databricks GenAI Hackathon
Author: Ryan Shiva
03 July, 2024
I recently had the incredible opportunity to participate in the GenAI Hackathon at the Databricks Data + AI Summit 2024. Our team, consisting of three other talented individuals whom I met at the event, was selected as a top 5 finalist out of 56 competing teams. In this blog post, I’ll share our journey of creating the “Smart Real Estate Advisor” chatbot and the technical details behind our implementation.
The Challenge
The hackathon challenged participants to build applications and solutions using the latest GenAI models on the Databricks platform. We were encouraged to use either datasets from the Databricks Marketplace or reputable open-source datasets. The focus was on showcasing open-source LLM models such as Llama3, Mixtral, and DBRX.
Our Solution: Smart Real Estate Advisor
We decided to create an AI-powered chatbot that could generate real estate listing recommendations based on natural language input from users. Our solution aimed to make the process of finding the perfect property more accessible, efficient, and user-friendly.
Key Features:
- Natural language processing to extract user preferences.
- Personalized property recommendations based on user criteria.
- Detailed property information on demand.
Overcoming LLM Query Challenges: A Lesson in Problem-Solving
At the beginning of project development, we encountered a significant hurdle while attempting to query the DBRX LLM using the OpenAI Python library. Despite using a code snippet copied directly from the Databricks workspace, we consistently received a confusing error message. This had us scrambling to figure out the issue, as querying the LLM was obviously the most important part of our project! Was our Databricks authentication token incorrect, or did we fail to install a dependency? After spending a solid chunk of time troubleshooting, we decided to ask one of the Hackathon assistants for help. It was then we discovered that the OpenAI service was experiencing a widespread issue affecting multiple teams.
Ultimately, we implemented a workaround using the databricks_genai_inference Python library, which allowed us to query the LLM without relying on the OpenAI service. This experience highlighted the importance of seeking help when faced with a persistent challenge and the value of flexibility in problem-solving approaches.
Technical Implementation
Data Preprocessing
We started by finding a suitable dataset in the Databricks Marketplace and importing it into our workspace. We then worked on preprocessing the real estate dataset to ensure it was in a format that could be easily queried based on user preferences. Here’s a snippet of our data preprocessing code:
property_pre_processed_df = spark.read.table(f"{custom_catalog_name}.{custom_schema_name}.us_listings_daily_pre_processed")
selected_property_df = property_pre_processed_df.select('pid', 'address', 'baths', 'beds',
'home_type', 'sqft', 'price', 'rent',
'city', 'state', 'zip',
'postingIsRental', 'description', 'great_schools_rating', 'date')
filtered_property_df = selected_property_df.filter(f.col('price').isNotNull())\
.withColumn('date', f.col('date').cast('date'))\
.withColumn('row_number', f.row_number().over(window_spec))\
.filter(f.col('row_number') == 1)\
.drop('row_number')
This code selects relevant columns, filters out null prices, and ensures we have the most recent listing for each property. This improved both query performance and the reliability of our results.
Leveraging DBRX for Natural Language Processing
We chose to use Databricks’ DBRX model for our project due to its strong performance in code related tasks and general-purpose capabilities. Here’s how we used DBRX to extract user preferences:
def extract_user_preferences(user_input):
preferences = {}
prompt = f"Extract the following information from the user input and provide the output in JSON format:\n\n{user_input}\n\n{{\"budget\": <integer or null>,\n\"state\": <two-letter state abbreviation or null>,\n\"city\": <string or null>,\n\"beds\": <integer or null>,\n\"baths\": <float or null>,\n\"sqft\": <integer or null>}}"
response = chat.reply(prompt)
# ... (JSON parsing code)
return preferences
This function uses DBRX to extract structured information from the user’s natural language input, which is then used to filter the real estate dataset.
Property Recommendation Engine
Once we have the user’s preferences, we use them to filter the preprocessed dataset and generate recommendations:
def get_property_recommendations(user_preferences):
filtered_data = real_estate_data
filtered_data = filtered_data.filter(filtered_data.price >= 30000)
if "budget" in user_preferences:
filtered_data = filtered_data.filter(filtered_data.price <= user_preferences["budget"])
# ... (additional filtering based on user preferences)
recommendations = []
for row in filtered_data.limit(20).collect():
address = row.address
city = row.city
state = row.state
price = row.price
beds = row.beds
baths = row.baths
sqft = row.sqft
recommendation = f"Address: {address}, City: {city}, State: {state}, Price: {price}, Beds: {beds}, Baths: {baths}, Sqft: {sqft}"
recommendations.append(recommendation)
return "\n".join(recommendations)
This function applies the user’s preferences to filter the dataset and returns a list of matching properties.
Challenges we Overcame
- Teamwork: Working with a team of individuals with varying levels of Databricks experience posed an initial challenge. However, it was impressive to see how quickly everyone adapted to the platform and collaborated effectively.
- Data Preprocessing: A significant portion of our time was spent on data preprocessing to ensure our dataset was in a format that could be easily queried based on user preferences. We also ran into issues due to gaps in our dataset that needed to be addressed (for instance, the dataset only contained data from certain states).
- Model Selection: We initially considered using Llama 3 70B but found that DBRX was more reliable in returning responses in the JSON format we needed.
- Time Limitation: Because we only had 6 hours to work on the project, we couldn’t finish every feature we wanted to by the deadline. For example, we wanted to implement a front end for the chatbot that would provide a better user experience than simply typing queries in the Databricks notebook cells. However, we needed to prioritize the most important project features according to the judging criteria.
Conclusion
Participating in the Databricks GenAI Hackathon was an exhilarating experience. Our team’s success in reaching the top 5 finalists demonstrates the power of collaboration and the potential of GenAI in solving real-world problems. The Smart Real Estate Advisor showcases how LLMs like DBRX can be leveraged to create intuitive, user-friendly applications that bridge the gap between complex datasets and end-users.
This hackathon not only allowed us to showcase our skills but also provided valuable insights into the capabilities of cutting-edge AI models and the Databricks platform. As we look to the future, it’s clear that the integration of GenAI in various industries, including real estate, has the potential to revolutionize how we interact with data and make decisions.