Machine Learning

A Deep Dive into Machine Learning with Databricks AutoML: Revolutionizing Retail Sales Predictions

At Brightly, we thrive on shaping the digital solutions of tomorrow. In the ever-evolving landscape of data science and machine learning - and, importantly, the business decisions they help to improve - tools that streamline the model development process will be invaluable. One such tool is Databricks AutoML, a collaborative and scalable environment for machine learning. This case study explains the solution we built in partnership with a prominent car dealer to enhance their business efficiency by leveraging a predictive model to estimate used car sales cycles. Sales and market data formed the backbone of a robust machine learning model designed to predict the resale time and price for each car in inventory.

The Genesis: A POC Outside Databricks

The journey began with a Proof of Concept (POC) outside the Databricks ecosystem. The goal was to predict how long it would take to sell a car - say, a Tesla Model S from 2018 with mileage of 160 000 km or a fully equipped Toyota Rav4 Plugin hybrid from 2022 - on the car dealer's website. The data included information about previously sold cars and relevant market trends and seasons. Various predictive models, such as Random Forest, Linear Regression, and XGBoost, were tested to get an overall understanding of the data, leading to the identification of key features that influenced sales time prediction accuracy. 

Remarkably, all models yielded a satisfactory prediction accuracy. Still, it was evident that the gradient decision tree model XGBoost gave the best accuracy and was the most efficient and easy to use regarding data preparation. It is worth mentioning some benefits of XGBoost for a retail sales prediction use case:

  • Accuracy: XGBoost can capture complex relationships and patterns in data, which is crucial for making accurate sales predictions. It is also robust to overfitting, which is important when working with sales data to ensure the model generalizes well to new, unseen data rather than fitting the training data too closely.
  • Performance: XGBoost is optimized for performance and can handle large datasets efficiently. It is an ensemble learning method combining multiple decision tree predictions to create a robust model. This ensemble approach helps improve the overall performance of the model.
  • Handling non-linearity: Sales data often involves non-linear relationships and interactions between various features. XGBoost can automatically handle such complexities.
  • Handling missing data: XGBoost has built-in capabilities to handle missing data, which can be expected in sales datasets. XGBoost does not require the imputation of missing values, simplifying the preprocessing steps. 

XGBoost also provides a feature importance score, which helps identify the most influential features in predicting sales. Understanding the importance of features can provide valuable insights into which factors contribute the most to sales outcomes.

The plot above indicates that Feature 1 is the most important feature for the predicted value on average. The color represents the original value of a feature. As the value of Feature 1 decreases, so does the predicted value.

The quest for further optimization and scalability led to the migration of the project into Databricks. On this unified analytics platform, it is possible to test with different models for data and adjust their parameters automatically.

Bringing it to Databricks: A Seamless Transition

The initial phase involved importing code and building the project within a Databricks Repository, leveraging Databricks’ Git integration and support for various data platforms (in this case, Google Cloud Platform). 

Data retrieval jobs were orchestrated with Databricks Jobs Workflow, which efficiently organizes the ingestion of data from various APIs into Databricks tables. These tables are written to and managed in Databricks' Unity Catalog, which provides a secure solution for governing and sharing data and importantly, also registering and executing machine learning models that read data directly from Databricks tables. This streamlined process facilitated a smooth transition from the proof of concept (POC) to the development and, eventually, production environment. 

With the data securely stored in Databricks, our focus shifted to running the project with Databricks’ automated machine learning lifecycle management tool, AutoML.

Data ingestion and model training end-to-end orchestration in Databricks Workflows.

AutoML in Action: Model Training and Evaluation

Screenshot from Databricks’ ML Experiments that shows results from a training session, where model versions are by Mean Average Error (mae).

Databricks AutoML can be accessed easily in the user interface. AutoML is a tool for managing the end-to-end machine learning lifecycle that simplifies the model training process, allowing data scientists to experiment with various machine learning algorithms without the need for extensive manual tuning and writing “boilerplate” code needed in cleaning data and fitting models. This is particularly beneficial when dealing with large datasets and complex feature interactions. 

AutoML allows users to track experiments, compare and evaluate results between different models, add code to the model’s training runs, and save, share, and deploy models. AutoML can help to select the most suitable algorithm for the task.  As we already knew it would be XGBoost, it was easy to give it as a parameter. 

With a few lines of code, AutoML initiates the training process through distributed computing, saves the model into Databricks Model Registry, and trains the model anew with new data at the decided interval. When the model is retrained on new data, it can be saved as a new version.

Most importantly, AutoML will do feature engineering for you. It automates the tuning of models, making it ideal for a sales prediction project with ever-changing data to retrain the model on a steady interval, taking in new data from the sales process. It will optimize hyperparameters, enhancing model performance (such as the decision tree depth), and assess each model version with different hyperparameters, using the evaluation scores that suit your model best.

AutoML trains models by testing them with different parameters and ranks the results in the user interface. The user can select the model, parameters, and evaluation metrics of choice, as well as the length of the training period, and view the progress of results for different models ranked by selected evaluation metrics, as in the above screenshot. 

We had already run and tested the model and knew what to expect.  AutoML confirmed the direction, which underscored the efficiency of the AutoML process in fine-tuning the model for better predictive accuracy.

Databricks also offers a centralized repository called Unity Catalog, where models and other artifacts can be stored, managed, and shared. After successfully training the AutoML model, the next step was to save it in the Unity Catalog. This ensures easy access for future use and promotes collaboration within the data science team, fostering a culture of knowledge sharing and reproducibility.

Good to Remember 

Tools like AutoML are designed to be user-friendly and automated, which also means there might be limited options for exploring, tuning, and customizing the models. We could do the transition to Databricks with quite a small effort because we had already tested different models with the data. 

Automation tools may generate complex black-box models where it can be challenging to understand the reasoning behind predictions. Thus, you will need a good general understanding of the input data and its possible outliers, missing values, or other anomalies, what to expect from the data with different types of models and parameters, as well as domain-specific knowledge. Human insight is needed to guide through. 

Data privacy protection planning may be needed for sensitive data that sales data may include. With large datasets, estimate the size of required computational resources.  

Conclusion: A Glimpse into the Future of Data Science

The journey from a POC to an AutoML-driven model in Databricks exemplifies the power of combining domain knowledge - the intuitive vision of the customer whose business the solution helps - with cutting-edge tools. AutoML expedites the model development process and improves predictive accuracy, making it a game-changer for industries like the retail sector that rely heavily on data-driven insights. The key learnings from this use case were:

  • Current powerful and accurate algorithms such as XGBoost work efficiently on many types of data companies have - such as sales data.
  • Databricks integrates widely and offers platform ecosystems to cover everything from adding code from Git repositories to job automation in training the model.
  • The possibilities become limitless when you have solid practical insights based on domain knowledge coupled with data.

As the synergy between machine learning and collaborative platforms like Databricks continues to evolve, we can expect a paradigm shift in how organizations approach data science. The success story of this sales time prediction project serves as proof of the future potential of AutoML in Databricks, hinting at projects where complex data science tasks are seamlessly integrated into scalable and collaborative environments.

At Brightly, we support you with building predictive machine learning solutions. Contact us to learn more.

Authors

Varpu Rantala

With a decade of expertise in advanced online analytics data development, particularly within the media industry and SaaS context, I am dedicated to discovering optimal and innovative methods for efficiently ingesting, processing, and delivering data, as well as enhancing the overall data user experience. I hold a Ph.D. focused on developing digital methods in media research and enjoy exploring data and visualization also from a creative perspective.