PyData London 2023

From zero to a working ML system with only Python, free serverless services and FTI pipelines
06-02, 09:00–10:30 (Europe/London), Minories

We will build an operational ML system to predict air quality in London. Instead of a single monolithic ML pipeline, we will build a more manageable system as 3 FTI pipelines: a Feature pipeline, a Training pipeline, and an Inference pipeline, connected together by a feature store. The feature pipeline scrapes new data and provides historical data (air quality observations and weather forecasts), The training pipeline produces a model using the air quality observations and features. The inference pipeline takes weather forecasts and predicts air quality for London, visualized in a UI. The system will be hosted on free serverless services - Modal, Hugging Face Spaces, and Hopsworks. It will be a continually improving ML system that keeps collecting more data, making better predictions, and provides a hindcast with insights into its historical performance.

This tutorial will produce three different Python programs (Feature/Training/Inference pipelines) that, when plugged together make a production ML system. First, we will understand the data sources: public, crowd-sourced air quality measurements that can be retrieved with an API key or scraped from a web page, and weather predictions/observations that can retrieved with free API services. The prediction problem is to predict air quality at the location of existing air quality sensors, using weather forecast data as the primary features for predicting air quality. We will show you how to write a Python program as a feature pipeline that can both scrape new data and provide historical data (air quality observations and weather forecasts). We will show you how to schedule this feature pipeline to run daily using Modal (you could also use Github Actions or any one of the many free Python orchestration services available today). Our feature pipeline will store our features in a free serverless feature store (Hopsworks) and then we will write a training pipeline that reads features and air quality observations (labels) to train a model to predict air quality given a weather forecast (a set of weather features). Finally, we will develop a UI using Hugging Face Spaces that will include a batch inference program to retrieve the latest weather forecast features and the model and to predict weather quality. We will show you how to log predictions, so that you can build a continually improving ML system that provides hindcasts with insights into its historical performance.

For this tutorial, you will need experience with programming in Python, a laptop and Internet access.

Prior Knowledge Expected

No previous knowledge expected

Jim Dowling is CEO of Hopsworks and an Associate Professor at KTH Royal Institute of Technology. He is one of the main developers of the open-source Hopsworks platform, a horizontally scalable data platform for machine learning that includes the industry’s first Feature Store.