PyData London 2023

Getting from data to insights with powerful XAI & DataViz open-source tool (inc. diving deep into shap’s TreeExplainer)
06-04, 10:15–10:55 (Europe/London), Warwick


Have you ever wanted a standard and efficient process to approach new datasets? Do you want a systematic way of highlighting complex nonlinear or low frequency patters in your data?

In this talk, I will share the open-source stack that I use to get efficiently extract interesting insights from any dataset. I will teach you how to use data visualisation, gradient boosted decision trees and XAI tools quickly find hidden patterns, to de-risk you projects early or debug your models.

In our field, the tools we use, and our ability to make the most out of them, is often key to success.

For tabular data, gradient boosted decision trees (LightGBM, XGBoost & CatBoost) have been the go-to algorithm for classification and regression problems for a few years because they are good at capturing nonlinear and complex relationships in the data. Understanding their results, however, is not trivial. In any application where the relationships in the data need to be understood, these more powerful models are typically left out.

In this talk, you will learn how to break free from this trade-off, and get the performance of gradient boosted decision trees as well as interpretable local and global explanations. You will learn how to dig beyond surface-level statistics to bring to light:
- Complex nonlinear interactions,
- Low frequency but high impact factors,
- Clusters of your population who behave similarly.

I will share my battle tested process to extract insights from tabular data by combining gradient-boosted tree algorithms with the SHAP (SHapley Additive exPlanations) black-box model explainer.
I will quickly review the paper from Scott Lundberg, the creator of the shap library, that inspired this approach. I will then present some of my results applying this method:
1. Client segmentation (semi-supervised)
2. First dive into a new dataset (for some contract work)
In this talk, we will cover the following productivity boosting open-source tools:
- Gradient-boosted tree algorithms: LightGBM, XGBoost & CatBoost:
- What are SHAP (SHapley Additive exPlanations) values and how to compute them efficiently with TreeExplainer
- Tools to automate your EDA: DataPrep, ydata-profiling (formally pandas-profiling) and Sweetviz
- The parallel plot tool to see your dataset across many dimensions: HiPlot
- A calm way to organise your data wrangling: pandas.pipe
- A tool to standardise the MLOps workflow: MLFlow

Prior Knowledge Expected

No previous knowledge expected

Raphaël Lüthi leads machine learning & AI @ Groupe Mutuel, one of the leading health insurance companies in Switzerland. With 5+ years of experience working on data and ML problems, Raphaël specialises in creating right conditions for data science projects (and teams) to thrive and reliably deliver value.