PyData London 2023

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Breakfast & Registration
Minories
08:00
60min
Breakfast & Registration
Warwick
08:00
60min
Breakfast & Registration
Salisbury
09:00
09:00
90min
A dive into Hyperparameter Optimization
Tanay Agrawal

The tutorial aims to introduce the audience to the power of Hyperparameter Optimization. It will help them learn; how using simple python libraries one can make a huge difference in their ML model behavior.

We start with understanding the importance of hyperparameters, and the different distributions they are selected from. We then review some basic methods of optimizing hyperparameters, moving on to distributed methods and then to bayesian optimization methods. We'll use these algorithms hands-on, and play around with search spaces. We'll try out packages like Hyperopt, Dask, Optuna, to tune hyperparameters.

This tutorial will help beginner-level ML practitioners and working professionals use these methods in their applied ML tasks. They will be able to enhance the model quality and tune hyperparameters for bulky experiments more effectively.

Prior Knowledge Expected - Basic Python, a very basic understanding of Machine learning.\
Good to have - worked with libraries like scikit-learn(just knowing model.fit() should be enough)

Salisbury
09:00
90min
From zero to a working ML system with only Python, free serverless services and FTI pipelines
Jim Dowling

We will build an operational ML system to predict air quality in London. Instead of a single monolithic ML pipeline, we will build a more manageable system as 3 FTI pipelines: a Feature pipeline, a Training pipeline, and an Inference pipeline, connected together by a feature store. The feature pipeline scrapes new data and provides historical data (air quality observations and weather forecasts), The training pipeline produces a model using the air quality observations and features. The inference pipeline takes weather forecasts and predicts air quality for London, visualized in a UI. The system will be hosted on free serverless services - Modal, Hugging Face Spaces, and Hopsworks. It will be a continually improving ML system that keeps collecting more data, making better predictions, and provides a hindcast with insights into its historical performance.

Minories
09:00
90min
sktime - python toolbox for time series: time series classification, regression, clustering, with modular time series distances and kernels
Sagar Mishra

sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.

This tutorial explains how to use sktime for three learning tasks with independent instances of time series: time series classification, regression, clustering. It also explains their close connection to time series distances, kernels, and time series alignment, and how to flexibly combine such estimators to classifiers, regressors, clusterers with custom distances/kernels or feature extraction steps.

This is a continuation of the sktime introductory tutorial at pydata global 2021.

Warwick
10:30
10:30
30min
Break
Minories
10:30
30min
Break
Warwick
10:30
30min
Break
Salisbury
10:30
30min
Break
Salisbury
11:00
11:00
90min
An Introduction to Polars
jonny edwards

Polars is a next generation data-frame library which aims to be fast, efficient, composable and lazy! This introductory tutorial will take you through the basics of getting started with polars in Python. We will demonstrate the out the box multi-core efficiencies, by composing advanced filters and joins, before comparing with the traditional pandas workflows. As a finale we will look at some lazy processing when applying polars to large scale data-sets.

Minories
11:00
90min
Entering the Forest with TensorFlow - An intro
Lisa Carpenter

This 90-minute tutorial provides an introduction to using TensorFlow for building random forest models. The tutorial will begin with an overview of the random forest algorithm and its advantages in the context of machine learning. Next, participants will learn how to implement a random forest model using TensorFlow's high-level API, Keras. The tutorial will cover important concepts such as model architecture, hyperparameter tuning, and training and evaluation techniques. Additionally, participants will learn how to use TensorFlow's TensorBoard to visualize and monitor their models during training. The tutorial will conclude with a discussion of best practices and tips for improving the performance of random forest models. By the end of the tutorial, participants will have gained a solid understanding of how to use TensorFlow to build powerful and accurate random forest models.

Warwick
11:00
90min
From Passive to Active: Exploring the Benefits of Active Learning in Data Science
Mate Timar

Active Learning is a powerful technique in the field of data science that enables efficient use of labelling resources. In this 90-minute-long hands-on tutorial, we will provide a step-by-step guide on how to apply basic Active Learning techniques for a document classification problem.

The tutorial will begin with an introduction to Active Learning, followed by a brief discussion of its cost and time savings benefits. Next, we will implement clustering to select the first batch of training data. Then, we will train a document classification model and analyse fundamental Active Learning concepts such as diversity, isolation, and model uncertainty. We will compare different metrics to select the best points for annotation.

Finally, we will evaluate the model's performance and compare the results of Active Learning with random annotation. Throughout the tutorial, attendees will have the opportunity to work on their implementation and receive assistance.

By the end of this tutorial, attendees will better understand the principles of Active Learning and how to apply them to their own supervised learning problems, enabling them to make more efficient use of their labelling resources.

Salisbury
12:30
12:30
60min
Lunch
Minories
12:30
60min
Lunch
Warwick
12:30
60min
Lunch
Salisbury
13:30
13:30
90min
Bring best practices to your messy data science team!
Gabriel Harris

If part of your job is to constantly poke your fellow data scientist to isolate projects environments, updating requirements, cleaning code, writing consistent docstrings, etc., then you should definitely join us for this very hands-on tutorial with reproducibility, compliance, and consistency in mind

Minories
13:30
90min
Building an End-to-End Open-Source Modern Data Platform for Biomedical Data
Alberto Labarga

Join us for a 90-minute tutorial on how to build an end-to-end open-source modern data platform for biomedical data using Python-based tools. In this tutorial, we will explore the technologies related to data warehousing, data integration, data transformation, data orchestration, and data visualization. We will use open-source tools such as DBT, Apache Airflow, Openmetadata, and Querybook to build the platform. All materials will be available on GitHub for attendees to access.

Warwick
13:30
90min
Hands-on Intro to developing Explainability for Recommendation Systems
Ade Idowu

Over the last decade, the commercial use of recommendation engines/systems by business has grown substantially, enabling the flexible and accurate recommendation of items/services to users. Examples of popular recommenders include (to name a few) movies, videos and books recommendation engines offered by Netflix, Youtube and Amazon respectively.

In general, most recommender systems are typically “black-box” algorithms trained to provide inference of relevant items to users using techniques such as collaborative or content-based filtering models or hybrid models. The algorithms used in these systems are broadly opaque, thus making the predicted recommendations lack full interpretability/explainability. Making recommenders explainable is very essential, as they try to provide transparency and address the question of why were particular items recommended by the engine to users/system designers.

Over the last few years there has been a growing area of research and development in explainable recommendation systems. Explainable recommendations systems are generally classified as Post-hoc (i.e. explainability is done post-recommendation) or Intrinsic (explainability is integrated into the recommender model) approaches. This workshop will provide a hands-on implementation of some of these approaches.

Salisbury
15:00
15:00
30min
Break & Snack
Minories
15:00
30min
Break & Snack
Warwick
15:00
30min
Break & Snack
Salisbury
15:30
15:30
90min
Delta Lake 101: How many water metaphors does it take to describe data?
Holly Smith, Eoin O'Flanagan

Delta Lake is an open-source storage framework that enables the creation of a Lakehouse architecture using a variety of compute engines such as Spark, PrestoDB, Flink, Trino, and Hive from Python. Its high data reliability and optimized query performance make it an ideal solution for supporting big data use cases, including batch and streaming data ingestion, fast interactive queries, and machine learning.

Warwick
15:30
90min
MLflow workshop
Theodore Meynard

In this tutorial, we will learn the basis of MLflow. After introducing the library and the problem it solved we will implement an end-to-end machine learning lifecycle using MLflow.

Salisbury
15:30
90min
Martial Arts Meets Machine Learning: Recognizing Judo Throws with MMAction2
Habeeb Shopeju

Object detection is arguably the most common Computer Vision task. It is applied to images and videos across various domains. However, action recognition is a tad different from object detection because it can be difficult to tell certain actions from a single image. It is hard to tell if a door is being opened or closed or tell what martial art technique is being executed from an image.

In this tutorial, the MMAction2 framework will be used to train an action recognition model to detect what Judo throws are being performed in videos. While it will be fun seeing Machine Learning techniques applied to Martial Arts, the knowledge and techniques applied can easily be generalized to other action recognition tasks where simple object detection does not suffice.

Minories
08:00
08:00
60min
Breakfast & Registration
Minories
08:00
60min
Breakfast & Registration
Warwick
08:00
60min
Breakfast & Registration
Salisbury
08:00
60min
Breakfast & Registration
Beaumont
09:00
09:00
45min
Keynote: Large Language Models: From Prototype to Production
Ines Montani

Keynote with Ines Montani

Warwick
09:45
09:45
30min
Break & Snacks
Minories
09:45
30min
Break & Snacks
Warwick
09:45
30min
Break & Snacks
Salisbury
10:15
10:15
40min
Building a data science solution for an NGO when you don’t know what infrastructure it will run on: a case study predicting tutor supply and demand mismatch
Adam Hill

The Brilliant Club supports less advantaged students to access and succeed in the UK’s most competitive universities. They do this by mobilising the PhD community to support students in schools via their courses and tutoring programme. A challenge they face is being able to anticipate the tutor supply they need to meet the increasing demands of their programmes as they expand nationally. A team of six DataKind UK volunteers worked with The Brilliant Club to develop a way to forecast and visualise the mismatch between tutor supply and demand across the UK. This is a talk about how we collaboratively explored their data and built a valuable, new tool for them and, crucially, how we did so in a flexible, scalable way that provides them with immediate value but also will fit into their future use of digital and cloud-based tools. This talk is for people intrigued by deploying new, data-driven solutions in organisations that are only just maturing into the data space. No previous knowledge is required.

Minories
10:15
40min
Causal modelling of agent-customer pairing outcomes to optimise call centre performance
Petros Syntelis

Large scale call centres are the frontline of customer experience across many industries. Optimizing their operations is crucial for achieving better customer service. We model agent customer pairing as a “talent” allocation problem. In this talk, we discuss how we used uplift modelling to provide real time agent-customer pairings that drive a positive lift in overall interaction score (which can come from any arbitrary scoring function). We discuss the challenges of developing and deploying such models to make real-time interventions in call centres. Similar approaches can be used to drive uplift of any important business KPI with respect to an allocation decision.

Warwick
10:15
40min
The Opinionated Python Stack I chose for my Company’s ML Projects and how I bundled it in a Project Generator
Yannick Wolff

Have you ever struggled with choosing the right tools for your Machine Learning projects? As a Lead Data Scientist in a consulting firm, I faced this challenge repeatedly and finally converged to a small set of technologies which allow to build reliable and scalable projects with a great DX (Developer Experience). In this talk, I will share the key components of my ML stack, including DVC, Streamlit, FastAPI, Terraform and other powerful tools to streamline the development and experimentation processes. Through a live demo, I will finally show you the Project Generator I’ve built to encourage adoption of these technologies and to help Data Scientists focus on the ML itself rather than the "plumbing" around it. Attendees should have a basic understanding of Python and Machine Learning concepts.

Salisbury
11:00
11:00
40min
Code Smells in Data Science: What can we do about them?
Laszlo Sragner

We all want to write cleaner code but usually don't know where to start. It also doesn't help that most guides available are written for software engineers and not data scientists.

Code smells are a taxonomy and a well-defined set of instructions on how to identify typical antipatterns in your code and change them in a few steps.

In this talk, I will select a short list of typical code smells that frequently appear in data-intensive workflows and walk you through how to resolve them.

Salisbury
11:00
40min
Executives at PyData
Ian Ozsvald

Executives at PyData is a facilitated discussion session for executives and leaders to discuss challenges around designing and delivering successful data projects, organizational communication, product management and design, hiring, and team growth.

We'll announce the agenda at the start of the session, you can ask questions or raise issues to get feedback from other leaders in the room, NumFOCUS board members and Ian and James.

Organized by Ian Ozsvald (London) and James Powell (New York)

Beaumont
11:00
40min
Large scale agent-based simulations: how to do it right, and how we used one to optimise fibre broadband rollout across the UK
James Schofield, Tristan West

What’s the optimal way to upgrade a broadband network to fibre? In this session we’ll talk about how we used actor-based simulations and discrete optimisation to build a planning tool that has not only optimised one of the biggest fibre upgrade operations in the UK, but also unlocked powerful scenario testing capabilities. We’ll go through how to architect scalable, agent-based simulations using only open-source libraries and Python, and take you on our journey (including pointing out pitfalls) towards optimising UK wide fibre broadband rollout.

Warwick
11:00
40min
Python for the Public Sector: How data science is being put to work for the public good
Arthur Turrell

In this talk, I will take you on a rollercoaster tour of how data science is delivering for the public good at the Office for National Statistics’ Data Science Campus and beyond. Drawing on examples from the dozens of data scientists working at the Campus, you’ll find out how Python is improving the public sector already in a myriad of ways, from creating or improving national statistics, to forecasting the economy, to dealing with Covid-19, to evaluating efforts to tackle the gender pay gap. We’ll even see how a tweet by a food campaigner led to a huge effort to web-scrape budget brand offerings in UK supermarkets—analysis that made it onto every major UK news programme! And we’ll look ahead to the challenges, and potential, of Python for the public sector in the future.

Minories
11:45
11:45
40min
Green software - building sustainable Python data analytics
Mark Pinkerton

Were you aware that the cloud infrastructure powering modern computing has a larger greenhouse gas footprint than commercial aviation? This talk is aimed at developers and data scientists who are concerned about the impact of their work on the environment and want to explore practical solutions to address this challenge. We will explain how greenhouse gas emissions are categorized and estimated for computing. We will also introduce approaches to developing more sustainable software and provide practical examples using Python data analytics.

Minories
11:45
40min
MLOps in practice: our journey from batch to real-time inference
Theodore Meynard

I will present the challenges we encountered while migrating an ML model from batch to real-time predictions and how we handled them. In particular, I will focus on the design decisions and open-source tools we built to test the code, data and models as part of the CI/CD pipeline and enable us to ship fast with confidence.

Warwick
11:45
40min
✨ fastAPI facts we wish we'd known beforehand. Spoiler: It's not about getting started.
Alexander Hendorf

An exchange of views on fastAPI in practice.

FastAPI has become an integral part of the PyData ecosytem. FastAPI is great, it helps many engineers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation.

FastAPI does a great job of getting people started with APIs quickly.

This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions based on experience building a data hub in asset management.

Salisbury
12:30
12:30
60min
Lunch
Minories
12:30
60min
Lunch
Warwick
12:30
60min
Lunch
Salisbury
13:30
13:30
60min
Lightning Talks
Warwick
14:30
14:30
30min
Break & Snack
Minories
14:30
30min
Break & Snack
Warwick
14:30
30min
Break & Snack
Salisbury
15:00
15:00
40min
Event Driven Machine Learning
Natan Mish

This talk focuses on the benefits of using an event-driven approach for machine learning products. We will cover the basics of event-driven architecture for software development and provide examples of how it can be applied for machine learning use cases. The talk will be accompanied by live examples and code for you to follow along, using open source tools such as Apache Kafka, FastAPI and River. By the end of the talk, you'll have a good understanding of the advantages of event-driven architectures, such as improved scalability and responsiveness. If you are a machine learning practitioner interested in exploring this topic, this talk is a great starting point in which we will cover the concepts, tools and common pitfalls of the event driven framework for machine learning products.

Minories
15:00
40min
Garbage in -> Pydantic -> you're golden!
Samuel Colvin

Pydantic is a data validation library for Python that has seen massive adoption over the last few years - it's used by major datascience and ML libraries like Spacy, Huggingface and jinja-ai - overall Pydantic is downloaded over 50m times a month!

In this talk Samuel Colvin, the creator of Pydantic will cover two subjects which have seen massive interest in recent years:

  • How Pydantic can be used to prepare data for processing thereby saving time and avoiding errors
  • The emergence of Rust as the go-to language for high performance python libraries - how this might go in the future, and the benefits and drawbacks of the trend
Salisbury
15:00
40min
Serverless Python Analytics at Petabyte scale using ArcticDB
William Dealtry

At Man Group we ingest data of all shapes and sizes, from market prices to weather, ESG reporting to news media. Storing that data in a format that is useful both to researchers and to strategies trading more than a billion dollars is a unique technical challenge - on that has resulted in two iterations of our high-performance data storage product ArcticDB. The aim is nothing less than to bring the performance and analytical capabilities of a server infrastructure right into the Python client. With the client now available on Conda and PyPI. and the source published on GitHub, I want to take you on a tour of our database; the design rationale, the ups and downs of developing a new database while supporting a billion-dollar trading estate, and the lessons we've learned along the way.

Warwick
15:00
40min
Teach data science better (a discussion)
Lisa Carpenter

This discussion session is for educators to talk about how we teach data science in industry and academia. It'll be a guided discussion, we'll vote on the top topics to discuss at the start and then we'll work our way through problems, solutions and new ideas. Maybe we'll get to talk about ChatGPT, or using Jupyter, or when to "teach in an IDE", or how to balance lecture vs problem solving vs homework - all topics can be up for voting at the start.

Beaumont
15:45
15:45
40min
Language Models for Music Recommendation
Nischal, Raghotham S

Music streaming services like Spotify and youtube are famous for their recommendation systems and each service takes a unique approach to recommending and personalize content. While most users are happy with the recommendations provided, there are a section of users who are curious how and why a certain track is recommended. Complex recommendation systems take various factors like track metadata, user metadata, and play counts along with the track content itself.

Inspired by Andrej Karpathy to build an own GPT, we have to use Language Models to build our own music recommendation system.

Minories
15:45
40min
Mastering Great Expectations: Ensuring Data Quality in Your Data Pipelines.
Carsten Frommhold

Join me as we dive into the world of Great Expectations, an open-source tool that helps data driven teams ensure their pipelines deliver high-quality data. You will be introduced to the key concepts of Great Expectations, including data validation, documentation and lineage. Plus, I'll show you how to set up Great Expectations in a cloud environment, using Google Cloud Platform as an example. By the end of this talk, you'll have a solid understanding of how Great Expectations can improve the reliability and correctness of your datasets and transformations.

Warwick
15:45
40min
Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine
Ian Ozsvald, Giles Weaver

Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool for your team to invest in.

Salisbury
16:30
16:30
40min
Autoencoders for Time Series Clustering
vincenzo crescimanna, Valerio Bonometti

AutoEncoders (AEs) are among the most popular techniques in modern machine learning. Thanks to their strong representation learning capability, they can be used not only to generate data, but also for many other tasks, e.g. clustering, dimensionality reduction and transfer learning.

Despite their popularity, their application is usually advertised mostly for applications with static tabular (e.g. for recommender systems) and image data (e.g. for computer vision tasks). With this talk we will try to shed some light on a less well-known area of application, namely the use of AEs with time series data. After a brief introduction on AEs we will highlight challenges to their application in the time-series domain with a particular focus on clustering, features extraction and transfer learning.

The talk is for everyone with an interest in deep learning, time series and their intersection. Despite some working knowledge of applied machine learning (deep learning in particular) and time series analysis would be beneficial, the talk will be delivered in a format accessible to all data science practitioners.

Warwick
16:30
40min
Discussing Higher Performance Python (Birds of a Feather session)
Ian Ozsvald

This discussion session is for anyone using Python for higher performance work. You probably use Pandas, NumPy, Polars, Dask, Vaex, Modin, cuDF or any of the related tools, you've got questions, you want to know what other people are using, what's pragmatic and where new opportunity might exist.
This will be a guided discussion, we'll vote on topics at the start of the session and then host Ian will work through the list.

Beaumont
16:30
40min
The Future of MLOps: Embedding Active Learning into Your ML Model Development Pipelines
Frederik Hvilshøj

Join us for an insightful session on active learning and its applications in machine learning. You will learn how leading teams are embedding active learning into their ML pipelines and how to build your first active learning loop. This session is for ML engineers and data scientists (aspiring or practitioners) who want to stay updated on the latest techniques and learn how to implement active learning with open-source tools.

Minories
16:30
40min
Unleashing the Power of dbt and Python for Modern Data Stack
Meder Kamalov

This talk will introduce the PyData community to dbt and demonstrate how to leverage Python to unlock its full potential. Attendees will learn best practices for working with dbt, how to integrate it with other tools in their data stack, and how to use Python packages like fal to perform complex data analysis. With real-world examples and use cases, this talk will equip attendees with the tools to build a modern, scalable, and maintainable data infrastructure.

Salisbury
08:00
08:00
60min
Breakfast & Registration
Minories
08:00
60min
Breakfast & Registration
Warwick
08:00
60min
Breakfast & Registration
Salisbury
08:00
60min
Breakfast & Registration
Beaumont
09:00
09:00
40min
Keynote: Using data science for social good
Lisa Carpenter, Antonio Campello

Keynote with Lisa Carpenter and Antonio Campello

Warwick
09:45
09:45
30min
Break
Minories
09:45
30min
Break
Warwick
09:45
30min
Break
Salisbury
10:15
10:15
40min
Building a skills extraction library using NLP tools
India Kerle, Liz Gallagher, Jack Vines

There is no publicly available data on the skills that are commonly required in UK online job adverts, despite this information being useful for a range of use cases. To address this, we have built an open source skills extraction python library using spaCy and huggingface. Our approach is twofold: we train a named entity recognition model to extract skill entities from job adverts then map them onto any standardised skills taxonomy. By applying this algorithm to a dataset of scraped online job adverts, we are then able to find skill similarities amongst occupations, and regional differences in skill requirements.

Minories
10:15
40min
Getting from data to insights with powerful XAI & DataViz open-source tool (inc. diving deep into shap’s TreeExplainer)
Raphaël Lüthi

SLIDES

Have you ever wanted a standard and efficient process to approach new datasets? Do you want a systematic way of highlighting complex nonlinear or low frequency patters in your data?

In this talk, I will share the open-source stack that I use to get efficiently extract interesting insights from any dataset. I will teach you how to use data visualisation, gradient boosted decision trees and XAI tools quickly find hidden patterns, to de-risk you projects early or debug your models.

Warwick
10:15
40min
New Developments in Pandas and Dask Dataframes
Matthew Rocklin

We're in a new era of dataframe development. Libraries like Arrow, Polars, DuckDB, Vaex, Modin, and others stretch the bounds of performance on what we think can be done with tabular data in Python. These systems have great benchmarking results and generate significant buzz on social media.

Pandas, the community favorite, is also innovating, although with less buzz. Structural improvements like Arrow data types, copy on write, and more bring the world's most popular dataframe library (55% of Python users) into significantly better performance and memory use. Additionally Dask, a parallel computing library developed closely with Pandas, has also added new features in the last year, like memory-stable shuffling, task queueing, and with recent experiments in query optimization which we'll discuss as well.

In this talk we'll highlight some of these new features and show the impact they make on speed and cost on real-world workloads, as well as a vision for future development.

Salisbury
11:00
11:00
40min
Data Storytelling through Visualization
Marysia Winkels

Data is everywhere. It is through analysis and visualization that we are able to turn data into information that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.

Warwick
11:00
40min
From correlations to causality in machine learning– a gentle guide to causal inference
Steve Goodman

Today most conventional ML systems look to exploit correlations in data in order to draw inferences. However as we learned back in school Statistics class, correlation is not causation. So when you need to know the ‘why’ behind a particular prediction, or why A outperforms B in an experiment, then relying on correlations is insufficient. Furthermore some ML models are build purely for explainability and insight purposes rather than predictions, in order to understand how the world works so we could potentially make some kind of policy change, e.g. What if we had chosen a different strategy or tactic – would the outcome have been different, and if so, by how much? To answer these kinds of questions, you need to delve into the world of causality.

This talk is a gentle (and occasionally entertaining) introduction to the interdisciplinary field of causality and how it is starting to impact machine learning. You will learn what kinds of questions causal inference can answer, and how it can address some of the limitations of current explainable ML methods, under certain conditions. I draw upon use-cases drawn from financial services and marketing, and I will show a short practical example of how combining human domain knowledge (intuitively via Graphical Causal Models) along with your data can sometimes unlock insights not recoverable by purely data driven approaches.

Salisbury
11:00
40min
Unconference: Is it GDPR 2.0? What is CRA and how can it affect you and OSS in general
Cheuk Ting Ho

The European Parliament proposed a Cyber Resilience Act - basically wants all software to have an “EC” stamp on it. There is a non-commercial craft out but it is still not enough to make sure open-source projects with limited resources are exempted from the Act. How will it affect the OSS ecosystem?

Beaumont
11:00
40min
Web Data Extraction with Deep Learning
Konstantin Lopukhin

Extracting data from web pages is a problem which is not so well covered and researched compared to image or text classification, object detection or named entity recognition. But this problem is extremely exciting to look into and rewarding to work on, because web pages can be represented in so many ways: as a screenshot of a page, as it's text, as an HTML tree, as a sequence of elements with discrete and continuous properties, and in other ways. This leads to many diverse approaches, which often combine different input types and ways of data representation inside one model. In this talk we will explore several intriguing approaches for web data extraction, and see how one can come up with novel approaches and grow your model according to the task at hand.
This talk is intended for anyone with interest in neural networks. I hope it gives you inspiration and intuition for building deep learning models tailored to the structure of your data. This is applicable not only to web information extraction, but also to document extraction and other domains with structured text or image inputs.

Minories
11:45
11:45
40min
"Unstructured" terabyte-scale textual data processing in a distributed cluster
Jay Chia

ChatGPT has reignited worldwide interest in text data, capturing the imaginations of thousands of developers, but how do we actually build large scale production pipelines for working with and processing this highly unstructured data?

SQL is a great language for simple data modalities that fits in a database table, but when it comes to complex "unstructured" data, it is Python that really shines. In this talk, we show how easy it is to go from data storage to querying and processing large amounts of unstructured data using modern Python open-sourced tooling such as Ray, Daft and HuggingFace models.

Minories
11:45
40min
Data work across Industries - Discussion
John Carney

This session will facilitate a discussion exploring the differences in data work between different industries, including eCommerce, Insurance, Cyber Security, and Finance.

We will discuss the challenges and opportunities of data work in each industry, as well as the skills and knowledge that data professionals may need, in order to be successful. We will cover Data Engineering and Data Science specifically, but this is an open forum for anyone to discuss data challenges in different industries.

Beaumont
11:45
40min
How to build stunning Data Science Web applications in Python
Florian Jacta

This talk presents Taipy, a new low-code Python package that allows you to create complete Data Science applications, including graphical visualization and managing algorithms, pipelines, and scenarios.

Warwick
11:45
40min
Robot Holmes and the Vision-Language Murder Mysteries
Johannes Kolbe

We will follow master detective Robot Holmes on his way to solve one of his hardest cases so far - a series of mysterious murders in the city of MLington. The traces lead him to the Vision-Language part of town, which has been a quiet and tranquil place with few incidents until lately. For a few months the neighbourhood has been growing extensively and careless benchmark leaders are dropping dead at an alarming rate.

Robot Holmes sets out to find the cause for this new development and will gather intel on some of the most notorious of the new citizens of the Vision-Language neighbourhood and find out what makes them tick.

Salisbury
12:30
12:30
60min
Lunch
Minories
12:30
60min
Lunch
Warwick
12:30
60min
Lunch
Salisbury
12:30
60min
PyData Organizers Meetup

We welcome all PyData Organizers to join us for an open discussion during lunch.

Beaumont
13:30
13:30
60min
Lightning Talks
Warwick
14:30
14:30
30min
Break & Snack
Minories
14:30
30min
Break & Snack
Warwick
14:30
30min
Break & Snack
Salisbury
15:00
15:00
40min
Actionable Machine Learning in the Browser with PyScript
Valerio Maggio

PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) zero-installation & server-less environment.

In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. built-in Javascript integration and local modules ), discussing new data & design patterns (e.g. loading heterogeneous data in the browser), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).

Salisbury
15:00
40min
ChatGPT, LLMs, and the future of data science
Ben Auffarth

The ChatGPT and the GPT models by OpenAI have brought about a revolution in the way we think about the world and not only how we write texts, but how we can process information about the world. Let's discusses the capabilities, and limitations of large language models including ChatGPT, about possible applications, tooling, data security, wider societal implications, and ethics. Some applications have gone as far as automating data analysis so this also poses a question about the future of data science.

Beaumont
15:00
40min
Driving down the Memray lane - Profiling your data science work
Cheuk Ting Ho

When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.

Minories
15:00
40min
The Sound of Your Footsteps is a Digital Biomarker
Debayan Das

This will be a gentle introduction to the world of clinical gait analysis and how your gait (a.k.a the way you walk) is a digital biomarker for predicting physical and cognitive health. We will talk about digital biomarker engineering from unconventional sources of data (footstep sounds for example). To demonstrate a real life application, I will briefly mention how the R&D team at MiiCare uses acoustic machine learning for fall risk assessment for older adults living alone at home and care homes across the UK.

You should join this talk if you are interested in digital health, digital biomarker engineering and applications of AI in social care.

Warwick
15:45
15:45
40min
Grouped Weighted Summary Statistics in pandas… and Other Things
James Powell

This talk is about grouped weighted summary statistics in pandas… and some other things.

Minories
15:45
40min
Introduction to RL for pricing problems
Cesc Cunillera

Reinforcement learning (RL) has become the go-to framework when working with decision processes. Originally demonstrating superhuman performance in videogames, applications of reinforcement learning providing state-of-the-art results now extend to a myriad of areas: from drug discovery to autonomous driving and computer vision, just to name a few.

In this talk, we will concentrate on the application of RL to pricing environments. In particular, we will consider how Ben, our friendly neighbourhood gelato merchant, might approach the dynamic problem of pricing his products throughout the year with RL. We will introduce the problem as a Markov decision process and review the most common archetypes of RL algorithms to solve it while highlighting various pitfalls and challenges, always with a focus on its implementation to pricing.

By the end of the talk, we will be able to help Ben set up a pricing model for his delicious gelato!

Warwick
15:45
40min
Software Engineering Practices in Data Science
Laszlo Sragner, Matteo Latinov

This discussion session focuses on exploring the application of software engineering practices in the field of data science. Join us to delve into essential aspects such as python packages, IDEs, testing, refactoring, and architecture that play a crucial role in building robust and scalable data science solutions. We will discuss how adopting software engineering principles can enhance the reliability, maintainability, and efficiency of data science projects. Whether you're a DS manager or practitioner, this session offers a platform to exchange insights, share experiences, and discover innovative approaches to integrating software engineering practices into the data science workflow.

Beaumont
15:45
40min
Synthetic data: what is it and why do we need it?
Zhaozhi Qian

One of the biggest barriers to machine learning and data analytics is the difficulty to access high quality data. Synthetic data has been widely recognized as a promising remedy to this problem. It allows sharing, augmenting and de-biasing data for building performant and socially responsible ML systems. In this talk, I will overview the significant progress in the theory and methodology of synthetic data over the past five years. I will also introduce the open-source library, Synthcity, which implements an array of cutting-edge synthetic data generators to address data scarcity, privacy, and bias. The participants will walk away with a deeper understanding of the theory and practice of synthetic data, an understanding of when which methods apply (or do not apply) to their specific use case, and be ready to apply them in hackathons, competitions, and their day-to-day work.

Salisbury
16:30
16:30
40min
Building end-to-end internal analytics products using Python and open-source
Leo Anthias

This talk educates the audience on how to create end-to-end data products using the Python data ecosystem, from data integration to reporting, dashboards, apps, and surfacing insights.

After analyzing the features found in popular proprietary analytics products across various verticals, this talk will demonstrate how data teams can use open-source libraries to create and deploy applications which are accessible to non-technical end users but hold distinct advantages over proprietary alternatives.

Minories
16:30
40min
Deploying Real-Time Machine Learning Models Using Serverless AWS
Pedro Tabacof

In this presentation, I will show how to use AWS Lambda and API Gateway to deploy real-time machine learning models developed in Python. I will use these tools to create a serverless web endpoint and serve model predictions with high availability/scalability. These tools provide a relatively simple and cost-effective solution for data scientists and machine learning engineers looking to deploy models without the hassle of managing servers and without needing to rely on third parties. I will cover potential pitfalls to be aware of, such as Lambda's cold start delays and memory limitations. Through code examples and practical tips, attendees will gain a solid understanding of how to use serverless AWS to deploy and serve their own machine learning models at scale.

Salisbury
16:30
40min
The 11 Types of Comedy and Large Language Models (LLMs)
Sam Joseph

This talk is about how advances in Large Language Models (LLMs) are helping make inroads into the 11 types of comedy. For many years most, but not all, types of humour were beyond the reach of automated systems. This talk is for those interested in comedy, how it is created, the state of the art in LLMs, and comedy datasets. This talk will include specific code examples as well as trying to be humorous in its own way. At the end the audience will have learnt how LLMs are changing the human/computer comedy landscape.

Warwick