PyData London 2023

I spent 10 years as an astrophysics researcher analysing high-energy data from space telescopes in the search for new objects in the universe and a better understanding of what we already knew to be out there. In 2015 I transitioned to data science joining a smart-cities startup called HAL24K. Over the next 8 years, I built data science solutions that enabled city governments and suppliers to derive actionable intelligence from their data to make cities more efficient, better informed and to make better use of resources. During that time I built and lead a team of 10 data scientists and helped the company spin out four new companies. In 2022, I joined ComplyAdvantage as a Senior Data Scientist working to combat financial crime and fraud.

I have supported DataKind UK since 2015 in their mission to bring pro-bono data science support to charities and NGOs in the third sector. And I have been an active member of the PyData community over the same time period.

Building a data science solution for an NGO when you don’t know what infrastructure it will run on: a case study predicting tutor supply and demand mismatch

Ade Idowu

A lead software engineer and data scientist. Has over 15 years’ experience in the development of software and AI/ML solutions. Pragmatic, analytic problem solver and builder of artificial intelligence solutions for business seeking efficiency and value. A passionate advocate of the development and use of ethical AI in products and services.

Hands-on Intro to developing Explainability for Recommendation Systems

Alberto Labarga

Alberto Labarga is a data engineer with over 5 years of experience in the healthcare industry. He specializes in building end-to-end modern data platforms for biomedical data analysis using open-source tools and libraries. Alberto has a Bachelor's degree in Computer Science and a Master's degree in Biomedical Engineering, both from the University of Madrid. He has worked on a variety of data engineering projects, including data warehousing, data integration, data transformation, and data orchestration. Alberto is passionate about open-source technologies and enjoys sharing his knowledge with others. In his free time, he enjoys hiking and playing guitar.

Building an End-to-End Open-Source Modern Data Platform for Biomedical Data

Alexander Hendorf

Alexander Hendorf is responsible for data and artificial intelligence at the boutique consultancy KÖNIGSWEG GmbH. Through his commitment as a speaker and chair of various international conferences he is a proven expert in the field of data intelligence. He's been appointed Python Software Foundation and EuroPython fellow for this various contributions. He's an dedicated community organiser (PyConDE & PyData Berlin, PyData Südwest, PyData Frankfurt, EuroPython, EuroSciPy). He has many years of experience in the practical application, introduction and communication of data and AI-driven strategies and decision-making processes.

✨ fastAPI facts we wish we'd known beforehand. Spoiler: It's not about getting started.

Antonio Campello

Antonio is a senior data scientist at Digital Science, helping clients leverage machine learning to extract useful information from large-scale academic literature data. Prior to that, he was a senior data scientist the Wellcome Trust, responsible for developing tools to inform funding decisions and portfolio analysis. He holds a PhD in applied mathematics and has over 7 years of experience working in signal processing, data science and machine learning with organisations across the globe, such as AT&T Labs, Télécom Paristech, and Imperial College London. He is a former chapter leader for Datakind UK, a volunteer-led organisation using data science in the service of humanity, and in 2022 he was selected as an emerging leader in philanthropy by the Technology Association of Grantmakers.

Keynote: Using data science for social good

Arthur Turrell

Arthur Turrell is Deputy Director of the Data Science Campus at the UK’s Office for National Statistics (ONS). Following studies in physics and mathematics, Arthur obtained his PhD in plasma physics from Imperial College London. He retains a keen interest in physics and has written a popular science book about nuclear fusion titled ‘The Star Builders’. He began his career in the public sector as a research economist in the Bank of England’s data science team where he led projects combining economics and data science. While at the Bank, he chose mathematical elements of Alan Turing’s work to feature on the UK’s £50 note and published research on labour markets, real-time data, natural language processing, forecasting, and macroeconomic modelling. He is also the author of several open source software packages and the open source online training book ‘Coding for Economists’. In 2021, he moved to the Data Science Campus, the UK public sector’s centre of excellence in data science.

Python for the Public Sector: How data science is being put to work for the public good

Ben Auffarth

Ben is a machine learning engineer and developer. With a PhD in computer science from KTH, he simulated brain connectivity on high-performance computers (up to 64k cores), authored scientific papers on feature selection and clustering, and designed and implemented a decision engine processing hundreds of thousands of financial transactions per day. As part of his previous work, he's trained large language models (deep learning) on millions of text documents for the purpose of information extraction from text documents.

ChatGPT, LLMs, and the future of data science

Carsten Frommhold

Carsten works as a data science consultant for Datadrivers, a consulting company based in Hamburg.
After working in risk management and graduating in mathematics, he entered the field five years ago. He focuses on the development of end2end AI solutions for customers in various industries, preferably in the cloud.

Mastering Great Expectations: Ensuring Data Quality in Your Data Pipelines.

Cesc Cunillera

After 7 years of academic research experience in string theory and cosmology, Cesc brings his unique blend of expertises to Data Science. With a keen interest in machine learning, optimisation problems and pricing, Cesc has been leading the Reinforcement Learning capabilities of the Data Science team at Tesco.

Introduction to RL for pricing problems

Cheuk Ting Ho

Before working in Developer Relations, Cheuk has been a Data Scientist in various companies which demands high numerical and programmatical skills, especially in Python. To follow her passion for the tech community, Cheuk is now working with the open-source community. Cheuk also contributes to multiple Open Source libraries like Hypothesis, Django and Pandas.

Besides her work, Cheuk enjoys talking about Python on personal streaming platforms and podcasts. Cheuk has also been a speaker at Universities and various conferences. Besides speaking at conferences, Cheuk also organises events for developers. Conferences that Cheuk has organized include EuroPython (which she is a board member), PyData Global and Pyjamas Conf. Believing in Tech Diversity and Inclusion, Cheuk constantly organizes workshops and mentored sprints for minority groups. In 2021, Cheuk has become a Python Software Foundation fellow.

Unconference: Is it GDPR 2.0? What is CRA and how can it affect you and OSS in general
Driving down the Memray lane - Profiling your data science work

Debayan Das

I'm the Interim Chief Data Officer at MiiCare, a Medtech company based out of London specialising in AI-powered Virtual Wards and D2A processes for older adults in the UK. My primary responsibilities lie at the intersection of Clinical Data Science, Cloud Architecture Design & Implementation, Data Security and Acoustic Machine Learning. I build AI and Data Infrastructures which empower older adults to live safely, happily and independently in the comfort of their own homes.

I have a Bachelor's Degree in Computer Science & Engineering with majors in AI and Cloud Computing. I have spent the last 5 years collaborating with academic institutions and care service providers in the UK to develop and use AI for the social impact.

If you want to learn more about me, you can connect with me on LinkedIn!

The Sound of Your Footsteps is a Digital Biomarker

Eoin O'Flanagan

Eoin is a Senior Resident Solutions Architect at Databricks. He has worked on data platforms in a variety of industries, including Retail, Financial and Manufacturing .

Delta Lake 101: How many water metaphors does it take to describe data?

Florian Jacta

Specialist of Taipy, a low-code open-source Python package enabling any Python developers to easily develop a production-ready AI application. Package pre-sales and after-sales function.
Data Scientist for Groupe Les Mousquetaires (Intermarche) and ATOS.
Developed several Predictive Models as part of strategic AI projects.
Master in Applied Mathematics from INSA, Major in Data Science and Mathematical Optimization.

How to build stunning Data Science Web applications in Python

Frederik Hvilshøj

Frederik is the Machine Learning Lead at Encord. He has a long background in computer vision and deep learning and has completed a PhD in Explainable Deep Learning and Generative Models at Aarhus University. Before his PhD Frederik studied a MSc in computer science while being a teaching assistant for "Introduction to databases" and "Pervasive computing and Software Architecture" at Aarhus University.

The Future of MLOps: Embedding Active Learning into Your ML Model Development Pipelines

Gabriel Harris

Lead Data Scientist and Data Science Manager at NIQ Brandbank

Bring best practices to your messy data science team!

Giles Weaver

Data scientist. Domain expertise in maritime shipping (AIS). User of PySpark & Dask for over five years. Formerly a bioinformatician. Available for contract work.

Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine

Habeeb Shopeju

Research Engineer - Machine Learning at Thomson Reuters Labs

Martial Arts Meets Machine Learning: Recognizing Judo Throws with MMAction2

Holly Smith

Holly Smith is a multi award winning Data & AI expert who has over a decade of experience working with Data & AI teams in a variety of capacities from individual contributors all the way up to leadership. She has spent the last four years at Databricks working with many multi national companies as they embark on their journey to the cutting edge of data. She has also worked with non profits through Datakind UK to advise on data strategy and bring data skills to social change organisations.

Delta Lake 101: How many water metaphors does it take to describe data?

Ian Ozsvald

Ian is a Chief Data Scientist, has helped co-organise the annual PyDataLondon conference raising $100k+ annually for the open source movement along with the associated 11,000+ member monthly meetup. Using data science he's helped clients find $2M in recoverable fraud, created the core IP which opened funding rounds for automated recruitment start-ups and diagnosed how major media companies can better supply recommendations to viewers. He gives conference talks internationally often as keynote speaker and is the author of the bestselling O'Reilly book High Performance Python (2nd edition). He has over 25 years of experience as a senior data science leader, trainer and team coach. For fun he's walked by his high-energy Springer Spaniel, surfs the Cornish coast and drinks fine coffee. Past talks and articles can be found at:

https://ianozsvald.com/
https://github.com/ianozsvald/
https://twitter.com/ianozsvald
https://fosstodon.org/@ianozsvald
https://www.linkedin.com/in/ianozsvald/

Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine
Discussing Higher Performance Python (Birds of a Feather session)
Executives at PyData

India Kerle

I am a data scientist with a background in quantitative social science and product management. At Nesta, I use natural language processing and machine learning to understand the changing skill demand landscape from millions of job adverts.

Building a skills extraction library using NLP tools

Ines Montani

Ines Montani is a developer specializing in tools for AI and NLP technology. She’s the co-founder and CEO of Explosion and a core developer of spaCy, a popular open-source library for Natural Language Processing in Python, and Prodigy, a modern annotation tool for creating training data for machine learning models.

Keynote: Large Language Models: From Prototype to Production

Jack Vines

Jack Vines is Lead Data Engineer at Nesta. He is interested in socially impactful applications of data science and open data. His work centres around large scale data collection and pipelines, and has included building infrastructure to collect over 3 million online job adverts per year, whilst applying large natural language models efficiently to enrich them.

Building a skills extraction library using NLP tools

James Powell

James Powell is the founder and lead instructor at Don’t Use This Code. He currently serves as Chairman of the NumFOCUS Board of Directors, helping to oversee the governance and sustainability of all of the major tools in the Python data analysis ecosystem (i.e., pandas, NumPy, Jupyter, Matplotlib). At NumFOCUS, he helps build global open source communities for data scientists, data engineers, and business analysts. He helps NumFOCUS run the PyData conference series and has sat on speaker selection and organizing committees for 18 conferences. James is also a prolific speaker: since 2013, he has given over seventy conference talks at over fifty Python events worldwide. In fact, he is the second most prolific speaker in the PyData and Python ecosystem.

Grouped Weighted Summary Statistics in pandas… and Other Things

James Schofield

Head of Data Science. PhD in keeping the lights on (electrical engineering) from Imperial College London, worked at easyJet scheduling your flights, worked at Kaluza scheduling your electric vehicle charging, now works at Virgin Media O2 scheduling fibre broadband rollout. James also loves scheduling things.

Large scale agent-based simulations: how to do it right, and how we used one to optimise fibre broadband rollout across the UK

Jay Chia

Jay Chia is originally from Singapore and has worked in companies such as Lyft (on autonomous vehicles) and Freenome (AI-powered cancer detection genomics) building large-scale machine learning and Python data infrastructure. Most recently, Jay started a startup and is now maintaining Daft (www.getdaft.io): the open-sourced Python distributed dataframe for complex data. He also knows how to drive and command tanks from serving in the Singapore military, and would love to chat if you are interested in big data frameworks. Or tanks.

"Unstructured" terabyte-scale textual data processing in a distributed cluster

Jim Dowling

Jim Dowling is CEO of Hopsworks and an Associate Professor at KTH Royal Institute of Technology. He is one of the main developers of the open-source Hopsworks platform, a horizontally scalable data platform for machine learning that includes the industry’s first Feature Store.

From zero to a working ML system with only Python, free serverless services and FTI pipelines

Johannes Kolbe

Johannes is a Data Scientist at celebrate company by day and an AI storyteller by night.

After experiences in research at Fraunhofer Fokus Institute and tinkering with sensor setups for autonomous vehicles, he decided to get more hands-on and joined celebrate company, where he is now bringing models from the NLP and CV research world into production.

Since last year he occasionally leads a Computer Vision Study Group on the Hugging Face discord server, where he presents papers embedded in presentations with a little geeky twist. You can find the recordings on the Hugging Face Youtube Channel.

He holds a Master's degree in Computer Science with a focus on cognitive systems from TU Berlin.

This will be his first ever conference talk ;-)

Robot Holmes and the Vision-Language Murder Mysteries

John Carney

Dr John Carney is a Manchester based Data Scientist and Engineering Architect for Netacea. He also runs a consultancy, PDFTA Ltd, helping organisations deliver value from Machine Learning products. John is a founder and co-organiser of PyData Manchester. John has worked in Data for around 10 years starting with wheat genetics, moving through local authorities, eCommerce and industrial sensors, and is now working in cyber-security.

Data work across Industries - Discussion

Konstantin Lopukhin

I lead Machine Learning research and development at Zyte, where we work on making the web data accessible to more people through products, services and open source projects. I also participated and won Kaggle competitions, achieving a competitions grandmaster title, and contributing to the community with talks, sharing code and knowledge.

Web Data Extraction with Deep Learning

Laszlo Sragner

I run Hypergolic, a consultancy in London specialising in Machine Learning Product Management.

Formerly I was Head of Data Science at Arkera, a fintech startup in London, where I built market intelligence products with Natural Language Processing for Tier 1 investment banks and hedge funds.

Before that, I worked in mobile gaming for King Digital (makers of Candy Crush), specialising in player behaviour and monetisation.

I started my career as a quant researcher writing trading strategies at multiple investment managers.

Code Smells in Data Science: What can we do about them?
Software Engineering Practices in Data Science

Leo Anthias

Building end-to-end internal analytics products using Python and open-source

Lisa Carpenter

Lisa is the lead data science instructor at Digital Futures, with responsibility for the design of our Data Science programme and delivery of a world-class learning experience for our engineers. Lisa has over 10 years experience in the data industry.

Teach data science better (a discussion)
Keynote: Using data science for social good
Entering the Forest with TensorFlow - An intro

Liz Gallagher

Liz is a Data Scientist with experience in natural language processing, machine learning, data analytics, agent-based modelling and evolutionary game theory. She applies these skills to areas such as the labour market, research funding, searching for policy impact, and modelling human behaviour. She currently works at Nesta, where she works on several projects involved extracting information from job advert text to understand the labour market.

Building a skills extraction library using NLP tools

Mark Pinkerton

Mark is the VP of software engineering at Risilience, a Cambridge based start-up that provides an analytics and SaaS platform for corporate businesses to assess their climate change risks. His interests and experience are in the application of modern data analysis techniques and frameworks to risk analytics and the benefits of running these analytics in the cloud. For fun, Mark surfs the Cornish coast as often as possible and tries to keep up with his two young children.

Green software - building sustainable Python data analytics

Marysia Winkels

Marysia is a Data Scientist and Data Science Educator at GoDataDriven. In addition to this, she is also chair of the PyData Amsterdam committee.

Data Storytelling through Visualization

Mate Timar

Mate Timar is a physicist turned data scientist, with expertise in both fields. He obtained his degree in physics and went on to specialise in strongly correlated quantum systems during his research career.

Driven by his passion for exploring the intersection of physics and data science, Mate eventually transitioned into the world of data science. He is now an expert in Bayesian Statistics, Interpretability, Experimentation, and Active Learning.

From Passive to Active: Exploring the Benefits of Active Learning in Data Science

Matteo Latinov

I am currently a Machine Learning Engineer at Yanmar R&D Europe.

I previously transitioned from the world of aftersales to data science in early 2020 and got my first role as a data scientist at la Marzocco, where I built ETL pipelines on AWS and managed the cloud infrastructure with Terraform. A year and a half later, I joined Yanmar and have been working on applying machine learning in the engine and powertrain sector.

I value code quality and have a bit of a soft spot for software design.

Software Engineering Practices in Data Science

Matthew Rocklin

Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled to improve Python's scalability with Dask for large organizations.

Matthew holds a bachelors degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.

Website: https://matthewrocklin.com
Dask: https://dask.org/
Coiled: https://coiled.io

New Developments in Pandas and Dask Dataframes

Meder Kamalov

Meder Kamalov is a Software Engineer at Features and Labels, a start-up that is building tools for the modern data stack. We are developers of fal and fal-serverless.

Unleashing the Power of dbt and Python for Modern Data Stack

Natan Mish

Lead Machine Learning Engineer at Zimmer Biomet. London School of Economics graduate with an MSc in Applied Social Data Science. Passionate about using Machine Learning to solve complicated problems. I have experience analysing, researching and building data products in the financial, real estate, transportation and healthcare industries. Curious about (almost) everything and always happy to take on new experiences and challenges. I love finding bugs, especially if they're my own making!

Event Driven Machine Learning

Nischal

Nischal is currently playing the role of Vice President of Data and ML at scoutbee, a company based out of Berlin, that is operating the space of procurement.

Having worked in the industry over the last 12+ years across enterprise companies and startups, Nischal has had the privilege of building and being part of teams that have designed and implemented data engineering and data science products to solve hard problems. Understanding the challenges of building data systems with Machine learning at the helm of it and taking them from research to production has been a fascinating and rewarding experience for Nischal.

Language Models for Music Recommendation

Pedro Tabacof

Pedro Tabacof is based in Dublin and is currently a staff machine learning scientist at Intercom. Previously, he has worked at Wildlife Studios (mobile gaming), Nubank (fintech), iFood (food delivery app). He has used and deployed machine learning models for anti-fraud, credit risk, lifetime value and marketing attribution, using XGBoost or LightGBM in almost all cases. Academically, he has a master's degree in deep learning and 300+ citations.

Deploying Real-Time Machine Learning Models Using Serverless AWS

Petros Syntelis

Lead Data Scientist in Virgin Media O2. Petros is an ex astrophysicist who has been modelling solar eruptions as a postdoctoral researcher and Lectures at the University of St Andrews. Following his academic posts, he has applied machine learning in financial documents, genetic data and telecommunications. Petros is passionate about using data science and machine learning to model data, derive insights and productionise solutions.

Causal modelling of agent-customer pairing outcomes to optimise call centre performance

Raghotham S

Raghotham currently leads NLP and computer vision teams at PayPal. Previously, he has built and led ML teams from scratch for various small and large enterprises.

Language Models for Music Recommendation

Raphaël Lüthi

Raphaël Lüthi leads machine learning & AI @ Groupe Mutuel, one of the leading health insurance companies in Switzerland. With 5+ years of experience working on data and ML problems, Raphaël specialises in creating right conditions for data science projects (and teams) to thrive and reliably deliver value.

Getting from data to insights with powerful XAI & DataViz open-source tool (inc. diving deep into shap’s TreeExplainer)

Sagar Mishra

Sagar Mishra is a core developer of sktime, and a Software Engineer at Oorjaa, a company that specializes in logistics optimization. He holds a Bachelor from the Indian Institute of Dhanbad. Besides, he has volunteered his time as a social worker in NGOs which aims to improve the life quality in underdeveloped regions of India.

sktime - python toolbox for time series: time series classification, regression, clustering, with modular time series distances and kernels

Sam Joseph

The 11 Types of Comedy and Large Language Models (LLMs)

Samuel Colvin

Open source Python and Rust developer, maintainer of Pydantic and other libraries.

Samuel recently founded Pydantic Services Inc. to build great developer tools by applying the same principles that have made Pydantic so successful.

Garbage in -> Pydantic -> you're golden!

Steve Goodman

Steve has 20 years experience in data analytics and data science, mostly in the fields of marketing, consulting and financial services. He is currently a Data Science Lead at Tide, a financial services platform based in London. He holds a PhD in Applied Statistics and a MBA.

From correlations to causality in machine learning– a gentle guide to causal inference

Tanay Agrawal

Tanay Agrawal is Deep Learning Engineer, currently working with Curl HG. He specializes in Computer Vision and Deep Learning. He has extensively worked on Hyperparameter Optimization. He has published a book on the same; "Hyperparameter Optimization in Machine Learning" with Apress.

A dive into Hyperparameter Optimization

Theodore Meynard

Theodore Meynard is a data scientist at GetYourGuide. He works on our ranking algorithm to help customers to find the best activities to book and locations to explore. He is one of the co-organisers of the Pydata Berlin meetup. When he is not programming, he loves riding his bike looking for the best bakery-patisserie in town.

MLflow workshop
MLOps in practice: our journey from batch to real-time inference

Tristan West

Lead data scientist. PhD in condensed matter physics from Imperial College London, worked at Ocado scheduling your food deliveries, worked at Babylon scheduling your medical appointments, and now works at Virgin Media O2 scheduling your fibre broadband rollout. Tristan loves scheduling things.

Large scale agent-based simulations: how to do it right, and how we used one to optimise fibre broadband rollout across the UK

Valerio Bonometti

Autoencoders for Time Series Clustering

Valerio Maggio

Valerio Maggio is a Researcher, a Data scientist Advocate at Anaconda, and a casual "Magic: The Gathering" wizard. He is well versed in open science and research software, supporting the adoption of best software development practice (e.g. Code Review) in Data Science. Valerio is also an open-source contributor, and an active member of the Python community. Over the last twelve years he has contributed and volunteered to the organization of many international conferences and community meetups like PyCon Italy, PyData, EuroPython, and EuroSciPy. All his talks, workshop materials and random ramblings are publicly available on his Speaker Deck and GitHub profiles.

Actionable Machine Learning in the Browser with PyScript

William Dealtry

William Dealtry has been working in both Python and C++ for many years, and has been a member of the C++ standardization committee for more than a decade. Currently he is the Architect of a new open-source Dataframe database, ArcticDB, which is backed by long-time Python enthusiasts Man Group and Bloomberg.

Serverless Python Analytics at Petabyte scale using ArcticDB

Yannick Wolff

As an Engineering Manager at Sicara, I work on various Machine Learning projects (Computer Vision, NLP, Time Series) to be pushed to production.

During the last years, I developed a passion for data science tooling and continuous improvement in our ways of working. I am now in charge of iterating on my company’s data science technical stack.

The Opinionated Python Stack I chose for my Company’s ML Projects and how I bundled it in a Project Generator

Zhaozhi Qian

I am a postdoc at the van der Schaar Lab in the University of Cambridge. In the past, I have led and contributed to the development of a host of novel algorithms for synthetic data (a list of publications can be found here). I am also leading the development of Synthcity, an open-source software library that aims to democratise the cutting-edge research in synthetic data.

Prior to joining the academia, I worked as a data scientist in one of the largest mobile games companies in the world, designing and implementing AI-powered systems that automatically optimize performance marketing campaigns. I also proudly worked for NHS as a volunteer during the pandemic, contributing to UK's first ICU capacity planning and forecasting system.

Synthetic data: what is it and why do we need it?

jonny edwards

My experience is at the intersection between academic innovation and it's
application in a commercial setting.
After active academic research in Neural Networks in the mid 90's, and teaching
Machine Learning (ML) and Information Theory in UK universities, I have
successfully founded three commercial concerns in the advanced computing
domain:
Thoughtful Technology, a data-science and ML consultancy with over twenty
successful projects in the last 10 years LifeQueue, a medical ML business which
generated a state of the art automated head and neck cancer diagnosis system
and Temporal Computing , an advanced hardware business using time-delays as
storage. Throughout this I have remained active in
research, becoming an international lead on temporal computing and an early
contributor to advanced explainable-AI methods. Recently, I was involved in
the COVID response, firstly, providing load prediction for the NHS-login
service, a single sign-in for many NHS services,secondly creating a graph
based search engine for a leading COVID academic corpus, and finally as a lead data scientist in the central NHS data team.

An Introduction to Polars

vincenzo crescimanna

Data Scientist at Tesco

Autoencoders for Time Series Clustering