PyData London 2023

An Introduction to Polars
06-02, 11:00–12:30 (Europe/London), Minories

Polars is a next generation data-frame library which aims to be fast, efficient, composable and lazy! This introductory tutorial will take you through the basics of getting started with polars in Python. We will demonstrate the out the box multi-core efficiencies, by composing advanced filters and joins, before comparing with the traditional pandas workflows. As a finale we will look at some lazy processing when applying polars to large scale data-sets.


Background

The tutorial targets intermediate data-scientists who use pandas as a part of their existing data science tool-kit.
The central premise of the tutorial is that polars is faster and more composable resulting in a cleaner and more productive work flows.

The ultimate aim of this tutorial is to "convert" pandas users to polars :) !

Introduction

  1. Installation - basic installation using pip
  2. Data type basics - column types and coalescing
  3. Interop with pandas/numpy - how this relates to the traditional numpy dtypes
  4. File reading basics - the standard operations to read data into a dataframe from a host of different formats

Standard Workflow

  1. Accessing columns
  2. Filtering - filtering is composed quite nicely in polars, so we will go through a few examples
  3. Grouping - grouping is again nicely multicore
  4. Joining
  5. Row based operations

Advanced Workflow

  1. A note about multicore - polars is Rust under the hood and the correctness allows for a clean multicore processing capacity. We will spend five minutes demonstrating this on a large data-set.
  2. Case study, lazy geospatial processing: The final part of the tutorial will be a case study example of efficient geospatial lazy processing. In this we will go through the efficiency gains of using the lazy interface to filter a large collection of geospatial data in a multicore way, to find points within defined polygonal shapes. We will show that large amounts of data can be processed efficiently even on relative small setup, and complex filters can be applied to disk backed data.

Prior Knowledge Expected

Previous knowledge expected

My experience is at the intersection between academic innovation and it's
application in a commercial setting.
After active academic research in Neural Networks in the mid 90's, and teaching
Machine Learning (ML) and Information Theory in UK universities, I have
successfully founded three commercial concerns in the advanced computing
domain:
Thoughtful Technology, a data-science and ML consultancy with over twenty
successful projects in the last 10 years LifeQueue, a medical ML business which
generated a state of the art automated head and neck cancer diagnosis system
and Temporal Computing , an advanced hardware business using time-delays as
storage. Throughout this I have remained active in
research, becoming an international lead on temporal computing and an early
contributor to advanced explainable-AI methods. Recently, I was involved in
the COVID response, firstly, providing load prediction for the NHS-login
service, a single sign-in for many NHS services,secondly creating a graph
based search engine for a leading COVID academic corpus, and finally as a lead data scientist in the central NHS data team.