PyData London 2023

Jay Chia

Jay Chia is originally from Singapore and has worked in companies such as Lyft (on autonomous vehicles) and Freenome (AI-powered cancer detection genomics) building large-scale machine learning and Python data infrastructure. Most recently, Jay started a startup and is now maintaining Daft ( the open-sourced Python distributed dataframe for complex data. He also knows how to drive and command tanks from serving in the Singapore military, and would love to chat if you are interested in big data frameworks. Or tanks.

The speaker's profile picture


"Unstructured" terabyte-scale textual data processing in a distributed cluster
Jay Chia

ChatGPT has reignited worldwide interest in text data, capturing the imaginations of thousands of developers, but how do we actually build large scale production pipelines for working with and processing this highly unstructured data?

SQL is a great language for simple data modalities that fits in a database table, but when it comes to complex "unstructured" data, it is Python that really shines. In this talk, we show how easy it is to go from data storage to querying and processing large amounts of unstructured data using modern Python open-sourced tooling such as Ray, Daft and HuggingFace models.