PyData London 2023

"Unstructured" terabyte-scale textual data processing in a distributed cluster
06-04, 11:45–12:25 (Europe/London), Minories

ChatGPT has reignited worldwide interest in text data, capturing the imaginations of thousands of developers, but how do we actually build large scale production pipelines for working with and processing this highly unstructured data?

SQL is a great language for simple data modalities that fits in a database table, but when it comes to complex "unstructured" data, it is Python that really shines. In this talk, we show how easy it is to go from data storage to querying and processing large amounts of unstructured data using modern Python open-sourced tooling such as Ray, Daft and HuggingFace models.


So you have an amazing corpus of "unstructured" textual data - maybe hundreds of thousands of PDFs and emails on a filesystem, and all their associated metadata in a database. And now you want to run some of this cool new Generative AI/NLP technology on it... What now?

This talk walks through some of the amazing open-sourced Python tooling available for your workflows, including those from HuggingFace (https://github.com/huggingface), Ray (https://www.ray.io/) and Daft (https://www.getdaft.io/). Writing code from your laptop to scale to hundreds of machines and working on terabytes of data has never been so easy!

If you are sitting on a large amount of Complex Data such as PDFs, emails, HTML and all their associated tabular metadata, then join us as we show you how Python offers such amazing tooling for leveraging your data across use-cases such as content generation, search, data extraction and more.


Prior Knowledge Expected

No previous knowledge expected

Jay Chia is originally from Singapore and has worked in companies such as Lyft (on autonomous vehicles) and Freenome (AI-powered cancer detection genomics) building large-scale machine learning and Python data infrastructure. Most recently, Jay started a startup and is now maintaining Daft (www.getdaft.io): the open-sourced Python distributed dataframe for complex data. He also knows how to drive and command tanks from serving in the Singapore military, and would love to chat if you are interested in big data frameworks. Or tanks.