PyData London 2023

Synthetic data: what is it and why do we need it?
06-04, 15:45–16:25 (Europe/London), Salisbury

One of the biggest barriers to machine learning and data analytics is the difficulty to access high quality data. Synthetic data has been widely recognized as a promising remedy to this problem. It allows sharing, augmenting and de-biasing data for building performant and socially responsible ML systems. In this talk, I will overview the significant progress in the theory and methodology of synthetic data over the past five years. I will also introduce the open-source library, Synthcity, which implements an array of cutting-edge synthetic data generators to address data scarcity, privacy, and bias. The participants will walk away with a deeper understanding of the theory and practice of synthetic data, an understanding of when which methods apply (or do not apply) to their specific use case, and be ready to apply them in hackathons, competitions, and their day-to-day work.


Goal

The primary objectives of this talk are:

  1. Presenting synthetic data as a viable solution to the common problems of data scarcity, bias, and sensitivity through examples and case studies.
  2. Introducing a map of generative algorithms so that the participants can choose the right one according to their need.
  3. Familiarizing the participants with Synthcity, an open-source Python library that offers an array of cutting-edge synthetic data generators designed to solve the use cases discussed above.

Outline

This informative talk will be structured as follows:

  1. I will start by introducing the various "data challenges" that hamper the adoption of analytics and machine learning, with a focus on data scarcity, bias and sensitivity. (5 min)
  2. I will proceed to give an overview of the state-of-the-art generative algorithms that are designed to address these challenges via data synthesis. (10 min)
  3. I will then introduce the open-source Synthcity library that implements these algorithms. (10 min)
  4. I will conclude the talk with several case studies where Synthcity helped practitioners extract value and insights through synthetic data. (15 min)

Target audience

The target audience comprises the following groups:

  • Data scientists and engineers who are keen to address the issues of data scarcity, privacy, and fairness with the goal of building better and more socially responsible AI.
  • Data owners who are interested in finding solutions for data sharing and de-biasing.
  • The developer community who are interested in contributing to the cutting-edge software library for synthetic data generation.

No prior experience about synthetic data or generative AI is required.

Key takeaways

The talk will convey the following messages:

  1. Synthetic data is an emerging technology that can serve as a solution to the challenges in data scarcity, bias, sensitivity and more.
  2. A variety of cutting-edge AI-powered generative algorithms are available in the open-source Synthcity library, which has been successfully applied to real-world analytics projects.
  3. The Synthcity community is open to all developers and users who wish to further engage with synthetic data technologies.

Prior Knowledge Expected

No previous knowledge expected

I am a postdoc at the van der Schaar Lab in the University of Cambridge. In the past, I have led and contributed to the development of a host of novel algorithms for synthetic data (a list of publications can be found here). I am also leading the development of Synthcity, an open-source software library that aims to democratise the cutting-edge research in synthetic data.

Prior to joining the academia, I worked as a data scientist in one of the largest mobile games companies in the world, designing and implementing AI-powered systems that automatically optimize performance marketing campaigns. I also proudly worked for NHS as a volunteer during the pandemic, contributing to UK's first ICU capacity planning and forecasting system.