PyData London 2023

Building an End-to-End Open-Source Modern Data Platform for Biomedical Data
06-02, 13:30–15:00 (Europe/London), Warwick

Join us for a 90-minute tutorial on how to build an end-to-end open-source modern data platform for biomedical data using Python-based tools. In this tutorial, we will explore the technologies related to data warehousing, data integration, data transformation, data orchestration, and data visualization. We will use open-source tools such as DBT, Apache Airflow, Openmetadata, and Querybook to build the platform. All materials will be available on GitHub for attendees to access.


Data engineering has experienced enormous growth in recent years, allowing for rapid progress and innovation as more people than ever are thinking about data resources and how to better leverage them. In this tutorial, we will build an end-to-end modern data platform for the analysis of medical data using open-source tools and libraries.

We will start with an overview of the platform components, including data warehousing, data integration, data transformation, data orchestration, and data visualization. We will then dive into each component, exploring the technologies and tools that make up the platform.

We will use Python-based tools such as DBT, Apache Airflow, Openmetadata, and Querybook to build the platform. We will walk through the process step-by-step, from creating a data warehouse to integrating data from multiple sources, transforming the data, orchestrating data workflows, and visualizing the data.

Attendees will benefit from this tutorial if they are interested in learning how to build an end-to-end modern data platform for biomedical data using Python-based tools. They will also benefit from learning about the open-source tools and libraries used in the tutorial, which they can then apply to their own data engineering projects.

No specific background knowledge is needed to attend this tutorial, although familiarity with Python and basic data engineering concepts will be helpful. All materials will be available on GitHub (https://github.com/bsc-health-data/pydatalondon23-modern-data-stack ), and attendees will have the opportunity to follow along and build the platform themselves.


Prior Knowledge Expected

No previous knowledge expected

Alberto Labarga is a data engineer with over 5 years of experience in the healthcare industry. He specializes in building end-to-end modern data platforms for biomedical data analysis using open-source tools and libraries. Alberto has a Bachelor's degree in Computer Science and a Master's degree in Biomedical Engineering, both from the University of Madrid. He has worked on a variety of data engineering projects, including data warehousing, data integration, data transformation, and data orchestration. Alberto is passionate about open-source technologies and enjoys sharing his knowledge with others. In his free time, he enjoys hiking and playing guitar.