06-04, 11:00–11:40 (Europe/London), Minories
Extracting data from web pages is a problem which is not so well covered and researched compared to image or text classification, object detection or named entity recognition. But this problem is extremely exciting to look into and rewarding to work on, because web pages can be represented in so many ways: as a screenshot of a page, as it's text, as an HTML tree, as a sequence of elements with discrete and continuous properties, and in other ways. This leads to many diverse approaches, which often combine different input types and ways of data representation inside one model. In this talk we will explore several intriguing approaches for web data extraction, and see how one can come up with novel approaches and grow your model according to the task at hand.
This talk is intended for anyone with interest in neural networks. I hope it gives you inspiration and intuition for building deep learning models tailored to the structure of your data. This is applicable not only to web information extraction, but also to document extraction and other domains with structured text or image inputs.
This talk has several main goals:
- Get a sense of how web data extraction can be turned into a machine learning problem, and what kind of features are available.
- Show how a complex model can grow from simple ideas, picking up tricks from related areas and papers. This part is broadly applicable and I hope it would inspire people to experiment more and see how building blocks and input modalities can be combined. The approach is based on the work at Zyte on this problem.
- Explore several modern approaches to this problem, such as transformer based architectures like MarkupLM, WebFormer and LayoutLM.
Previous knowledge expected
I lead Machine Learning research and development at Zyte, where we work on making the web data accessible to more people through products, services and open source projects. I also participated and won Kaggle competitions, achieving a competitions grandmaster title, and contributing to the community with talks, sharing code and knowledge.