PyData London 2023

Building a skills extraction library using NLP tools
06-04, 10:15–10:55 (Europe/London), Minories

There is no publicly available data on the skills that are commonly required in UK online job adverts, despite this information being useful for a range of use cases. To address this, we have built an open source skills extraction python library using spaCy and huggingface. Our approach is twofold: we train a named entity recognition model to extract skill entities from job adverts then map them onto any standardised skills taxonomy. By applying this algorithm to a dataset of scraped online job adverts, we are then able to find skill similarities amongst occupations, and regional differences in skill requirements.


There is no publicly available data on the skills that are commonly required in UK job adverts. As a result, there is very little understanding of either the skill specialities that exist in different regions in the UK or the skills required for given occupations. To help address this, we built an open source skills extraction python library using spaCy and huggingface. Our approach is twofold: we extract skills from job adverts then map them onto any standardised skills taxonomy, like the European Commission’s European Skills, Competences, and Occupations (ESCO) or Lightcast’s Open Skills.

In this talk we will explain our pipeline: from building infrastructure to scrape millions of job adverts, the torments of labelling skill entities, training the NER model to finally matching entities onto a given taxonomy using huggingface’s sentence transformers. We will also showcase our open-source Streamlit tools, including a demo app that allows anyone to extract skills and an interactive analysis blog.


Prior Knowledge Expected

Previous knowledge expected

I am a data scientist with a background in quantitative social science and product management. At Nesta, I use natural language processing and machine learning to understand the changing skill demand landscape from millions of job adverts.

Liz is a Data Scientist with experience in natural language processing, machine learning, data analytics, agent-based modelling and evolutionary game theory. She applies these skills to areas such as the labour market, research funding, searching for policy impact, and modelling human behaviour. She currently works at Nesta, where she works on several projects involved extracting information from job advert text to understand the labour market.

Jack Vines is Lead Data Engineer at Nesta. He is interested in socially impactful applications of data science and open data. His work centres around large scale data collection and pipelines, and has included building infrastructure to collect over 3 million online job adverts per year, whilst applying large natural language models efficiently to enrich them.