Transformer-based language models and active learning for finite-population estimation from textual data, with applications in crime statistics

Abstract

In recent years, new large-scale language models (LLM) have been developed based on deep neural networks with transformer architectures (Vaswani et al., 2017). These models are very effective and set a new text classification standard (e.g., Devlin et al., 2018; Liu et al., 2019).

These new LLMs open up new possibilities for finite-population estimation from textual data, i.e. a large part of the descriptive statistics produced in public agencies, research, and industry. This project aims to develop and use Swedish language models and active learning to produce statistics from textual data effectively.

This project will focus on two areas, (1) the development of language models for use in finite-population estimation and (2) the analysis and estimation of error sources for population estimators. The project will focus on official crime statistics and its use in the insurance industry.

If the project is successful, it will enable more efficient production of statistics from large textual material. The projects will also, hopefully, enable new official criminal statistics (regional and national) that cannot be produced with conventional methods.

Project information

The project is partly financed by Länsförsäkringars forskningsfond and is conduced in close collaboration with the Swedish National Council for Crime Prevention (Brå).