Parquet data format and using it in Denmark Statistics

Luke W. Johnston

2025-10-24

Outline

  1. Refresher: How Denmark Statistics currently stores data.

  2. Intro to Parquet file format.

  3. Problems that Parquet solves.

Denmark Statistics data storage

Data format: Proprietary SAS format

For example, BEF register:

bef2018.sas7bdat
bef2019.sas7bdat
bef2020.sas7bdat
bef2021.sas7bdat
bef2022.sas7bdat

Challenge: Takes many minutes to load one year of data (in R).

Data updates make more work for us

bef2021.sas7bdat
bef2022.sas7bdat
December_2023/bef2022.sas7bdat
December_2023/bef2023.sas7bdat

Can you see the issue?

Metadata is confusing and poorly documented

  • Variables are not consistent across years.

  • Finding the metadata is difficult.

  • Some variables are numeric but actually categorical.

Use something other than SAS? Data gets duplicated

E.g. Stata will create .dta files, doubling storage needs.

Parquet file format

https://parquet.apache.org/

Parquet is a column-based data storage format

Most data formats are row-based, like CSV. Newer formats tend to be column-based.

Row vs column-based storage

Row-based

name,sex,age
Tim,M,30
Jenny,F,25

Column-based

name,Tim,Jenny
sex,M,F
age,30,25

Column-based storage has better compression

sex,M,F,F,M,M,F,F,F
age,30,30,25,32,31,40,39,50
diabetes,0,1,0,0,1,0,0,0

…becomes…

sex,M,F{2},M{2},F{3}
age,30{2},25,32,31,40,39,50
diabetes,0,1,0{2},1,0{3}

Loading data is faster

  • Computers read by lines.
  • Per line = same data type.
  • Only read needed columns.

Only need age? Only read that line:

sex,M,F
age,30,25
diabetes,0,1
age,30,25

Parquet is 50-75% smaller than other formats

File size between CSV, Parquet, Stata, and SAS for bef register for 2017.
File type Size (MB)
SAS (.sas7bdat) 1.45 Gb
CSV (.csv) ~90% of SAS
Stata (.dta) 745 Mb
Parquet (.parquet) 398 Mb

Personal experience: 500 GB SAS = 80 GB Parquet

Can partition data by a value (e.g. year)

bef/
├── year=2018/
│   └── part-0.parquet
├── year=2019/
│   └── part-0.parquet
├── year=2020/
│   └── part-0.parquet
└── year=2021/
    └── part-0.parquet

Partitioned Parquet dataset can be loaded all at once

Load in R with arrow package:

bef <- arrow::open_dataset("bef")

Loads all years in fraction of a second, compared to ~5 min for one year without using Parquet.

Easy connection to DuckDB engine

DuckDB https://duckdb.org/ is a recent powerful SQL engine designed for analytical queries.

bef <- arrow::open_dataset("bef") |>
    arrow::to_duckdb()

SAS and Python can load Parquet but not Stata

(But we should be pushing for R or Python use anyway.)

Problems Parquet solves

Less space used = less money spent

DST charges for storage used.

Faster loading and analysis times

Parquet loads multiple files in seconds, compared to minutes for other formats.

Sooner that researcher is done = less money spent

DST charges per user on a project.