This repository provides a diverse suite of real-world and synthetic floating-point datasets designed for benchmarking numeric parsing and string-conversion algorithms.
All datasets are stored as plain text, one numeric value per line, making them easy to inspect and reproducible across systems.
The goal of this dataset collection is to provide a representative, diverse,
and challenging benchmark corpus mirroring the numeric values commonly
encountered in practice—ranging from geospatial data, scientific simulations,
astronomy catalogs, financial time series, and machine-learning model weights,
up to pathological IEEE-754 edge cases. Only two datasets (numbers.txt and
hellfloat64.txt) are synthetic, serving well-defined roles: a simple uniform
baseline and a comprehensive stress-test for string-conversion algorithms.
Below is a detailed description of each dataset, including the type of numeric values they contain and typical real-world scenarios they represent.
- Extracted from the GeoJSON dataset of geographic features.
- Contains latitude, longitude, elevations, and associated attributes.
- Representative of GIS pipelines, navigation systems, and open-data APIs.
- Numeric values from a marine robotics inverse-kinematics example.
- Contains small real numbers (between -1 and 4.4) from a physical simulation.
- Typical of control systems and scientific computing workloads.
- Vertex coordinates and related mesh data from a triangulated 3D surface.
- Similar to formats used in CAD, graphics engines, and scientific visualization.
- Heavy on small numbers (between -1 and 3), but realistic and widespread.
- Daily closing prices of Bitcoin (USD), from 2020-01-01 to 2022-07-31.
- Representative of financial APIs, trading systems, and real-time dashboards.
- Synthetic baseline dataset.
- Useful for comparisons but not intended to represent real-world patterns.
- Serialized weights of the MobileNetV3-Large ImageNet model.
- Contains millions of FP32 values (both small and moderately large), typical of:
- neural networks,
- gradient updates,
- machine-learning pipelines.
- Extracted from ESA Gaia DR3.
- Includes:
- right ascension / declination,
- parallax, proper motions,
- photometric fluxes,
- galactic & ecliptic coordinates.
- True scientific dataset with large dynamic range, typical of astronomy and big-science archives.
- NOAA NCEI “Global Hourly” dataset (temperature, dew point, visibility, pressure).
- Extremely common real-world data format: noisy, irregular, and API-like.
- Extracted from NOAA GFS model GRIB2 files.
- Contains fields such as geopotential height, temperature, humidity, pressure, wind components.
- True scientific FP32 with meaningful numerical variety and scaling.
A custom-built dataset designed to stress special values.
Contains:
- subnormals,
- powers of two across the full range,
- powers of ten in ±308,
- values near rounding boundaries,
- extreme magnitudes,
- structured edge cases (±0, EPS, min/max normal/subnormal),
- log-distributed extreme values.
This dataset is purely synthetic and intended as a worst-case test harness.
Several datasets can be regenerated automatically using the scripts found in
the scripts/ directory. All Python dependencies are managed using
uv.
Example commands:
cd scripts
uv sync
./noaa_gfs_1p00.sh
uv run ./noaa_global_hourly_f32.py
uv run ./gaia.py
uv run ./mobilenetv3.py
uv run ./hellfloat64.pyYou can list or inspect the files with these shell commands:
ls -l number_files/
head -n 30 number_files/canada.txt
wc -l number_files/gaia.txtIf you use this dataset in research or a publication, please cite it.
Example BibTeX entry:
@misc{float-data,
title = {float-data: A collection of floating-point numbers},
author = {Jaël Champagne Gareau and Daniel Lemire},
year = {2025},
howpublished = {\url{https://github.com/fastfloat/float-data}}
}