Skip to content

Conversation

@sinanshamsudheen
Copy link

Add Apache TsFile format support

Adds support for loading .tsfile datasets. Closes #7922.

What's TsFile?

Apache TsFile is a columnar time-series format popular in IoT. The TsFile community requested this integration and offered to help maintain it.

What I did

Created a new TsFile builder in packaged_modules/tsfile/ following the same pattern as HDF5. Registered the module and added .tsfile extension mapping. Also added tsfile>=2.0.0 as an optional dependency.

The builder uses tsfile.to_dataframe() with iterator mode for memory-efficient reading, then converts to PyArrow tables. Schema is inferred automatically from file metadata.

Config options

  • batch_size - rows per batch (default 10000)
  • table_name - which table to read (for multi-table files)
  • columns - filter specific columns
  • start_time / end_time - time-range filtering

Usage

from datasets import load_dataset

ds = load_dataset("tsfile", data_files=["data.tsfile"], split="train")

# with filtering
ds = load_dataset("tsfile", data_files=["data.tsfile"], 
                  columns=["temperature"], start_time=1609459200000)

Tests

Added 11 tests covering config validation, basic loading, data integrity, feature inference, and error handling. All passing.

Implements support for Apache TsFile time-series data format (huggingface#7922).

TsFile is a columnar storage format designed for IoT and time-series
data, providing efficient compression and high query performance.

Changes:
- Add TsFileConfig and TsFile builder classes in packaged_modules/tsfile/
- Register tsfile module with .tsfile extension mapping
- Add tsfile>=2.0.0 as optional dependency in setup.py
- Add comprehensive test suite (11 tests)

Features:
- Automatic feature/schema inference from TsFile metadata
- Support for table-model and tree-model TsFiles
- Time-range filtering via start_time/end_time parameters
- Column selection via columns parameter
- Memory-efficient iterator-based reading with configurable batch_size

Co-authored-by: TsFile Community
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Apache TsFile Datasets

2 participants