Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
We introduce DDR-Bench (Deep Data Research Benchmark): A benchmark for evaluating LLM agents on autonomous data exploration and analysis tasks without predefined queries or goals.
DDR-Bench provides a framework for running and evaluating LLM-based data analysis agents across three real-world domains:
| Scenario | Domain | Data Type | Task |
|---|---|---|---|
| MIMIC | Healthcare | MIMIC-IV patient records | Extract clinical insights from electronic health records |
| 10-K | Finance | SEC 10-K filings | Analyze XBRL filings data to get company financial insights |
| GLOBEM | Behavioral | GLOBEM wearable data | Discover user well-beings from wearable data patterns |
- Python 3.10+
- Required packages:
pip3 install -r requirements.txt-
Modify the configuration file
config.yamlto set up paths and LLM settings. -
Set up API keys as environment variables:
export GEMINI_API_KEY="your-gemini-api-key"
export AZURE_OPENAI_API_KEY="your-azure-openai-key"
export AZURE_OPENAI_ENDPOINT="your-azure-endpoint"
# Or other provider keys as needed- Run
Use run_agent.py to run data analysis agents. All configurations (LLM provider, model, data paths) are managed in config.yaml. See configuration section for more details.
python run_agent.py --scenario mimic
python run_agent.py --scenario 10k
python run_agent.py --scenario globemIt will create a log directory at base_log_dir from config.yaml, where all the agent trajectories, run metadata and insights are stored. The log directory is then used by the evaluation script to evaluate the agent's performance.
Note: Path configurations (
db_path,data_path) and LLM settings (provider,model,api_key) should be set inconfig.yaml.
Use run_evaluation.py to evaluate agent results using LLM-as-a-checker against data/*/qa.json . Like the agent runner, evaluation paths are also pulled from config.yaml:
# Evaluate MIMIC results
python run_evaluation.py --scenario mimic --log-dir ./{base_log_dir}/mimic
# Evaluate 10-K results
python run_evaluation.py --scenario 10k --log-dir ./{base_log_dir}/10k
# Evaluate GLOBEM results
python run_evaluation.py --scenario globem --log-dir ./{base_log_dir}/globemConfiguration is managed via config.yaml. You only need to adjust the following essential settings; all other parameters (like max_turns, log_level, etc.) are pre-configured with sensible defaults.
# 1. LLM Provider Settings
provider:
default_provider: "gemini" # Options: gemini, vllm, openai, minimax
default_model: "gemini-2.0-flash"
evaluation:
provider: "azure" # Provider for LLM-as-judge
model: "gpt-5-mini"
# 2. Database Paths (Set to your local database locations)
scenarios:
mimic:
db_path: "/path/to/mimic_iv.db" # Download from physionet.org, coming soon
10k:
db_path: "/path/to/10k_financial_data.db" # Download from https://huggingface.co/datasets/thinkwee/DDRBench_10K/tree/main/raw
globem:
data_path: "/path/to/globem_data/" # Download from physionet.org, coming soonSet up your API keys as environment variables:
| Variable | Description |
|---|---|
GEMINI_API_KEY |
Google Gemini API key |
AZURE_OPENAI_API_KEY |
Azure OpenAI API key |
AZURE_OPENAI_ENDPOINT |
Azure OpenAI endpoint URL |
OPENAI_API_KEY |
OpenAI API key (if not using Azure) |
MINIMAX_API_KEY |
MiniMax API key |
- 10 K
- The checklist, database, and agent trajectories are fully open sourced.
- Checklist https://github.com/thinkwee/DDR_Bench/tree/main/data
- Database https://huggingface.co/collections/thinkwee/ddrbench
- Agent trajectories https://huggingface.co/datasets/thinkwee/DDRBench_10K_trajectory
- MIMIC
- Access requires certification from PhysioNet.
- Obtain certification and download the data from MIMIC IV v3.1 at https://physionet.org/content/mimiciv/3.1/
- Use
scripts/construct_mimic_sqlite.pyto convert the data into a SQLite database - Update
scenarios.mimic.db_pathinconfig.yamlto point to the generated SQLite database
- GLOBEM
- Access also requires certification from PhysioNet.
- Obtain certification and download the data from https://physionet.org/content/globem/1.1/
- Use
scripts/process_globem.pyto preprocess the data - Update
scenarios.globem.data_pathinconfig.yamlto point to the processed data directory

