Converts BIRD benchmark SQL queries into API call sequences, producing two benchmark types:
- SLOT-BIRD (
slot): Slot-filling-style — generic data manipulation functions - SEL-BIRD (
sel): Selection-style — specific data manipulation functions and table dependent getter functions
git clone <git@github.com:IBM/live-api-bench.git> live-api-bench
cd live-api-benchpip install -r requirements.txtThis installs vakra-benchmark (the tool-calling library) along with other dependencies.
The pipeline reads SQLite databases and query JSON files from a database directory. The location is resolved in this order:
--db-path <path>CLI argumentBIRD_DB_PATHenvironment variable- Default:
db/at the repo root
The expected layout is:
live-api-bench/db/
├── train.json # BIRD train query set
├── dev.json # BIRD dev query set
├── train_queries/ # Per-dataset prefiltered query files (auto-generated)
├── dev_queries/ # Per-dataset prefiltered query files (auto-generated)
├── train_databases/ # BIRD train SQLite databases
│ └── <dataset>/
│ ├── <dataset>.sqlite
│ └── database_description/
│ └── *.csv
└── dev_databases/ # BIRD dev SQLite databases
└── <dataset>/
├── <dataset>.sqlite
└── database_description/
└── *.csv
Download the BIRD benchmark data from bird-bench.github.io and place train.json / dev.json and the train_databases/ / dev_databases/ folders under db/.
Run the translation pipeline from the live-api-bench directory:
python live_api_bench/python_tools/run_bird_translation.py [OPTIONS]| Flag | Choices | Default | Description |
|---|---|---|---|
-m, --mode |
train, dev |
dev |
Which BIRD split to use |
-d, --dataset |
any BIRD db name | — | Process a single named database (overrides -m) |
-api, --api_style |
slot, sel |
slot |
Output format: slot-filling or selection |
--db-path |
any path | see above | Path to the BIRD database directory |
# Defaults: all dev databases, slot-filling output
python live_api_bench/python_tools/run_bird_translation.py
# All train databases, selection output
python live_api_bench/python_tools/run_bird_translation.py -m train -api sel
# Single specific database
python live_api_bench/python_tools/run_bird_translation.py -d formula_1 -api slotResults are written to output/slot/<dataset>.json or output/sel/<dataset>.json. Each file is a list of entries with the translated API call sequence, the original SQL and natural-language question, the gold answer, and the tool specifications used.
Not all BIRD SQL queries can be translated. The pipeline supports a subset of SQL patterns; unsupported patterns are skipped and logged in the run report.
- Simple
SELECTwith column projections,WHEREfilters (equality,<,>,LIKE),ORDER BY, andLIMIT INNER JOINacross two or more tables on a single equality condition- Aggregations in
SELECT:COUNT,SUM,AVG,MIN,MAX GROUP BYwith aggregationsCASTexpressions inSELECT(Slot-filling only)- Arithmetic expressions in
SELECT(e.g.col1 / col2,col1 - col2) (slot-filling only) DISTINCT
| Pattern | Affects | Notes |
|---|---|---|
Subqueries (any nested SELECT) |
both | Includes scalar subqueries, correlated EXISTS, derived tables in FROM, and multi-SELECT arithmetic. The most common failure category. |
Arithmetic in WHERE |
both | Expressions like WHERE col1 / col2 > 0.3 or WHERE col1 + col2 > 500 are not parsed. Arithmetic is only supported in SELECT. |
Window functions (RANK() OVER, etc.) |
both | No translation strategy exists for window functions. |
!= / <> operator in WHERE |
both | Only equality and inequality comparisons (=, <, >, <=, >=, LIKE) are handled. |
Math functions in ORDER BY (ABS, etc.) |
both | Math function calls in ORDER BY are not recognized as column expressions. |
Multi-condition JOIN ON |
both | JOIN t ON t.a = x AND t.b = y — only a single equality per ON clause is supported. |
IN (val, val, ...) in WHERE |
both | Literal-list IN clauses are not handled by the WHERE parser. |
Multiple OR in WHERE |
both | Arbitrary numbers of AND clauses and up to one OR clause is supported, but two or more OR clauses are not. |
| Missing column getter | sel only | If a column appears in SELECT but no getter exists in the toolbox for it, translation fails. |
For a detailed breakdown with sample IDs and fix-complexity estimates, see QUERY_SUPPORT.md.
To run the translated benchmark against the vakra MCP server, each sample must be registered in environment/configs/mcp_tool_universe_id_mapping.yaml inside the vakra repo. The MCP server loads this file at startup and uses it as a lookup table: given a sample's UUID, it retrieves the initialize_active_data arguments (table aliases, join sequence, database path) and the server type (slot_filling or selection) needed to set up the data environment before the agent starts issuing tool calls.
Each benchmark sample targets a specific set of tables in a specific SQLite database, possibly with aliased table names and a join sequence that must be established before any tool calls can execute. Rather than embedding this configuration in the benchmark input itself, the MCP server externalizes it: the agent passes a tool_universe_id (the sample UUID) to get_data(), and the server looks up the corresponding configuration in the YAML to initialize the active data correctly.
After placing new benchmark output files in data/test/capability_1_bi_apis/output/ inside the vakra repo, run:
# From the vakra repo root
python environment/configs/generate_task_1_tool_universe_mapping.pyThis scans all *.json files under data/test/capability_1_bi_apis/output/, extracts the initialize_active_data arguments and server type from each sample's ground-truth tool call sequence, and writes the result to environment/configs/mcp_tool_universe_id_mapping.yaml.
Pass --dry-run to print the entry count without writing the file.
Each JSON file must be a list of records in vakra's output format:
{
"uuid": "<unique sample ID>",
"domain": "<database name>",
"output": [
{
"turn_id": 0,
"query": "<natural language question>",
"sequence": [
{
"tool_call": [
{
"name": "initialize_active_data",
"arguments": {
"alias_to_table_dict": { ... },
"condition_sequence": [ ... ],
"database_path": "<path to .sqlite>"
},
"label": "starting_table_var"
},
...
]
}
]
}
]
}The server type (slot_filling or selection) is inferred automatically from which tool names appear in the tool call sequence. Note that this format differs from live-api-bench's native output (which uses sample_id/dataset_name and a flat tool call list); a conversion step is required before running the generate script.