Add InfiniLM inference benchmark framework #5

zzhfz · 2025-12-11T08:07:47Z

1. Describe

Add InfiniLM inference benchmark framework

Complete framework skeleton with config management
Direct mode runner for offline batch inference
Service mode runner for real-time service testing
InfiniLM adapter with proper API usage
Trace client for request simulation
GPU monitoring and metrics collection
Prompt/token generators for testing
Standardized JSON output format

2. Evidence

1. direct

input.json:

{
  "run_id": "my.custom.runid",
  "testcase": "infer.InfiniLM.Direct",
  "config": {
    "model": "Qwen3-1.7B",
    "model_path": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",

    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0],
      "cpu_only": false
    },

    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": "./test_perplexity_data.json",
    "output_dir": "./test_output_real",

    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "static_batch_size": 4,
      "prompt_token_num": 128,
      "output_token_num": 128,
      "max_seq_len": 512,
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 50
    },

    "warmup_iterations": 1,
    "measured_iterations": 2
  }
}

Partial results of the run：

2025-12-11 12:44:05,907 - infer_runner_base - INFO - Results saved to: test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
2025-12-11 12:44:05,909 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,909 - infer_runner_base - INFO - BENCHMARK STATISTICS
2025-12-11 12:44:05,910 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average latency: 1834.85 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - P95 latency: 1835.95 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average TTFT: 15.36 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - P95 TTFT: 15.38 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average throughput: 279.04 tokens/s/gpu
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Success rate: 100.00%
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Total duration: 22.96 s
2025-12-11 12:44:05,910 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,911 - infer_runner_base - INFO - Benchmark completed successfully: my.custom.runid.20251211_124341.mqqdqdss
2025-12-11 12:44:05,911 - __main__ - INFO - ============================================================
2025-12-11 12:44:05,911 - __main__ - INFO - BENCHMARK COMPLETED SUCCESSFULLY
2025-12-11 12:44:05,911 - __main__ - INFO - ============================================================
2025-12-11 12:44:05,911 - __main__ - INFO - Results saved to: test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
2025-12-11 12:44:05,911 - __main__ - INFO - Benchmark success status: 1

result.json:

(megatron) sunjinge@node1:~/InfiniLM$ cat test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
{
  "run_id": "my.custom.runid.20251211_124341.mqqdqdss",
  "testcase": "infer.InfiniLM.Direct",
  "success": 1,
  "time": "2025-12-11 12:44:05",
  "config": {
    "command": "python scripts/jiuge.py --nvidia /var/qy_home/sunjinge/models/Qwen3-1.7B 1 --batch-size 4",
    "framework": "infinilm",
    "model": "Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": "./test_perplexity_data.json",
    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "static_batch_size": 4,
      "prompt_token_num": 128,
      "output_token_num": 128,
      "max_seq_len": 512,
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 50
    },
    "warmup_iterations": 1,
    "measured_iterations": 2
  },
  "metrics": [
    {
      "name": "infer.ppl",
      "type": "scalar",
      "unit": null,
      "value": 14.736358586474928
    },
    {
      "name": "infer.accuracy",
      "type": "scalar",
      "unit": null,
      "value": 0.0
    },
    {
      "name": "infer.compute_latency",
      "type": "timeseries",
      "unit": "ms",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_latency.csv"
    },
    {
      "name": "infer.ttft",
      "type": "timeseries",
      "unit": "ms",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_ttft.csv"
    },
    {
      "name": "infer.direct_throughput_tps",
      "type": "timeseries",
      "unit": "tokens/s/gpu",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_throughput.csv"
    },
    {
      "name": "infer.peak_memory_usage",
      "type": "scalar",
      "unit": "GB",
      "value": 26.573242
    }
  ]
}

csv example

sunjinge@node1:~/InfiniLM$ cat test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_infer_latency.csv
timestamp,latency_ms
0,1833.7632436305285
1,1833.7632436305285
2,1833.7632436305285
3,1833.7632436305285
4,1835.945948958397
5,1835.945948958397
6,1835.945948958397
7,1835.945948958397

service

input.json

sunjinge@node1:~/InfiniLM$ cat inference/configs/real_infinilm_service_small.json
{
  "run_id": "my.custom.runid",
  "testcase": "infer.InfiniLM.Service",
  "config": {
    "model": "Qwen3-1.7B",
    "model_path":"/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",

    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0],
      "cpu_only": false
    },

    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": null,
    "output_dir": "./test_output_service",

    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "request_trace": "./test_trace.csv",
      "concurrency": 4,
      "max_seq_len": 2048,
      "stream": true,
      "timeout_ms": 30000
    },

    "timeout_ms": 60000,
    "warmup_iterations": 10,
    "measured_iterations": 50
  }
}

Partial results of the run：

2025-12-11 12:32:58,780 - utils.trace_client - DEBUG - Request req-0049 completed: TTFT=18.21ms, E2E=18.61ms, tokens=1
2025-12-11 12:32:58,784 - utils.trace_client - INFO - Trace run completed: 100.00% success rate, 40.14ms avg TTFT, 72.29ms avg E2E latency
2025-12-11 12:32:58,785 - utils.trace_client - INFO - Trace results saved to test_output_service/infer
2025-12-11 12:33:01,417 - utils.gpu_monitor - DEBUG - Current GPU memory usage: [22541] MiB
2025-12-11 12:33:01,917 - utils.gpu_monitor - INFO - GPU monitoring stopped
2025-12-11 12:33:01,918 - service_infer_runner - INFO - Peak GPU memory usage during test: 24.692383 GB
2025-12-11 12:33:01,919 - service_infer_runner - INFO - Collecting service inference metrics
2025-12-11 12:33:04,944 - adapters.infinilm_adapter - INFO - GPU memory - Peak: 0.00 GB, Current: 0.00 GB
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.accuracy_mmlu
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.peak_memory_usage
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.compute_latency
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Created placeholder file: my.custom.runid.20251211_123239.g07zk3ex_infer_compute_latency.csv
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.max_throughput_tps
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG - Created placeholder file: my.custom.runid.20251211_123239.g07zk3ex_infer_max_throughput.csv
2025-12-11 12:33:04,946 - infer_runner_base - INFO - Total metrics in JSON: 12
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_ttft (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_e2e_latency (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_throughput_tps (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.success_rate (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.total_requests (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.e2e_latency (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.ttft (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.response_per_second (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.accuracy_mmlu (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.peak_memory_usage (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.compute_latency (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.max_throughput_tps (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - INFO - Results saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,946 - service_infer_runner - INFO - Real peak GPU memory usage: 24.692383 GB
2025-12-11 12:33:04,946 - service_infer_runner - INFO - Service metrics saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,947 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,947 - infer_runner_base - INFO - BENCHMARK STATISTICS
2025-12-11 12:33:04,947 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average latency: 72.29 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - P95 latency: 334.51 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average TTFT: 40.14 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - P95 TTFT: 92.17 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average throughput: 3.00 requests/s
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Success rate: 100.00%
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Total duration: 23.03 s
2025-12-11 12:33:04,948 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,948 - infer_runner_base - INFO - Benchmark completed successfully: my.custom.runid.20251211_123239.g07zk3ex
2025-12-11 12:33:04,948 - __main__ - INFO - ============================================================
2025-12-11 12:33:04,948 - __main__ - INFO - BENCHMARK COMPLETED SUCCESSFULLY
2025-12-11 12:33:04,948 - __main__ - INFO - ============================================================
2025-12-11 12:33:04,948 - __main__ - INFO - Results saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,948 - __main__ - INFO - Benchmark success status: 1

result.josn

{
  "run_id": "my.custom.runid.20251211_154047.elp9ujpk",
  "testcase": "infer.InfiniLM.Service",
  "success": 1,
  "time": "2025-12-11 15:41:11",
  "config": {
    "command": "python scripts/launch_server.py --model-path /var/qy_home/sunjinge/models/Qwen3-1.7B --dev nvidia --ndev 1 --max-tokens 2048 # Trace测试: python trace_client.py --trace ./test_trace.csv --concurrency 4",
    "framework": "infinilm",
    "model": "Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": null,
    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "request_trace": "./test_trace.csv",
      "concurrency": 4,
      "max_seq_len": 2048,
      "stream": true,
      "timeout_ms": 30000
    },
    "warmup_iterations": 10,
    "measured_iterations": 50
  },
  "metrics": [
    {
      "name": "infer.accuracy_mmlu",
      "type": "scalar",
      "value": null,
      "unit": null
    },
    {
      "name": "infer.e2e_latency",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_latency.csv",
      "unit": "ms"
    },
    {
      "name": "infer.ttft",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_ttft.csv",
      "unit": "ms"
    },
    {
      "name": "infer.peak_memory_usage",
      "type": "scalar",
      "value": 24.114258,
      "unit": "GB"
    },
    {
      "name": "infer.response_per_second",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_throughput.csv",
      "unit": null
    },
    {
      "name": "infer.compute_latency",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_compute_latency.csv",
      "unit": "ms"
    },
    {
      "name": "infer.max_throughput_tps",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_max_throughput.csv",
      "unit": "tokens/s/gpu"
    }
  ]
}

csv example

sunjinge@node1:~/InfiniLM$ cat test_output_service/infer/my.custom.runid.20251211_154047.elp9ujpk_infer_latency.csv
timestamp,latency_ms
0,13.034327886998653
1,18.155450001358986
2,22.843563929200172
3,86.28608845174313
4,22.767651826143265
5,22.943492978811264
6,23.340691812336445
7,22.912921383976936
8,14.578982256352901
9,39.56571500748396
10,87.50941324979067
... ...

test_trace.csv

sunjinge@node1:~/InfiniLM$ cat test_trace.csv
request_id,arrival_timestamp_ms,input_token_num,output_token_num
req-0000,56.5,128,256
req-0001,120.97,64,64
req-0002,121.06,256,64
req-0003,197.59,512,64
req-0004,218.33,128,64

sunjqa1 and others added 6 commits December 11, 2025 14:49

feat(inference): Add InfiniLM inference benchmark framework

b40f2a3

fix: correct service command string in _build_command_string

225c458

Delete infinimetrics/inference/configs/real_infinilm_service_small.json

01b7199

Delete infinimetrics/inference/configs/test_direct_infinilm_real.json

cae31d1

Delete infinimetrics/inference/configs/test_direct_infinilm_v2.json

9d011b8

Delete infinimetrics/inference/configs/test_direct_infinilm.json

266abdf

zzhfz requested review from baominghelly and bitzyz and removed request for baominghelly and bitzyz December 11, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InfiniLM inference benchmark framework #5

Add InfiniLM inference benchmark framework #5

Uh oh!

zzhfz commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add InfiniLM inference benchmark framework #5

Are you sure you want to change the base?

Add InfiniLM inference benchmark framework #5

Uh oh!

Conversation

zzhfz commented Dec 11, 2025

1. Describe

2. Evidence

1. direct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants