Skip to content

Conversation

@zzhfz
Copy link
Contributor

@zzhfz zzhfz commented Dec 11, 2025

1. Describe

Add InfiniLM inference benchmark framework

  • Complete framework skeleton with config management
  • Direct mode runner for offline batch inference
  • Service mode runner for real-time service testing
  • InfiniLM adapter with proper API usage
  • Trace client for request simulation
  • GPU monitoring and metrics collection
  • Prompt/token generators for testing
  • Standardized JSON output format

2. Evidence

1. direct
  • input.json:
{
  "run_id": "my.custom.runid",
  "testcase": "infer.InfiniLM.Direct",
  "config": {
    "model": "Qwen3-1.7B",
    "model_path": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",

    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0],
      "cpu_only": false
    },

    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": "./test_perplexity_data.json",
    "output_dir": "./test_output_real",

    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "static_batch_size": 4,
      "prompt_token_num": 128,
      "output_token_num": 128,
      "max_seq_len": 512,
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 50
    },

    "warmup_iterations": 1,
    "measured_iterations": 2
  }
}

  • Partial results of the run:
2025-12-11 12:44:05,907 - infer_runner_base - INFO - Results saved to: test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
2025-12-11 12:44:05,909 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,909 - infer_runner_base - INFO - BENCHMARK STATISTICS
2025-12-11 12:44:05,910 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average latency: 1834.85 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - P95 latency: 1835.95 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average TTFT: 15.36 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - P95 TTFT: 15.38 ms
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Average throughput: 279.04 tokens/s/gpu
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Success rate: 100.00%
2025-12-11 12:44:05,910 - infer_runner_base - INFO - Total duration: 22.96 s
2025-12-11 12:44:05,910 - infer_runner_base - INFO - ============================================================
2025-12-11 12:44:05,911 - infer_runner_base - INFO - Benchmark completed successfully: my.custom.runid.20251211_124341.mqqdqdss
2025-12-11 12:44:05,911 - __main__ - INFO - ============================================================
2025-12-11 12:44:05,911 - __main__ - INFO - BENCHMARK COMPLETED SUCCESSFULLY
2025-12-11 12:44:05,911 - __main__ - INFO - ============================================================
2025-12-11 12:44:05,911 - __main__ - INFO - Results saved to: test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
2025-12-11 12:44:05,911 - __main__ - INFO - Benchmark success status: 1
  • result.json:
(megatron) sunjinge@node1:~/InfiniLM$ cat test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_results.json
{
  "run_id": "my.custom.runid.20251211_124341.mqqdqdss",
  "testcase": "infer.InfiniLM.Direct",
  "success": 1,
  "time": "2025-12-11 12:44:05",
  "config": {
    "command": "python scripts/jiuge.py --nvidia /var/qy_home/sunjinge/models/Qwen3-1.7B 1 --batch-size 4",
    "framework": "infinilm",
    "model": "Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": "./test_perplexity_data.json",
    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "static_batch_size": 4,
      "prompt_token_num": 128,
      "output_token_num": 128,
      "max_seq_len": 512,
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 50
    },
    "warmup_iterations": 1,
    "measured_iterations": 2
  },
  "metrics": [
    {
      "name": "infer.ppl",
      "type": "scalar",
      "unit": null,
      "value": 14.736358586474928
    },
    {
      "name": "infer.accuracy",
      "type": "scalar",
      "unit": null,
      "value": 0.0
    },
    {
      "name": "infer.compute_latency",
      "type": "timeseries",
      "unit": "ms",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_latency.csv"
    },
    {
      "name": "infer.ttft",
      "type": "timeseries",
      "unit": "ms",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_ttft.csv"
    },
    {
      "name": "infer.direct_throughput_tps",
      "type": "timeseries",
      "unit": "tokens/s/gpu",
      "raw_data_url": "./infer/my.custom.runid.20251211_124341.mqqdqdss_infer_throughput.csv"
    },
    {
      "name": "infer.peak_memory_usage",
      "type": "scalar",
      "unit": "GB",
      "value": 26.573242
    }
  ]
}
  • csv example
sunjinge@node1:~/InfiniLM$ cat test_output_real/infer/my.custom.runid.20251211_124341.mqqdqdss_infer_latency.csv
timestamp,latency_ms
0,1833.7632436305285
1,1833.7632436305285
2,1833.7632436305285
3,1833.7632436305285
4,1835.945948958397
5,1835.945948958397
6,1835.945948958397
7,1835.945948958397
  1. service
  • input.json
sunjinge@node1:~/InfiniLM$ cat inference/configs/real_infinilm_service_small.json
{
  "run_id": "my.custom.runid",
  "testcase": "infer.InfiniLM.Service",
  "config": {
    "model": "Qwen3-1.7B",
    "model_path":"/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",

    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0],
      "cpu_only": false
    },

    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": null,
    "output_dir": "./test_output_service",

    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "request_trace": "./test_trace.csv",
      "concurrency": 4,
      "max_seq_len": 2048,
      "stream": true,
      "timeout_ms": 30000
    },

    "timeout_ms": 60000,
    "warmup_iterations": 10,
    "measured_iterations": 50
  }
}
  • Partial results of the run:
2025-12-11 12:32:58,780 - utils.trace_client - DEBUG - Request req-0049 completed: TTFT=18.21ms, E2E=18.61ms, tokens=1
2025-12-11 12:32:58,784 - utils.trace_client - INFO - Trace run completed: 100.00% success rate, 40.14ms avg TTFT, 72.29ms avg E2E latency
2025-12-11 12:32:58,785 - utils.trace_client - INFO - Trace results saved to test_output_service/infer
2025-12-11 12:33:01,417 - utils.gpu_monitor - DEBUG - Current GPU memory usage: [22541] MiB
2025-12-11 12:33:01,917 - utils.gpu_monitor - INFO - GPU monitoring stopped
2025-12-11 12:33:01,918 - service_infer_runner - INFO - Peak GPU memory usage during test: 24.692383 GB
2025-12-11 12:33:01,919 - service_infer_runner - INFO - Collecting service inference metrics
2025-12-11 12:33:04,944 - adapters.infinilm_adapter - INFO - GPU memory - Peak: 0.00 GB, Current: 0.00 GB
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.accuracy_mmlu
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.peak_memory_usage
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.compute_latency
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Created placeholder file: my.custom.runid.20251211_123239.g07zk3ex_infer_compute_latency.csv
2025-12-11 12:33:04,945 - infer_runner_base - DEBUG - Adding missing metric: infer.max_throughput_tps
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG - Created placeholder file: my.custom.runid.20251211_123239.g07zk3ex_infer_max_throughput.csv
2025-12-11 12:33:04,946 - infer_runner_base - INFO - Total metrics in JSON: 12
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_ttft (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_e2e_latency (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.avg_throughput_tps (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.success_rate (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.total_requests (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.e2e_latency (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.ttft (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.response_per_second (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.accuracy_mmlu (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.peak_memory_usage (scalar)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.compute_latency (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - DEBUG -   - infer.max_throughput_tps (timeseries)
2025-12-11 12:33:04,946 - infer_runner_base - INFO - Results saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,946 - service_infer_runner - INFO - Real peak GPU memory usage: 24.692383 GB
2025-12-11 12:33:04,946 - service_infer_runner - INFO - Service metrics saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,947 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,947 - infer_runner_base - INFO - BENCHMARK STATISTICS
2025-12-11 12:33:04,947 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average latency: 72.29 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - P95 latency: 334.51 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average TTFT: 40.14 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - P95 TTFT: 92.17 ms
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Average throughput: 3.00 requests/s
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Success rate: 100.00%
2025-12-11 12:33:04,947 - infer_runner_base - INFO - Total duration: 23.03 s
2025-12-11 12:33:04,948 - infer_runner_base - INFO - ============================================================
2025-12-11 12:33:04,948 - infer_runner_base - INFO - Benchmark completed successfully: my.custom.runid.20251211_123239.g07zk3ex
2025-12-11 12:33:04,948 - __main__ - INFO - ============================================================
2025-12-11 12:33:04,948 - __main__ - INFO - BENCHMARK COMPLETED SUCCESSFULLY
2025-12-11 12:33:04,948 - __main__ - INFO - ============================================================
2025-12-11 12:33:04,948 - __main__ - INFO - Results saved to: test_output_service/infer/my.custom.runid.20251211_123239.g07zk3ex_results.json
2025-12-11 12:33:04,948 - __main__ - INFO - Benchmark success status: 1

  • result.josn
{
  "run_id": "my.custom.runid.20251211_154047.elp9ujpk",
  "testcase": "infer.InfiniLM.Service",
  "success": 1,
  "time": "2025-12-11 15:41:11",
  "config": {
    "command": "python scripts/launch_server.py --model-path /var/qy_home/sunjinge/models/Qwen3-1.7B --dev nvidia --ndev 1 --max-tokens 2048 # Trace测试: python trace_client.py --trace ./test_trace.csv --concurrency 4",
    "framework": "infinilm",
    "model": "Qwen3-1.7B",
    "model_config": "/var/qy_home/sunjinge/models/Qwen3-1.7B",
    "train_dataset": null,
    "validation_dataset": null,
    "test_dataset": null,
    "infer_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 1
      },
      "request_trace": "./test_trace.csv",
      "concurrency": 4,
      "max_seq_len": 2048,
      "stream": true,
      "timeout_ms": 30000
    },
    "warmup_iterations": 10,
    "measured_iterations": 50
  },
  "metrics": [
    {
      "name": "infer.accuracy_mmlu",
      "type": "scalar",
      "value": null,
      "unit": null
    },
    {
      "name": "infer.e2e_latency",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_latency.csv",
      "unit": "ms"
    },
    {
      "name": "infer.ttft",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_ttft.csv",
      "unit": "ms"
    },
    {
      "name": "infer.peak_memory_usage",
      "type": "scalar",
      "value": 24.114258,
      "unit": "GB"
    },
    {
      "name": "infer.response_per_second",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_throughput.csv",
      "unit": null
    },
    {
      "name": "infer.compute_latency",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_compute_latency.csv",
      "unit": "ms"
    },
    {
      "name": "infer.max_throughput_tps",
      "type": "timeseries",
      "raw_data_url": "./infer/my.custom.runid.20251211_154047.elp9ujpk_infer_max_throughput.csv",
      "unit": "tokens/s/gpu"
    }
  ]
}
  • csv example
sunjinge@node1:~/InfiniLM$ cat test_output_service/infer/my.custom.runid.20251211_154047.elp9ujpk_infer_latency.csv
timestamp,latency_ms
0,13.034327886998653
1,18.155450001358986
2,22.843563929200172
3,86.28608845174313
4,22.767651826143265
5,22.943492978811264
6,23.340691812336445
7,22.912921383976936
8,14.578982256352901
9,39.56571500748396
10,87.50941324979067
... ...
  • test_trace.csv
sunjinge@node1:~/InfiniLM$ cat test_trace.csv
request_id,arrival_timestamp_ms,input_token_num,output_token_num
req-0000,56.5,128,256
req-0001,120.97,64,64
req-0002,121.06,256,64
req-0003,197.59,512,64
req-0004,218.33,128,64

@zzhfz zzhfz requested review from baominghelly and bitzyz and removed request for baominghelly and bitzyz December 11, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants