tmesh CLI User Documentation
Overview
tmesh-cli is a benchmarking tool for LLM inference endpoints. It measures performance metrics for OpenAI-compatible API endpoints, with a focus on testing KV cache offloading capabilities in TensorMesh deployments.
Installation
tmesh requires Python 3.9 or higher. uv is recommended.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.12
sourc .venv/bin/activate
Then install the latest wheel
uv pip install tmesh
Verify Installation
After installation, verify tmesh is available:
tmesh-cli --help
You should see the help menu with available commands.
Quick Start
Run a benchmark against your LLM endpoint:
tmesh-cli benchmark --endpoint "http://localhost:8000" --api-key "your-api-key"
The benchmark will run continuously until interrupted (Ctrl+C).
Command Reference
tmesh-cli benchmark
Runs an infinite benchmarking workload against an OpenAI-compatible LLM endpoint.
Required Arguments
--endpoint <URL>- The OpenAI-compatible API endpoint URL--api-key <KEY>- API key for authentication
Optional Arguments
---model <MODEL_NAME>- The model name if you don't want to rely on auto-discovery from your endpoint
Endpoint URL Formats
The tool accepts various URL formats and normalizes them to /v1/:
# All of these are valid and equivalent:
tmesh-cli benchmark --endpoint "http://localhost:8000" --api-key "sk-123"
tmesh-cli benchmark --endpoint "http://localhost:8000/" --api-key "sk-123"
tmesh-cli benchmark --endpoint "http://localhost:8000/v1" --api-key "sk-123"
tmesh-cli benchmark --endpoint "http://localhost:8000/v1/" --api-key "sk-123"
tmesh-cli benchmark --endpoint "localhost:8000" --api-key "sk-123"
Note: HTTPS URLs are automatically converted to HTTP.
Example
tmesh-cli benchmark \
--endpoint "http://89.169.111.29:30080/" \
--api-key "vllm_sk_555a1b7ff3e0f617b1240375ea411c2a5f08d2666fcdc718075f66c9"
How It Works
1. Model Discovery
The tool automatically discovers the model name from your endpoint:
endpoint: http://localhost:8000/v1/
found model: Qwen/Qwen3-30B-A3B-Instruct-2507
2. Workload Calculation
Based on the discovered model (matched against model_configs.json), the tool calculates optimal workload parameters:
offload_size: 100
Workload Specifications:
Model: Qwen/Qwen3-30B-A3B-Instruct-2507
Number of Contexts: 30
Number of Questions per Context: 30
Max Inflight Requests (Load-Balancing): 10
Input Length: 32000
Output Length: 100
The workload is designed to stress-test the KV cache offloading buffer by:
- Creating multiple long contexts (32k tokens each)
- Reusing contexts with different questions
- Managing concurrent requests with load balancing
3. Continuous Benchmarking
The tool sends requests continuously using a tiling pattern:
- Cycles through all contexts sequentially
- Appends random questions to each context
- Maintains max inflight requests for load balancing
- Maximizes cache evictions to test offloading
4. Real-time Metrics
Every 5 seconds, the tool displays performance metrics:
Elapsed Time: 165.07
Total Number of Requests Processed: 280
QPS: 1.70
Global Average TTFT: 3.77
Global Average ITL: 0.015
Global Average Prefill Throughput: 22871.74
Global Average Decode Throughput: 104.12
Requests Processed in Last 5 second Interval: 56
Interval Average TTFT: 2.60
Interval Average ITL: 0.012
Interval Average Prefill Throughput: 25740.82
Interval Average Decode Throughput: 90.82
Metrics Explained
Global Metrics (Cumulative)
- Elapsed Time: Total time since benchmark started (seconds)
- Total Number of Requests Processed: All completed requests
- QPS (Queries Per Second): Average throughput over entire run
- Global Average TTFT: Average time to first token (seconds)
- Global Average ITL: Average inter-token latency (seconds per token)
- Global Average Prefill Throughput: Average input tokens processed per second
- Global Average Decode Throughput: Average output tokens generated per second
Interval Metrics (Last 5 Seconds)
- Requests Processed in Last 5 second Interval: Requests completed in the last interval
- Interval Average TTFT: TTFT for recent requests only
- Interval Average ITL: ITL for recent requests only
- Interval Average Prefill Throughput: Recent prefill performance
- Interval Average Decode Throughput: Recent decode performance
Global metrics provide overall performance, while interval metrics show current behavior and help detect performance changes over time.
Workload Configuration
Hardcoded Parameters
- Input Length: 32,000 tokens per context
- Output Length: 100 tokens per completion
Auto-calculated Parameters
Based on model specifications in model_configs.json:
- Number of Contexts: Calculated from offload buffer size
- Number of Questions per Context: Equal to number of contexts
- Max Inflight Requests: 1/3 of number of contexts
Formula for number of contexts:
num_contexts = (0.9 × offload_size) / (bytes_per_token × input_length / 1024³)
Supported Models
The tool supports these pre-configured models:
- openai/gpt-oss-120b (36 layers, 73,728 bytes/token)
- openai/gpt-oss-20b (24 layers, 49,152 bytes/token)
- Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 (94 layers, TP=8, 96,256 bytes/token)
- Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 (62 layers, TP=8, 253,952 bytes/token)
- Qwen/Qwen3-30B-A3B-Instruct-2507 (48 layers, TP=1, 98,304 bytes/token)
To add support for additional models, edit model_configs.json with the model's specifications.
Troubleshooting
Connection Errors
If you see:
[ERROR] Could not connect to endpoint: http://localhost:8000/v1/
Make sure a model server is running and accessible.
Try: curl http://localhost:8000/v1/models
Solutions:
- Verify the endpoint URL is correct
- Check that the model server is running
- Test connectivity with:
curl <endpoint>/v1/models - Ensure firewall/network allows connections
Model Not Found
If you see:
[ERROR] model <model_name> not found in model_configs.json
Solutions:
- Check if your model is supported (see Supported Models section)
- Add your model's configuration to
model_configs.json - Ensure the model name returned by
/v1/modelsmatches exactly
Stopping the Benchmark
The benchmark runs indefinitely. To stop:
- Press Ctrl+C (or send SIGINT/SIGTERM)
- The tool will gracefully shutdown and display final metrics
Advanced Usage
Testing Different Endpoints
Compare performance across different deployments:
# Test local deployment
tmesh-cli benchmark --endpoint "http://localhost:8000" --api-key "test"
# Test remote deployment
tmesh-cli benchmark --endpoint "http://production.example.com" --api-key "prod-key"
Long-running Tests
For extended testing, run in the background and redirect output:
nohup tmesh-cli benchmark \
--endpoint "http://localhost:8000" \
--api-key "sk-123" \
> benchmark.log 2>&1 &
Monitor with:
tail -f benchmark.log
Analyzing Results
The metrics help identify:
- Throughput bottlenecks: Low QPS or decode throughput
- Latency issues: High TTFT or ITL
- Cache performance: Changes in interval metrics when contexts rotate
- Resource constraints: Degrading metrics over time
API Compatibility
The tool uses the OpenAI Python SDK and expects these endpoints:
- GET /v1/models - List available models
- POST /v1/completions - Streaming completions API
Your endpoint must support:
- Streaming responses (
stream=True) max_tokensparameter- Standard OpenAI response format
Best Practices
- Warm-up period: Ignore first 30-60 seconds of metrics (cold start effects)
- Steady state: Look at metrics after several full context rotations
- Compare intervals: Watch for performance degradation over time
- Resource monitoring: Monitor CPU, memory, GPU usage alongside tmesh metrics
- Network stability: Run from stable network connection for accurate latency measurements
Support
For issues, questions, or contributions, visit the project repository.