Performance Optimization Guide

This guide provides performance optimization tips for running Webis on local or cloud servers. Since Webis is based on vLLM and large language models, GPU memory and CPU throughput greatly affect performance.

Hardware Recommendations

GPU: NVIDIA GPU supporting CUDA
- Recommended: ≥ 8GB memory
- Minimum: 6GB memory (need to reduce --memory-limit and precision)
CPU: ≥ 4 physical cores
Memory: ≥ 16GB
Hard Drive: SSD (speeds up model loading)

GPU Memory Optimization

Adjust Memory Utilization
- By default, vLLM uses 90% of available memory
- Can be reduced when memory is insufficient:

bash

python scripts/start_model_server.py --memory-limit 0.6

Use Low Precision
- float16: Significantly reduces memory usage with minimal precision loss
- int8 quantization: Further reduces memory but may affect results
Example (modify model loading in start_model_server.py):

python

model = LLM(
    model=model_path,
    tensor_parallel_size=1,
    gpu_memory_utilization=gpu_memory_utilization,
    trust_remote_code=True,
    dtype="float16"
)

Release Memory Before Starting

bash

nvidia-smi
kill -9 <PID>

CPU and Concurrency Optimization

Increase Concurrent Requests

bash

uvicorn scripts.start_model_server:app --host 0.0.0.0 --port 8000 --workers 2

Request Batching
- vLLM supports batch processing of multiple prompts
- Recommend merging small requests into one API call

Disk and Model Loading

Cache Models HuggingFace cache directory:

bash

~/.cache/huggingface/hub

Use SSD Storage Model loading speed is much faster than HDD

Network Optimization

Choose cloud servers in regions closest to users
Use reverse proxies like Nginx and enable HTTP keep-alive
Enable response compression (gzip)

Common Optimization Scenarios

Low Memory GPU (6GB):
- --memory-limit 0.6
- dtype="float16"
- Reduce max_tokens in API requests
High Concurrency API:
- Increase uvicorn workers
- Use batching
Slow Model Loading:
- Keep service running continuously, avoid frequent restarts
- Put models on SSD

Troubleshooting

"No available memory for cache blocks": Reduce --memory-limit or free up memory
"Free memory on device ... less than desired": Reduce memory utilization or close other GPU processes
"CUDA not found": Install NVIDIA drivers and CUDA toolkit

Performance Optimization Guide ​

Hardware Recommendations ​

GPU Memory Optimization ​

CPU and Concurrency Optimization ​

Disk and Model Loading ​

Network Optimization ​

Common Optimization Scenarios ​

Troubleshooting ​

Performance Optimization Guide

Hardware Recommendations

GPU Memory Optimization

CPU and Concurrency Optimization

Disk and Model Loading

Network Optimization

Common Optimization Scenarios

Troubleshooting