tech:computer-stuff:roo-code-with-vllm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:37] – Add initial info jon-dokuwiki | tech:computer-stuff:roo-code-with-vllm [2025/08/27 12:26] (current) – Update with info about gpt-oss-120b jon-dokuwiki | ||
---|---|---|---|
Line 34: | Line 34: | ||
</ | </ | ||
- | To fit the code on two RTX 3090s I'm using //--dtype float16// and since I'm... TBC | + | UPDATE: Roo Code requires complex reasoning and understanding. The closes |
+ | < | ||
+ | services: | ||
+ | vllm: | ||
+ | container_name: | ||
+ | image: vllm/ | ||
+ | restart: unless-stopped | ||
+ | deploy: | ||
+ | resources: | ||
+ | reservations: | ||
+ | devices: | ||
+ | - driver: nvidia | ||
+ | device_ids: [" | ||
+ | capabilities: | ||
+ | runtime: nvidia | ||
+ | ports: | ||
+ | - " | ||
+ | volumes: | ||
+ | - ~/ | ||
+ | environment: | ||
+ | - HUGGING_FACE_HUB_TOKEN=HFTOKEN_HERE | ||
+ | - TORCH_CUDA_ARCH_LIST=8.6 | ||
+ | - NCCL_IB_DISABLE=1 | ||
+ | - NCCL_P2P_DISABLE=0 | ||
+ | # vLLM stability/ | ||
+ | - VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
+ | - CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
+ | - VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 | ||
+ | command: > | ||
+ | --model openai/ | ||
+ | --tensor-parallel-size 4 | ||
+ | --gpu-memory-utilization 0.90 | ||
+ | --dtype auto | ||
+ | --max-model-len 131072 | ||
+ | --allowed-origins [\" | ||
+ | --disable-fastapi-docs | ||
+ | --hf-overrides ' | ||
+ | --disable-custom-all-reduce | ||
+ | |||
+ | ipc: host | ||
+ | |||
+ | </ | ||
+ | |||
+ | U can prob. push the GPU mem util past 0.90. I've made it as far as 0.95, and more memory means larger KV cache and larger throughput. In Roo I had to activate high (max) reasoning to make it understand the complex requests. The current issue is that all code that // |
tech/computer-stuff/roo-code-with-vllm.txt · Last modified: by jon-dokuwiki