tech:computer-stuff:roo-code-with-vllm
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:13] – created jon-dokuwiki | tech:computer-stuff:roo-code-with-vllm [2025/08/27 12:26] (current) – Update with info about gpt-oss-120b jon-dokuwiki | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | Initial commit | + | [[https:// |
+ | |||
+ | //Roo Code is an open-source, | ||
+ | |||
+ | I am very new to Roo Code, and frankly, my main interest is not in actually using Roo Code but rather to get it up-and-running with locally hosted models. Roo Code supports Ollama and that is likely the simplest way to get it to work locally as Ollama is super simple to get running. I have however chosen to go for vLLM because I'm aiming for the highest throughput and because I want to check out vLLM. My initial test is with // | ||
+ | |||
+ | < | ||
+ | services: | ||
+ | vllm: | ||
+ | container_name: | ||
+ | image: vllm/ | ||
+ | deploy: | ||
+ | resources: | ||
+ | reservations: | ||
+ | devices: | ||
+ | - driver: nvidia | ||
+ | device_ids: [" | ||
+ | capabilities: | ||
+ | runtime: nvidia | ||
+ | ports: | ||
+ | - " | ||
+ | volumes: | ||
+ | - ~/ | ||
+ | environment: | ||
+ | - HUGGING_FACE_HUB_TOKEN=< | ||
+ | command: > | ||
+ | --model deepseek-ai/ | ||
+ | --tensor-parallel-size 2 | ||
+ | --gpu-memory-utilization 0.95 | ||
+ | --max-model-len 16384 | ||
+ | --allowed-origins [\" | ||
+ | --dtype float16 | ||
+ | ipc: host | ||
+ | </ | ||
+ | |||
+ | UPDATE: Roo Code requires complex reasoning and understanding. The closes I've come to make it slightly do as it is supposed to is with OpenAI' | ||
+ | |||
+ | < | ||
+ | services: | ||
+ | vllm: | ||
+ | container_name: | ||
+ | image: vllm/ | ||
+ | restart: unless-stopped | ||
+ | deploy: | ||
+ | resources: | ||
+ | reservations: | ||
+ | devices: | ||
+ | - driver: nvidia | ||
+ | device_ids: [" | ||
+ | capabilities: | ||
+ | runtime: nvidia | ||
+ | ports: | ||
+ | - " | ||
+ | volumes: | ||
+ | - ~/ | ||
+ | environment: | ||
+ | - HUGGING_FACE_HUB_TOKEN=HFTOKEN_HERE | ||
+ | - TORCH_CUDA_ARCH_LIST=8.6 | ||
+ | - NCCL_IB_DISABLE=1 | ||
+ | - NCCL_P2P_DISABLE=0 | ||
+ | # vLLM stability/ | ||
+ | - VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
+ | - CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
+ | - VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 | ||
+ | command: > | ||
+ | --model openai/ | ||
+ | --tensor-parallel-size 4 | ||
+ | --gpu-memory-utilization 0.90 | ||
+ | --dtype auto | ||
+ | --max-model-len 131072 | ||
+ | --allowed-origins [\" | ||
+ | --disable-fastapi-docs | ||
+ | --hf-overrides ' | ||
+ | --disable-custom-all-reduce | ||
+ | |||
+ | ipc: host | ||
+ | |||
+ | </ | ||
+ | |||
+ | U can prob. push the GPU mem util past 0.90. I've made it as far as 0.95, and more memory means larger KV cache and larger throughput. In Roo I had to activate high (max) reasoning to make it understand the complex requests. The current issue is that all code that // |
tech/computer-stuff/roo-code-with-vllm.1749111236.txt.gz · Last modified: by jon-dokuwiki