tech:computer-stuff:roo-code-with-vllm
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:13] – created jon-dokuwiki | tech:computer-stuff:roo-code-with-vllm [2025/08/27 12:26] (current) – Update with info about gpt-oss-120b jon-dokuwiki | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Initial commit | + | [[https:// |
| + | |||
| + | //Roo Code is an open-source, | ||
| + | |||
| + | I am very new to Roo Code, and frankly, my main interest is not in actually using Roo Code but rather to get it up-and-running with locally hosted models. Roo Code supports Ollama and that is likely the simplest way to get it to work locally as Ollama is super simple to get running. I have however chosen to go for vLLM because I'm aiming for the highest throughput and because I want to check out vLLM. My initial test is with // | ||
| + | |||
| + | < | ||
| + | services: | ||
| + | vllm: | ||
| + | container_name: | ||
| + | image: vllm/ | ||
| + | deploy: | ||
| + | resources: | ||
| + | reservations: | ||
| + | devices: | ||
| + | - driver: nvidia | ||
| + | device_ids: [" | ||
| + | capabilities: | ||
| + | runtime: nvidia | ||
| + | ports: | ||
| + | - " | ||
| + | volumes: | ||
| + | - ~/ | ||
| + | environment: | ||
| + | - HUGGING_FACE_HUB_TOKEN=< | ||
| + | command: > | ||
| + | --model deepseek-ai/ | ||
| + | --tensor-parallel-size 2 | ||
| + | --gpu-memory-utilization 0.95 | ||
| + | --max-model-len 16384 | ||
| + | --allowed-origins [\" | ||
| + | --dtype float16 | ||
| + | ipc: host | ||
| + | </ | ||
| + | |||
| + | UPDATE: Roo Code requires complex reasoning and understanding. The closes I've come to make it slightly do as it is supposed to is with OpenAI' | ||
| + | |||
| + | < | ||
| + | services: | ||
| + | vllm: | ||
| + | container_name: | ||
| + | image: vllm/ | ||
| + | restart: unless-stopped | ||
| + | deploy: | ||
| + | resources: | ||
| + | reservations: | ||
| + | devices: | ||
| + | - driver: nvidia | ||
| + | device_ids: [" | ||
| + | capabilities: | ||
| + | runtime: nvidia | ||
| + | ports: | ||
| + | - " | ||
| + | volumes: | ||
| + | - ~/ | ||
| + | environment: | ||
| + | - HUGGING_FACE_HUB_TOKEN=HFTOKEN_HERE | ||
| + | - TORCH_CUDA_ARCH_LIST=8.6 | ||
| + | - NCCL_IB_DISABLE=1 | ||
| + | - NCCL_P2P_DISABLE=0 | ||
| + | # vLLM stability/ | ||
| + | - VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| + | - CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
| + | - VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 | ||
| + | command: > | ||
| + | --model openai/ | ||
| + | --tensor-parallel-size 4 | ||
| + | --gpu-memory-utilization 0.90 | ||
| + | --dtype auto | ||
| + | --max-model-len 131072 | ||
| + | --allowed-origins [\" | ||
| + | --disable-fastapi-docs | ||
| + | --hf-overrides ' | ||
| + | --disable-custom-all-reduce | ||
| + | |||
| + | ipc: host | ||
| + | |||
| + | </ | ||
| + | |||
| + | U can prob. push the GPU mem util past 0.90. I've made it as far as 0.95, and more memory means larger KV cache and larger throughput. In Roo I had to activate high (max) reasoning to make it understand the complex requests. The current issue is that all code that // | ||
tech/computer-stuff/roo-code-with-vllm.1749111236.txt.gz · Last modified: by jon-dokuwiki
