[[https://roocode.com|Roo Code]] is an extension to VSCode, and I quote them verbatim:
//Roo Code is an open-source, AI-powered coding assistant that runs in VS Code. It goes beyond simple autocompletion by reading and writing across multiple files, executing commands, and adapting to your workflow—like having a whole dev team right inside your editor.//
I am very new to Roo Code, and frankly, my main interest is not in actually using Roo Code but rather to get it up-and-running with locally hosted models. Roo Code supports Ollama and that is likely the simplest way to get it to work locally as Ollama is super simple to get running. I have however chosen to go for vLLM because I'm aiming for the highest throughput and because I want to check out vLLM. My initial test is with //DeepSeek-R1-Distill-Qwen-14B// and the following docker compose is what I use to get vLLM up and running.
services:
vllm:
container_name: vllm
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
runtime: nvidia
ports:
- "8001:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=
command: >
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
--tensor-parallel-size 2
--gpu-memory-utilization 0.95
--max-model-len 16384
--allowed-origins [\"*\"]
--dtype float16
ipc: host
UPDATE: Roo Code requires complex reasoning and understanding. The closes I've come to make it slightly do as it is supposed to is with OpenAI's //gpt-oss-120b//. Here's the compose:
services:
vllm:
container_name: vllm-openai-oss
image: vllm/vllm-openai:gptoss
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1", "2", "3"]
capabilities: [gpu]
runtime: nvidia
ports:
- "8001:8000"
volumes:
- ~/.cache:/root/.cache
environment:
- HUGGING_FACE_HUB_TOKEN=HFTOKEN_HERE
- TORCH_CUDA_ARCH_LIST=8.6 # compile CUDA extensions only for the 3090 architecture
- NCCL_IB_DISABLE=1
- NCCL_P2P_DISABLE=0
# vLLM stability/perf knobs
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- CUDA_DEVICE_MAX_CONNECTIONS=1
- VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 # <-- force Triton backend on Ampere
command: >
--model openai/gpt-oss-120b
--tensor-parallel-size 4
--gpu-memory-utilization 0.90
--dtype auto
--max-model-len 131072
--allowed-origins [\"*\"]
--disable-fastapi-docs
--hf-overrides '{"sink_token_len": 0, "use_sliding_window": false}'
--disable-custom-all-reduce
ipc: host
U can prob. push the GPU mem util past 0.90. I've made it as far as 0.95, and more memory means larger KV cache and larger throughput. In Roo I had to activate high (max) reasoning to make it understand the complex requests. The current issue is that all code that //gpt-oss-120b// generates is kinda stuck inside the "thinking box" in the Roo panel. A lot of code is generated, but it is never actually put into any file... Have to think more about it.