This is an old revision of the document!
Roo Code is an extension to VSCode, and I quote them verbatim:
Roo Code is an open-source, AI-powered coding assistant that runs in VS Code. It goes beyond simple autocompletion by reading and writing across multiple files, executing commands, and adapting to your workflow—like having a whole dev team right inside your editor.
I am very new to Roo Code, and frankly, my main interest is not in actually using Roo Code but rather to get it up-and-running with locally hosted models. Roo Code supports Ollama and that is likely the simplest way to get it to work locally as Ollama is super simple to get running. I have however chosen to go for vLLM because I'm aiming for the highest throughput and because I want to check out vLLM. My initial test is with DeepSeek-R1-Distill-Qwen-14B and the following docker compose is what I use to get vLLM up and running.
services:
vllm:
container_name: vllm
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
runtime: nvidia
ports:
- "8001:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=<token here!>
command: >
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
--tensor-parallel-size 2
--gpu-memory-utilization 0.95
--max-model-len 16384
--allowed-origins [\"*\"]
--dtype float16
ipc: host
To fit the code on two RTX 3090s I'm using –dtype float16 and since I'm… TBC

Discussion