Differences

This shows you the differences between two versions of the page.

--- tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:13] – created jon-dokuwiki
+++ tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:37] (current) – Add initial info jon-dokuwiki
@@ Line 1: / Line 1: @@
-Initial commit
+[[https://roocode.com|Roo Code]] is an extension to VSCode, and I quote them verbatim:
+//Roo Code is an open-source, AI-powered coding assistant that runs in VS Code. It goes beyond simple autocompletion by reading and writing across multiple files, executing commands, and adapting to your workflow—like having a whole dev team right inside your editor.//
+I am very new to Roo Code, and frankly, my main interest is not in actually using Roo Code but rather to get it up-and-running with locally hosted models. Roo Code supports Ollama and that is likely the simplest way to get it to work locally as Ollama is super simple to get running. I have however chosen to go for vLLM because I'm aiming for the highest throughput and because I want to check out vLLM. My initial test is with //DeepSeek-R1-Distill-Qwen-14B// and the following docker compose is what I use to get vLLM up and running.
+<code>
+services:
+  vllm:
+    container_name: vllm
+    image: vllm/vllm-openai:latest
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              device_ids: ["0", "1"]
+              capabilities: [gpu]
+    runtime: nvidia
+    ports:
+      - "8001:8000"
+    volumes:
+      - ~/.cache/huggingface:/root/.cache/huggingface
+    environment:
+      - HUGGING_FACE_HUB_TOKEN=<token here!>
+    command: >
+      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
+      --tensor-parallel-size 2
+      --gpu-memory-utilization 0.95
+      --max-model-len 16384
+      --allowed-origins [\"*\"]
+      --dtype float16
+    ipc: host
+</code>
+To fit the code on two RTX 3090s I'm using //--dtype float16// and since I'm... TBC