User Tools

Site Tools


tech:computer-stuff:roo-code-with-vllm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
tech:computer-stuff:roo-code-with-vllm [2025/06/05 08:37] – Add initial info jon-dokuwikitech:computer-stuff:roo-code-with-vllm [2025/08/27 12:26] (current) – Update with info about gpt-oss-120b jon-dokuwiki
Line 34: Line 34:
 </code> </code>
  
-To fit the code on two RTX 3090s I'm using //--dtype float16// and since I'm... TBC+UPDATE: Roo Code requires complex reasoning and understanding. The closes I've come to make it slightly do as it is supposed to is with OpenAI'//gpt-oss-120b//. Here's the compose:
  
 +<code>
 +services:
 +  vllm:
 +    container_name: vllm-openai-oss
 +    image: vllm/vllm-openai:gptoss
 +    restart: unless-stopped
 +    deploy:
 +      resources:
 +        reservations:
 +          devices:
 +            - driver: nvidia
 +              device_ids: ["0", "1", "2", "3"]
 +              capabilities: [gpu]
 +    runtime: nvidia
 +    ports:
 +      - "8001:8000"
 +    volumes:
 +      - ~/.cache:/root/.cache
 +    environment:
 +      - HUGGING_FACE_HUB_TOKEN=HFTOKEN_HERE
 +      - TORCH_CUDA_ARCH_LIST=8.6  # compile CUDA extensions only for the 3090 architecture
 +      - NCCL_IB_DISABLE=1
 +      - NCCL_P2P_DISABLE=0
 +      # vLLM stability/perf knobs
 +      - VLLM_WORKER_MULTIPROC_METHOD=spawn
 +      - CUDA_DEVICE_MAX_CONNECTIONS=1
 +      - VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1   # <-- force Triton backend on Ampere
 +    command: >
 +      --model openai/gpt-oss-120b
 +      --tensor-parallel-size 4
 +      --gpu-memory-utilization 0.90
 +      --dtype auto
 +      --max-model-len 131072
 +      --allowed-origins [\"*\"]
 +      --disable-fastapi-docs
 +      --hf-overrides '{"sink_token_len": 0, "use_sliding_window": false}'
 +      --disable-custom-all-reduce
 +
 +    ipc: host
 +
 +</code>
 +
 +U can prob. push the GPU mem util past 0.90. I've made it as far as 0.95, and more memory means larger KV cache and larger throughput. In Roo I had to activate high (max) reasoning to make it understand the complex requests. The current issue is that all code that //gpt-oss-120b// generates is kinda stuck inside the "thinking box" in the Roo panel. A lot of code is generated, but it is never actually put into any file... Have to think more about it.
tech/computer-stuff/roo-code-with-vllm.txt · Last modified: by jon-dokuwiki