Unverified community knowledge from r/LocalLLaMA, generated by Nemotron 9B
Opus 4.6 Audit Insights (2026-03-16)
- バージョン指定(PyTorch 2.8.0 cu128, vLLM 0.10.2, FlashInfer 0.3.1)は当時として正確だが、現在はより新しいバージョンが利用可能。
- 80 tok/s のスループットは他の RTX Pro 6000 ベンチマーク(記事12, 13)と整合的。
- 「Claude や ChatGPT に頼るな」という警告は、最先端スタック構成に対する AI アシスタントの限界として妥当な指摘。
Serving Qwen3‑Next‑80B‑A3B‑Instruct (FP8) on Windows 11 WSL2 with vLLM and FlashInfer — A Real‑World Tale
Score: 86 upvotes (Reddit r/LocalLLaMA, community post)
To run Qwen3‑Next‑80B‑A3B‑Instruct FP8 on a Blackwell GPU under WSL2, use PyTorch 2.8.0 (cu128), vLLM 0.10.2, and FlashInfer ≥ 0.3.1 (preferably 0.3.1). Pin the nightly cu128 vLLM image, expose /dev/dxg and /usr/lib/wsl/lib, and run a small run.sh that installs the precise userspace stack and starts the OpenAI‑compatible server. This configuration reaches 80 tokens per second on a single stream, leveraging both FlashInfer and CUDA graphs — a setup that contradicts common advice from some AI assistants.
The Reddit post from r/LocalLLaMA details a successful deployment of the massive Qwen3‑Next‑80B‑A3B‑Instruct model in FP8 format on a Blackwell‑based RTX PRO 6000 (96 GB VRAM) via WSL2, vLLM, and FlashInfer. The author, a member of the community, shares a step‑by‑step Docker configuration that resolves several known pitfalls, including the “FlashInfer requires sm75+” crash, and achieves 80 tok/s throughput.
Below is a technical blog article summarizing the post’s claims and instructions, presented in its own words and with attention to the numbers and terminology used.
According to the post, the following stack must be pinned to avoid crashes and ensure compatibility:
The post stresses that FlashInfer must be upgraded because the default version triggers a crash on Blackwell GPUs. With FlashInfer 0.3.1, the system enables CUDA graphs and delivers 80 tokens per second in a single stream.
| Component | Specification |
|---|---|
| OS | Windows 11 + WSL2 (Ubuntu) |
| GPU | RTX PRO 6000 Blackwell (96 GB) |
| Serving engine | vLLM OpenAI‑compatible server |
| Model | TheClusterDev/Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic (80 B total, ~3 B activated per token) |
Key points from the post:
The author shares a Docker run command that incorporates all required parameters:
docker run --rm --name vllm-qwen \
--gpus all \
--ipc=host \
-p 8000:8000 \
--entrypoint bash \
--device /dev/dxg \
-v /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro \
-e LD_LIBRARY_PATH="/usr/lib/wsl/lib:$LD_LIBRARY_PATH" \
-e HUGGING_FACE_HUB_TOKEN="$HF_TOKEN" \
-e HF_TOKEN="$HF_TOKEN" \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$HOME/.cache/torch:/root/.cache/torch" \
-v "$HOME/.triton:/root/.triton" \
-v /data/models/qwen3_next_fp8:/models \
-v "$PWD/run-vllm-qwen.sh:/run.sh:ro" \
lmcache/vllm-openai:latest-nightly-cu128 \
-lc '/run.sh'
--device /dev/dxg – Expose the WSL GPU device node so the container can access the NVIDIA driver.-v /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro – Mount the WSL CUDA stub directory, providing libcuda.so.1 for PyTorch’s dlopen.-e LD_LIBRARY_PATH=... – Prepend the WSL library path so the container finds libcuda.so.1.-p 8000:8000 – Bind the OpenAI-compatible API port.--entrypoint bash -lc '/run.sh' – Executes the custom script that installs dependencies and starts vLLM.run.sh ScriptWhile the full script is truncated in the original post, its purpose is clear: it installs PyTorch 2.8.0 (cu128), FlashInfer 0.3.1, and any other runtime dependencies, then launches vLLM with the correct arguments. The script avoids relying on pre‑installed system packages that may differ across WSL2 installations.
With the above configuration, the author reports:
This community‑contributed guide demonstrates that, with precise version pinning and careful WSL2 GPU exposure, it is possible to serve a massive 80 B FP8 MoE model on a Blackwell GPU via vLLM and FlashInfer. The key steps are:
/dev/dxg and /usr/lib/wsl/lib to expose the WSL CUDA environment.run.sh to ensure consistent dependencies.The post, with its 86 upvotes, offers one of the most detailed, working configurations available for this niche deployment.
Disclaimer: This article summarizes unverified community information from a Reddit post (r/LocalLLaMA). The details are presented as‑is and may require further validation for production use.