Replace 80% of your Cursor / Copilot / Claude Code spend with a hybrid local + free-cloud AI stack for VS Code — without sacrificing quality where it matters.
A production-ready coding setup that runs autocomplete and embeddings locally on your GPU, routes routine agent work through NVIDIA NIM's free endpoint models, and reserves Claude Code for the hardest 5% of tasks. Includes start/stop scripts, configs, and tested model combos.
Quick Start · Hardware Requirements · Cline Combos · Troubleshooting · Contributing
| Capability | Tool | Cost |
|---|---|---|
| Tab autocomplete (sub-500ms) | Continue.dev + local Qwen 1.5B | Free |
| Quick "explain this code" chat | Continue.dev + NIM MiniMax M2.7 | Free (free endpoint) |
Whole-codebase semantic search (@codebase) |
Continue.dev + nomic embeddings | Free |
| Multi-file agent tasks (refactors, features) | Cline + NIM MiniMax M2.7 | Free (free endpoint) |
| Hard agentic tasks | Cline + NIM Kimi K2.6 / GLM-5.1 | Free credits (~10–50 per task) |
| Architectural / must-not-fail work | Claude Code | Paid (only when needed) |
Realistic monthly split: ~50% local · ~35% NIM free · ~10% NIM credits · ~5% Claude Code.
local-nim-coding-stack/
├── scripts/
│ ├── start-dev.ps1 # Start Ollama + pin models in VRAM
│ └── stop-dev.ps1 # Unload models + kill Ollama
├── configs/
│ ├── continue-config.yaml # Continue.dev config template
│ └── cline-config.md # Cline settings + Plan/Act combos
└── docs/
├── HARDWARE.md # VRAM tiers + Mac/Linux notes
└── TROUBLESHOOTING.md # Common errors and fixes
# 1. Install Ollama
# https://ollama.com/download/windows
# 2. Pull all models (~16 GB total)
ollama pull qwen2.5-coder:1.5b-base-q8_0
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull nomic-embed-text:latest
# 3. Disable Ollama autostart
# Ctrl+Shift+Esc → Startup apps → disable Ollama
# 4. Get NIM key at https://build.nvidia.com (free)
# 5. Install VS Code extensions: Continue + Cline
# Copy configs/continue-config.yaml to %USERPROFILE%\.continue\config.yaml
# Replace nvapi-PASTE-YOUR-ACTUAL-KEY-HERE with your real key
# Configure Cline (see configs/cline-config.md)
# 6. Add shell aliases to your PowerShell profile
# notepad $PROFILEAdd to your $PROFILE:
function dev-start {
param([switch]$Heavy, [switch]$Agent)
& "$HOME\path\to\scripts\start-dev.ps1" -Heavy:$Heavy -Agent:$Agent
}
function dev-stop { & "$HOME\path\to\scripts\stop-dev.ps1" }Then: . $PROFILE
Targets 12 GB VRAM laptop GPUs (RTX 4070 Ti, 4080 mobile, 5070 Ti mobile). Also needs 32 GB RAM and ~16 GB disk.
| VRAM | Adjustments |
|---|---|
| 8 GB | Skip the 14B local agent; use 7B as fallback. |
| 12 GB | Use as written. Recommended. |
| 16 GB | Replace 14B with Qwen 2.5 Coder 32B Q4_K_M. |
| 24 GB+ | Replace with Qwen3-Coder-30B-A3B for near-NIM quality locally. |
Full details including Mac/Linux bash equivalents: docs/HARDWARE.md
- Ollama over LM Studio for daily work — CLI-driven, scriptable start/stop, lighter footprint.
- Continue.dev for autocomplete + chat — best free VS Code extension for inline AI; FIM-aware autocomplete; supports local + cloud through one config.
- Cline for agent tasks — autonomous coding agent with explicit per-step approval; OpenAI-compatible (works with NIM); Plan/Act mode enables the cost-saving combos below.
- NVIDIA NIM as default cloud — Free Endpoint models (no credit cost) covering 230B+ coders, plus 1,000 free credits for premium models. OpenAI-compatible.
- Local fallback — NIM policy can change anytime. Keeping local models loaded means you're never blocked.
- Claude Code reserved for premium — best-in-class for novel architecture and complex debugging; 5% of tasks, not 100%.
Download from https://ollama.com/download/windows, run the installer. Verify:
ollama --versionollama pull qwen2.5-coder:1.5b-base-q8_0 # autocomplete (~1.6 GB)
ollama pull qwen2.5-coder:7b # local chat fallback (~4.7 GB)
ollama pull qwen2.5-coder:14b # local agent fallback (~9 GB)
ollama pull nomic-embed-text:latest # embeddings (~274 MB)Why
:1.5b-base-q8_0? The-basesuffix is FIM-trained for tab completion. Q8 quantization preserves quality on small models with negligible speed cost.
You only want Ollama running while you work, not eating RAM 24/7.
Ctrl+Shift+Esc→ Startup apps tab → find Ollama → right-click → Disable- System tray → right-click llama icon → Quit Ollama
Sign up at https://build.nvidia.com → generate API key (starts with nvapi-). Free 1,000 credits + access to Free Endpoint models.
- Continue (publisher: Continue)
- Cline (publisher: saoudrizwan)
- Copy
configs/continue-config.yamlto%USERPROFILE%\.continue\config.yaml - Replace
nvapi-PASTE-YOUR-ACTUAL-KEY-HEREwith your actual NIM key - In VS Code:
Ctrl+Shift+P→ Developer: Reload Window
Security note: The key is plaintext in config. The folder is in your user home, not a project repo. Make sure
~/.continue/isn't synced via OneDrive/Dropbox.
See configs/cline-config.md for full settings. Quick summary: Cline icon → settings gear → fill in NIM endpoint, your API key, set Plan/Act combo.
- Save
scripts/folder anywhere (e.g.~\custom-scripts\ollama-llms\) - Allow PowerShell scripts (run as admin, one-time):
Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
- Add aliases to your profile (see Quick Start above)
| Tool | Job |
|---|---|
| Continue.dev | Autocomplete + chat (you stay in control) |
| Cline | Agent that takes actions (it does the work, you supervise) |
| Claude Code | Hardest 5% of work — novel architecture, must-not-fail |
Continue helps you write code. Cline writes code for you. Claude Code does the hardest parts.
| Command | VRAM | What's loaded | When to use |
|---|---|---|---|
dev-start |
~2.5 GB | autocomplete + embeddings | Default. NIM cloud handles chat + agent |
dev-start -Heavy |
~7.5 GB | + local 7B chat | Offline / privacy / NIM rate-limited |
dev-start -Agent |
~12 GB | + local 14B agent | NIM down/deprecated → local Cline fallback |
dev-stop |
0 | nothing | End of work session |
You can escalate without restarting: running dev-start -Heavy while light mode is up just adds the 7B.
| Action | Shortcut |
|---|---|
| Tab autocomplete (ghost text) | just type, Tab to accept |
| Open chat with selection | Ctrl+L |
| Inline edit (highlight + describe change) | Ctrl+I |
| Search whole repo semantically | @codebase your question |
| Add specific file to context | @file <name> |
| Add workspace errors to context | @problems |
| Add current diff | @diff |
| Add terminal output | @terminal |
Open Cline panel → describe goal → reviews plan → approve → executes (creates files, runs commands).
When to switch from Continue to Cline: if your task is 2+ sentences and touches multiple files.
Toggle "Use different models for Plan and Act modes" ON in Cline settings.
| Mode | Model ID | Cost |
|---|---|---|
| Plan | z-ai/glm-5.1 |
~5–10 credits per plan |
| Act | minimaxai/minimax-m2.7 |
$0 (free endpoint) |
~5–10 credits per task. Smart plans, free execution. Use for ~80% of work.
| Mode | Model ID | Cost |
|---|---|---|
| Plan | moonshotai/kimi-k2.6 |
~10–20 credits per plan |
| Act | minimaxai/minimax-m2.7 |
$0 |
When Combo 1's plans aren't cutting it. Kimi K2.6 (1T MoE) is the smartest open-weight model on NIM.
| Mode | Model ID | Cost |
|---|---|---|
| Plan | deepseek-ai/deepseek-v4-pro |
credits |
| Act | deepseek-ai/deepseek-v4-flash |
credits |
Both have 1M context. Use when refactoring across many files.
| Mode | Model ID | Cost |
|---|---|---|
| Plan | minimaxai/minimax-m2.7 |
$0 |
| Act | minimaxai/minimax-m2.7 |
$0 |
Pure free endpoint. Use for exploratory work.
| Mode | Model ID | Provider |
|---|---|---|
| Plan | qwen2.5-coder:14b |
Ollama, http://localhost:11434 |
| Act | qwen2.5-coder:14b |
same |
Requires dev-start -Agent first.
Full combo details + switching instructions: configs/cline-config.md
┌─ Tab autocomplete ─────────────► Continue + Local Qwen 1.5B
│
├─ Quick "what does this do?" ──► Continue + MiniMax M2.7 (Ctrl+L)
│
├─ Inline edit a function ──────► Continue + MiniMax M2.7 (Ctrl+I)
│
├─ Codebase question ───────────► Continue + @codebase
│
├─ Routine multi-file task ─────► Cline + Combo 1 (GLM plan / MiniMax act)
│
├─ Hard agent task ─────────────► Cline + Combo 2 (Kimi plan / MiniMax act)
│
├─ Massive context needed ──────► Cline + Combo 3 (DeepSeek V4)
│
├─ Offline / NIM down ──────────► dev-start -Agent
│ → Cline + Combo 5 (Local Qwen 14B)
│
└─ Hardest 5% / must-not-fail ──► Claude Code (terminal: `claude`)
dev-start # Default: 2.5 GB VRAM, NIM cloud workflow
dev-start -Heavy # +7B local chat (offline/privacy)
dev-start -Agent # +14B local agent (NIM fallback)
dev-stop # Free all VRAM, kill Ollama
ollama ps # See what's loaded right nowminimaxai/minimax-m2.7 ⭐ free endpoint, 230B coder
moonshotai/kimi-k2.6 premium, 1T MoE, best agentic
deepseek-ai/deepseek-v4-flash 1M context, fast
deepseek-ai/deepseek-v4-pro 1M context, deeper reasoning
z-ai/glm-5.1 premium, agentic flagship
dev-start -Agent- Cline settings → API Provider:
Ollama, Base URL:http://localhost:11434, Model ID:qwen2.5-coder:14b, Context:32768
See docs/TROUBLESHOOTING.md for the full table.
Quick fixes:
- Continue 401 → wrong NIM key in
~/.continue/config.yaml, reload window @codebasenot working → nomic not loaded, re-rundev-start- Cline rate limit → MiniMax free endpoint is 40 req/min, wait or switch model
- API not responding → run
ollama servein a fresh terminal to see real errors
Q: Why not just use Cursor / Copilot / Claude Code for everything?
A: Cost. A heavy Cursor user spends $20–50/month easy; Claude Code on serious agent work can be $100+/month. This setup gets you 80–90% of the same productivity for ~$5–10/month, often $0.
Q: Why local autocomplete instead of cloud?
A: Latency. Cloud autocomplete needs ~300ms round-trip; local Qwen 1.5B does ~150ms. The difference is felt on every keystroke. Plus it's free and private.
Q: Why MiniMax M2.7 over other free models?
A: It's a 230B coder with a Free Endpoint (no credit cost) on NIM. Nothing else in the free tier comes close on coding quality. As of early 2026.
Q: What if NIM deprecates MiniMax M2.7?
A: That's why local fallback exists. Run dev-start -Agent, swap Cline config in 10 seconds, keep working. Or swap to a different free provider (Groq, Cerebras, etc.) — the YAML anchor pattern makes this a one-line change. Open an issue or PR with the replacement.
Q: Does this work for non-coding tasks?
A: Local models are coder-tuned, so they're meh for general writing. For mixed work, swap the local 7B for Qwen 3 8B or Gemma 4 E4B. Cloud models (MiniMax, Kimi, GLM) are general-purpose and work fine for any task.
Q: Is Continue.dev's @codebase actually useful?
A: Yes — semantic search across your repo using local nomic embeddings. First index takes 1–3 min for medium repos; after that it's incremental.
Built on top of:
- Ollama — local model runtime
- Continue.dev — VS Code AI assistant
- Cline — autonomous coding agent
- NVIDIA NIM — free model API
- Qwen, DeepSeek, Z.ai, Moonshot AI, MiniMax — open-weight models
- Anthropic Claude — premium tier
PRs welcome. See CONTRIBUTING.md for what we accept and how to submit.
MIT — see LICENSE.