Local + NIM Coding Stack

Replace 80% of your Cursor / Copilot / Claude Code spend with a hybrid local + free-cloud AI stack for VS Code — without sacrificing quality where it matters.

A production-ready coding setup that runs autocomplete and embeddings locally on your GPU, routes routine agent work through NVIDIA NIM's free endpoint models, and reserves Claude Code for the hardest 5% of tasks. Includes start/stop scripts, configs, and tested model combos.

Quick Start · Hardware Requirements · Cline Combos · Troubleshooting · Contributing

TL;DR — What you get

Capability	Tool	Cost
Tab autocomplete (sub-500ms)	Continue.dev + local Qwen 1.5B	Free
Quick "explain this code" chat	Continue.dev + NIM MiniMax M2.7	Free (free endpoint)
Whole-codebase semantic search (`@codebase`)	Continue.dev + nomic embeddings	Free
Multi-file agent tasks (refactors, features)	Cline + NIM MiniMax M2.7	Free (free endpoint)
Hard agentic tasks	Cline + NIM Kimi K2.6 / GLM-5.1	Free credits (~10–50 per task)
Architectural / must-not-fail work	Claude Code	Paid (only when needed)

Realistic monthly split: ~50% local · ~35% NIM free · ~10% NIM credits · ~5% Claude Code.

Repo structure

local-nim-coding-stack/
├── scripts/
│   ├── start-dev.ps1       # Start Ollama + pin models in VRAM
│   └── stop-dev.ps1        # Unload models + kill Ollama
├── configs/
│   ├── continue-config.yaml  # Continue.dev config template
│   └── cline-config.md       # Cline settings + Plan/Act combos
└── docs/
    ├── HARDWARE.md           # VRAM tiers + Mac/Linux notes
    └── TROUBLESHOOTING.md    # Common errors and fixes

Quick Start

# 1. Install Ollama
#    https://ollama.com/download/windows

# 2. Pull all models (~16 GB total)
ollama pull qwen2.5-coder:1.5b-base-q8_0
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull nomic-embed-text:latest

# 3. Disable Ollama autostart
#    Ctrl+Shift+Esc → Startup apps → disable Ollama

# 4. Get NIM key at https://build.nvidia.com (free)

# 5. Install VS Code extensions: Continue + Cline
#    Copy configs/continue-config.yaml to %USERPROFILE%\.continue\config.yaml
#    Replace nvapi-PASTE-YOUR-ACTUAL-KEY-HERE with your real key
#    Configure Cline (see configs/cline-config.md)

# 6. Add shell aliases to your PowerShell profile
#    notepad $PROFILE

Add to your $PROFILE:

function dev-start {
    param([switch]$Heavy, [switch]$Agent)
    & "$HOME\path\to\scripts\start-dev.ps1" -Heavy:$Heavy -Agent:$Agent
}
function dev-stop { & "$HOME\path\to\scripts\stop-dev.ps1" }

Then: . $PROFILE

Hardware Requirements

Targets 12 GB VRAM laptop GPUs (RTX 4070 Ti, 4080 mobile, 5070 Ti mobile). Also needs 32 GB RAM and ~16 GB disk.

VRAM	Adjustments
8 GB	Skip the 14B local agent; use 7B as fallback.
12 GB	Use as written. Recommended.
16 GB	Replace 14B with Qwen 2.5 Coder 32B Q4_K_M.
24 GB+	Replace with Qwen3-Coder-30B-A3B for near-NIM quality locally.

Full details including Mac/Linux bash equivalents: docs/HARDWARE.md

Why this stack?

Ollama over LM Studio for daily work — CLI-driven, scriptable start/stop, lighter footprint.
Continue.dev for autocomplete + chat — best free VS Code extension for inline AI; FIM-aware autocomplete; supports local + cloud through one config.
Cline for agent tasks — autonomous coding agent with explicit per-step approval; OpenAI-compatible (works with NIM); Plan/Act mode enables the cost-saving combos below.
NVIDIA NIM as default cloud — Free Endpoint models (no credit cost) covering 230B+ coders, plus 1,000 free credits for premium models. OpenAI-compatible.
Local fallback — NIM policy can change anytime. Keeping local models loaded means you're never blocked.
Claude Code reserved for premium — best-in-class for novel architecture and complex debugging; 5% of tasks, not 100%.

Detailed Setup

Step 1 — Install Ollama

Download from https://ollama.com/download/windows, run the installer. Verify:

ollama --version

Step 2 — Pull all models

ollama pull qwen2.5-coder:1.5b-base-q8_0   # autocomplete (~1.6 GB)
ollama pull qwen2.5-coder:7b                # local chat fallback (~4.7 GB)
ollama pull qwen2.5-coder:14b               # local agent fallback (~9 GB)
ollama pull nomic-embed-text:latest         # embeddings (~274 MB)

Why :1.5b-base-q8_0? The -base suffix is FIM-trained for tab completion. Q8 quantization preserves quality on small models with negligible speed cost.

Step 3 — Disable Ollama autostart

You only want Ollama running while you work, not eating RAM 24/7.

Ctrl+Shift+Esc → Startup apps tab → find Ollama → right-click → Disable
System tray → right-click llama icon → Quit Ollama

Step 4 — Get NIM API key

Sign up at https://build.nvidia.com → generate API key (starts with nvapi-). Free 1,000 credits + access to Free Endpoint models.

Step 5 — Install VS Code extensions

Continue (publisher: Continue)
Cline (publisher: saoudrizwan)

Step 6 — Configure Continue.dev

Copy configs/continue-config.yaml to %USERPROFILE%\.continue\config.yaml
Replace nvapi-PASTE-YOUR-ACTUAL-KEY-HERE with your actual NIM key
In VS Code: Ctrl+Shift+P → Developer: Reload Window

Security note: The key is plaintext in config. The folder is in your user home, not a project repo. Make sure ~/.continue/ isn't synced via OneDrive/Dropbox.

Step 7 — Configure Cline

See configs/cline-config.md for full settings. Quick summary: Cline icon → settings gear → fill in NIM endpoint, your API key, set Plan/Act combo.

Step 8 — Install scripts

Save scripts/ folder anywhere (e.g. ~\custom-scripts\ollama-llms\)

Allow PowerShell scripts (run as admin, one-time):

Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned

Add aliases to your profile (see Quick Start above)

Daily Usage

Three tools, one job each

Tool	Job
Continue.dev	Autocomplete + chat (you stay in control)
Cline	Agent that takes actions (it does the work, you supervise)
Claude Code	Hardest 5% of work — novel architecture, must-not-fail

Continue helps you write code. Cline writes code for you. Claude Code does the hardest parts.

Start / Stop modes

Command	VRAM	What's loaded	When to use
`dev-start`	~2.5 GB	autocomplete + embeddings	Default. NIM cloud handles chat + agent
`dev-start -Heavy`	~7.5 GB	+ local 7B chat	Offline / privacy / NIM rate-limited
`dev-start -Agent`	~12 GB	+ local 14B agent	NIM down/deprecated → local Cline fallback
`dev-stop`	0	nothing	End of work session

You can escalate without restarting: running dev-start -Heavy while light mode is up just adds the 7B.

Continue.dev — when YOU write the code

Action	Shortcut
Tab autocomplete (ghost text)	just type, `Tab` to accept
Open chat with selection	`Ctrl+L`
Inline edit (highlight + describe change)	`Ctrl+I`
Search whole repo semantically	`@codebase your question`
Add specific file to context	`@file <name>`
Add workspace errors to context	`@problems`
Add current diff	`@diff`
Add terminal output	`@terminal`

Cline — when the AGENT writes the code

Open Cline panel → describe goal → reviews plan → approve → executes (creates files, runs commands).

When to switch from Continue to Cline: if your task is 2+ sentences and touches multiple files.

Cline Plan/Act Combos

Toggle "Use different models for Plan and Act modes" ON in Cline settings.

Combo 1 — Daily Driver ⭐ (recommended)

Mode	Model ID	Cost
Plan	`z-ai/glm-5.1`	~5–10 credits per plan
Act	`minimaxai/minimax-m2.7`	$0 (free endpoint)

~5–10 credits per task. Smart plans, free execution. Use for ~80% of work.

Combo 2 — Highest Quality

Mode	Model ID	Cost
Plan	`moonshotai/kimi-k2.6`	~10–20 credits per plan
Act	`minimaxai/minimax-m2.7`	$0

When Combo 1's plans aren't cutting it. Kimi K2.6 (1T MoE) is the smartest open-weight model on NIM.

Combo 3 — Entire-Repo Refactors

Mode	Model ID	Cost
Plan	`deepseek-ai/deepseek-v4-pro`	credits
Act	`deepseek-ai/deepseek-v4-flash`	credits

Both have 1M context. Use when refactoring across many files.

Combo 4 — Zero Credit Risk

Mode	Model ID	Cost
Plan	`minimaxai/minimax-m2.7`	$0
Act	`minimaxai/minimax-m2.7`	$0

Pure free endpoint. Use for exploratory work.

Combo 5 — Local Fallback (NIM down/dead)

Mode	Model ID	Provider
Plan	`qwen2.5-coder:14b`	`Ollama`, `http://localhost:11434`
Act	`qwen2.5-coder:14b`	same

Requires dev-start -Agent first.

Full combo details + switching instructions: configs/cline-config.md

Decision Tree — Which Tool, Which Model

┌─ Tab autocomplete ─────────────► Continue + Local Qwen 1.5B
│
├─ Quick "what does this do?" ──► Continue + MiniMax M2.7 (Ctrl+L)
│
├─ Inline edit a function ──────► Continue + MiniMax M2.7 (Ctrl+I)
│
├─ Codebase question ───────────► Continue + @codebase
│
├─ Routine multi-file task ─────► Cline + Combo 1 (GLM plan / MiniMax act)
│
├─ Hard agent task ─────────────► Cline + Combo 2 (Kimi plan / MiniMax act)
│
├─ Massive context needed ──────► Cline + Combo 3 (DeepSeek V4)
│
├─ Offline / NIM down ──────────► dev-start -Agent
│                                 → Cline + Combo 5 (Local Qwen 14B)
│
└─ Hardest 5% / must-not-fail ──► Claude Code (terminal: `claude`)

Quick Reference Card

Script commands

dev-start           # Default: 2.5 GB VRAM, NIM cloud workflow
dev-start -Heavy    # +7B local chat (offline/privacy)
dev-start -Agent    # +14B local agent (NIM fallback)
dev-stop            # Free all VRAM, kill Ollama
ollama ps           # See what's loaded right now

NIM Model IDs

minimaxai/minimax-m2.7              ⭐ free endpoint, 230B coder
moonshotai/kimi-k2.6                premium, 1T MoE, best agentic
deepseek-ai/deepseek-v4-flash       1M context, fast
deepseek-ai/deepseek-v4-pro         1M context, deeper reasoning
z-ai/glm-5.1                        premium, agentic flagship

Cline emergency swap to local (when NIM is dead)

dev-start -Agent
Cline settings → API Provider: Ollama, Base URL: http://localhost:11434, Model ID: qwen2.5-coder:14b, Context: 32768

Troubleshooting

See docs/TROUBLESHOOTING.md for the full table.

Quick fixes:

Continue 401 → wrong NIM key in ~/.continue/config.yaml, reload window
@codebase not working → nomic not loaded, re-run dev-start
Cline rate limit → MiniMax free endpoint is 40 req/min, wait or switch model
API not responding → run ollama serve in a fresh terminal to see real errors

FAQ

Q: Why not just use Cursor / Copilot / Claude Code for everything?
A: Cost. A heavy Cursor user spends $20–50/month easy; Claude Code on serious agent work can be $100+/month. This setup gets you 80–90% of the same productivity for ~$5–10/month, often $0.

Q: Why local autocomplete instead of cloud?
A: Latency. Cloud autocomplete needs ~300ms round-trip; local Qwen 1.5B does ~150ms. The difference is felt on every keystroke. Plus it's free and private.

Q: Why MiniMax M2.7 over other free models?
A: It's a 230B coder with a Free Endpoint (no credit cost) on NIM. Nothing else in the free tier comes close on coding quality. As of early 2026.

Q: What if NIM deprecates MiniMax M2.7?
A: That's why local fallback exists. Run dev-start -Agent, swap Cline config in 10 seconds, keep working. Or swap to a different free provider (Groq, Cerebras, etc.) — the YAML anchor pattern makes this a one-line change. Open an issue or PR with the replacement.

Q: Does this work for non-coding tasks?
A: Local models are coder-tuned, so they're meh for general writing. For mixed work, swap the local 7B for Qwen 3 8B or Gemma 4 E4B. Cloud models (MiniMax, Kimi, GLM) are general-purpose and work fine for any task.

Q: Is Continue.dev's @codebase actually useful?
A: Yes — semantic search across your repo using local nomic embeddings. First index takes 1–3 min for medium repos; after that it's incremental.

Acknowledgments

Built on top of:

Ollama — local model runtime
Continue.dev — VS Code AI assistant
Cline — autonomous coding agent
NVIDIA NIM — free model API
Qwen, DeepSeek, Z.ai, Moonshot AI, MiniMax — open-weight models
Anthropic Claude — premium tier

Contributing

PRs welcome. See CONTRIBUTING.md for what we accept and how to submit.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
configs		configs
docs		docs
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Folders and files

Latest commit

History

Repository files navigation

Local + NIM Coding Stack

TL;DR — What you get

Repo structure

Quick Start

Hardware Requirements

Why this stack?

Detailed Setup

Step 1 — Install Ollama

Step 2 — Pull all models

Step 3 — Disable Ollama autostart

Step 4 — Get NIM API key

Step 5 — Install VS Code extensions

Step 6 — Configure Continue.dev

Step 7 — Configure Cline

Step 8 — Install scripts

Daily Usage

Three tools, one job each

Start / Stop modes

Continue.dev — when YOU write the code

Cline — when the AGENT writes the code

Cline Plan/Act Combos

Combo 1 — Daily Driver ⭐ (recommended)

Combo 2 — Highest Quality

Combo 3 — Entire-Repo Refactors

Combo 4 — Zero Credit Risk

Combo 5 — Local Fallback (NIM down/dead)

Decision Tree — Which Tool, Which Model

Quick Reference Card

Script commands

NIM Model IDs

Cline emergency swap to local (when NIM is dead)

Troubleshooting

FAQ

Acknowledgments

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages