Skip to content

VISHWAJ33T/local-nim-coding-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Local + NIM Coding Stack

Replace 80% of your Cursor / Copilot / Claude Code spend with a hybrid local + free-cloud AI stack for VS Code — without sacrificing quality where it matters.

A production-ready coding setup that runs autocomplete and embeddings locally on your GPU, routes routine agent work through NVIDIA NIM's free endpoint models, and reserves Claude Code for the hardest 5% of tasks. Includes start/stop scripts, configs, and tested model combos.

Quick Start · Hardware Requirements · Cline Combos · Troubleshooting · Contributing


TL;DR — What you get

Capability Tool Cost
Tab autocomplete (sub-500ms) Continue.dev + local Qwen 1.5B Free
Quick "explain this code" chat Continue.dev + NIM MiniMax M2.7 Free (free endpoint)
Whole-codebase semantic search (@codebase) Continue.dev + nomic embeddings Free
Multi-file agent tasks (refactors, features) Cline + NIM MiniMax M2.7 Free (free endpoint)
Hard agentic tasks Cline + NIM Kimi K2.6 / GLM-5.1 Free credits (~10–50 per task)
Architectural / must-not-fail work Claude Code Paid (only when needed)

Realistic monthly split: ~50% local · ~35% NIM free · ~10% NIM credits · ~5% Claude Code.


Repo structure

local-nim-coding-stack/
├── scripts/
│   ├── start-dev.ps1       # Start Ollama + pin models in VRAM
│   └── stop-dev.ps1        # Unload models + kill Ollama
├── configs/
│   ├── continue-config.yaml  # Continue.dev config template
│   └── cline-config.md       # Cline settings + Plan/Act combos
└── docs/
    ├── HARDWARE.md           # VRAM tiers + Mac/Linux notes
    └── TROUBLESHOOTING.md    # Common errors and fixes

Quick Start

# 1. Install Ollama
#    https://ollama.com/download/windows

# 2. Pull all models (~16 GB total)
ollama pull qwen2.5-coder:1.5b-base-q8_0
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b
ollama pull nomic-embed-text:latest

# 3. Disable Ollama autostart
#    Ctrl+Shift+Esc → Startup apps → disable Ollama

# 4. Get NIM key at https://build.nvidia.com (free)

# 5. Install VS Code extensions: Continue + Cline
#    Copy configs/continue-config.yaml to %USERPROFILE%\.continue\config.yaml
#    Replace nvapi-PASTE-YOUR-ACTUAL-KEY-HERE with your real key
#    Configure Cline (see configs/cline-config.md)

# 6. Add shell aliases to your PowerShell profile
#    notepad $PROFILE

Add to your $PROFILE:

function dev-start {
    param([switch]$Heavy, [switch]$Agent)
    & "$HOME\path\to\scripts\start-dev.ps1" -Heavy:$Heavy -Agent:$Agent
}
function dev-stop { & "$HOME\path\to\scripts\stop-dev.ps1" }

Then: . $PROFILE


Hardware Requirements

Targets 12 GB VRAM laptop GPUs (RTX 4070 Ti, 4080 mobile, 5070 Ti mobile). Also needs 32 GB RAM and ~16 GB disk.

VRAM Adjustments
8 GB Skip the 14B local agent; use 7B as fallback.
12 GB Use as written. Recommended.
16 GB Replace 14B with Qwen 2.5 Coder 32B Q4_K_M.
24 GB+ Replace with Qwen3-Coder-30B-A3B for near-NIM quality locally.

Full details including Mac/Linux bash equivalents: docs/HARDWARE.md


Why this stack?

  • Ollama over LM Studio for daily work — CLI-driven, scriptable start/stop, lighter footprint.
  • Continue.dev for autocomplete + chat — best free VS Code extension for inline AI; FIM-aware autocomplete; supports local + cloud through one config.
  • Cline for agent tasks — autonomous coding agent with explicit per-step approval; OpenAI-compatible (works with NIM); Plan/Act mode enables the cost-saving combos below.
  • NVIDIA NIM as default cloud — Free Endpoint models (no credit cost) covering 230B+ coders, plus 1,000 free credits for premium models. OpenAI-compatible.
  • Local fallback — NIM policy can change anytime. Keeping local models loaded means you're never blocked.
  • Claude Code reserved for premium — best-in-class for novel architecture and complex debugging; 5% of tasks, not 100%.

Detailed Setup

Step 1 — Install Ollama

Download from https://ollama.com/download/windows, run the installer. Verify:

ollama --version

Step 2 — Pull all models

ollama pull qwen2.5-coder:1.5b-base-q8_0   # autocomplete (~1.6 GB)
ollama pull qwen2.5-coder:7b                # local chat fallback (~4.7 GB)
ollama pull qwen2.5-coder:14b               # local agent fallback (~9 GB)
ollama pull nomic-embed-text:latest         # embeddings (~274 MB)

Why :1.5b-base-q8_0? The -base suffix is FIM-trained for tab completion. Q8 quantization preserves quality on small models with negligible speed cost.

Step 3 — Disable Ollama autostart

You only want Ollama running while you work, not eating RAM 24/7.

  • Ctrl+Shift+EscStartup apps tab → find Ollama → right-click → Disable
  • System tray → right-click llama icon → Quit Ollama

Step 4 — Get NIM API key

Sign up at https://build.nvidia.com → generate API key (starts with nvapi-). Free 1,000 credits + access to Free Endpoint models.

Step 5 — Install VS Code extensions

  • Continue (publisher: Continue)
  • Cline (publisher: saoudrizwan)

Step 6 — Configure Continue.dev

  1. Copy configs/continue-config.yaml to %USERPROFILE%\.continue\config.yaml
  2. Replace nvapi-PASTE-YOUR-ACTUAL-KEY-HERE with your actual NIM key
  3. In VS Code: Ctrl+Shift+PDeveloper: Reload Window

Security note: The key is plaintext in config. The folder is in your user home, not a project repo. Make sure ~/.continue/ isn't synced via OneDrive/Dropbox.

Step 7 — Configure Cline

See configs/cline-config.md for full settings. Quick summary: Cline icon → settings gear → fill in NIM endpoint, your API key, set Plan/Act combo.

Step 8 — Install scripts

  1. Save scripts/ folder anywhere (e.g. ~\custom-scripts\ollama-llms\)
  2. Allow PowerShell scripts (run as admin, one-time):
    Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
  3. Add aliases to your profile (see Quick Start above)

Daily Usage

Three tools, one job each

Tool Job
Continue.dev Autocomplete + chat (you stay in control)
Cline Agent that takes actions (it does the work, you supervise)
Claude Code Hardest 5% of work — novel architecture, must-not-fail

Continue helps you write code. Cline writes code for you. Claude Code does the hardest parts.

Start / Stop modes

Command VRAM What's loaded When to use
dev-start ~2.5 GB autocomplete + embeddings Default. NIM cloud handles chat + agent
dev-start -Heavy ~7.5 GB + local 7B chat Offline / privacy / NIM rate-limited
dev-start -Agent ~12 GB + local 14B agent NIM down/deprecated → local Cline fallback
dev-stop 0 nothing End of work session

You can escalate without restarting: running dev-start -Heavy while light mode is up just adds the 7B.

Continue.dev — when YOU write the code

Action Shortcut
Tab autocomplete (ghost text) just type, Tab to accept
Open chat with selection Ctrl+L
Inline edit (highlight + describe change) Ctrl+I
Search whole repo semantically @codebase your question
Add specific file to context @file <name>
Add workspace errors to context @problems
Add current diff @diff
Add terminal output @terminal

Cline — when the AGENT writes the code

Open Cline panel → describe goal → reviews plan → approve → executes (creates files, runs commands).

When to switch from Continue to Cline: if your task is 2+ sentences and touches multiple files.


Cline Plan/Act Combos

Toggle "Use different models for Plan and Act modes" ON in Cline settings.

Combo 1 — Daily Driver ⭐ (recommended)

Mode Model ID Cost
Plan z-ai/glm-5.1 ~5–10 credits per plan
Act minimaxai/minimax-m2.7 $0 (free endpoint)

~5–10 credits per task. Smart plans, free execution. Use for ~80% of work.

Combo 2 — Highest Quality

Mode Model ID Cost
Plan moonshotai/kimi-k2.6 ~10–20 credits per plan
Act minimaxai/minimax-m2.7 $0

When Combo 1's plans aren't cutting it. Kimi K2.6 (1T MoE) is the smartest open-weight model on NIM.

Combo 3 — Entire-Repo Refactors

Mode Model ID Cost
Plan deepseek-ai/deepseek-v4-pro credits
Act deepseek-ai/deepseek-v4-flash credits

Both have 1M context. Use when refactoring across many files.

Combo 4 — Zero Credit Risk

Mode Model ID Cost
Plan minimaxai/minimax-m2.7 $0
Act minimaxai/minimax-m2.7 $0

Pure free endpoint. Use for exploratory work.

Combo 5 — Local Fallback (NIM down/dead)

Mode Model ID Provider
Plan qwen2.5-coder:14b Ollama, http://localhost:11434
Act qwen2.5-coder:14b same

Requires dev-start -Agent first.

Full combo details + switching instructions: configs/cline-config.md


Decision Tree — Which Tool, Which Model

┌─ Tab autocomplete ─────────────► Continue + Local Qwen 1.5B
│
├─ Quick "what does this do?" ──► Continue + MiniMax M2.7 (Ctrl+L)
│
├─ Inline edit a function ──────► Continue + MiniMax M2.7 (Ctrl+I)
│
├─ Codebase question ───────────► Continue + @codebase
│
├─ Routine multi-file task ─────► Cline + Combo 1 (GLM plan / MiniMax act)
│
├─ Hard agent task ─────────────► Cline + Combo 2 (Kimi plan / MiniMax act)
│
├─ Massive context needed ──────► Cline + Combo 3 (DeepSeek V4)
│
├─ Offline / NIM down ──────────► dev-start -Agent
│                                 → Cline + Combo 5 (Local Qwen 14B)
│
└─ Hardest 5% / must-not-fail ──► Claude Code (terminal: `claude`)

Quick Reference Card

Script commands

dev-start           # Default: 2.5 GB VRAM, NIM cloud workflow
dev-start -Heavy    # +7B local chat (offline/privacy)
dev-start -Agent    # +14B local agent (NIM fallback)
dev-stop            # Free all VRAM, kill Ollama
ollama ps           # See what's loaded right now

NIM Model IDs

minimaxai/minimax-m2.7              ⭐ free endpoint, 230B coder
moonshotai/kimi-k2.6                premium, 1T MoE, best agentic
deepseek-ai/deepseek-v4-flash       1M context, fast
deepseek-ai/deepseek-v4-pro         1M context, deeper reasoning
z-ai/glm-5.1                        premium, agentic flagship

Cline emergency swap to local (when NIM is dead)

  1. dev-start -Agent
  2. Cline settings → API Provider: Ollama, Base URL: http://localhost:11434, Model ID: qwen2.5-coder:14b, Context: 32768

Troubleshooting

See docs/TROUBLESHOOTING.md for the full table.

Quick fixes:

  • Continue 401 → wrong NIM key in ~/.continue/config.yaml, reload window
  • @codebase not working → nomic not loaded, re-run dev-start
  • Cline rate limit → MiniMax free endpoint is 40 req/min, wait or switch model
  • API not responding → run ollama serve in a fresh terminal to see real errors

FAQ

Q: Why not just use Cursor / Copilot / Claude Code for everything?
A: Cost. A heavy Cursor user spends $20–50/month easy; Claude Code on serious agent work can be $100+/month. This setup gets you 80–90% of the same productivity for ~$5–10/month, often $0.

Q: Why local autocomplete instead of cloud?
A: Latency. Cloud autocomplete needs ~300ms round-trip; local Qwen 1.5B does ~150ms. The difference is felt on every keystroke. Plus it's free and private.

Q: Why MiniMax M2.7 over other free models?
A: It's a 230B coder with a Free Endpoint (no credit cost) on NIM. Nothing else in the free tier comes close on coding quality. As of early 2026.

Q: What if NIM deprecates MiniMax M2.7?
A: That's why local fallback exists. Run dev-start -Agent, swap Cline config in 10 seconds, keep working. Or swap to a different free provider (Groq, Cerebras, etc.) — the YAML anchor pattern makes this a one-line change. Open an issue or PR with the replacement.

Q: Does this work for non-coding tasks?
A: Local models are coder-tuned, so they're meh for general writing. For mixed work, swap the local 7B for Qwen 3 8B or Gemma 4 E4B. Cloud models (MiniMax, Kimi, GLM) are general-purpose and work fine for any task.

Q: Is Continue.dev's @codebase actually useful?
A: Yes — semantic search across your repo using local nomic embeddings. First index takes 1–3 min for medium repos; after that it's incremental.


Acknowledgments

Built on top of:


Contributing

PRs welcome. See CONTRIBUTING.md for what we accept and how to submit.

License

MIT — see LICENSE.

About

Hybrid local + free-cloud AI coding setup for VS Code. Combines Ollama (local autocomplete + embeddings), NVIDIA NIM (free chat + agent models), and Claude Code (premium fallback) to replace ~80% of Cursor/Copilot/Claude spend without sacrificing quality. Includes scripts, configs, and Cline Plan/Act combos.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors