Local LLMs as coding sub-agents: notes from a home-lab rabbit hole

2025-06-08 · ~7 min read

A few months ago I started an evening project: running local language models on my own hardware for routine coding tasks, while delegating the harder architectural thinking to cloud models. The theory was nice — cheap inference for the easy stuff, smart inference only when you need it. The practice has been an education in hardware trade-offs, quantization artifacts, and just how much "routine" actually means.

The premise

A decent chunk of day-to-day coding is mechanical. Rename this field across 40 files. Add logging to these three methods. Convert this Scala 2 code to Scala 3 syntax. Generate boilerplate for a new data class. These are tasks where I don't need a 200B-parameter frontier model — I need a competent tool that can read code, understand intent, and generate correct changes.

Meanwhile, the genuinely hard stuff — "should this be one service or two?", "what's the right API shape for this flow?", "is this bug a symptom of a deeper problem?" — benefits from the best available model, even if it costs more per token. The natural split: local for the boring, cloud for the interesting.

The hardware rabbit hole

This is where things got complicated. For local inference, you're basically optimizing three things at once: model quality (bigger parameters generally better), inference speed (tokens/sec), and memory footprint (can the quantized weights even fit). On consumer hardware in 2025, the trade-offs look roughly like:

24 GB VRAM (RTX 4090, 3090): comfortable for a 7B–14B model at reasonable quantization (Q4_K_M, Q5_K_M). Runs Qwen Coder 14B well, and 32B models at Q3 if you're patient.
48 GB VRAM (RTX 6000 Ada, or 2×3090): opens up 32B–34B models at better quantization and some 70B at Q2/Q3. Coding-specific 32B models are the sweet spot here.
Apple Silicon (M2/M3 Max with 64+ GB): surprisingly capable thanks to unified memory, slower than consumer NVIDIA but no CUDA pain.

I've been going back and forth on whether to build a CUDA desktop or stay portable with a mini-PC form factor. Different answer depending on whether fine-tuning is a regular workflow — CUDA matters for fine-tuning, less so for pure inference with batch size 1.

What actually works

After a few months of trial, the setup that's earned its keep:

A local coding model (~14–32B) running in a background daemon, exposed over an OpenAI-compatible HTTP API. I use llama.cpp's server mode for this — stable, easy to monitor, decent throughput. The daemon stays loaded so there's no cold-start cost when I hit it from an editor plugin.

A routing layer in my editor that picks which model to hit. Small, mechanical requests go local; anything that needs chain-of-thought or broad context goes to a cloud frontier model. The heuristic is crude — prompt length, presence of certain keywords, explicit user choice — but crude works fine for an evening-project tool.

A prompt cache with an LLM-aware hash. Routine refactors tend to repeat; caching responses for identical prompts (normalized to ignore whitespace and irrelevant context) saves a surprising amount of compute on the local model.

What doesn't

Local models are genuinely bad at a few things, and pretending otherwise led to wasted evenings:

Long context. Local 14–32B models degrade noticeably past ~32k tokens. If I need to reason about a repository-sized context, I send it to the cloud. The cutoff is improving fast, but as of now, this isn't competitive.

Novel reasoning. Anything that's basically a math problem dressed as code — "figure out why this recursive algorithm is off by one" — is where frontier cloud models still substantially outperform anything I can run locally. Not even close, honestly.

"Pretending to be a senior engineer." Local models are OK at generating code. They're bad at pushing back when my request is wrong, and bad at suggesting I'm solving the wrong problem. I've stopped relying on them for anything adjacent to design decisions.

The meta-lesson

The thing I didn't expect: having a local model available at zero marginal cost changes my habits. I'll ask for things I wouldn't have bothered asking a cloud model for — "here are three functions, find the subtle bug" — just because the round trip is free. That's been a quiet productivity win, independent of model quality.

Is it worth the setup cost? For someone who enjoys home-lab tinkering, yes. For someone who just wants working AI coding assistance, probably not — the cloud-only path is still the right default in 2025. But the gap is narrowing, and I expect this to flip within a few generations of hardware and open-model releases.