← All posts

Running LLMs Locally: No GPU vs Integrated GPU vs Dedicated GPU

What kind of computer do you actually need to run LLMs locally? That was the question I wanted to answer, not with theory, but with real machines, real models, and real tokens per second.

I ran the same local language models on three very different setups:

  1. A regular laptop with no dedicated GPU
  2. A mini PC with an integrated GPU
  3. A gaming laptop with a dedicated NVIDIA RTX GPU

The idea was simple: follow the progression no GPU → integrated GPU → dedicated GPU and see what each tier actually feels like in daily use. This is not a lab-grade benchmark. I was not trying to publish a perfect scientific comparison. I wanted something more practical: run real local models on real hardware and answer the question that matters most to me: can this machine run the model in a way that feels usable? Because technically running a model is one thing; actually wanting to use it every day is another.

The Machines

ComputerHardware TypeRole in the Test
Dell XPS 13 Plus 9320CPU/RAM onlyBaseline machine
GMKtec K8 PlusIntegrated GPUSmall local AI / home server machine
ASUS ROG G16Dedicated NVIDIA RTX GPUPerformance machine

The Dell is the kind of laptop a lot of people already own: thin, portable, no discrete graphics card. The GMKtec K8 Plus is a compact mini PC with a modern Ryzen CPU, 32 GB of RAM, and integrated graphics. It represents the “small always-on local AI box” idea that keeps popping up in home lab conversations. The ASUS ROG G16 is the dedicated GPU option, where inference can lean on NVIDIA acceleration and actual VRAM instead of fighting for system RAM.

The Models

I tested three Qwen models:

ModelSize CategoryWhy I Tested It
Qwen3 4BSmall modelPractical everyday local model
Qwen3 14BMedium modelTests memory and usability limits
Qwen3 30B A3B MoELarger MoE modelStress test with a bigger architecture

All models ran through LM Studio using GGUF weights. The prompt was the same across every test: “Explain local AI tradeoffs in 5 short paragraphs.”

I kept the prompt controlled on purpose. If one run produces a wall of text and another produces two sentences, the tokens/sec numbers become harder to compare. I wanted each machine doing roughly the same work.

What I Measured

The main metric was tokens per second. It is not the only thing that matters, but it is a very practical signal, a quick way to understand whether the model feels snappy or painful. I also paid attention to CPU usage, RAM usage, GPU or iGPU usage, and whether the machine stayed responsive while the model was running.

That last part matters more than people think. A model can technically run while the rest of the system crawls. That might be fine for a one-off experiment, but it is not fine if you want local AI to become part of your workflow.

The Results

ModelDell XPS 13 Plus (CPU/RAM only)GMKtec K8 Plus (iGPU)ASUS ROG G16 (RTX GPU)
Qwen3 4B5 tokens/sec21 tokens/sec74 tokens/sec
Qwen3 14B1 token/sec6 tokens/sec14 tokens/sec
Qwen3 30B A3B MoE4 tokens/sec11 tokens/sec31 tokens/sec

The pattern is pretty clear. The Dell can run the models, but it is clearly limited by CPU and system RAM. The K8 Plus is a major step up, especially on the 4B and 30B A3B MoE runs. The ASUS with a dedicated RTX GPU wins every test, and more importantly, it is the only setup where larger models start to feel like something you would actually keep open all day.

CPU/RAM Only: It Works, But Stay Small

The Dell XPS 13 Plus was my baseline. With Qwen3 4B, it hit around 5 tokens/sec, not fast, but usable enough for basic experimentation. Summarizing a paragraph, rewriting text, brainstorming ideas, testing a local workflow without sending data to the cloud: all of that works.

Then I tried Qwen3 14B, and the experience changed completely. About 1 token/sec. The model runs, technically, but waiting that long for every response changes how you interact with it. You stop treating it like a tool and start treating it like a patience test.

The interesting result was Qwen3 30B A3B MoE, which landed around 4 tokens/sec, faster than the dense 14B model in this test. That does not mean the Dell is suddenly a great machine for bigger models. It still chews through memory, and the overall experience is limited. But it is a good reminder that model architecture matters, not just the headline parameter count.

My takeaway for CPU/RAM-only machines is simple: a regular laptop can run local LLMs, but the sweet spot is small models. I would focus on something in the 3B–4B range, maybe 7B depending on quantization and how much RAM you have. Once you push into larger dense models, it becomes more of a technical experiment than a comfortable daily workflow.

Integrated GPU: A Practical Middle Ground

The GMKtec K8 Plus was the most interesting machine in this test for me. It is not a gaming PC and it does not have a dedicated GPU, but it has a modern Ryzen chip, 32 GB of RAM, and integrated graphics, and that combination punched above what I expected from a box this small.

At 21 tokens/sec on Qwen3 4B, the experience feels genuinely practical. At 6 tokens/sec on the 14B model, it is still slow, but much more usable than the Dell. And the 30B A3B MoE result at 11 tokens/sec was the one that really made me pause. For a mini PC with integrated graphics, that is a useful number.

This is where the K8 Plus starts to make sense beyond raw speed. It is compact, quiet enough to leave on, and well suited for things like LM Studio, Ollama, Open WebUI, n8n, local agents, personal automation, and private workflows you want running at home.

The catch is that an integrated GPU is not a dedicated GPU. The iGPU shares system memory and does not get its own VRAM pool like an NVIDIA card, so memory pressure matters a lot, especially if the machine is also running other services. Still, for small and medium local models, home automation, and always-on AI experiments, the K8 Plus sits in a really interesting middle ground: much better than CPU-only inference, not in the same class as a dedicated RTX GPU, but compelling on its own terms.

Dedicated GPU: Performance Still Wins

The ASUS ROG G16 with a dedicated NVIDIA RTX GPU was the performance machine, and the numbers made that obvious: 74 tokens/sec on Qwen3 4B, 14 tokens/sec on Qwen3 14B, and 31 tokens/sec on Qwen3 30B A3B MoE.

At 74 tokens/sec on the 4B model, local AI stops feeling like a demo and starts feeling like a real tool. The 14B model at 14 tokens/sec is more than twice as fast as the K8 Plus. The 30B A3B MoE model at 31 tokens/sec is the kind of result that makes larger local models actually viable for longer sessions. This is the biggest practical difference with a dedicated GPU: the model does not just run, it feels responsive enough that you keep coming back to it.

If a local model is too slow, you might try it once or twice. If it answers quickly, it starts to feel like something worth building workflows around: coding help, faster iteration, longer chat sessions, more ambitious local experiments. That does not mean everyone needs a dedicated GPU, but if performance is the goal, especially with larger models, an RTX card is still the clearest answer.

Why Was the 30B MoE Faster Than the 14B Model?

One result surprised me: Qwen3 30B A3B MoE was faster than Qwen3 14B on all three machines. At first glance, that looks backwards. Thirty billion sounds much bigger than fourteen billion.

The key detail is that this is a Mixture-of-Experts model. A MoE model can have a large total parameter count while only activating a subset of those parameters during generation. The “A3B” part means the active parameter count is much smaller than the full model size. That does not make the model free to run. Loading it can still be heavy, and memory usage still matters, but it helps explain why tokens/sec can beat expectations compared with a dense model of similar “headline” size.

Model size alone does not tell the full story. Architecture, quantization, memory, the inference backend, and hardware all play a role. If you are still getting oriented on how these models work under the hood, my earlier posts on what is actually happening inside an LLM and how Transformers scaled into today’s AI boom are good background reading.

Practical Takeaways

Hardware TypeBest ForMain Limitation
CPU/RAM onlyLearning, testing, small modelsSlow with larger models
Integrated GPUHome server, automation, small/medium modelsShared memory, limited GPU power
Dedicated RTX GPUBest performance, larger models, coding workflowsMore expensive, higher power draw

If you only have a regular laptop, you can still start. Use small models and keep expectations realistic. If you want a compact machine that can stay on and run local AI tools, something like the K8 Plus is a genuinely interesting option. If you want the best performance, especially with larger models, a dedicated GPU is still the better choice.

Local LLMs vs Frontier Models

One thing worth saying clearly: local LLMs are useful, but they are not the same thing as frontier models from OpenAI, Anthropic, or Google. Small local models can do a lot of practical work: summarization, rewriting, brainstorming, simple coding help, automation, private workflows, learning, experimentation. For many day-to-day tasks, they are already good enough.

But when we talk about very complex reasoning, very long context, advanced tool use, and frontier-level intelligence, that is a different scale entirely. Those models run on serious infrastructure: large GPU clusters and model sizes far beyond what a normal laptop or mini PC can realistically handle.

I do not see local LLMs as a full replacement for cloud AI. I see them as a different tool, valuable for privacy, learning, experimentation, automation, and smaller practical tasks, while frontier cloud models still have a big advantage on the hardest reasoning problems.

Final Thoughts

So what kind of computer do you actually need to run LLMs locally? It depends on what you want to do. If you want to learn and experiment, a regular laptop is enough. Start with small models. If you want a compact always-on local AI machine, a mini pc K8 Plus-style setup is a compelling middle ground. If you want the best performance and larger models, a dedicated GPU is still the clear winner.

The biggest lesson for me is that “running LLMs locally” is not one single experience. The model, RAM, GPU, VRAM, software stack, and use case all matter, and your use case matters most of all. Small models are already useful. Medium models start to expose the limits of your hardware. Bigger models quickly explain why GPU power matters, and why the gap between a laptop, a mini PC, and a dedicated GPU machine is so much larger than the spec sheets suggest.