Hardware for Running Powerful AI Models Locally (2026)

Building the right hardware for running powerful AI models locally is the single most consequential technical decision you’ll make as an AI practitioner in 2026. The difference between a system that handles a 70B parameter model at a usable 25 tokens per second and one that crawls at 3 tokens per second with constant RAM swapping isn’t luck — it’s component selection. This guide on hardware for running powerful AI models locally dissects every layer of the stack: GPUs, CPUs, motherboards, RAM, storage, and power delivery, with real benchmarks, real build costs, and documented real-world setups from researchers and engineers who’ve already done this.

We are not talking about running Llama 3 8B on a laptop with 16GB of RAM. We’re talking about locally hosting 70B, 405B, and 671B parameter models with enough throughput to be genuinely productive — or even serve multiple users simultaneously. The hardware for running powerful AI models locally that we cover here is available for purchase today, without institutional procurement, and this guide will show you exactly what to buy.

Why Run Powerful AI Models Locally?

Before specifying hardware, the question of why deserves a direct answer. Local AI inference is not a hobbyist exercise. The reasons professionals and organizations run hardware for powerful AI models locally are substantive:

Data privacy and sovereignty: No tokens, prompts, or completions leave your physical premises. For legal, medical, financial, and government workloads, this is non-negotiable.
Cost at scale: Running GPT-4-class models via API at production volumes — thousands of calls per day — costs hundreds to thousands of dollars monthly. A one-time hardware investment amortizes over 3–5 years.
Latency control: With local inference, your network RTT is your LAN latency. Streaming responses begin in milliseconds, not 300–2,000ms round-trips to API endpoints.
Model customization: Fine-tuning, LoRA adapters, quantization to specific bpw targets, GGUF format selection — all of these require direct hardware access.
Capability ceiling: As of April 2026, the open-weight models available (Llama 3.1 405B, DeepSeek R1 671B, Qwen 2.5 72B) approach or match GPT-4-level performance on many benchmarks. The argument that “cloud models are better” no longer holds categorically.

The Physics of Local AI: What Actually Limits Performance

Understanding why certain components matter more than others for hardware for running powerful AI models locally requires understanding the physical bottleneck in large model inference. Unlike gaming, where GPU compute (TFLOPS) is the primary limiter, LLM inference at typical batch sizes of 1–4 is almost always memory bandwidth bound, not compute bound.

When you generate a single token with a 70B model, the GPU must load approximately 40GB of model weights from VRAM into the arithmetic units. With an RTX 5090’s 1,792 GB/s bandwidth, this takes roughly 22ms — which translates to a theoretical ceiling of ~45 tokens per second before any overhead. With an RTX 4090’s 1,008 GB/s, the ceiling drops to ~25 tokens per second. This is why VRAM bandwidth often predicts LLM inference speed better than TFLOPS do.

The second physical constraint is VRAM capacity. If the model doesn’t fit in VRAM, it must be partially offloaded to system RAM. GDDR7 memory bandwidth is 1,792 GB/s. DDR5 system RAM bandwidth is roughly 90–150 GB/s. A model that’s 20% offloaded to RAM can drop inference speed by 40–60% — not 20%. The penalty is disproportionate because the bottleneck shifts from VRAM to DRAM for every layer that hits system memory.

The third constraint, often underestimated, is inter-GPU bandwidth in multi-GPU setups. PCIe 5.0 x16 provides ~64 GB/s bidirectional bandwidth. NVLink (available on RTX 3090 and professional GPUs) provides 112 GB/s. For models split across two GPUs, the communication overhead between cards during the attention mechanism can account for 30–50% of total latency if bandwidth is insufficient.

This framework — bandwidth first, capacity second, inter-GPU communication third — should guide every component decision in this guide.

The GPU Tier List: Hardware for Running Powerful AI Models Locally (2026)

Hardware for running powerful AI models locally starts and ends with the GPU. No other component decision will have as much impact on your inference speed and maximum model size. Here is the complete tier list of what’s available for purchase today.

Tier S: NVIDIA RTX 5090 — The Consumer King

The RTX 5090 (Blackwell architecture, January 2025) is the undisputed single-card champion for local AI inference in 2026. Its specifications set a new ceiling for consumer hardware:

Specification	RTX 5090	RTX 4090	Delta
Architecture	Blackwell (GB202)	Ada Lovelace (AD102)	—
CUDA Cores	21,760	16,384	+33%
Tensor Cores (5th Gen)	680	512 (4th Gen)	+33%
VRAM	32 GB GDDR7	24 GB GDDR6X	+33%
Memory Bandwidth	1,792 GB/s	1,008 GB/s	+78%
FP16 TFLOPS	838	330	+154%
TDP	575W	450W	+28%
PCIe Interface	PCIe 5.0 x16	PCIe 4.0 x16	Gen+1
MSRP	$1,999	$1,599	+$400
Street Price (Apr 2026)	$3,500–$4,200	$1,400–$1,700	+~$2,000

Real inference numbers (April 2026, llama.cpp, Q4_K_M quantization):

Llama 3 8B: ~213 tokens/sec (prompt processing: 11,796 tokens/sec)
Qwen 2.5 7B: ~274 tokens/sec (decode)
Llama 3 70B (Q4, fits in 32GB): ~45–61 tokens/sec with aggressive quantization
DeepSeek R1 32B: ~65 tokens/sec (Q4_K_M, easily fits in 32GB)
Qwen2.5-Coder-7B at batch-8: 5,841 tokens/sec — 2.6× faster than an A100 80GB

Key limitation: The RTX 5090 does not support NVLink in its consumer form factor. Multi-GPU setups require PCIe communication only, which limits scaling efficiency. The card also requires a PCIe 5.0 16-pin (12V-2×6) connector and a PSU rated for at least 575W GPU draw plus the rest of the system.

Who should buy it: Anyone who needs maximum single-card performance without the cost and complexity of a multi-GPU professional workstation. The 32GB VRAM is sufficient for most 70B quantized models at Q3–Q4 level with minimal context overhead.

Tier A: NVIDIA RTX 4090 — The Proven Workhorse

The RTX 4090 remains the best value proposition for local AI inference in 2026 when purchased at its current market price of $1,400–$1,700. It has an enormous ecosystem of tested configurations, proven stability over 18+ months of AI workload operation, and the 24GB GDDR6X is sufficient for the vast majority of practical use cases below 70B parameters.

Model	Quantization	Fits in 24GB?	Speed (tok/s)
Llama 3 8B	Q4_K_M	✅ Yes (5.5GB)	~118–130
Mistral 7B	Q4_K_M	✅ Yes (4.4GB)	~125–140
DeepSeek R1 32B	Q4_K_M	✅ Yes (18.5GB)	~35–45
Llama 3 70B	Q4_K_M	❌ 43GB required	~10–15 (with CPU offload)
Llama 3 70B	Q2_K	⚠️ Tight (21GB)	~18–25 (with quality loss)
DeepSeek R1 70B	Q4_K_M	❌ 43GB required	~8–12 (with offload)

The 24GB VRAM ceiling on the RTX 4090 means that 70B models at useful quality levels require either significant quantization quality compromise, CPU offloading with major throughput penalties, or a second GPU. For users primarily working with models up to 34B parameters, the RTX 4090 is fully adequate and represents better cost-per-token than the RTX 5090 at street prices.

Tier A: Dual RTX 3090 with NVLink — The Budget 48GB Solution

The RTX 3090 is the last consumer NVIDIA GPU to support NVLink, and this detail transforms what would otherwise be an aging card into a compelling 2026 option for specific workloads. A dual RTX 3090 NVLink setup provides:

48GB combined VRAM with 112 GB/s inter-GPU bandwidth via NVLink bridge
Comfortable full-quality Q4 inference of 70B models at 25–35 tokens/sec
Used card prices of $550–$850 each (April 2026), making the GPU pair cost $1,100–$1,700
Full compatibility with llama.cpp tensor parallelism (–tensor-split flag)

The NVLink bridge for RTX 3090 costs approximately $100–$150 and physically connects two cards in adjacent slots. Without NVLink, dual RTX 3090s would communicate via PCIe (up to ~32 GB/s), which introduces significant bottleneck for large matrix operations during attention computation. With NVLink at 112 GB/s, the inter-GPU penalty is dramatically reduced.

Ahmad Osman’s 8×RTX 3090 basement server (documented July 2024) is the most extreme public example of this architecture. Running 8 RTX 3090s with 192GB total VRAM on an ASRock Rack ROMED8-2T with AMD EPYC Milan 7713 (64 cores), 512GB DDR4 RAM, and three 1600W PSUs, this system can host Llama 3.1 405B at Q4 quantization with real throughput. Total build cost: approximately $12,000–$15,000.

Tier A+: NVIDIA RTX 6000 Ada Generation — The Professional 48GB Card

The RTX 6000 Ada is NVIDIA’s workstation GPU based on the same Ada Lovelace die as the RTX 4090, but with critical differences that matter for sustained AI workloads:

Feature	RTX 6000 Ada	RTX 4090	Advantage
VRAM	48 GB GDDR6 ECC	24 GB GDDR6X	2× capacity, ECC protection
Memory Bandwidth	960 GB/s	1,008 GB/s	RTX 4090 slightly faster
TDP	300W	450W	RTX 6000 Ada (33% less power)
FP32 TFLOPS	91.1	82.6	RTX 6000 Ada (+10%)
L2 Cache	96 MB	72 MB	RTX 6000 Ada (+33%)
vGPU Support	Yes	No	RTX 6000 Ada
NVLink	Yes (2-way)	No	RTX 6000 Ada
Thermal design	Blower (single-slot airflow)	Triple-fan (open air)	Workstation cases: RTX 6000 Ada
Purchase price	~$6,500–$8,000	~$1,400–$1,700	RTX 4090 (4–5× cheaper)

LLM inference benchmarks (Q4_K_M, llama.cpp):

Llama 3 8B: 131 tokens/sec (RTX 6000 Ada) vs 113 (L40S) vs 110 (RTX 4090)
Llama 3 70B: 18.4 tokens/sec (RTX 6000 Ada) — fully in VRAM, no offloading needed
At FP16 precision (Llama 3 8B): 52 tokens/sec — superior to RTX 4090 at full precision

The RTX 6000 Ada’s 48GB with ECC enables reliable operation for 70B models at Q4 quality without any CPU offloading, while the RTX 4090 must offload or use more aggressive quantization. For teams running 70B inference continuously on professional workloads, the premium may justify itself through reliability (ECC protects against bit-flip corruption) and lower power draw.

Tier B: NVIDIA L40S — The Datacenter Card That Fits at a Desk

The L40S is NVIDIA’s inference-optimized Ada Lovelace card, positioned between the RTX 6000 Ada and A100 in the product stack. With 48GB GDDR6 and 864 GB/s bandwidth, it’s slightly slower than the RTX 6000 Ada for LLM inference per token but shares the same VRAM capacity tier.

Purchase price: $7,000–$12,000 (new), $4,000–$7,000 (used/refurbished)
Llama 3 70B (Q4_K_M): 15.3 tokens/sec (vs 18.4 on RTX 6000 Ada)
Dual L40S (96GB combined): Runs Q8 Llama 3 70B fully in VRAM at ~22 tokens/sec
4× L40S (192GB): Can run full FP16 Llama 3 70B or quantized 405B models
Passive cooling; requires workstation chassis with proper airflow management
Supports NVLink for 2-way configurations

The L40S is the workhorse of enterprise AI inference deployments. For individual buyers willing to spend $8,000–$15,000 on a single GPU card, it offers production-grade reliability and the expanded VRAM headroom that consumer cards cannot match.

Tier B: AMD Radeon RX 7900 XTX — The Open-Source Alternative

AMD’s flagship consumer card offers 24GB GDDR6 at $800–$950 retail, making it the most affordable 24GB option available. The RX 7900 XTX runs LLM inference via ROCm (AMD’s CUDA equivalent) through llama.cpp, Ollama, and vLLM with AMD ROCm support.

Performance caveats are significant: ROCm support remains less mature than CUDA, optimization paths like FlashAttention and custom CUDA kernels for quantized inference don’t apply, and real-world LLM throughput on the 7900 XTX is typically 40–60% of what an RTX 4090 achieves despite similar VRAM capacity. As of vLLM v0.16 (March 2026), AMD ROCm support has become “first-class,” but CUDA still leads in optimized inference kernels for most quantization formats.

For users committed to open-source, privacy-first toolchains who are also running Linux, the RX 7900 XTX provides a cost-effective entry into 24GB VRAM inference without NVIDIA’s driver ecosystem.

Multi-GPU Architectures: When One Card Is Not Enough

Running hardware for powerful AI models locally at 70B+ parameters at acceptable quality and speed often requires multiple GPUs. There are three fundamentally different multi-GPU architectures, each with different trade-offs in complexity, cost, and performance.

Architecture 1: NVLink Consumer (2×RTX 3090)

The RTX 3090 is unique in the consumer GPU market: it’s the only modern consumer card supporting NVLink. The NVLink HB bridge for dual RTX 3090 provides 112 GB/s bidirectional bandwidth — 3.5× the bandwidth of PCIe 4.0 x16.

Why this matters for LLMs: During transformer attention computation, model layers are split across GPUs. Each forward pass requires exchanging activations between cards. At 112 GB/s vs 32 GB/s, NVLink reduces communication bottleneck by 3.5×, which translates directly to higher tokens per second for models that don’t fit in a single card’s VRAM.

Practical build constraints:

Requires a motherboard with two PCIe x16 slots physically close enough for the NVLink bridge (typically 2–3 slots apart)
Both GPUs must be RTX 3090 (not 3090 Ti — the 3090 Ti does not support NVLink)
Consumes 2× 350W TDP = 700W GPU draw minimum; requires 850W+ PSU for system
Used card availability declining as supply ages out of the market

Expected performance on 70B models (Q4_K_M): 28–35 tokens/sec, which is comparable to a single RTX 6000 Ada at 40–60% of the cost in hardware terms.

Architecture 2: PCIe Multi-GPU (2–4× RTX 4090 or RTX 5090)

Without NVLink, modern consumer GPUs (RTX 4090, RTX 5090) communicate via PCIe only. PCIe 5.0 x16 provides ~64 GB/s bidirectional bandwidth — a significant improvement over PCIe 4.0’s 32 GB/s, but still well below NVLink. llama.cpp supports tensor parallelism across PCIe-only multi-GPU setups via the --tensor-split parameter.

The efficiency of PCIe multi-GPU for LLM inference depends heavily on how the model’s layers are distributed. For 70B models across 2× RTX 5090 (64GB combined VRAM), you can avoid any CPU offloading and run the full model in GPU memory. The PCIe communication overhead for attention is real but acceptable — typically 15–25% throughput reduction vs. theoretical peak.

2× RTX 5090 (64GB combined VRAM) expected performance:

Llama 3 70B (Q4_K_M, full in VRAM): 70–85 tokens/sec (combined throughput)
DeepSeek R1 70B (Q4_K_M): 65–80 tokens/sec
Llama 3.1 405B (Q2_K, ~115GB): Cannot fit — requires further GPU or CPU offload
Multi-user serving with vLLM: 4–6 concurrent users at 15–20 tok/s each

Cost: 2× RTX 5090 at street prices costs $7,000–$8,400 in GPUs alone. Add motherboard, Threadripper CPU, RAM, and PSU and a dual RTX 5090 workstation approaches $12,000–$15,000 total.

Architecture 3: Professional GPU Arrays (4–8× Datacenter Cards)

At the highest tier, 4–8 professional GPUs (L40S, RTX 6000 Ada, A100) provide the VRAM pool necessary to run the largest open-weight models at full quality. An 8× L40S configuration provides 384GB VRAM — enough for Llama 3.1 405B at Q4 without any offloading, or DeepSeek R1 671B at aggressive quantization.

These configurations require server-grade hardware: EPYC or Threadripper Pro platforms with sufficient PCIe lanes, server cases with proper airflow for blower-fan datacenter cards, and three or more 1600W PSUs. The documented 8× RTX 3090 basement server (Ahmad Osman, 2024) cost approximately $12,000–$15,000 using used consumer cards. A 4× L40S new configuration would cost $28,000–$48,000 in GPUs alone.

GPU Comparison Matrix: At a Glance

GPU	VRAM	Bandwidth	70B Q4 Speed	NVLink	Price (2026)	Best For
RTX 5090	32GB GDDR7	1,792 GB/s	~45 tok/s*	No	$3,500–4,200	Up to 32B optimal, 70B with heavy quant
RTX 4090	24GB GDDR6X	1,008 GB/s	~10 tok/s**	No	$1,400–1,700	Up to 34B optimal
2× RTX 3090 NVLink	48GB (NV)	112 GB/s (NV)	28–35 tok/s	Yes	$1,100–1,700	70B sweet spot value
RTX 6000 Ada	48GB GDDR6	960 GB/s	18.4 tok/s	Yes	$6,500–8,000	Professional 70B inference, ECC
L40S	48GB GDDR6	864 GB/s	15.3 tok/s	Yes	$7,000–12,000	Sustained production inference
2× RTX 5090	64GB (PCIe)	1,792 GB/s ea	70–85 tok/s	No	$7,000–8,400	70B high-speed, 405B partial
RX 7900 XTX	24GB GDDR6	960 GB/s	~6 tok/s**	No	$800–950	Budget, Linux-native, ROCm

* With Q3_K_S quantization; 70B at Q4 requires ~43GB which exceeds 32GB VRAM, necessitating CPU offloading that significantly reduces speed.
** With CPU offloading of layers that don’t fit in 24GB VRAM.

CPUs for Local AI Workstations: The Foundation That Determines Your Ceiling

When choosing hardware for running powerful AI models locally, the CPU’s role in local AI inference is often misunderstood. For pure GPU inference where the entire model fits in VRAM, the CPU is largely idle — a modern consumer CPU is sufficient. The CPU becomes critical in three specific scenarios:

CPU offloading: When model layers must be computed in system RAM instead of VRAM, the CPU processes those layers. Cache size, memory bandwidth, and core count all directly impact offloaded-layer throughput.
Multi-GPU setups: PCIe lane count determines how many GPUs can operate at full x16 bandwidth simultaneously. Consumer CPUs (Core i9, Ryzen 9) typically offer 24–48 usable PCIe lanes. Running 4 GPUs requires a HEDT or server platform.
Model preparation: Quantization, format conversion (GGUF, EXL2), and fine-tuning are CPU-intensive preprocessing tasks that benefit from high core count and large cache.

AMD Threadripper Pro 9995WX — The Ultimate AI Workstation CPU

Specification	Value
Architecture	Zen 5, 4nm TSMC
Cores / Threads	96 / 192
Base / Boost Clock	2.5 GHz / 5.4 GHz
L3 Cache	384 MB
TDP	350W
Memory Channels	8-channel DDR5
Memory Bandwidth	409.6 GB/s (DDR5-6400)
PCIe Lanes (CPU)	128 lanes PCIe 5.0
Socket	sTR5
Max Memory	2TB ECC RDIMM DDR5

The Threadripper Pro 9995WX is the most capable consumer-purchasable CPU for local AI workloads in 2026. Its 128 PCIe 5.0 lanes allow up to 8 GPUs at full x16 bandwidth simultaneously — no lane sharing, no bifurcation compromises. The 409.6 GB/s memory bandwidth (with DDR5-6400 in 8-channel configuration) is critical for CPU offloading: when model layers hit system RAM, that bandwidth directly determines throughput.

AMD claimed a 49% faster tokens-per-second improvement for 32B DeepSeek R1 inference versus the Intel Xeon W9-3595X, attributed to Zen 5’s superior cache hierarchy and memory bandwidth. The 384MB L3 cache can hold entire quantized versions of models up to approximately 12–15B parameters, enabling zero-latency cache hits on repeated patterns.

Price: Threadripper Pro 9995WX pricing is not publicly listed at MSRP — it is sold through authorized resellers and system integrators. Expect $5,000–$8,000 for the CPU alone, with complete workstation builds starting at $15,000.

AMD Threadripper Pro 7995WX — Proven Zen 4 Workhorse

The previous-generation Threadripper Pro 7995WX (Zen 4, 96 cores) remains a top choice for users who need the sTR5 platform’s lane count but want a slightly more accessible price point. It offers 128 PCIe 5.0 lanes, 8-channel DDR5 up to DDR5-5200, and is available from major system integrators.

CPU price: approximately $2,800–$4,500 in April 2026. Compared to the 9995WX, expect 10–15% lower memory bandwidth and single-thread performance but otherwise similar multi-GPU support capability.

Intel Core Ultra 9 285K — Consumer Tier, 4× GPU Capable

Specification	Core Ultra 9 285K
Architecture	Arrow Lake, Intel 20A
Cores / Threads	24P + 8E cores / 32 threads
L3 Cache	36 MB
Memory	DDR5-6400, 2-channel
Memory Bandwidth	~102 GB/s
PCIe Lanes	24 (CPU) + 20 (chipset)
TDP	125W (base) / 250W (boost)
Socket	LGA 1851
Retail Price	~$580–$620

For single-GPU or dual-GPU setups (2× GPUs at x8/x8 or x16/x4), the Core Ultra 9 285K is the price/performance champion for consumer budgets. Its 24 CPU PCIe lanes can support two GPUs at PCIe 5.0 x16/x4 (not ideal) or x8/x8, which still provides ample bandwidth for GPU communication and NVMe storage.

The 2-channel DDR5 memory limits system RAM bandwidth to approximately 102 GB/s — a significant constraint for heavy CPU offloading workloads. For single-GPU setups with no offloading, this is irrelevant. For mixed GPU+CPU inference, it becomes a bottleneck.

AMD Ryzen 9 9950X — The Consumer 16-Core Option

Specification	Ryzen 9 9950X
Architecture	Zen 5, 4nm TSMC
Cores / Threads	16 / 32
Boost Clock	5.7 GHz
L3 Cache	64 MB
Memory	DDR5-5600, 2-channel
PCIe Lanes	24 (CPU) + 28 (chipset, X870E)
TDP	170W
Retail Price	~$550–$650

The Ryzen 9 9950X paired with an X870E motherboard is the sweet spot for single-RTX 5090 or dual-RTX 4090 builds. The 24 CPU PCIe lanes support one GPU at full PCIe 5.0 x16, with storage on additional lanes. For CPU offloading, the Zen 5 architecture’s improved IPC and 64MB L3 cache outperform the Core Ultra 9 285K in LLM-specific benchmarks by approximately 8–12%.

CPU Comparison for AI Workstations

CPU	PCIe Lanes	Max GPUs (x16)	RAM BW	Max RAM	Price	Best Use
TR Pro 9995WX	128 (PCIe 5)	8	409.6 GB/s	2TB ECC	$5,000–8,000	4–8 GPU professional
TR Pro 7995WX	128 (PCIe 5)	8	307.2 GB/s	2TB ECC	$2,800–4,500	4–8 GPU workstation
Ryzen 9 9950X	24 (PCIe 5)	1–2	~100 GB/s	192GB DDR5	$550–650	1–2 GPU consumer
Core Ultra 9 285K	24 (PCIe 5)	1–2	~102 GB/s	192GB DDR5	$580–620	1–2 GPU consumer
EPYC 9654	160 (PCIe 5)	10	460 GB/s	6TB ECC	$8,000–12,000	Server 8+ GPU

Motherboards: The Platform That Defines What’s Possible

The motherboard determines your PCIe lane topology, maximum GPU count, memory channel configuration, and upgrade headroom. For local AI workstations, this decision is platform-defining.

ASUS Pro WS WRX90E-SAGE SE — The Multi-GPU Pinnacle

This board represents the most capable consumer-purchasable motherboard for multi-GPU AI workstations. Designed specifically for the Threadripper Pro 7000/9000 WX-Series on socket sTR5:

PCIe slots: 7× PCIe 5.0 x16 — all capable of running at full x16 bandwidth from the CPU’s 128 PCIe lanes
Memory: 8 DIMM slots, up to 2TB ECC RDIMM DDR5, 8-channel memory architecture
Storage: 4× M.2 slots (PCIe 5.0 compatible), multiple SATA ports
Networking: Dual 10 GbE Intel LAN ports — critical for distributed inference or serving models over fast LAN
Form factor: EEB (Extended ATX) — requires full-tower or server chassis
Remote management: Integrated IPMI for headless server operation
Price: $1,247–$1,291 (Newegg, April 2026)

With 7 physical x16 slots all fed by 128 PCIe 5.0 lanes from the CPU directly (no lane sharing through chipset), this board can support configurations impossible on any consumer desktop platform: 4× RTX 6000 Ada all at PCIe 5.0 x16, or even 6× RTX 4090 with x16 each plus NVMe storage.

ASRock WRX90 WS EVO — The Alternative Threadripper Pro Platform

The ASRock WRX90 WS EVO offers similar capabilities to the ASUS Pro WS at a marginally lower price point. Key differences: the ASUS has a 32-phase VRM (vs ASRock’s 18-phase), which matters for the highest TDP Threadripper Pro CPUs under heavy sustained load. The ASRock offers 7× PCIe 5.0 x16 slots, identical DDR5 8-channel support, and dual Intel 10G LAN. Both boards are compatible with both the 7000WX and 9000WX Threadripper Pro series.

ASUS Pro WS TRX50-SAGE WIFI — Threadripper (Non-Pro) Option

For users who want a multi-GPU platform without the Threadripper Pro price premium, the non-Pro Threadripper 7000 series (socket TRX50) offers up to 64 PCIe 5.0 lanes — enough for 4 GPUs at x16 simultaneously. The ASUS Pro WS TRX50-SAGE WIFI supports the Ryzen Threadripper 7960X (24 cores, ~$1,400) or 7980X (64 cores, ~$2,800):

6× PCIe 5.0 x16 slots (4 at CPU, 2 at chipset)
8 DIMM slots (quad-channel DDR5, up to 512GB non-ECC or 256GB ECC)
Multiple M.2 PCIe 5.0 slots
Wi-Fi 7, 10 GbE LAN
Price: ~$800–$900

Consumer Z890 / X870E Boards for Single or Dual GPU

For builds centered on one or two GPUs, consumer AM5 (Ryzen 9000) or LGA 1851 (Intel Core Ultra) platforms are fully adequate. Top boards for AI use cases:

ASUS ROG Maximus Z890 Apex: $850, supports PCIe 5.0 x16/x16 with Core Ultra 9 285K, dual x16 slots capable
MSI MEG X870E ACE: $700, AM5 socket (Ryzen 9000), PCIe 5.0 x16 primary, PCIe 4.0 x16 secondary
ASUS ProArt X870E-Creator WiFi: $550, content creator focus with two full-bandwidth PCIe 5.0 slots and extensive connectivity
Gigabyte X870E Aorus Master: $480, excellent VRM for heavy CPU loads, dual M.2 PCIe 5.0

RAM: System Memory as the AI Model’s Secondary Layer

System RAM (DRAM) plays two distinct roles in local AI workstations:

CPU offloading layer: Layers of the model that don’t fit in VRAM are stored and computed in system RAM. RAM bandwidth and capacity directly determine how much model can be offloaded and at what speed.
KV Cache and context: The key-value cache for transformer attention grows with context length. At 128K context with a 70B model, KV cache alone can reach 10–20GB. This typically stays in VRAM but overflows to RAM for very long contexts.

How Much RAM Do You Need?

Use Case	Minimum RAM	Recommended RAM	Notes
Single GPU, no offloading (model fits in VRAM)	32 GB	64 GB	OS + apps + headroom
70B model with partial CPU offloading	64 GB	128 GB	Offloaded layers need ~40–60GB headroom
405B model (full CPU offload, Q4)	256 GB	512 GB	~115GB model + OS + KV cache
Multi-user vLLM serving (70B, 8 users)	128 GB	256 GB	Parallel KV caches multiply
Fine-tuning 70B (QLoRA)	128 GB	256 GB	Gradient checkpoints and optimizer states

Consumer DDR5: The Right Kits for AI Builds

For consumer AM5 / LGA 1851 (max 128–192GB, dual-channel):

G.SKILL Trident Z5 Neo 128GB (2×64GB) DDR5-6000 CL30: ~$280. Best performance-per-dollar for Ryzen 9 9950X builds. Dual-channel limits bandwidth to ~96 GB/s but this is the platform ceiling.
Kingston FURY Beast 128GB (4×32GB) DDR5-5600 CL40: ~$240. More slots occupied but lower latency than many kits at this speed.

For Threadripper / Threadripper Pro (8-channel ECC RDIMM):

Kingston FURY Renegade Pro 256GB (8×32GB) DDR5-5600 ECC RDIMM CL36: ~$1,200. The standard choice for Threadripper Pro 7000 workstations. 8-channel DDR5 provides up to 307 GB/s bandwidth — critical for CPU offloading workloads.
G.SKILL G5 Series 256GB (8×32GB) DDR5-6000 ECC RDIMM CL30: ~$1,400–$1,800. Higher speed with excellent stability. XMP 3.0 profiles tested with ASUS WRX90 boards.
A-Tech 512GB (8×64GB) DDR5-6400 ECC RDIMM: ~$17,500. For the 405B use case — 512GB system RAM enables full Q4 inference of Llama 3.1 405B or DeepSeek R1 671B in system memory when VRAM is insufficient.
V-Color 2TB RDIMM kit (256GB per DIMM) for Threadripper Pro 9000: Announced in 2025, available through specialty channels. Enables 2TB system RAM on a single WRX90 platform — theoretical capability to run even the 671B DeepSeek in full Q4 precision from system RAM.

Important note on RAM bandwidth and CPU offloading speed: When model layers are offloaded to system RAM, throughput scales almost linearly with memory bandwidth. A Threadripper Pro with 409.6 GB/s bandwidth processes offloaded layers approximately 4× faster than a dual-channel consumer platform at 100 GB/s. This makes the platform choice critical when you know you’ll be offloading.

Storage: Where Your Models Live Between Sessions

LLM model files are substantial: a 7B model in Q4_K_M format is approximately 4–5GB; a 70B model is 40–43GB; the 405B model in Q4 is approximately 230GB. Selecting the right storage ensures fast model loading and smooth switching between models.

What Storage Performance Actually Means for AI Workloads

When you launch llama.cpp, Ollama, or vLLM with a 70B model, the model weights are loaded from disk into VRAM (or system RAM for offloading). With a PCIe 4.0 NVMe drive reading at 7,000 MB/s, a 43GB model loads in approximately 6–7 seconds. With a PCIe 3.0 drive at 3,500 MB/s, the same model takes 12–14 seconds. For inference sessions that run for hours once loaded, the difference is startup time only — after loading, disk speed is irrelevant unless you’re frequently switching models.

PCIe 5.0 drives (14,000+ MB/s) cut load time to 3–4 seconds for 70B models but carry a 40–60% price premium and run significantly hotter, requiring active heatsinks. The consensus among practitioners is: PCIe 4.0 is the optimal tier for model storage — fast enough that loading isn’t a friction point, without the premium and heat of PCIe 5.0.

Top NVMe SSDs for AI Model Storage (2026)

Drive	Capacity	Interface	Sequential Read	Price (Apr 2026)	Endurance (TBW)
Samsung 990 Pro	2TB / 4TB	PCIe 4.0 NVMe	7,450 MB/s	$150 / $280	1,200 / 2,400
WD Black SN850X	2TB / 4TB	PCIe 4.0 NVMe	7,300 MB/s	$140 / $260	1,200 / 2,400
Kingston KC3000	2TB / 4TB	PCIe 4.0 NVMe	7,000 MB/s	$120 / $230	1,600 / 3,200
Crucial T705	2TB / 4TB	PCIe 5.0 NVMe	14,500 MB/s	$220 / $390	1,200 / 2,400
Seagate FireCuda 530	2TB / 4TB	PCIe 4.0 NVMe	7,300 MB/s	$145 / $270	1,275 / 2,550

Recommended minimum configuration: 2× Samsung 990 Pro 2TB in separate M.2 slots — one for the OS and applications, one dedicated to model storage. Total: 4TB capacity, $300 in drives, and the ability to keep 5–8 medium-large models (7B–34B) on the model drive simultaneously.

For 405B+ model users: A dedicated 4TB or larger drive is necessary. The Llama 3.1 405B in Q4_K_M format occupies 232GB; keeping multiple large models requires 2–4TB minimum. Consider a 4TB Samsung 990 Pro ($280) or a 4TB Seagate IronWolf Pro NAS drive ($120) for slower but high-capacity model archive storage, with a fast NVMe for the active model.

Power Supply Units: The Foundation You Cannot Compromise

GPU TDP numbers in the RTX 5090 generation are not suggestions — they’re sustained power draws under AI inference load, which is often more demanding than gaming workloads. A PSU that’s undersized will either throttle your GPU or fail under sustained load. Calculate your power budget correctly.

PSU Sizing Calculator

Component	Typical Peak Draw
RTX 5090	575W (TDP, sustained at 100%)
RTX 4090	450W (TDP, sustained)
RTX 3090	350W (TDP, sustained)
Threadripper Pro 9995WX	350W (TDP)
Ryzen 9 9950X	170W (PBO off) / 240W (PBO2 sustained)
Core Ultra 9 285K	253W (PL2 sustained)
Motherboard + RAM + Storage	50–100W
Cooling (AIOs, fans)	30–60W

Single RTX 5090 build (consumer CPU):
575W (GPU) + 250W (CPU) + 100W (rest) = 925W. Use a 1,200W PSU minimum for headroom. A 1,200W 80+ Platinum PSU at 90% load is more efficient and cooler than a 1,000W at 95% load.

Dual RTX 5090 build (consumer CPU):
1,150W (2× GPU) + 250W (CPU) + 100W (rest) = 1,500W. Use a 1,600W+ PSU. At this power level, ATX 3.1 compliance and PCIe 5.1 12V-2×6 connectors are required for each GPU.

4× GPU professional build (Threadripper Pro):
1,600W+ (4× GPU) + 350W (CPU) + 150W (rest) = 2,100W+. Requires dual PSU configuration (two 1,200–1,600W units) or a single server PSU rated for 2,000W+.

Top PSU Recommendations for AI Workstations

1,200W Tier — Single High-End GPU (RTX 5090) or Dual RTX 4090

Seasonic PRIME TX-1300 (1,300W, 80+ Titanium): ~$360. Industry benchmark for quality and reliability. 12-year warranty, zero-RPM fan mode, fully modular. The choice for builds where PSU failure is not an option.
Corsair HX1500i (1,500W, 80+ Platinum): ~$280. Semi-modular, i-series digital monitoring via USB (iCUE), excellent sustained efficiency. 10-year warranty.
be quiet! Dark Power 13 (1,000W, 80+ Titanium): ~$250. For single GPU builds. Silent operation, exceptional build quality, 10-year warranty.

1,600W Tier — Dual RTX 5090 or 3×RTX 4090

MSI MEG Ai1600T PCIE5 (1,600W, 80+ Titanium): ~$695. ATX 3.1 and PCIe 5.1 compliant with dual 12V-2×6 connectors built-in. Designed specifically for the dual RTX 5090 configuration. 12-year warranty.
Corsair AX1600i Digital (1,600W, 80+ Titanium): ~$610. GaN switching transistors, fully modular, 10-year warranty. The enthusiast’s choice for fully digital power monitoring.
be quiet! Dark Power Pro 13 (1,600W, 80+ Titanium): ~$580. Up to 94.5% efficiency, ATX 3.1 compliant, frameless Silent Wings 4 fan, 10-year warranty. Best acoustics in its class.
ASUS Pro WS 1600W Platinum (ATX 3.1): ~$400. Purpose-engineered for AI workstations; supports two RTX 5090s on the same unit.
Seasonic PRIME PX-1600 (1,600W, 80+ Platinum): ~$380. Premium Japanese capacitors, 12-year warranty, ATX 3.1 and PCIe 5.1 ready.

Dual PSU for 4+ GPU Builds

For systems exceeding 2,000W sustained draw, two synchronized PSUs are the practical solution on consumer/prosumer hardware. The Seasonic PRIME TX-1300 (×2) combination provides 2,600W at Titanium efficiency. An add-2-PSU (Add2PSU) adapter module ($20–$40) synchronizes the two units’ power-on sequencing from a single motherboard power-good signal.

Cooling: Managing Thermal Loads That Would Melt Budget Hardware

AI inference is one of the most thermally demanding consumer workloads. Unlike gaming, which involves frame-to-frame variation in GPU utilization, LLM inference runs the GPU at 95–100% utilization continuously for as long as the inference session is active. A model generating responses for two hours is putting sustained thermal stress on GPU, VRM, and memory for two continuous hours.

GPU Cooling Options

Triple-fan open-air (consumer cards: RTX 5090, 4090, 3090): These coolers are designed for sustained gaming loads (minutes at a time) but handle AI inference adequately provided your case has strong positive airflow. Case airflow is critical: at least three 120mm or two 140mm intake fans directed at GPU exhaust zone.

Custom water cooling loops: For dual-GPU consumer builds, custom loops with full-cover GPU waterblocks eliminate thermal throttling and reduce noise significantly. EKWB, Alphacool, and Bykski offer waterblocks for RTX 5090 and RTX 4090. A basic custom loop adds $400–$800 to the build cost but provides the best sustained thermals achievable.

Blower-fan professional cards (L40S, RTX 6000 Ada): These cards use a single blower fan that exhausts heat directly out the rear of the card and the case — ideal for multi-GPU rack configurations where open-air coolers would recirculate each other’s heat. In desktop cases, blower cards can be noisier but thermally better suited for multi-GPU arrays.

CPU Cooling

For Threadripper Pro systems at 350W TDP, air cooling is marginal. The recommended options:

Noctua NH-U14S TR5-SP6: ~$100. Noctua’s best air cooler specifically designed for sTR5 socket. Handles Threadripper Pro 7995WX comfortably at stock settings. Quiet, reliable, no liquid risk.
ASUS ROG Ryujin III 360 ARGB: ~$250. 360mm AIO, Gen 4 Asetek pump, for users pushing Threadripper Pro to maximum sustained performance with PBO enabled.
Custom loop with Heatkiller IV TR5 waterblock: For the 9995WX at full 350W sustained, a custom loop is the only way to maintain safe temperatures without throttling. Budget $800–$1,200 for a proper custom CPU block + 480mm radiator setup.

The NVIDIA DGX Spark: A New Category of Local AI Computer

In January 2025, NVIDIA announced the DGX Spark — a personal AI supercomputer powered by the GB10 Grace Blackwell Superchip. This is not a traditional desktop computer. It’s a compact, purpose-built AI inference device available for direct consumer purchase, and it deserves specific coverage in any guide on hardware for running powerful AI models locally.

Specification	DGX Spark
Chip	NVIDIA GB10 Grace Blackwell Superchip
CPU	20-core ARM (10× Cortex-X925 @ 4GHz + 10× Cortex-A725 @ 2.8GHz)
GPU Cores	6,144 CUDA cores (Blackwell)
Memory	128GB LPDDR5x unified (CPU + GPU share)
Memory Bandwidth	273 GB/s (unified, not segregated)
AI Performance	1 PFLOP at FP4 with sparsity
Storage	1TB or 4TB NVMe
Networking	200Gbps QSFP (2×), 10GbE, Wi-Fi 7
Power	240W total (entire system)
Size	150 × 150 × 50.5mm — smaller than a Mac Mini
Price	$4,699 (Founder’s Edition, April 2026)

The DGX Spark’s key capability: it can run models up to 200 billion parameters natively, or up to 405B when two units are connected via QSFP interconnect. The 128GB unified memory pool eliminates the VRAM vs. system RAM dichotomy that constrains traditional GPU builds — the entire memory pool is accessible at GPU bandwidth.

However, at 273 GB/s bandwidth (shared CPU/GPU), it is significantly slower than discrete GPU setups for models that fit in VRAM. A single RTX 5090 with 1,792 GB/s GPU bandwidth will generate tokens 5–6× faster for 7B–32B models. The DGX Spark’s advantage is in handling 100B–200B models that a single RTX 5090 cannot address without heavy quantization and CPU offloading.

Who should buy the DGX Spark: Organizations and individual researchers who need 100B–200B model inference locally, without building and maintaining a multi-GPU x86 workstation. The simplicity of the DGX Spark (plug in, install software, run models) versus the complexity of a multi-GPU Threadripper Pro build is a legitimate consideration for teams where hardware administration isn’t a core competency.

Apple Silicon: The Unified Memory Alternative

No guide on local AI hardware is complete without addressing Apple Silicon’s unique architecture. The Mac Studio M4 Ultra and Mac Pro M4 Ultra offer a fundamentally different approach to the VRAM bottleneck: unified memory that serves both CPU and GPU from a single high-bandwidth pool, with no data transfer penalty between compute units.

Specification	Mac Studio M4 Ultra	Mac Pro M4 Ultra	Mac Mini M4 Pro
Unified Memory	Up to 192GB	Up to 192GB	Up to 64GB
Memory Bandwidth	800 GB/s	800 GB/s	273 GB/s
GPU Cores	80	80	20
Price	$4,999–$9,000+	$9,999–$15,000+	$1,399–$1,999

How Apple Silicon compares for LLM inference:

Models under 32B: An RTX 5090 at 1,792 GB/s bandwidth produces 2–3× more tokens per second than a Mac Studio M4 Ultra at 800 GB/s for small-to-medium models that fit in the GPU’s VRAM.
70B models: The Mac Studio M4 Ultra with 192GB unified memory can run Llama 3 70B at Q8 quality level (70GB) without any offloading, at approximately 15–25 tokens/sec. An RTX 5090 trying the same model must offload heavily, dropping to 3–5 tokens/sec. The Mac wins decisively here.
405B models: Apple’s M4 Ultra at 192GB can run Llama 3.1 405B at Q2 quantization (approximately 100–115GB). This is an extraordinary capability for a single-unit $9,000 machine. Token generation speed is ~2–5 tok/s, which is slow but functional for research and evaluation purposes.

The framework that emerges: if you primarily work with models under 32B parameters, a discrete NVIDIA GPU workstation is faster and cheaper. If you primarily work with 70B–200B models and need high-quality output without heavy quantization, Apple Silicon’s unified memory architecture is currently unmatched in the consumer market.

Hardware for Running Powerful AI Models Locally: Three Complete Build Tiers

The following builds represent complete, purchasable hardware for running powerful AI models locally, targeting specific use cases and budgets. Prices reflect April 2026 market rates.

Build Tier 1 — “The Practitioner” (~$5,000–$7,000)

Best for: Professionals running models up to 34B parameters. Daily LLM coding assistant, local RAG, fine-tuning small models.

Component	Selection	Price	Notes
GPU	NVIDIA RTX 5090 (ASUS ROG Strix OC or MSI Suprim X)	$3,500–$4,200	32GB GDDR7, 575W TDP
CPU	AMD Ryzen 9 9950X	$580–$650	16C/32T, Zen 5, excellent IPC
Motherboard	ASUS ProArt X870E-Creator WiFi	$520–$560	PCIe 5.0 x16 GPU slot, robust VRM
RAM	G.SKILL Trident Z5 Neo 128GB (2×64GB) DDR5-6000	$280–$320	Max supported, dual-channel
Storage (OS)	Samsung 990 Pro 2TB	$145–$160	PCIe 4.0, 7,450 MB/s read
Storage (Models)	WD Black SN850X 4TB	$260–$280	Dedicated model storage
PSU	Corsair HX1500i 1,500W Platinum	$270–$300	Headroom for GPU+CPU peak draw
CPU Cooler	Noctua NH-D15 G2 (or NZXT Kraken 360 AIO)	$120–$200	9950X has high boost power draw
Case	Fractal Design Torrent XL or Lian Li PC-O11 Dynamic EVO XL	$170–$220	Excellent airflow for 575W GPU
Total		~$5,845–$7,110

Expected performance:

Llama 3 8B (Q4_K_M): ~200–213 tokens/sec
DeepSeek R1 32B (Q4_K_M, fits comfortably in 32GB): ~65 tokens/sec
Llama 3 70B (Q4, with partial CPU offload): ~30–40 tokens/sec (some layers in 128GB RAM)
Qwen 2.5 72B (Q3_K_S, ~27GB): ~50–60 tokens/sec

Who uses a setup like this: Independent AI researchers, senior ML engineers running local coding assistants with 32B+ models, privacy-first law firms running document analysis workflows, and developers fine-tuning models on 7B–13B architectures with custom datasets.

Build Tier 2 — “The Power User” (~$12,000–$16,000)

Best for: Running 70B models at high quality without compromise. Small-team model serving. Local research with 405B class models at acceptable quantization.

Component	Selection	Price	Notes
GPU ×2	2× NVIDIA RTX 5090 (reference or AIB partner)	$7,000–$8,400	64GB combined GDDR7 via PCIe
CPU	AMD Threadripper 7960X (24C) or 7980X (64C)	$1,400–$2,800	24–64 PCIe 5.0 lanes from TRX50
Motherboard	ASUS Pro WS TRX50-SAGE WIFI	$800–$900	4× PCIe 5.0 x16 slots from CPU
RAM	64GB DDR5-5600 (4×16GB) or 128GB DDR5-4800 (4×32GB)	$200–$450	Quad-channel on TRX50
Storage (OS)	Samsung 990 Pro 2TB	$150
Storage (Models)	2× Samsung 990 Pro 4TB (RAID-0 stripe)	$560	8TB, ~13,000 MB/s combined read
PSU	MSI MEG Ai1600T PCIE5 1,600W Titanium	$695	ATX 3.1, dual 12V-2×6 native
CPU Cooler	ASUS ROG Ryujin III 360 ARGB	$250	TRX50 platform supported
Case	Thermaltake Core P8 TG ATX Full Tower	$250–$300	Supports dual-GPU full-length cards with clearance
Total		~$11,305–$14,255

Expected performance:

Llama 3 70B (Q4_K_M, full in 64GB combined VRAM): ~70–85 tokens/sec
DeepSeek R1 70B (Q5_K_M, ~50GB, full in VRAM): ~55–65 tokens/sec
Llama 3.1 405B (Q2_K, ~115GB, needs CPU offload): ~10–18 tokens/sec
Multi-user vLLM (70B Q4, 4 concurrent users): ~18–22 tokens/sec each

Real-world comparison: This build profile mirrors the dual RTX 5090 air-gapped setup documented by CraftRigs for legal and compliance teams in early 2026, described as “enterprise-class local AI without enterprise procurement.” The use case: a legal firm processing confidential client documents through 70B-class models with absolute certainty that no data leaves the premises.

Build Tier 3 — “The Research Station” (~$25,000–$45,000+)

Best for: Running 70B at FP16, quantized 405B and 671B at respectable speeds. Multi-user research team serving. The closest consumer-purchasable equivalent to a small inference cluster.

Component	Selection	Price	Notes
GPU ×4	4× NVIDIA L40S 48GB (or 4× RTX 6000 Ada)	$28,000–$32,000	192GB combined VRAM via PCIe 4.0
CPU	AMD Threadripper Pro 7995WX (96C)	$3,500–$4,500	128 PCIe 5.0 lanes for 4×GPU at x16
Motherboard	ASUS Pro WS WRX90E-SAGE SE	$1,247–$1,291	7× PCIe 5.0 x16, 2TB ECC RAM support
RAM	Kingston FURY Renegade Pro 256GB (8×32GB) DDR5-5600 ECC RDIMM	$1,200–$1,400	8-channel, 307 GB/s bandwidth
Storage (OS)	Samsung 990 Pro 2TB	$150
Storage (Models)	4× WD Black SN850X 4TB + 4TB HDD archive	$1,040+$120	16TB NVMe fast storage
PSU ×2	2× Seasonic PRIME TX-1300 + Add2PSU	$720	2,600W combined at Titanium efficiency
CPU Cooler	Custom loop: EKWB Quantum Magnitude sTR5 + 480mm rad	$500–$800	Mandatory for TR Pro at sustained 350W
Case	Lian Li O11 Vision XL or server chassis (4U rackmount)	$300–$600	Must accommodate 4× dual-slot blower cards
Total		~$36,777–$42,631

Expected performance (4× L40S, 192GB total VRAM):

Llama 3 70B (Q8, ~74GB, full in VRAM): ~35–45 tokens/sec
Llama 3 70B (FP16, ~140GB, full in 192GB): ~18–25 tokens/sec
Llama 3.1 405B (Q4, ~230GB — requires CPU offload): ~8–15 tokens/sec
DeepSeek R1 671B (Q2, ~350GB — requires CPU offload to 256GB RAM): Functional but slow (~2–4 tok/s)
Multi-user serving: 10–20 concurrent researchers at 10–15 tok/s each

Documented parallel: Ahmad Osman’s 8× RTX 3090 basement server (192GB total VRAM via consumer GPUs and NVLink pairs) documented in July 2024 is the closest public example to Tier 3 functionality at a lower budget. It runs on an ASRock Rack ROMED8-2T motherboard with AMD EPYC Milan 7713, 512GB DDR4 RAM, and three 1600W PSUs. The total cost was approximately $12,000–$15,000 using secondhand RTX 3090 cards — a testament to the fact that used professional configurations can dramatically undercut new build costs for research teams willing to accept used-hardware risk.

Software Stack: Maximizing Your Hardware’s Potential

Choosing the right inference framework is as important as the hardware configuration. The software layer determines how efficiently your GPUs are utilized, what quantization formats are supported, and how many concurrent users you can serve.

llama.cpp — The Universal Baseline

llama.cpp is the foundational open-source inference library that runs quantized models across NVIDIA (CUDA), AMD (ROCm), Apple Silicon, and even CPU-only configurations. It’s the engine underneath Ollama and many other user-facing tools.

Key characteristics:

Supports GGUF quantization formats (Q2_K through Q8_0, IQ1 through IQ4)
Multi-GPU tensor split via --tensor-split flag (PCIe multi-GPU without NVLink)
CPU offloading via --n-gpu-layers parameter (precise control over what goes to VRAM vs. RAM)
MCP (Model Context Protocol) client support added March 2026
Best for: single-user, maximum performance, all hardware platforms

Performance benchmark (13B model, RTX 4090): ~75–85 tokens/sec at Q4_K_M via llama.cpp CUDA.

Ollama — The Practitioner’s Interface

Ollama wraps llama.cpp with a clean CLI, REST API, and model management system. For practitioners who want to switch between models quickly without managing GGUF files manually, Ollama reduces friction significantly. Performance is essentially identical to llama.cpp for single-user inference (same engine underneath).

Best for: Individual practitioners, local coding assistant integration (Cursor, VS Code, Continue.dev), non-technical team members who need model access without CLI expertise.

ExLlamaV2 — Maximum Speed on NVIDIA

ExLlamaV2 is the fastest inference solution available for NVIDIA GPUs, using custom CUDA kernels that bypass PyTorch overhead. Benchmarks consistently show 50–85% faster token generation than llama.cpp for equivalent quantization levels.

Framework	Model	GPU	Speed (tok/s)
ExLlamaV2 (EXL2 4.25 bpw)	Llama 2 13B	RTX 3090	~57
llama.cpp (Q4_K_M)	Llama 2 13B	RTX 3090	~31
ExLlamaV2	Mistral 7B	RTX 4070	~118
Ollama	Mistral 7B	RTX 4070	~52

Limitation: ExLlamaV2 requires NVIDIA GPUs (RTX 2000-series or newer) and uses EXL2 format rather than GGUF. It does not support CPU offloading, so the entire model must fit in VRAM. For VRAM-constrained setups, llama.cpp’s flexibility may be worth the throughput trade-off.

vLLM — Production Multi-User Serving

vLLM is the standard production inference server for multi-user LLM deployment. Its PagedAttention mechanism efficiently manages KV cache for multiple concurrent requests, enabling dramatically better throughput under concurrent load compared to llama.cpp:

Single user: llama.cpp slightly faster (lower overhead)
5 concurrent users: vLLM ~80 tok/s each vs llama.cpp ~60 tok/s each
Batch processing: vLLM achieves ~2,000 tok/s total vs llama.cpp’s ~300 tok/s
vLLM v0.17.0 (March 2026): PyTorch 2.10, FlashAttention 4, AMD ROCm first-class support

Best for: Research teams serving 5+ concurrent users, API-based access to local models, organizations running 70B models as a shared internal resource.

Model Selection: What Can Your Hardware Actually Run?

With hardware for running powerful AI models locally specified across three build tiers, the practical question is: which models can each tier actually run, and at what quality level?

The Quantization Trade-Off

Modern quantization reduces model size by storing weights at lower precision than the training FP16/BF16. The quality cost varies by quantization level:

Quantization	Bits per Weight	Quality vs FP16	Size (70B model)	VRAM Needed
FP16	16	100% (baseline)	~140GB	140GB+
Q8_0	8	~99%	~74GB	76GB+
Q5_K_M	5	~97–98%	~48GB	50GB+
Q4_K_M	4	~95–96%	~43GB	45GB+
Q3_K_M	3	~92–93%	~32GB	34GB+
Q2_K	2	~85–88%	~21GB	23GB+

Model Compatibility by Build Tier

Model	Params	Tier 1 (RTX 5090, 32GB)	Tier 2 (2×RTX 5090, 64GB)	Tier 3 (4×L40S, 192GB)
Llama 3 8B	8B	✅ Q8 native (~5GB)	✅ FP16 native (~16GB)	✅ FP16 native
Qwen 2.5 14B	14B	✅ Q8 native (~14GB)	✅ FP16 native (~28GB)	✅ FP16 native
DeepSeek R1 32B	32B	✅ Q4 native (~19GB)	✅ Q8 native (~32GB)	✅ FP16 native
Llama 3 70B	70B	⚠️ Q3 partial (~34GB, some CPU offload)	✅ Q4 native (~43GB)	✅ Q8 native (~74GB)
DeepSeek R1 70B	70B	⚠️ Q3 partial, CPU offload	✅ Q4 native (~43GB)	✅ Q8 native (~74GB)
Qwen 2.5 72B	72B	✅ Q3_K_S native (~27GB)	✅ Q4 native (~44GB)	✅ Q8 (~75GB)
Llama 3.1 405B	405B	❌ Not practical	⚠️ Q2 with heavy CPU offload	⚠️ Q4 with partial CPU offload
DeepSeek R1 671B	671B	❌ Not possible	❌ Not possible	⚠️ Q2 with 256GB RAM offload

Buying Guide: Where to Purchase and What to Watch For

RTX 5090 Supply Situation (April 2026)

The RTX 5090 launched at $1,999 MSRP in January 2025 but has experienced persistent supply constraints. As of April 2026, street prices remain $3,500–$4,200 at major US retailers. Supply arrives at Newegg, B&H, Adorama, and Best Buy in unpredictable batches. The most reliable method for purchasing at or near MSRP:

Set up stock alerts via NowInStock.net for RTX 5090 at all major retailers
NVIDIA’s own store (store.nvidia.com) offers Founders Edition drops with purchase limits
AIB partner cards (ASUS, EVGA, MSI, Gigabyte) often appear at slightly above MSRP from their own stores

RTX 4090 Value Assessment

At $1,400–$1,700, the RTX 4090 provides approximately 55% of the RTX 5090’s memory bandwidth at 33% of the street price premium. For the vast majority of practical local AI use cases below 70B parameters, the RTX 4090 remains the rational choice. Check B&H, Newegg, and Amazon Warehouse Deals for open-box units at $1,300–$1,500.

Used RTX 3090 for NVLink Builds

If the dual-NVLink 3090 architecture fits your use case, eBay and local Craigslist/Facebook Marketplace listings regularly offer RTX 3090s at $550–$850. Verify the NVLink connector is not damaged before purchase — it’s a small gold contact strip on the top edge of the card. The NVLink HB bridge accessory (required) sells new for $100–$150 from NVIDIA-authorized resellers.

Professional GPU Channels

L40S and RTX 6000 Ada cards are sold through NVIDIA’s professional reseller network, not consumer channels. Major suppliers include Microway, Silicon Mechanics, and Puget Systems (for complete system builds). Expect lead times of 2–6 weeks for new stock. The used market for professional GPUs is active on eBay — L40S cards have appeared at $4,500–$6,500 from data center liquidations.

Complete System Builders

For organizations that want a single-vendor solution with professional support:

Puget Systems: Specialized in video production and deep learning workstations. Pre-configured AI systems starting at $8,000. Exceptional documentation and customer support for workstation AI builds.
Lambda Labs: Offers “GPU Cloud for Research” workstation systems for sale, in addition to their cloud service. GPU workstations priced $12,000–$65,000+ depending on GPU configuration.
NVIDIA DGX Station A100: The institutional equivalent — 4× A100 80GB GPUs in a tower chassis. Pricing available through NVIDIA enterprise sales (typically $80,000–$150,000). Not consumer-purchasable in the traditional sense.

Real-World Use Cases: Who Is Actually Running This Hardware?

Beyond specifications, the strongest validation for these hardware configurations comes from documented real-world deployments by engineers and researchers who have published their setups.

The 8-GPU RTX 3090 Basement AI Server

Ahmad Osman’s documented basement server build (published July 2024) remains one of the most comprehensive public examples of a high-VRAM local AI cluster. The configuration:

8× NVIDIA RTX 3090 (192GB total VRAM via 4 NVLink pairs)
ASRock Rack ROMED8-2T motherboard with AMD EPYC Milan 7713 (64 cores, 128 threads)
512GB DDR4-3200 ECC RAM
Three 1,600W power supplies
Primary purpose: running Meta’s Llama 3.1 405B for research applications

The NVLink pairs are critical: 4 bridges creating 4 GPU pairs, each pair sharing 48GB of NVLink-bonded VRAM. For model layers distributed across all 8 GPUs, PCIe 4.0 x16 handles inter-pair communication while NVLink handles intra-pair communication. The result is Llama 3.1 405B at Q4_K_M inference at functional (if not fast) throughput.

The Andrej Karpathy Single-GPU Autoresearch Setup

Andrej Karpathy (former Tesla AI director, OpenAI co-founder) released “autoresearch” in March 2026 — an AI agent that autonomously runs model training experiments. His documented setup targets RTX 3090 and RTX 4090 single-GPU configurations, with specific notes:

24GB VRAM (RTX 3090/4090) is the standard target — sufficient for continuous automated fine-tuning experiments
For 12–16GB cards (RTX 3060/4060 Ti): scaled-down configurations require reducing model depth, vocab size, and sequence length
CUDA 12.8+ required, Python 3.10+

This use case exemplifies single-GPU productivity: not the largest models, but continuous, automated, low-overhead experimentation that runs overnight and produces results by morning.

The Compliance Team Air-Gapped Dual RTX 5090 Setup

CraftRigs documented a dual RTX 5090 workstation deployed by a legal compliance team in early 2026. The configuration (2× RTX 5090, 64GB combined VRAM, air-gapped network) runs a 70B model on a dedicated workstation with zero internet connectivity — a hard requirement for attorney-client privilege in document review workflows. The team serves 4–6 concurrent attorneys at 20+ tokens/sec each, with 100% data locality guarantees.

This is the archetype for the Tier 2 build: enterprise-tier privacy requirements met with consumer-purchasable hardware at a fraction of NVIDIA enterprise hardware costs.

Decision Framework: Which Build Is Right for You?

After covering the full landscape of hardware for running powerful AI models locally, the practical question is how to navigate this decision. Here is a direct framework:

Choose Tier 1 (RTX 5090 single GPU) if:

Your primary models are 7B–34B parameters
You occasionally need 70B access and can tolerate Q3 quantization with some CPU offloading
Single-user inference (you’re the only person running the model)
Budget ceiling of $7,000–$8,000 total
You want a workstation that also handles other GPU workloads (video, gaming, stable diffusion)

Choose 2× RTX 3090 NVLink if:

70B models at Q4 quality are your primary use case
Budget is under $3,000 for the GPU pair
You’re comfortable with used hardware and the associated risk
You understand that RTX 3090 is end-of-life from NVIDIA’s perspective (no further driver feature development)

Choose Tier 2 (2× RTX 5090 + Threadripper) if:

70B models at Q4 or higher quality without CPU offloading are required
Multi-user serving (2–6 concurrent users) is needed
Privacy requirements mandate local deployment of larger models
Budget can accommodate $12,000–$16,000 total investment

Choose Tier 3 (4× L40S + Threadripper Pro) if:

Your team regularly works with 70B models at FP16 or Q8 quality
Research workflows require 405B class inference, even at slow speeds
10–20 concurrent users need model access
You have an institutional budget and need professional-grade reliability (ECC, blower cards, IPMI management)

Consider DGX Spark ($4,699) instead of Tier 1 if:

Models from 70B–200B are your primary use case and low token speed (2–10 tok/s) is acceptable
You want turnkey simplicity over flexibility
Silence and compact form factor matter
You’re not planning to use the hardware for anything other than AI inference

Consider Apple Silicon (Mac Studio M4 Ultra) instead of GPU build if:

macOS is your primary environment (integrated ecosystem with Xcode, ML frameworks)
70B–192B models at moderate speed are acceptable (15–25 tok/s at Q4)
Single-vendor warranty and support matter
Power consumption and noise are priorities (Mac Studio draws 60–70W vs 700W+ for GPU builds)

Power and Electrical Considerations

Tier 2 and Tier 3 builds consume power at levels that require attention to electrical infrastructure beyond just the PSU selection.

Tier 1 (~900–1,100W sustained): Standard 15A/120V household circuit is marginally sufficient. Running on a dedicated 20A circuit is strongly recommended to avoid breaker trips during sustained inference. Draw-down current peaks at startup can exceed the 15A rating momentarily.
Tier 2 (~1,400–1,600W sustained): Requires a dedicated 20A/120V circuit or a 15A/240V circuit. Many home offices do not have 20A circuits; electrician consultation is warranted before building.
Tier 3 (2,000W+ sustained): Requires a 240V 20A dedicated circuit or two separate 20A circuits. Server room or dedicated electrical infrastructure may be necessary. At 24/7 operation, annual electricity cost at $0.15/kWh is approximately $2,628/year — a real operational cost to factor into the ROI calculation.

Conclusion: The Best Time to Build a Local AI Workstation Is Now

The argument for investing in the right hardware for running powerful AI models locally has never been stronger. The 2026 landscape offers a convergence of circumstances that didn’t exist 18 months ago:

Open-weight models at the 70B parameter scale routinely match GPT-4-era performance on coding, reasoning, and analysis tasks
Consumer hardware (RTX 5090) has crossed the 32GB VRAM threshold, enabling 70B inference at acceptable quality levels from a single card with aggressive quantization
Software frameworks (llama.cpp, ExLlamaV2, vLLM) have matured to extract near-theoretical hardware efficiency
Quantization techniques (EXL2, IQ quants, GGUF) have advanced to where Q4 inference on well-quantized models is difficult to distinguish from FP16 on most practical tasks

The Tier 1 build (~$6,000–$7,000) delivers LLM inference performance that would have required a $50,000+ enterprise server in 2022. The Tier 3 build (~$40,000) replicates capabilities of a small inference cluster. For the professionals and organizations for whom data privacy, latency, and cost at scale are genuinely important, these investments return their value within months of deployment.

For more on the cloud GPU alternatives — when renting beats building — see our comprehensive GPU VPS for AI roundup, our RunPod vs Vast.ai vs Lambda Labs comparison, and our regularly updated GPU cloud pricing comparison.

Sources and References

Author: Iovanny Olguín Ávila

Computer Systems Engineer with an MSc in Computer Science. I apply quantitative analysis and data-driven methodologies to evaluate financial instruments, investment vehicles, and emerging technologies. My technical background allows me to cut through marketing language and analyze the actual mechanics of financial products — from HELOC structures to Medicare Advantage plan design to business credit card reward algorithms.