Building the right hardware for running powerful AI models locally is the single most consequential technical decision you’ll make as an AI practitioner in 2026. The difference between a system that handles a 70B parameter model at a usable 25 tokens per second and one that crawls at 3 tokens per second with constant RAM swapping isn’t luck — it’s component selection. This guide on hardware for running powerful AI models locally dissects every layer of the stack: GPUs, CPUs, motherboards, RAM, storage, and power delivery, with real benchmarks, real build costs, and documented real-world setups from researchers and engineers who’ve already done this.
We are not talking about running Llama 3 8B on a laptop with 16GB of RAM. We’re talking about locally hosting 70B, 405B, and 671B parameter models with enough throughput to be genuinely productive — or even serve multiple users simultaneously. The hardware for running powerful AI models locally that we cover here is available for purchase today, without institutional procurement, and this guide will show you exactly what to buy.
Why Run Powerful AI Models Locally?
Before specifying hardware, the question of why deserves a direct answer. Local AI inference is not a hobbyist exercise. The reasons professionals and organizations run hardware for powerful AI models locally are substantive:
- Data privacy and sovereignty: No tokens, prompts, or completions leave your physical premises. For legal, medical, financial, and government workloads, this is non-negotiable.
- Cost at scale: Running GPT-4-class models via API at production volumes — thousands of calls per day — costs hundreds to thousands of dollars monthly. A one-time hardware investment amortizes over 3–5 years.
- Latency control: With local inference, your network RTT is your LAN latency. Streaming responses begin in milliseconds, not 300–2,000ms round-trips to API endpoints.
- Model customization: Fine-tuning, LoRA adapters, quantization to specific bpw targets, GGUF format selection — all of these require direct hardware access.
- Capability ceiling: As of April 2026, the open-weight models available (Llama 3.1 405B, DeepSeek R1 671B, Qwen 2.5 72B) approach or match GPT-4-level performance on many benchmarks. The argument that “cloud models are better” no longer holds categorically.
The Physics of Local AI: What Actually Limits Performance
Understanding why certain components matter more than others for hardware for running powerful AI models locally requires understanding the physical bottleneck in large model inference. Unlike gaming, where GPU compute (TFLOPS) is the primary limiter, LLM inference at typical batch sizes of 1–4 is almost always memory bandwidth bound, not compute bound.
When you generate a single token with a 70B model, the GPU must load approximately 40GB of model weights from VRAM into the arithmetic units. With an RTX 5090’s 1,792 GB/s bandwidth, this takes roughly 22ms — which translates to a theoretical ceiling of ~45 tokens per second before any overhead. With an RTX 4090’s 1,008 GB/s, the ceiling drops to ~25 tokens per second. This is why VRAM bandwidth often predicts LLM inference speed better than TFLOPS do.
The second physical constraint is VRAM capacity. If the model doesn’t fit in VRAM, it must be partially offloaded to system RAM. GDDR7 memory bandwidth is 1,792 GB/s. DDR5 system RAM bandwidth is roughly 90–150 GB/s. A model that’s 20% offloaded to RAM can drop inference speed by 40–60% — not 20%. The penalty is disproportionate because the bottleneck shifts from VRAM to DRAM for every layer that hits system memory.
The third constraint, often underestimated, is inter-GPU bandwidth in multi-GPU setups. PCIe 5.0 x16 provides ~64 GB/s bidirectional bandwidth. NVLink (available on RTX 3090 and professional GPUs) provides 112 GB/s. For models split across two GPUs, the communication overhead between cards during the attention mechanism can account for 30–50% of total latency if bandwidth is insufficient.
This framework — bandwidth first, capacity second, inter-GPU communication third — should guide every component decision in this guide.
The GPU Tier List: Hardware for Running Powerful AI Models Locally (2026)
Hardware for running powerful AI models locally starts and ends with the GPU. No other component decision will have as much impact on your inference speed and maximum model size. Here is the complete tier list of what’s available for purchase today.
Tier S: NVIDIA RTX 5090 — The Consumer King
The RTX 5090 (Blackwell architecture, January 2025) is the undisputed single-card champion for local AI inference in 2026. Its specifications set a new ceiling for consumer hardware:
| Specification | RTX 5090 | RTX 4090 | Delta |
|---|---|---|---|
| Architecture | Blackwell (GB202) | Ada Lovelace (AD102) | — |
| CUDA Cores | 21,760 | 16,384 | +33% |
| Tensor Cores (5th Gen) | 680 | 512 (4th Gen) | +33% |
| VRAM | 32 GB GDDR7 | 24 GB GDDR6X | +33% |
| Memory Bandwidth | 1,792 GB/s | 1,008 GB/s | +78% |
| FP16 TFLOPS | 838 | 330 | +154% |
| TDP | 575W | 450W | +28% |
| PCIe Interface | PCIe 5.0 x16 | PCIe 4.0 x16 | Gen+1 |
| MSRP | $1,999 | $1,599 | +$400 |
| Street Price (Apr 2026) | $3,500–$4,200 | $1,400–$1,700 | +~$2,000 |
Real inference numbers (April 2026, llama.cpp, Q4_K_M quantization):
- Llama 3 8B: ~213 tokens/sec (prompt processing: 11,796 tokens/sec)
- Qwen 2.5 7B: ~274 tokens/sec (decode)
- Llama 3 70B (Q4, fits in 32GB): ~45–61 tokens/sec with aggressive quantization
- DeepSeek R1 32B: ~65 tokens/sec (Q4_K_M, easily fits in 32GB)
- Qwen2.5-Coder-7B at batch-8: 5,841 tokens/sec — 2.6× faster than an A100 80GB
Key limitation: The RTX 5090 does not support NVLink in its consumer form factor. Multi-GPU setups require PCIe communication only, which limits scaling efficiency. The card also requires a PCIe 5.0 16-pin (12V-2×6) connector and a PSU rated for at least 575W GPU draw plus the rest of the system.
Who should buy it: Anyone who needs maximum single-card performance without the cost and complexity of a multi-GPU professional workstation. The 32GB VRAM is sufficient for most 70B quantized models at Q3–Q4 level with minimal context overhead.
Tier A: NVIDIA RTX 4090 — The Proven Workhorse
The RTX 4090 remains the best value proposition for local AI inference in 2026 when purchased at its current market price of $1,400–$1,700. It has an enormous ecosystem of tested configurations, proven stability over 18+ months of AI workload operation, and the 24GB GDDR6X is sufficient for the vast majority of practical use cases below 70B parameters.
| Model | Quantization | Fits in 24GB? | Speed (tok/s) |
|---|---|---|---|
| Llama 3 8B | Q4_K_M | ✅ Yes (5.5GB) | ~118–130 |
| Mistral 7B | Q4_K_M | ✅ Yes (4.4GB) | ~125–140 |
| DeepSeek R1 32B | Q4_K_M | ✅ Yes (18.5GB) | ~35–45 |
| Llama 3 70B | Q4_K_M | ❌ 43GB required | ~10–15 (with CPU offload) |
| Llama 3 70B | Q2_K | ⚠️ Tight (21GB) | ~18–25 (with quality loss) |
| DeepSeek R1 70B | Q4_K_M | ❌ 43GB required | ~8–12 (with offload) |
The 24GB VRAM ceiling on the RTX 4090 means that 70B models at useful quality levels require either significant quantization quality compromise, CPU offloading with major throughput penalties, or a second GPU. For users primarily working with models up to 34B parameters, the RTX 4090 is fully adequate and represents better cost-per-token than the RTX 5090 at street prices.
Tier A: Dual RTX 3090 with NVLink — The Budget 48GB Solution
The RTX 3090 is the last consumer NVIDIA GPU to support NVLink, and this detail transforms what would otherwise be an aging card into a compelling 2026 option for specific workloads. A dual RTX 3090 NVLink setup provides:
- 48GB combined VRAM with 112 GB/s inter-GPU bandwidth via NVLink bridge
- Comfortable full-quality Q4 inference of 70B models at 25–35 tokens/sec
- Used card prices of $550–$850 each (April 2026), making the GPU pair cost $1,100–$1,700
- Full compatibility with llama.cpp tensor parallelism (–tensor-split flag)
The NVLink bridge for RTX 3090 costs approximately $100–$150 and physically connects two cards in adjacent slots. Without NVLink, dual RTX 3090s would communicate via PCIe (up to ~32 GB/s), which introduces significant bottleneck for large matrix operations during attention computation. With NVLink at 112 GB/s, the inter-GPU penalty is dramatically reduced.
Ahmad Osman’s 8×RTX 3090 basement server (documented July 2024) is the most extreme public example of this architecture. Running 8 RTX 3090s with 192GB total VRAM on an ASRock Rack ROMED8-2T with AMD EPYC Milan 7713 (64 cores), 512GB DDR4 RAM, and three 1600W PSUs, this system can host Llama 3.1 405B at Q4 quantization with real throughput. Total build cost: approximately $12,000–$15,000.
Tier A+: NVIDIA RTX 6000 Ada Generation — The Professional 48GB Card
The RTX 6000 Ada is NVIDIA’s workstation GPU based on the same Ada Lovelace die as the RTX 4090, but with critical differences that matter for sustained AI workloads:
| Feature | RTX 6000 Ada | RTX 4090 | Advantage |
|---|---|---|---|
| VRAM | 48 GB GDDR6 ECC | 24 GB GDDR6X | 2× capacity, ECC protection |
| Memory Bandwidth | 960 GB/s | 1,008 GB/s | RTX 4090 slightly faster |
| TDP | 300W | 450W | RTX 6000 Ada (33% less power) |
| FP32 TFLOPS | 91.1 | 82.6 | RTX 6000 Ada (+10%) |
| L2 Cache | 96 MB | 72 MB | RTX 6000 Ada (+33%) |
| vGPU Support | Yes | No | RTX 6000 Ada |
| NVLink | Yes (2-way) | No | RTX 6000 Ada |
| Thermal design | Blower (single-slot airflow) | Triple-fan (open air) | Workstation cases: RTX 6000 Ada |
| Purchase price | ~$6,500–$8,000 | ~$1,400–$1,700 | RTX 4090 (4–5× cheaper) |
LLM inference benchmarks (Q4_K_M, llama.cpp):
- Llama 3 8B: 131 tokens/sec (RTX 6000 Ada) vs 113 (L40S) vs 110 (RTX 4090)
- Llama 3 70B: 18.4 tokens/sec (RTX 6000 Ada) — fully in VRAM, no offloading needed
- At FP16 precision (Llama 3 8B): 52 tokens/sec — superior to RTX 4090 at full precision
The RTX 6000 Ada’s 48GB with ECC enables reliable operation for 70B models at Q4 quality without any CPU offloading, while the RTX 4090 must offload or use more aggressive quantization. For teams running 70B inference continuously on professional workloads, the premium may justify itself through reliability (ECC protects against bit-flip corruption) and lower power draw.
Tier B: NVIDIA L40S — The Datacenter Card That Fits at a Desk
The L40S is NVIDIA’s inference-optimized Ada Lovelace card, positioned between the RTX 6000 Ada and A100 in the product stack. With 48GB GDDR6 and 864 GB/s bandwidth, it’s slightly slower than the RTX 6000 Ada for LLM inference per token but shares the same VRAM capacity tier.
- Purchase price: $7,000–$12,000 (new), $4,000–$7,000 (used/refurbished)
- Llama 3 70B (Q4_K_M): 15.3 tokens/sec (vs 18.4 on RTX 6000 Ada)
- Dual L40S (96GB combined): Runs Q8 Llama 3 70B fully in VRAM at ~22 tokens/sec
- 4× L40S (192GB): Can run full FP16 Llama 3 70B or quantized 405B models
- Passive cooling; requires workstation chassis with proper airflow management
- Supports NVLink for 2-way configurations
The L40S is the workhorse of enterprise AI inference deployments. For individual buyers willing to spend $8,000–$15,000 on a single GPU card, it offers production-grade reliability and the expanded VRAM headroom that consumer cards cannot match.
Tier B: AMD Radeon RX 7900 XTX — The Open-Source Alternative
AMD’s flagship consumer card offers 24GB GDDR6 at $800–$950 retail, making it the most affordable 24GB option available. The RX 7900 XTX runs LLM inference via ROCm (AMD’s CUDA equivalent) through llama.cpp, Ollama, and vLLM with AMD ROCm support.
Performance caveats are significant: ROCm support remains less mature than CUDA, optimization paths like FlashAttention and custom CUDA kernels for quantized inference don’t apply, and real-world LLM throughput on the 7900 XTX is typically 40–60% of what an RTX 4090 achieves despite similar VRAM capacity. As of vLLM v0.16 (March 2026), AMD ROCm support has become “first-class,” but CUDA still leads in optimized inference kernels for most quantization formats.
For users committed to open-source, privacy-first toolchains who are also running Linux, the RX 7900 XTX provides a cost-effective entry into 24GB VRAM inference without NVIDIA’s driver ecosystem.
Multi-GPU Architectures: When One Card Is Not Enough
Running hardware for powerful AI models locally at 70B+ parameters at acceptable quality and speed often requires multiple GPUs. There are three fundamentally different multi-GPU architectures, each with different trade-offs in complexity, cost, and performance.
Architecture 1: NVLink Consumer (2×RTX 3090)
The RTX 3090 is unique in the consumer GPU market: it’s the only modern consumer card supporting NVLink. The NVLink HB bridge for dual RTX 3090 provides 112 GB/s bidirectional bandwidth — 3.5× the bandwidth of PCIe 4.0 x16.
Why this matters for LLMs: During transformer attention computation, model layers are split across GPUs. Each forward pass requires exchanging activations between cards. At 112 GB/s vs 32 GB/s, NVLink reduces communication bottleneck by 3.5×, which translates directly to higher tokens per second for models that don’t fit in a single card’s VRAM.
Practical build constraints:
- Requires a motherboard with two PCIe x16 slots physically close enough for the NVLink bridge (typically 2–3 slots apart)
- Both GPUs must be RTX 3090 (not 3090 Ti — the 3090 Ti does not support NVLink)
- Consumes 2× 350W TDP = 700W GPU draw minimum; requires 850W+ PSU for system
- Used card availability declining as supply ages out of the market
Expected performance on 70B models (Q4_K_M): 28–35 tokens/sec, which is comparable to a single RTX 6000 Ada at 40–60% of the cost in hardware terms.
Architecture 2: PCIe Multi-GPU (2–4× RTX 4090 or RTX 5090)
Without NVLink, modern consumer GPUs (RTX 4090, RTX 5090) communicate via PCIe only. PCIe 5.0 x16 provides ~64 GB/s bidirectional bandwidth — a significant improvement over PCIe 4.0’s 32 GB/s, but still well below NVLink. llama.cpp supports tensor parallelism across PCIe-only multi-GPU setups via the --tensor-split parameter.
The efficiency of PCIe multi-GPU for LLM inference depends heavily on how the model’s layers are distributed. For 70B models across 2× RTX 5090 (64GB combined VRAM), you can avoid any CPU offloading and run the full model in GPU memory. The PCIe communication overhead for attention is real but acceptable — typically 15–25% throughput reduction vs. theoretical peak.
2× RTX 5090 (64GB combined VRAM) expected performance:
- Llama 3 70B (Q4_K_M, full in VRAM): 70–85 tokens/sec (combined throughput)
- DeepSeek R1 70B (Q4_K_M): 65–80 tokens/sec
- Llama 3.1 405B (Q2_K, ~115GB): Cannot fit — requires further GPU or CPU offload
- Multi-user serving with vLLM: 4–6 concurrent users at 15–20 tok/s each
Cost: 2× RTX 5090 at street prices costs $7,000–$8,400 in GPUs alone. Add motherboard, Threadripper CPU, RAM, and PSU and a dual RTX 5090 workstation approaches $12,000–$15,000 total.
Architecture 3: Professional GPU Arrays (4–8× Datacenter Cards)
At the highest tier, 4–8 professional GPUs (L40S, RTX 6000 Ada, A100) provide the VRAM pool necessary to run the largest open-weight models at full quality. An 8× L40S configuration provides 384GB VRAM — enough for Llama 3.1 405B at Q4 without any offloading, or DeepSeek R1 671B at aggressive quantization.
These configurations require server-grade hardware: EPYC or Threadripper Pro platforms with sufficient PCIe lanes, server cases with proper airflow for blower-fan datacenter cards, and three or more 1600W PSUs. The documented 8× RTX 3090 basement server (Ahmad Osman, 2024) cost approximately $12,000–$15,000 using used consumer cards. A 4× L40S new configuration would cost $28,000–$48,000 in GPUs alone.
GPU Comparison Matrix: At a Glance
| GPU | VRAM | Bandwidth | 70B Q4 Speed | NVLink | Price (2026) | Best For |
|---|---|---|---|---|---|---|
| RTX 5090 | 32GB GDDR7 | 1,792 GB/s | ~45 tok/s* | No | $3,500–4,200 | Up to 32B optimal, 70B with heavy quant |
| RTX 4090 | 24GB GDDR6X | 1,008 GB/s | ~10 tok/s** | No | $1,400–1,700 | Up to 34B optimal |
| 2× RTX 3090 NVLink | 48GB (NV) | 112 GB/s (NV) | 28–35 tok/s | Yes | $1,100–1,700 | 70B sweet spot value |
| RTX 6000 Ada | 48GB GDDR6 | 960 GB/s | 18.4 tok/s | Yes | $6,500–8,000 | Professional 70B inference, ECC |
| L40S | 48GB GDDR6 | 864 GB/s | 15.3 tok/s | Yes | $7,000–12,000 | Sustained production inference |
| 2× RTX 5090 | 64GB (PCIe) | 1,792 GB/s ea | 70–85 tok/s | No | $7,000–8,400 | 70B high-speed, 405B partial |
| RX 7900 XTX | 24GB GDDR6 | 960 GB/s | ~6 tok/s** | No | $800–950 | Budget, Linux-native, ROCm |
* With Q3_K_S quantization; 70B at Q4 requires ~43GB which exceeds 32GB VRAM, necessitating CPU offloading that significantly reduces speed.
** With CPU offloading of layers that don’t fit in 24GB VRAM.
CPUs for Local AI Workstations: The Foundation That Determines Your Ceiling
When choosing hardware for running powerful AI models locally, the CPU’s role in local AI inference is often misunderstood. For pure GPU inference where the entire model fits in VRAM, the CPU is largely idle — a modern consumer CPU is sufficient. The CPU becomes critical in three specific scenarios:
- CPU offloading: When model layers must be computed in system RAM instead of VRAM, the CPU processes those layers. Cache size, memory bandwidth, and core count all directly impact offloaded-layer throughput.
- Multi-GPU setups: PCIe lane count determines how many GPUs can operate at full x16 bandwidth simultaneously. Consumer CPUs (Core i9, Ryzen 9) typically offer 24–48 usable PCIe lanes. Running 4 GPUs requires a HEDT or server platform.
- Model preparation: Quantization, format conversion (GGUF, EXL2), and fine-tuning are CPU-intensive preprocessing tasks that benefit from high core count and large cache.
AMD Threadripper Pro 9995WX — The Ultimate AI Workstation CPU
| Specification | Value |
|---|---|
| Architecture | Zen 5, 4nm TSMC |
| Cores / Threads | 96 / 192 |
| Base / Boost Clock | 2.5 GHz / 5.4 GHz |
| L3 Cache | 384 MB |
| TDP | 350W |
| Memory Channels | 8-channel DDR5 |
| Memory Bandwidth | 409.6 GB/s (DDR5-6400) |
| PCIe Lanes (CPU) | 128 lanes PCIe 5.0 |
| Socket | sTR5 |
| Max Memory | 2TB ECC RDIMM DDR5 |
The Threadripper Pro 9995WX is the most capable consumer-purchasable CPU for local AI workloads in 2026. Its 128 PCIe 5.0 lanes allow up to 8 GPUs at full x16 bandwidth simultaneously — no lane sharing, no bifurcation compromises. The 409.6 GB/s memory bandwidth (with DDR5-6400 in 8-channel configuration) is critical for CPU offloading: when model layers hit system RAM, that bandwidth directly determines throughput.
AMD claimed a 49% faster tokens-per-second improvement for 32B DeepSeek R1 inference versus the Intel Xeon W9-3595X, attributed to Zen 5’s superior cache hierarchy and memory bandwidth. The 384MB L3 cache can hold entire quantized versions of models up to approximately 12–15B parameters, enabling zero-latency cache hits on repeated patterns.
Price: Threadripper Pro 9995WX pricing is not publicly listed at MSRP — it is sold through authorized resellers and system integrators. Expect $5,000–$8,000 for the CPU alone, with complete workstation builds starting at $15,000.
AMD Threadripper Pro 7995WX — Proven Zen 4 Workhorse
The previous-generation Threadripper Pro 7995WX (Zen 4, 96 cores) remains a top choice for users who need the sTR5 platform’s lane count but want a slightly more accessible price point. It offers 128 PCIe 5.0 lanes, 8-channel DDR5 up to DDR5-5200, and is available from major system integrators.
CPU price: approximately $2,800–$4,500 in April 2026. Compared to the 9995WX, expect 10–15% lower memory bandwidth and single-thread performance but otherwise similar multi-GPU support capability.
Intel Core Ultra 9 285K — Consumer Tier, 4× GPU Capable
| Specification | Core Ultra 9 285K |
|---|---|
| Architecture | Arrow Lake, Intel 20A |
| Cores / Threads | 24P + 8E cores / 32 threads |
| L3 Cache | 36 MB |
| Memory | DDR5-6400, 2-channel |
| Memory Bandwidth | ~102 GB/s |
| PCIe Lanes | 24 (CPU) + 20 (chipset) |
| TDP | 125W (base) / 250W (boost) |
| Socket | LGA 1851 |
| Retail Price | ~$580–$620 |
For single-GPU or dual-GPU setups (2× GPUs at x8/x8 or x16/x4), the Core Ultra 9 285K is the price/performance champion for consumer budgets. Its 24 CPU PCIe lanes can support two GPUs at PCIe 5.0 x16/x4 (not ideal) or x8/x8, which still provides ample bandwidth for GPU communication and NVMe storage.
The 2-channel DDR5 memory limits system RAM bandwidth to approximately 102 GB/s — a significant constraint for heavy CPU offloading workloads. For single-GPU setups with no offloading, this is irrelevant. For mixed GPU+CPU inference, it becomes a bottleneck.
AMD Ryzen 9 9950X — The Consumer 16-Core Option
| Specification | Ryzen 9 9950X |
|---|---|
| Architecture | Zen 5, 4nm TSMC |
| Cores / Threads | 16 / 32 |
| Boost Clock | 5.7 GHz |
| L3 Cache | 64 MB |
| Memory | DDR5-5600, 2-channel |
| PCIe Lanes | 24 (CPU) + 28 (chipset, X870E) |
| TDP | 170W |
| Retail Price | ~$550–$650 |
The Ryzen 9 9950X paired with an X870E motherboard is the sweet spot for single-RTX 5090 or dual-RTX 4090 builds. The 24 CPU PCIe lanes support one GPU at full PCIe 5.0 x16, with storage on additional lanes. For CPU offloading, the Zen 5 architecture’s improved IPC and 64MB L3 cache outperform the Core Ultra 9 285K in LLM-specific benchmarks by approximately 8–12%.
CPU Comparison for AI Workstations
| CPU | PCIe Lanes | Max GPUs (x16) | RAM BW | Max RAM | Price | Best Use |
|---|---|---|---|---|---|---|
| TR Pro 9995WX | 128 (PCIe 5) | 8 | 409.6 GB/s | 2TB ECC | $5,000–8,000 | 4–8 GPU professional |
| TR Pro 7995WX | 128 (PCIe 5) | 8 | 307.2 GB/s | 2TB ECC | $2,800–4,500 | 4–8 GPU workstation |
| Ryzen 9 9950X | 24 (PCIe 5) | 1–2 | ~100 GB/s | 192GB DDR5 | $550–650 | 1–2 GPU consumer |
| Core Ultra 9 285K | 24 (PCIe 5) | 1–2 | ~102 GB/s | 192GB DDR5 | $580–620 | 1–2 GPU consumer |
| EPYC 9654 | 160 (PCIe 5) | 10 | 460 GB/s | 6TB ECC | $8,000–12,000 | Server 8+ GPU |
Motherboards: The Platform That Defines What’s Possible
The motherboard determines your PCIe lane topology, maximum GPU count, memory channel configuration, and upgrade headroom. For local AI workstations, this decision is platform-defining.
ASUS Pro WS WRX90E-SAGE SE — The Multi-GPU Pinnacle
This board represents the most capable consumer-purchasable motherboard for multi-GPU AI workstations. Designed specifically for the Threadripper Pro 7000/9000 WX-Series on socket sTR5:
- PCIe slots: 7× PCIe 5.0 x16 — all capable of running at full x16 bandwidth from the CPU’s 128 PCIe lanes
- Memory: 8 DIMM slots, up to 2TB ECC RDIMM DDR5, 8-channel memory architecture
- Storage: 4× M.2 slots (PCIe 5.0 compatible), multiple SATA ports
- Networking: Dual 10 GbE Intel LAN ports — critical for distributed inference or serving models over fast LAN
- Form factor: EEB (Extended ATX) — requires full-tower or server chassis
- Remote management: Integrated IPMI for headless server operation
- Price: $1,247–$1,291 (Newegg, April 2026)
With 7 physical x16 slots all fed by 128 PCIe 5.0 lanes from the CPU directly (no lane sharing through chipset), this board can support configurations impossible on any consumer desktop platform: 4× RTX 6000 Ada all at PCIe 5.0 x16, or even 6× RTX 4090 with x16 each plus NVMe storage.
ASRock WRX90 WS EVO — The Alternative Threadripper Pro Platform
The ASRock WRX90 WS EVO offers similar capabilities to the ASUS Pro WS at a marginally lower price point. Key differences: the ASUS has a 32-phase VRM (vs ASRock’s 18-phase), which matters for the highest TDP Threadripper Pro CPUs under heavy sustained load. The ASRock offers 7× PCIe 5.0 x16 slots, identical DDR5 8-channel support, and dual Intel 10G LAN. Both boards are compatible with both the 7000WX and 9000WX Threadripper Pro series.
ASUS Pro WS TRX50-SAGE WIFI — Threadripper (Non-Pro) Option
For users who want a multi-GPU platform without the Threadripper Pro price premium, the non-Pro Threadripper 7000 series (socket TRX50) offers up to 64 PCIe 5.0 lanes — enough for 4 GPUs at x16 simultaneously. The ASUS Pro WS TRX50-SAGE WIFI supports the Ryzen Threadripper 7960X (24 cores, ~$1,400) or 7980X (64 cores, ~$2,800):
- 6× PCIe 5.0 x16 slots (4 at CPU, 2 at chipset)
- 8 DIMM slots (quad-channel DDR5, up to 512GB non-ECC or 256GB ECC)
- Multiple M.2 PCIe 5.0 slots
- Wi-Fi 7, 10 GbE LAN
- Price: ~$800–$900
Consumer Z890 / X870E Boards for Single or Dual GPU
For builds centered on one or two GPUs, consumer AM5 (Ryzen 9000) or LGA 1851 (Intel Core Ultra) platforms are fully adequate. Top boards for AI use cases:
- ASUS ROG Maximus Z890 Apex: $850, supports PCIe 5.0 x16/x16 with Core Ultra 9 285K, dual x16 slots capable
- MSI MEG X870E ACE: $700, AM5 socket (Ryzen 9000), PCIe 5.0 x16 primary, PCIe 4.0 x16 secondary
- ASUS ProArt X870E-Creator WiFi: $550, content creator focus with two full-bandwidth PCIe 5.0 slots and extensive connectivity
- Gigabyte X870E Aorus Master: $480, excellent VRM for heavy CPU loads, dual M.2 PCIe 5.0
RAM: System Memory as the AI Model’s Secondary Layer
System RAM (DRAM) plays two distinct roles in local AI workstations:
- CPU offloading layer: Layers of the model that don’t fit in VRAM are stored and computed in system RAM. RAM bandwidth and capacity directly determine how much model can be offloaded and at what speed.
- KV Cache and context: The key-value cache for transformer attention grows with context length. At 128K context with a 70B model, KV cache alone can reach 10–20GB. This typically stays in VRAM but overflows to RAM for very long contexts.
How Much RAM Do You Need?
| Use Case | Minimum RAM | Recommended RAM | Notes |
|---|---|---|---|
| Single GPU, no offloading (model fits in VRAM) | 32 GB | 64 GB | OS + apps + headroom |
| 70B model with partial CPU offloading | 64 GB | 128 GB | Offloaded layers need ~40–60GB headroom |
| 405B model (full CPU offload, Q4) | 256 GB | 512 GB | ~115GB model + OS + KV cache |
| Multi-user vLLM serving (70B, 8 users) | 128 GB | 256 GB | Parallel KV caches multiply |
| Fine-tuning 70B (QLoRA) | 128 GB | 256 GB | Gradient checkpoints and optimizer states |
Consumer DDR5: The Right Kits for AI Builds
For consumer AM5 / LGA 1851 (max 128–192GB, dual-channel):
- G.SKILL Trident Z5 Neo 128GB (2×64GB) DDR5-6000 CL30: ~$280. Best performance-per-dollar for Ryzen 9 9950X builds. Dual-channel limits bandwidth to ~96 GB/s but this is the platform ceiling.
- Kingston FURY Beast 128GB (4×32GB) DDR5-5600 CL40: ~$240. More slots occupied but lower latency than many kits at this speed.
For Threadripper / Threadripper Pro (8-channel ECC RDIMM):
- Kingston FURY Renegade Pro 256GB (8×32GB) DDR5-5600 ECC RDIMM CL36: ~$1,200. The standard choice for Threadripper Pro 7000 workstations. 8-channel DDR5 provides up to 307 GB/s bandwidth — critical for CPU offloading workloads.
- G.SKILL G5 Series 256GB (8×32GB) DDR5-6000 ECC RDIMM CL30: ~$1,400–$1,800. Higher speed with excellent stability. XMP 3.0 profiles tested with ASUS WRX90 boards.
- A-Tech 512GB (8×64GB) DDR5-6400 ECC RDIMM: ~$17,500. For the 405B use case — 512GB system RAM enables full Q4 inference of Llama 3.1 405B or DeepSeek R1 671B in system memory when VRAM is insufficient.
- V-Color 2TB RDIMM kit (256GB per DIMM) for Threadripper Pro 9000: Announced in 2025, available through specialty channels. Enables 2TB system RAM on a single WRX90 platform — theoretical capability to run even the 671B DeepSeek in full Q4 precision from system RAM.
Important note on RAM bandwidth and CPU offloading speed: When model layers are offloaded to system RAM, throughput scales almost linearly with memory bandwidth. A Threadripper Pro with 409.6 GB/s bandwidth processes offloaded layers approximately 4× faster than a dual-channel consumer platform at 100 GB/s. This makes the platform choice critical when you know you’ll be offloading.
Storage: Where Your Models Live Between Sessions
LLM model files are substantial: a 7B model in Q4_K_M format is approximately 4–5GB; a 70B model is 40–43GB; the 405B model in Q4 is approximately 230GB. Selecting the right storage ensures fast model loading and smooth switching between models.
What Storage Performance Actually Means for AI Workloads
When you launch llama.cpp, Ollama, or vLLM with a 70B model, the model weights are loaded from disk into VRAM (or system RAM for offloading). With a PCIe 4.0 NVMe drive reading at 7,000 MB/s, a 43GB model loads in approximately 6–7 seconds. With a PCIe 3.0 drive at 3,500 MB/s, the same model takes 12–14 seconds. For inference sessions that run for hours once loaded, the difference is startup time only — after loading, disk speed is irrelevant unless you’re frequently switching models.
PCIe 5.0 drives (14,000+ MB/s) cut load time to 3–4 seconds for 70B models but carry a 40–60% price premium and run significantly hotter, requiring active heatsinks. The consensus among practitioners is: PCIe 4.0 is the optimal tier for model storage — fast enough that loading isn’t a friction point, without the premium and heat of PCIe 5.0.
Top NVMe SSDs for AI Model Storage (2026)
| Drive | Capacity | Interface | Sequential Read | Price (Apr 2026) | Endurance (TBW) |
|---|---|---|---|---|---|
| Samsung 990 Pro | 2TB / 4TB | PCIe 4.0 NVMe | 7,450 MB/s | $150 / $280 | 1,200 / 2,400 |
| WD Black SN850X | 2TB / 4TB | PCIe 4.0 NVMe | 7,300 MB/s | $140 / $260 | 1,200 / 2,400 |
| Kingston KC3000 | 2TB / 4TB | PCIe 4.0 NVMe | 7,000 MB/s | $120 / $230 | 1,600 / 3,200 |
| Crucial T705 | 2TB / 4TB | PCIe 5.0 NVMe | 14,500 MB/s | $220 / $390 | 1,200 / 2,400 |
| Seagate FireCuda 530 | 2TB / 4TB | PCIe 4.0 NVMe | 7,300 MB/s | $145 / $270 | 1,275 / 2,550 |
Recommended minimum configuration: 2× Samsung 990 Pro 2TB in separate M.2 slots — one for the OS and applications, one dedicated to model storage. Total: 4TB capacity, $300 in drives, and the ability to keep 5–8 medium-large models (7B–34B) on the model drive simultaneously.
For 405B+ model users: A dedicated 4TB or larger drive is necessary. The Llama 3.1 405B in Q4_K_M format occupies 232GB; keeping multiple large models requires 2–4TB minimum. Consider a 4TB Samsung 990 Pro ($280) or a 4TB Seagate IronWolf Pro NAS drive ($120) for slower but high-capacity model archive storage, with a fast NVMe for the active model.
Power Supply Units: The Foundation You Cannot Compromise
GPU TDP numbers in the RTX 5090 generation are not suggestions — they’re sustained power draws under AI inference load, which is often more demanding than gaming workloads. A PSU that’s undersized will either throttle your GPU or fail under sustained load. Calculate your power budget correctly.
PSU Sizing Calculator
| Component | Typical Peak Draw |
|---|---|
| RTX 5090 | 575W (TDP, sustained at 100%) |
| RTX 4090 | 450W (TDP, sustained) |
| RTX 3090 | 350W (TDP, sustained) |
| Threadripper Pro 9995WX | 350W (TDP) |
| Ryzen 9 9950X | 170W (PBO off) / 240W (PBO2 sustained) |
| Core Ultra 9 285K | 253W (PL2 sustained) |
| Motherboard + RAM + Storage | 50–100W |
| Cooling (AIOs, fans) | 30–60W |
Single RTX 5090 build (consumer CPU):
575W (GPU) + 250W (CPU) + 100W (rest) = 925W. Use a 1,200W PSU minimum for headroom. A 1,200W 80+ Platinum PSU at 90% load is more efficient and cooler than a 1,000W at 95% load.
Dual RTX 5090 build (consumer CPU):
1,150W (2× GPU) + 250W (CPU) + 100W (rest) = 1,500W. Use a 1,600W+ PSU. At this power level, ATX 3.1 compliance and PCIe 5.1 12V-2×6 connectors are required for each GPU.
4× GPU professional build (Threadripper Pro):
1,600W+ (4× GPU) + 350W (CPU) + 150W (rest) = 2,100W+. Requires dual PSU configuration (two 1,200–1,600W units) or a single server PSU rated for 2,000W+.
Top PSU Recommendations for AI Workstations
1,200W Tier — Single High-End GPU (RTX 5090) or Dual RTX 4090
- Seasonic PRIME TX-1300 (1,300W, 80+ Titanium): ~$360. Industry benchmark for quality and reliability. 12-year warranty, zero-RPM fan mode, fully modular. The choice for builds where PSU failure is not an option.
- Corsair HX1500i (1,500W, 80+ Platinum): ~$280. Semi-modular, i-series digital monitoring via USB (iCUE), excellent sustained efficiency. 10-year warranty.
- be quiet! Dark Power 13 (1,000W, 80+ Titanium): ~$250. For single GPU builds. Silent operation, exceptional build quality, 10-year warranty.
1,600W Tier — Dual RTX 5090 or 3×RTX 4090
- MSI MEG Ai1600T PCIE5 (1,600W, 80+ Titanium): ~$695. ATX 3.1 and PCIe 5.1 compliant with dual 12V-2×6 connectors built-in. Designed specifically for the dual RTX 5090 configuration. 12-year warranty.
- Corsair AX1600i Digital (1,600W, 80+ Titanium): ~$610. GaN switching transistors, fully modular, 10-year warranty. The enthusiast’s choice for fully digital power monitoring.
- be quiet! Dark Power Pro 13 (1,600W, 80+ Titanium): ~$580. Up to 94.5% efficiency, ATX 3.1 compliant, frameless Silent Wings 4 fan, 10-year warranty. Best acoustics in its class.
- ASUS Pro WS 1600W Platinum (ATX 3.1): ~$400. Purpose-engineered for AI workstations; supports two RTX 5090s on the same unit.
- Seasonic PRIME PX-1600 (1,600W, 80+ Platinum): ~$380. Premium Japanese capacitors, 12-year warranty, ATX 3.1 and PCIe 5.1 ready.
Dual PSU for 4+ GPU Builds
For systems exceeding 2,000W sustained draw, two synchronized PSUs are the practical solution on consumer/prosumer hardware. The Seasonic PRIME TX-1300 (×2) combination provides 2,600W at Titanium efficiency. An add-2-PSU (Add2PSU) adapter module ($20–$40) synchronizes the two units’ power-on sequencing from a single motherboard power-good signal.
Cooling: Managing Thermal Loads That Would Melt Budget Hardware
AI inference is one of the most thermally demanding consumer workloads. Unlike gaming, which involves frame-to-frame variation in GPU utilization, LLM inference runs the GPU at 95–100% utilization continuously for as long as the inference session is active. A model generating responses for two hours is putting sustained thermal stress on GPU, VRM, and memory for two continuous hours.
GPU Cooling Options
Triple-fan open-air (consumer cards: RTX 5090, 4090, 3090): These coolers are designed for sustained gaming loads (minutes at a time) but handle AI inference adequately provided your case has strong positive airflow. Case airflow is critical: at least three 120mm or two 140mm intake fans directed at GPU exhaust zone.
Custom water cooling loops: For dual-GPU consumer builds, custom loops with full-cover GPU waterblocks eliminate thermal throttling and reduce noise significantly. EKWB, Alphacool, and Bykski offer waterblocks for RTX 5090 and RTX 4090. A basic custom loop adds $400–$800 to the build cost but provides the best sustained thermals achievable.
Blower-fan professional cards (L40S, RTX 6000 Ada): These cards use a single blower fan that exhausts heat directly out the rear of the card and the case — ideal for multi-GPU rack configurations where open-air coolers would recirculate each other’s heat. In desktop cases, blower cards can be noisier but thermally better suited for multi-GPU arrays.
CPU Cooling
For Threadripper Pro systems at 350W TDP, air cooling is marginal. The recommended options:
- Noctua NH-U14S TR5-SP6: ~$100. Noctua’s best air cooler specifically designed for sTR5 socket. Handles Threadripper Pro 7995WX comfortably at stock settings. Quiet, reliable, no liquid risk.
- ASUS ROG Ryujin III 360 ARGB: ~$250. 360mm AIO, Gen 4 Asetek pump, for users pushing Threadripper Pro to maximum sustained performance with PBO enabled.
- Custom loop with Heatkiller IV TR5 waterblock: For the 9995WX at full 350W sustained, a custom loop is the only way to maintain safe temperatures without throttling. Budget $800–$1,200 for a proper custom CPU block + 480mm radiator setup.
The NVIDIA DGX Spark: A New Category of Local AI Computer
In January 2025, NVIDIA announced the DGX Spark — a personal AI supercomputer powered by the GB10 Grace Blackwell Superchip. This is not a traditional desktop computer. It’s a compact, purpose-built AI inference device available for direct consumer purchase, and it deserves specific coverage in any guide on hardware for running powerful AI models locally.
| Specification | DGX Spark |
|---|---|
| Chip | NVIDIA GB10 Grace Blackwell Superchip |
| CPU | 20-core ARM (10× Cortex-X925 @ 4GHz + 10× Cortex-A725 @ 2.8GHz) |
| GPU Cores | 6,144 CUDA cores (Blackwell) |
| Memory | 128GB LPDDR5x unified (CPU + GPU share) |
| Memory Bandwidth | 273 GB/s (unified, not segregated) |
| AI Performance | 1 PFLOP at FP4 with sparsity |
| Storage | 1TB or 4TB NVMe |
| Networking | 200Gbps QSFP (2×), 10GbE, Wi-Fi 7 |
| Power | 240W total (entire system) |
| Size | 150 × 150 × 50.5mm — smaller than a Mac Mini |
| Price | $4,699 (Founder’s Edition, April 2026) |
The DGX Spark’s key capability: it can run models up to 200 billion parameters natively, or up to 405B when two units are connected via QSFP interconnect. The 128GB unified memory pool eliminates the VRAM vs. system RAM dichotomy that constrains traditional GPU builds — the entire memory pool is accessible at GPU bandwidth.
However, at 273 GB/s bandwidth (shared CPU/GPU), it is significantly slower than discrete GPU setups for models that fit in VRAM. A single RTX 5090 with 1,792 GB/s GPU bandwidth will generate tokens 5–6× faster for 7B–32B models. The DGX Spark’s advantage is in handling 100B–200B models that a single RTX 5090 cannot address without heavy quantization and CPU offloading.
Who should buy the DGX Spark: Organizations and individual researchers who need 100B–200B model inference locally, without building and maintaining a multi-GPU x86 workstation. The simplicity of the DGX Spark (plug in, install software, run models) versus the complexity of a multi-GPU Threadripper Pro build is a legitimate consideration for teams where hardware administration isn’t a core competency.
Apple Silicon: The Unified Memory Alternative
No guide on local AI hardware is complete without addressing Apple Silicon’s unique architecture. The Mac Studio M4 Ultra and Mac Pro M4 Ultra offer a fundamentally different approach to the VRAM bottleneck: unified memory that serves both CPU and GPU from a single high-bandwidth pool, with no data transfer penalty between compute units.
| Specification | Mac Studio M4 Ultra | Mac Pro M4 Ultra | Mac Mini M4 Pro |
|---|---|---|---|
| Unified Memory | Up to 192GB | Up to 192GB | Up to 64GB |
| Memory Bandwidth | 800 GB/s | 800 GB/s | 273 GB/s |
| GPU Cores | 80 | 80 | 20 |
| Price | $4,999–$9,000+ | $9,999–$15,000+ | $1,399–$1,999 |
How Apple Silicon compares for LLM inference:
- Models under 32B: An RTX 5090 at 1,792 GB/s bandwidth produces 2–3× more tokens per second than a Mac Studio M4 Ultra at 800 GB/s for small-to-medium models that fit in the GPU’s VRAM.
- 70B models: The Mac Studio M4 Ultra with 192GB unified memory can run Llama 3 70B at Q8 quality level (70GB) without any offloading, at approximately 15–25 tokens/sec. An RTX 5090 trying the same model must offload heavily, dropping to 3–5 tokens/sec. The Mac wins decisively here.
- 405B models: Apple’s M4 Ultra at 192GB can run Llama 3.1 405B at Q2 quantization (approximately 100–115GB). This is an extraordinary capability for a single-unit $9,000 machine. Token generation speed is ~2–5 tok/s, which is slow but functional for research and evaluation purposes.
The framework that emerges: if you primarily work with models under 32B parameters, a discrete NVIDIA GPU workstation is faster and cheaper. If you primarily work with 70B–200B models and need high-quality output without heavy quantization, Apple Silicon’s unified memory architecture is currently unmatched in the consumer market.
Hardware for Running Powerful AI Models Locally: Three Complete Build Tiers
The following builds represent complete, purchasable hardware for running powerful AI models locally, targeting specific use cases and budgets. Prices reflect April 2026 market rates.
Build Tier 1 — “The Practitioner” (~$5,000–$7,000)
Best for: Professionals running models up to 34B parameters. Daily LLM coding assistant, local RAG, fine-tuning small models.
| Component | Selection | Price | Notes |
|---|---|---|---|
| GPU | NVIDIA RTX 5090 (ASUS ROG Strix OC or MSI Suprim X) | $3,500–$4,200 | 32GB GDDR7, 575W TDP |
| CPU | AMD Ryzen 9 9950X | $580–$650 | 16C/32T, Zen 5, excellent IPC |
| Motherboard | ASUS ProArt X870E-Creator WiFi | $520–$560 | PCIe 5.0 x16 GPU slot, robust VRM |
| RAM | G.SKILL Trident Z5 Neo 128GB (2×64GB) DDR5-6000 | $280–$320 | Max supported, dual-channel |
| Storage (OS) | Samsung 990 Pro 2TB | $145–$160 | PCIe 4.0, 7,450 MB/s read |
| Storage (Models) | WD Black SN850X 4TB | $260–$280 | Dedicated model storage |
| PSU | Corsair HX1500i 1,500W Platinum | $270–$300 | Headroom for GPU+CPU peak draw |
| CPU Cooler | Noctua NH-D15 G2 (or NZXT Kraken 360 AIO) | $120–$200 | 9950X has high boost power draw |
| Case | Fractal Design Torrent XL or Lian Li PC-O11 Dynamic EVO XL | $170–$220 | Excellent airflow for 575W GPU |
| Total | ~$5,845–$7,110 |
Expected performance:
- Llama 3 8B (Q4_K_M): ~200–213 tokens/sec
- DeepSeek R1 32B (Q4_K_M, fits comfortably in 32GB): ~65 tokens/sec
- Llama 3 70B (Q4, with partial CPU offload): ~30–40 tokens/sec (some layers in 128GB RAM)
- Qwen 2.5 72B (Q3_K_S, ~27GB): ~50–60 tokens/sec
Who uses a setup like this: Independent AI researchers, senior ML engineers running local coding assistants with 32B+ models, privacy-first law firms running document analysis workflows, and developers fine-tuning models on 7B–13B architectures with custom datasets.
Build Tier 2 — “The Power User” (~$12,000–$16,000)
Best for: Running 70B models at high quality without compromise. Small-team model serving. Local research with 405B class models at acceptable quantization.
| Component | Selection | Price | Notes |
|---|---|---|---|
| GPU ×2 | 2× NVIDIA RTX 5090 (reference or AIB partner) | $7,000–$8,400 | 64GB combined GDDR7 via PCIe |
| CPU | AMD Threadripper 7960X (24C) or 7980X (64C) | $1,400–$2,800 | 24–64 PCIe 5.0 lanes from TRX50 |
| Motherboard | ASUS Pro WS TRX50-SAGE WIFI | $800–$900 | 4× PCIe 5.0 x16 slots from CPU |
| RAM | 64GB DDR5-5600 (4×16GB) or 128GB DDR5-4800 (4×32GB) | $200–$450 | Quad-channel on TRX50 |
| Storage (OS) | Samsung 990 Pro 2TB | $150 | |
| Storage (Models) | 2× Samsung 990 Pro 4TB (RAID-0 stripe) | $560 | 8TB, ~13,000 MB/s combined read |
| PSU | MSI MEG Ai1600T PCIE5 1,600W Titanium | $695 | ATX 3.1, dual 12V-2×6 native |
| CPU Cooler | ASUS ROG Ryujin III 360 ARGB | $250 | TRX50 platform supported |
| Case | Thermaltake Core P8 TG ATX Full Tower | $250–$300 | Supports dual-GPU full-length cards with clearance |
| Total | ~$11,305–$14,255 |
Expected performance:
- Llama 3 70B (Q4_K_M, full in 64GB combined VRAM): ~70–85 tokens/sec
- DeepSeek R1 70B (Q5_K_M, ~50GB, full in VRAM): ~55–65 tokens/sec
- Llama 3.1 405B (Q2_K, ~115GB, needs CPU offload): ~10–18 tokens/sec
- Multi-user vLLM (70B Q4, 4 concurrent users): ~18–22 tokens/sec each
Real-world comparison: This build profile mirrors the dual RTX 5090 air-gapped setup documented by CraftRigs for legal and compliance teams in early 2026, described as “enterprise-class local AI without enterprise procurement.” The use case: a legal firm processing confidential client documents through 70B-class models with absolute certainty that no data leaves the premises.
Build Tier 3 — “The Research Station” (~$25,000–$45,000+)
Best for: Running 70B at FP16, quantized 405B and 671B at respectable speeds. Multi-user research team serving. The closest consumer-purchasable equivalent to a small inference cluster.
| Component | Selection | Price | Notes |
|---|---|---|---|
| GPU ×4 | 4× NVIDIA L40S 48GB (or 4× RTX 6000 Ada) | $28,000–$32,000 | 192GB combined VRAM via PCIe 4.0 |
| CPU | AMD Threadripper Pro 7995WX (96C) | $3,500–$4,500 | 128 PCIe 5.0 lanes for 4×GPU at x16 |
| Motherboard | ASUS Pro WS WRX90E-SAGE SE | $1,247–$1,291 | 7× PCIe 5.0 x16, 2TB ECC RAM support |
| RAM | Kingston FURY Renegade Pro 256GB (8×32GB) DDR5-5600 ECC RDIMM | $1,200–$1,400 | 8-channel, 307 GB/s bandwidth |
| Storage (OS) | Samsung 990 Pro 2TB | $150 | |
| Storage (Models) | 4× WD Black SN850X 4TB + 4TB HDD archive | $1,040+$120 | 16TB NVMe fast storage |
| PSU ×2 | 2× Seasonic PRIME TX-1300 + Add2PSU | $720 | 2,600W combined at Titanium efficiency |
| CPU Cooler | Custom loop: EKWB Quantum Magnitude sTR5 + 480mm rad | $500–$800 | Mandatory for TR Pro at sustained 350W |
| Case | Lian Li O11 Vision XL or server chassis (4U rackmount) | $300–$600 | Must accommodate 4× dual-slot blower cards |
| Total | ~$36,777–$42,631 |
Expected performance (4× L40S, 192GB total VRAM):
- Llama 3 70B (Q8, ~74GB, full in VRAM): ~35–45 tokens/sec
- Llama 3 70B (FP16, ~140GB, full in 192GB): ~18–25 tokens/sec
- Llama 3.1 405B (Q4, ~230GB — requires CPU offload): ~8–15 tokens/sec
- DeepSeek R1 671B (Q2, ~350GB — requires CPU offload to 256GB RAM): Functional but slow (~2–4 tok/s)
- Multi-user serving: 10–20 concurrent researchers at 10–15 tok/s each
Documented parallel: Ahmad Osman’s 8× RTX 3090 basement server (192GB total VRAM via consumer GPUs and NVLink pairs) documented in July 2024 is the closest public example to Tier 3 functionality at a lower budget. It runs on an ASRock Rack ROMED8-2T motherboard with AMD EPYC Milan 7713, 512GB DDR4 RAM, and three 1600W PSUs. The total cost was approximately $12,000–$15,000 using secondhand RTX 3090 cards — a testament to the fact that used professional configurations can dramatically undercut new build costs for research teams willing to accept used-hardware risk.
Software Stack: Maximizing Your Hardware’s Potential
Choosing the right inference framework is as important as the hardware configuration. The software layer determines how efficiently your GPUs are utilized, what quantization formats are supported, and how many concurrent users you can serve.
llama.cpp — The Universal Baseline
llama.cpp is the foundational open-source inference library that runs quantized models across NVIDIA (CUDA), AMD (ROCm), Apple Silicon, and even CPU-only configurations. It’s the engine underneath Ollama and many other user-facing tools.
Key characteristics:
- Supports GGUF quantization formats (Q2_K through Q8_0, IQ1 through IQ4)
- Multi-GPU tensor split via
--tensor-splitflag (PCIe multi-GPU without NVLink) - CPU offloading via
--n-gpu-layersparameter (precise control over what goes to VRAM vs. RAM) - MCP (Model Context Protocol) client support added March 2026
- Best for: single-user, maximum performance, all hardware platforms
Performance benchmark (13B model, RTX 4090): ~75–85 tokens/sec at Q4_K_M via llama.cpp CUDA.
Ollama — The Practitioner’s Interface
Ollama wraps llama.cpp with a clean CLI, REST API, and model management system. For practitioners who want to switch between models quickly without managing GGUF files manually, Ollama reduces friction significantly. Performance is essentially identical to llama.cpp for single-user inference (same engine underneath).
Best for: Individual practitioners, local coding assistant integration (Cursor, VS Code, Continue.dev), non-technical team members who need model access without CLI expertise.
ExLlamaV2 — Maximum Speed on NVIDIA
ExLlamaV2 is the fastest inference solution available for NVIDIA GPUs, using custom CUDA kernels that bypass PyTorch overhead. Benchmarks consistently show 50–85% faster token generation than llama.cpp for equivalent quantization levels.
| Framework | Model | GPU | Speed (tok/s) |
|---|---|---|---|
| ExLlamaV2 (EXL2 4.25 bpw) | Llama 2 13B | RTX 3090 | ~57 |
| llama.cpp (Q4_K_M) | Llama 2 13B | RTX 3090 | ~31 |
| ExLlamaV2 | Mistral 7B | RTX 4070 | ~118 |
| Ollama | Mistral 7B | RTX 4070 | ~52 |
Limitation: ExLlamaV2 requires NVIDIA GPUs (RTX 2000-series or newer) and uses EXL2 format rather than GGUF. It does not support CPU offloading, so the entire model must fit in VRAM. For VRAM-constrained setups, llama.cpp’s flexibility may be worth the throughput trade-off.
vLLM — Production Multi-User Serving
vLLM is the standard production inference server for multi-user LLM deployment. Its PagedAttention mechanism efficiently manages KV cache for multiple concurrent requests, enabling dramatically better throughput under concurrent load compared to llama.cpp:
- Single user: llama.cpp slightly faster (lower overhead)
- 5 concurrent users: vLLM ~80 tok/s each vs llama.cpp ~60 tok/s each
- Batch processing: vLLM achieves ~2,000 tok/s total vs llama.cpp’s ~300 tok/s
- vLLM v0.17.0 (March 2026): PyTorch 2.10, FlashAttention 4, AMD ROCm first-class support
Best for: Research teams serving 5+ concurrent users, API-based access to local models, organizations running 70B models as a shared internal resource.
Model Selection: What Can Your Hardware Actually Run?
With hardware for running powerful AI models locally specified across three build tiers, the practical question is: which models can each tier actually run, and at what quality level?
The Quantization Trade-Off
Modern quantization reduces model size by storing weights at lower precision than the training FP16/BF16. The quality cost varies by quantization level:
| Quantization | Bits per Weight | Quality vs FP16 | Size (70B model) | VRAM Needed |
|---|---|---|---|---|
| FP16 | 16 | 100% (baseline) | ~140GB | 140GB+ |
| Q8_0 | 8 | ~99% | ~74GB | 76GB+ |
| Q5_K_M | 5 | ~97–98% | ~48GB | 50GB+ |
| Q4_K_M | 4 | ~95–96% | ~43GB | 45GB+ |
| Q3_K_M | 3 | ~92–93% | ~32GB | 34GB+ |
| Q2_K | 2 | ~85–88% | ~21GB | 23GB+ |
Model Compatibility by Build Tier
| Model | Params | Tier 1 (RTX 5090, 32GB) | Tier 2 (2×RTX 5090, 64GB) | Tier 3 (4×L40S, 192GB) |
|---|---|---|---|---|
| Llama 3 8B | 8B | ✅ Q8 native (~5GB) | ✅ FP16 native (~16GB) | ✅ FP16 native |
| Qwen 2.5 14B | 14B | ✅ Q8 native (~14GB) | ✅ FP16 native (~28GB) | ✅ FP16 native |
| DeepSeek R1 32B | 32B | ✅ Q4 native (~19GB) | ✅ Q8 native (~32GB) | ✅ FP16 native |
| Llama 3 70B | 70B | ⚠️ Q3 partial (~34GB, some CPU offload) | ✅ Q4 native (~43GB) | ✅ Q8 native (~74GB) |
| DeepSeek R1 70B | 70B | ⚠️ Q3 partial, CPU offload | ✅ Q4 native (~43GB) | ✅ Q8 native (~74GB) |
| Qwen 2.5 72B | 72B | ✅ Q3_K_S native (~27GB) | ✅ Q4 native (~44GB) | ✅ Q8 (~75GB) |
| Llama 3.1 405B | 405B | ❌ Not practical | ⚠️ Q2 with heavy CPU offload | ⚠️ Q4 with partial CPU offload |
| DeepSeek R1 671B | 671B | ❌ Not possible | ❌ Not possible | ⚠️ Q2 with 256GB RAM offload |
Buying Guide: Where to Purchase and What to Watch For
RTX 5090 Supply Situation (April 2026)
The RTX 5090 launched at $1,999 MSRP in January 2025 but has experienced persistent supply constraints. As of April 2026, street prices remain $3,500–$4,200 at major US retailers. Supply arrives at Newegg, B&H, Adorama, and Best Buy in unpredictable batches. The most reliable method for purchasing at or near MSRP:
- Set up stock alerts via NowInStock.net for RTX 5090 at all major retailers
- NVIDIA’s own store (store.nvidia.com) offers Founders Edition drops with purchase limits
- AIB partner cards (ASUS, EVGA, MSI, Gigabyte) often appear at slightly above MSRP from their own stores
RTX 4090 Value Assessment
At $1,400–$1,700, the RTX 4090 provides approximately 55% of the RTX 5090’s memory bandwidth at 33% of the street price premium. For the vast majority of practical local AI use cases below 70B parameters, the RTX 4090 remains the rational choice. Check B&H, Newegg, and Amazon Warehouse Deals for open-box units at $1,300–$1,500.
Used RTX 3090 for NVLink Builds
If the dual-NVLink 3090 architecture fits your use case, eBay and local Craigslist/Facebook Marketplace listings regularly offer RTX 3090s at $550–$850. Verify the NVLink connector is not damaged before purchase — it’s a small gold contact strip on the top edge of the card. The NVLink HB bridge accessory (required) sells new for $100–$150 from NVIDIA-authorized resellers.
Professional GPU Channels
L40S and RTX 6000 Ada cards are sold through NVIDIA’s professional reseller network, not consumer channels. Major suppliers include Microway, Silicon Mechanics, and Puget Systems (for complete system builds). Expect lead times of 2–6 weeks for new stock. The used market for professional GPUs is active on eBay — L40S cards have appeared at $4,500–$6,500 from data center liquidations.
Complete System Builders
For organizations that want a single-vendor solution with professional support:
- Puget Systems: Specialized in video production and deep learning workstations. Pre-configured AI systems starting at $8,000. Exceptional documentation and customer support for workstation AI builds.
- Lambda Labs: Offers “GPU Cloud for Research” workstation systems for sale, in addition to their cloud service. GPU workstations priced $12,000–$65,000+ depending on GPU configuration.
- NVIDIA DGX Station A100: The institutional equivalent — 4× A100 80GB GPUs in a tower chassis. Pricing available through NVIDIA enterprise sales (typically $80,000–$150,000). Not consumer-purchasable in the traditional sense.
Real-World Use Cases: Who Is Actually Running This Hardware?
Beyond specifications, the strongest validation for these hardware configurations comes from documented real-world deployments by engineers and researchers who have published their setups.
The 8-GPU RTX 3090 Basement AI Server
Ahmad Osman’s documented basement server build (published July 2024) remains one of the most comprehensive public examples of a high-VRAM local AI cluster. The configuration:
- 8× NVIDIA RTX 3090 (192GB total VRAM via 4 NVLink pairs)
- ASRock Rack ROMED8-2T motherboard with AMD EPYC Milan 7713 (64 cores, 128 threads)
- 512GB DDR4-3200 ECC RAM
- Three 1,600W power supplies
- Primary purpose: running Meta’s Llama 3.1 405B for research applications
The NVLink pairs are critical: 4 bridges creating 4 GPU pairs, each pair sharing 48GB of NVLink-bonded VRAM. For model layers distributed across all 8 GPUs, PCIe 4.0 x16 handles inter-pair communication while NVLink handles intra-pair communication. The result is Llama 3.1 405B at Q4_K_M inference at functional (if not fast) throughput.
The Andrej Karpathy Single-GPU Autoresearch Setup
Andrej Karpathy (former Tesla AI director, OpenAI co-founder) released “autoresearch” in March 2026 — an AI agent that autonomously runs model training experiments. His documented setup targets RTX 3090 and RTX 4090 single-GPU configurations, with specific notes:
- 24GB VRAM (RTX 3090/4090) is the standard target — sufficient for continuous automated fine-tuning experiments
- For 12–16GB cards (RTX 3060/4060 Ti): scaled-down configurations require reducing model depth, vocab size, and sequence length
- CUDA 12.8+ required, Python 3.10+
This use case exemplifies single-GPU productivity: not the largest models, but continuous, automated, low-overhead experimentation that runs overnight and produces results by morning.
The Compliance Team Air-Gapped Dual RTX 5090 Setup
CraftRigs documented a dual RTX 5090 workstation deployed by a legal compliance team in early 2026. The configuration (2× RTX 5090, 64GB combined VRAM, air-gapped network) runs a 70B model on a dedicated workstation with zero internet connectivity — a hard requirement for attorney-client privilege in document review workflows. The team serves 4–6 concurrent attorneys at 20+ tokens/sec each, with 100% data locality guarantees.
This is the archetype for the Tier 2 build: enterprise-tier privacy requirements met with consumer-purchasable hardware at a fraction of NVIDIA enterprise hardware costs.
Decision Framework: Which Build Is Right for You?
After covering the full landscape of hardware for running powerful AI models locally, the practical question is how to navigate this decision. Here is a direct framework:
Choose Tier 1 (RTX 5090 single GPU) if:
- Your primary models are 7B–34B parameters
- You occasionally need 70B access and can tolerate Q3 quantization with some CPU offloading
- Single-user inference (you’re the only person running the model)
- Budget ceiling of $7,000–$8,000 total
- You want a workstation that also handles other GPU workloads (video, gaming, stable diffusion)
Choose 2× RTX 3090 NVLink if:
- 70B models at Q4 quality are your primary use case
- Budget is under $3,000 for the GPU pair
- You’re comfortable with used hardware and the associated risk
- You understand that RTX 3090 is end-of-life from NVIDIA’s perspective (no further driver feature development)
Choose Tier 2 (2× RTX 5090 + Threadripper) if:
- 70B models at Q4 or higher quality without CPU offloading are required
- Multi-user serving (2–6 concurrent users) is needed
- Privacy requirements mandate local deployment of larger models
- Budget can accommodate $12,000–$16,000 total investment
Choose Tier 3 (4× L40S + Threadripper Pro) if:
- Your team regularly works with 70B models at FP16 or Q8 quality
- Research workflows require 405B class inference, even at slow speeds
- 10–20 concurrent users need model access
- You have an institutional budget and need professional-grade reliability (ECC, blower cards, IPMI management)
Consider DGX Spark ($4,699) instead of Tier 1 if:
- Models from 70B–200B are your primary use case and low token speed (2–10 tok/s) is acceptable
- You want turnkey simplicity over flexibility
- Silence and compact form factor matter
- You’re not planning to use the hardware for anything other than AI inference
Consider Apple Silicon (Mac Studio M4 Ultra) instead of GPU build if:
- macOS is your primary environment (integrated ecosystem with Xcode, ML frameworks)
- 70B–192B models at moderate speed are acceptable (15–25 tok/s at Q4)
- Single-vendor warranty and support matter
- Power consumption and noise are priorities (Mac Studio draws 60–70W vs 700W+ for GPU builds)
Power and Electrical Considerations
Tier 2 and Tier 3 builds consume power at levels that require attention to electrical infrastructure beyond just the PSU selection.
- Tier 1 (~900–1,100W sustained): Standard 15A/120V household circuit is marginally sufficient. Running on a dedicated 20A circuit is strongly recommended to avoid breaker trips during sustained inference. Draw-down current peaks at startup can exceed the 15A rating momentarily.
- Tier 2 (~1,400–1,600W sustained): Requires a dedicated 20A/120V circuit or a 15A/240V circuit. Many home offices do not have 20A circuits; electrician consultation is warranted before building.
- Tier 3 (2,000W+ sustained): Requires a 240V 20A dedicated circuit or two separate 20A circuits. Server room or dedicated electrical infrastructure may be necessary. At 24/7 operation, annual electricity cost at $0.15/kWh is approximately $2,628/year — a real operational cost to factor into the ROI calculation.
Conclusion: The Best Time to Build a Local AI Workstation Is Now
The argument for investing in the right hardware for running powerful AI models locally has never been stronger. The 2026 landscape offers a convergence of circumstances that didn’t exist 18 months ago:
- Open-weight models at the 70B parameter scale routinely match GPT-4-era performance on coding, reasoning, and analysis tasks
- Consumer hardware (RTX 5090) has crossed the 32GB VRAM threshold, enabling 70B inference at acceptable quality levels from a single card with aggressive quantization
- Software frameworks (llama.cpp, ExLlamaV2, vLLM) have matured to extract near-theoretical hardware efficiency
- Quantization techniques (EXL2, IQ quants, GGUF) have advanced to where Q4 inference on well-quantized models is difficult to distinguish from FP16 on most practical tasks
The Tier 1 build (~$6,000–$7,000) delivers LLM inference performance that would have required a $50,000+ enterprise server in 2022. The Tier 3 build (~$40,000) replicates capabilities of a small inference cluster. For the professionals and organizations for whom data privacy, latency, and cost at scale are genuinely important, these investments return their value within months of deployment.
For more on the cloud GPU alternatives — when renting beats building — see our comprehensive GPU VPS for AI roundup, our RunPod vs Vast.ai vs Lambda Labs comparison, and our regularly updated GPU cloud pricing comparison.
Sources and References
- NVIDIA GeForce RTX 5090 — Official Specifications
- NVIDIA DGX Spark — Official Product Page
- Tim Dettmers — Which GPU for Deep Learning (2023, updated reference)
- llama.cpp — GitHub Repository
- vLLM Documentation
- ASUS Pro WS WRX90E-SAGE SE — Newegg listing
- Ahmad Osman — Serving AI from the Basement: 8× RTX 3090 Build
- Lambda Labs — GPU Cloud and Workstations
1 thought on “The Complete Hardware Guide for Running Powerful AI Models Locally (2026)”