Skip to main content
AI Guide

How to Run Llama 3 8B and 70B Locally?

4 min read

No credit card required

How to Run Llama 3 8B and 70B Locally: A Complete Guide for Developers

Running Meta’s Llama 3 models (8B and 70B) locally gives you full privacy, zero API costs, instant offline access, and complete control over quantization and behavior. In 2025–2026, consumer hardware has advanced enough that even mid-range laptops can run 8B models at 50–100+ tokens/second, while 70B is now realistic on high-end desktops or RTX 4090-class GPUs.

This step-by-step guide covers everything: hardware requirements, best tools, quantization choices, setup steps, expected speeds, and common fixes.

Hardware Requirements (Realistic 2025–2026)

Llama 3 8B

  • Minimum: 8 GB RAM (CPU only) → 12–25 t/s
  • Recommended: 12–16 GB VRAM GPU (RTX 3060/4060/4070) → 50–110 t/s
  • Sweet spot: RTX 4070/4080 or M2/M3/M4 Pro/Max → 80–140+ t/s

Llama 3 70B (Q4_K_M or Q5_K_M)

  • Minimum: 24 GB VRAM (RTX 4090 or dual 3090/4090) → 18–35 t/s
  • Comfortable: 48 GB VRAM (A6000, 2×4090, Mac Studio M2 Ultra) → 30–55 t/s
  • Fast: Apple Silicon M4 Max (128 GB unified) or high-end multi-GPU setups → 40–70+ t/s

General tips: NVMe SSD for fast loading, 32 GB+ system RAM, good cooling to avoid thermal throttling.

Best Tools for Running Llama 3 Locally

Ollama — Easiest one-command setup, great for beginners LM Studio — Beautiful GUI, easy model discovery and chat text-generation-webui (oobabooga) — Most customizable, extensions, LoRAs llama.cpp — Fastest raw performance, best for power users KoboldCPP / SillyTavern — Great for roleplay and storytelling

For most people in 2025, Ollama or LM Studio is the fastest way to get started. Want to see uncensored variants in action? Check this Dolphin Llama 3 guide for popular downloads and prompt examples.

Step-by-Step: Running Llama 3 8B & 70B

1. Download Quantized Models

Get GGUF files from trusted Hugging Face repos:

  • TheBloke
  • bartowski
  • lmstudio-community
  • NousResearch

Recommended starting points:

  • 8B → Q4_K_M or Q5_K_M
  • 70B → Q4_K_M (fits 24 GB VRAM) or IQ4_XS / Q3_K_M (lower VRAM)

2. Choose & Install Your Tool

  • Ollama: ollama run llama3:8b or ollama run llama3:70b
  • LM Studio: Search “Llama 3”, download the desired quant, load, and chat
  • llama.cpp: Clone repo, build, run server with GGUF file

For maximum speed, always offload as many layers as possible to the GPU (Ollama does this automatically in recent versions). Need more speed tweaks? See this Ollama performance guide.

3. Launch and Test

Start with small prompts:

  • 8B Q4_K_M on RTX 4070 → expect 60–100 t/s
  • 70B Q4_K_M on RTX 4090 → expect 20–40 t/s

Adjust temperature (0.7–1.0), top-p (0.9), and context length (8k–16k is fastest).

Expected Real-World Speeds (2025 Hardware)

  • Laptop RTX 4060 (8 GB) → 8B Q5: 45–70 t/s
  • Desktop RTX 4070 Ti (12–16 GB) → 8B Q5: 80–120 t/s
  • RTX 4090 (24 GB) → 70B Q4: 25–45 t/s
  • M3 Max 64 GB → 8B Q8: 90–130 t/s, 70B Q4: 18–32 t/s

These are typical numbers after optimization — your mileage may vary.

Common Fixes & Troubleshooting

  • Slow loading → Use NVMe SSD
  • Out of memory → Lower quantization or reduce context
  • GPU not used → Force full layer offload
  • Choppy responses → Close background apps, update drivers

For a browser-based interface to your local Llama 3 instance, see this local web UI guide.

Want everything in one place (local models + cloud chat + image generation)? Explore Merlio’s full suite of AI tools for seamless workflows.

Turn your fast local text outputs into visuals instantly with dedicated text-to-image AI — perfect for storyboarding, concept art, or social posts.

Conclusion

Running Llama 3 8B is now very accessible — most modern laptops can handle it smoothly. Llama 3 70B is realistic on high-end consumer GPUs or Apple Silicon. Choose Ollama or LM Studio for ease, or llama.cpp for maximum speed.

With privacy, no costs, and full control, local Llama 3 remains one of the best ways to use powerful AI.

Frequently Asked Questions

Try the #1 AI Platform

Generate Images, Chat with AI, Create Videos.

🎨Image Gen💬AI Chat🎬Video🎙️Voice
Used by 277,000+ creators worldwide

No credit card • Cancel anytime

Author Merlio

Written by

Merlio