Running a Local LLM for Threat Hunting: Setup, Models, and Real Workflows

Running a large language model locally changes what is practical for a security analyst. A local LLM has no data retention policy to worry about, no API cost per query, no terms of service that restrict what you can paste into it, and no network dependency when you are working in an air-gapped environment or analysing something sensitive that should not leave your network. The practical use cases for threat hunting are real: parsing and summarising large event log exports, explaining unfamiliar code or shellcode, drafting Sigma rules from a plain English description of behaviour, and running a conversational analysis session against a memory dump or a packet capture.

This covers the full setup across three hardware tiers, the models that actually work for security work, and how to integrate a local LLM into a threat hunting workflow in ways that make a measurable difference to investigation efficiency.

Choosing the right model for security work

Not all models are equally useful for technical security work. The things that matter most are instruction following (does it actually do what you ask), code comprehension (can it read and explain a PowerShell script or a Yara rule), and context window size (can it hold a large event log or a full Sigma ruleset in memory at once). As of mid-2026, the models that perform best for security analyst workflows are in the 7B to 34B parameter range, with the largest models you can run at acceptable speed on your hardware generally being the best choice.

Models worth knowing about for this use case:

Qwen2.5 and Qwen2.5-Coder (Alibaba) are the standout recommendation for security analyst workflows as of 2026. Qwen2.5-Coder-32B in particular outperforms most similarly sized models on code comprehension, code generation, and structured output tasks. For explaining shellcode, deobfuscating PowerShell, and generating Sigma rules with correct syntax, it consistently produces better results than Llama or Mistral models of equivalent size. Qwen2.5-72B is the general-purpose variant for reasoning-heavy tasks. Both are available through Ollama.

Llama 3.1 8B and 70B (Meta) are reliable all-rounders. The 8B model is the right choice when speed matters more than depth. The 70B model produces noticeably better reasoning on complex analysis tasks but requires substantial hardware.

Mistral 7B and Mixtral 8x7B are excellent for structured output. Mixtral specifically tends to produce cleaner YAML and JSON than models of similar capability, which makes it useful for Sigma rule generation where formatting correctness matters.

Phi-3 and Phi-4 Medium (Microsoft) are surprisingly capable for their size and work well on CPU-only setups where the larger Qwen and Llama models are too slow to be practical.

The security-fine-tuned models like SecureBERT are narrower in scope and generally less useful than a well-prompted general-purpose model of similar size. Qwen2.5-Coder has effectively made CodeLlama redundant for most security code analysis tasks.

Tier 1: Apple Silicon (M1, M2, M3, M4)

Apple Silicon Macs are the best consumer hardware for local LLM work because the unified memory architecture allows the GPU and CPU to share the same memory pool. This means a MacBook Pro M3 Max with 128GB RAM can run a 70B parameter model at reasonable speed, something that would require a multi-GPU server setup on x86 hardware. Even an M1 MacBook Air with 16GB RAM runs 7B and 13B models comfortably.

## Setup on Apple Silicon
## Ollama handles model download, GPU/CPU allocation, and serving

# Install Ollama (handles everything automatically on Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model - Ollama selects the right quantisation for your RAM
ollama pull qwen2.5-coder:7b   # ~4.4GB, best for code/Sigma/VQL tasks
ollama pull qwen2.5-coder:32b  # ~19GB, excellent for deep code analysis
ollama pull qwen2.5:72b        # ~41GB, best general reasoning, needs 48GB+
ollama pull llama3.1:8b        # ~4.7GB, solid all-rounder
ollama pull mistral:7b         # ~4.1GB, excellent structured output
ollama pull phi3:medium        # ~8.5GB, best CPU-only option

# Test it immediately
ollama run llama3.1:8b

# Run as a background service (starts on login)
ollama serve &

# Check GPU utilisation during inference (confirm it is using the GPU)
sudo powermetrics --samplers gpu_power -i 1000 -n 3

Performance on Apple Silicon: an M3 Max running Llama 3.1 8B generates around 60-80 tokens per second. At that speed, a 500-token response (roughly a detailed analysis of a suspicious PowerShell script) takes about 7 seconds. The 70B model on the same hardware runs at around 8-12 tokens per second, which is slower but produces noticeably better reasoning on complex tasks.

Tier 2: Consumer GPU (RTX 3080 / 4080 / 4090)

NVIDIA GPUs with CUDA are the most common setup for local LLM work on Windows and Linux. The limiting factor is VRAM: a 4090 has 24GB, which fits a 13B model comfortably at full precision or a 34B model with 4-bit quantisation. The RTX 3080 with 10GB VRAM fits a 7B model with room to spare but will need CPU offloading for anything larger.

## Setup on Linux with NVIDIA GPU (Ubuntu 22.04)

# Install CUDA toolkit (if not already installed)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update && sudo apt install cuda-toolkit-12-3 -y

# Install Ollama (automatically detects CUDA)
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU is being used
nvidia-smi  # Check VRAM usage while a model is running

# Pull models appropriate for your VRAM
# RTX 3080 10GB: use 7B models
ollama pull llama3.1:8b

# RTX 4080 16GB: 13B models work well
ollama pull llama3.1:13b

# RTX 4090 24GB: best-in-class for local security work
ollama pull qwen2.5-coder:32b  # Best code/security analysis at this tier
ollama pull qwen2.5:72b        # 4-bit quantised, fits in 24GB with some headroom
ollama pull llama3.1:34b       # Alternative general-purpose option

# On Windows with WSL2:
# Install WSL2 with Ubuntu, then follow the Linux steps above
# NVIDIA drivers on the Windows host are automatically available in WSL2

## Optimising for speed on GPU

# Set number of GPU layers (default is all layers on GPU if VRAM allows)
# If you get VRAM errors, reduce this number to offload some layers to CPU
OLLAMA_NUM_GPU_LAYERS=35 ollama run llama3.1:13b

# Increase context window for larger log files
# Default context is 2048 tokens - increase for large inputs
ollama run llama3.1:8b --num-ctx 8192

# Check what is currently loaded and VRAM usage
ollama ps

Tier 3: CPU-only (no GPU required)

A CPU-only setup is slower but completely accessible. A modern 8-core CPU with 32GB RAM can run a 7B model at around 8-15 tokens per second, which is usable for analysis tasks where you are not waiting in real time. The key is using aggressively quantised models (Q4 or Q5 format) which trade a small amount of quality for a large reduction in memory and compute requirements.

## CPU-only setup - works on any modern Linux, Mac, or Windows machine

# Ollama works on CPU automatically when no GPU is detected
curl -fsSL https://ollama.com/install.sh | sh

# On Windows (no WSL required)
# Download Ollama installer from https://ollama.com/download/windows
# Run the installer - it handles everything

# Use smaller, more efficient models for CPU inference
ollama pull phi3:mini        # ~2.3GB, fastest, good for simple queries
ollama pull mistral:7b-q4   # 4-bit quantised, better quality at similar speed

# llama.cpp directly gives more control on CPU
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Download a GGUF model directly
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run with llama.cpp (more control over threading and context)
./llama-cli \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    -c 4096 \
    -t $(nproc) \
    --prompt "Analyse this PowerShell command for malicious indicators:"

Adding a web interface: Open WebUI

The command line interface works but a web interface makes longer analysis sessions significantly more comfortable. Open WebUI connects directly to Ollama and provides a chat interface with conversation history, model switching, and file upload.

## Install Open WebUI via Docker (easiest method)

# Requires Docker to be installed
# Connect to Ollama running on the same machine
docker run -d \
    -p 3000:8080 \
    -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
    -v open-webui:/app/backend/data \
    --name open-webui \
    --restart always \
    ghcr.io/open-webui/open-webui:main

# Access at http://localhost:3000
# Create a local account (no external auth, fully local)

# Without Docker - pip install
pip install open-webui
open-webui serve
# Access at http://localhost:8080

Practical threat hunting workflows

The value of a local LLM in a security workflow is not in replacing analytical judgment. It is in reducing the time spent on mechanical analysis tasks so more time is available for the parts that require human judgment. These are the workflows that actually save meaningful time.

Explaining unknown code: Paste a suspicious PowerShell script, obfuscated VBScript, or shellcode disassembly directly into the chat. A prompt like “Explain what this script does step by step, identify any suspicious behaviour, and list any network indicators or file paths it references” reliably produces useful output on most models of 13B or larger.

# Example: pipe a suspicious script directly to Ollama from the command line
cat suspicious_script.ps1 | ollama run llama3.1:13b \
    "Analyse this PowerShell script. Explain what it does step by step.
     Identify: (1) any download or execution behaviour, (2) any C2 indicators
     including URLs or IPs, (3) any persistence mechanisms, (4) any evasion
     techniques. Format your response with clear headings."

Generating Sigma rules from behaviour descriptions: Describe what you observed in plain English and ask the model to produce a Sigma rule. This works well with Mistral and Mixtral which produce clean YAML output reliably.

# Sigma rule generation prompt
# Qwen2.5-Coder produces cleaner YAML output for Sigma rules than most alternatives
ollama run qwen2.5-coder:7b \
"Write a Sigma rule for the following behaviour:
 A PowerShell process spawned by winword.exe that uses the -EncodedCommand
 flag and makes an outbound network connection within 30 seconds of starting.
 The rule should target Windows Security Event ID 4688 and Sysmon Event ID 1.
 Use the standard Sigma YAML format with all required fields including
 title, id, status, logsource, detection, and falsepositives."

Parsing and summarising large event log exports: Export a filtered event log to CSV or JSON and feed it to the model in chunks. Ask for a timeline summary, anomaly identification, or correlation with a known attack pattern. The context window is the limiting factor here: an 8K context window holds roughly 6,000 words of log data, which is typically around 80-120 Windows security events.

## Python wrapper for chunked log analysis
import subprocess, json

def analyse_logs_with_llm(log_file, model="llama3.1:8b", chunk_size=50):
    with open(log_file) as f:
        lines = f.readlines()

    findings = []
    for i in range(0, len(lines), chunk_size):
        chunk = "".join(lines[i:i+chunk_size])
        prompt = f"""Analyse these Windows security events for suspicious activity.
Look for: unusual parent-child process relationships, credential access patterns,
lateral movement indicators, and persistence mechanisms.
Summarise findings in bullet points. If nothing suspicious, say CLEAN.

EVENTS:
{chunk}"""
        result = subprocess.run(
            ["ollama", "run", model, prompt],
            capture_output=True, text=True, timeout=120
        )
        output = result.stdout.strip()
        if "CLEAN" not in output.upper():
            findings.append({"chunk": i, "analysis": output})

    return findings

findings = analyse_logs_with_llm("security_events.csv")
for f in findings:
    print(f"Events {f['chunk']}-{f['chunk']+50}: {f['analysis']}\n")

VQL query assistance: Velociraptor’s VQL syntax is not widely documented enough to appear heavily in LLM training data, but models with strong SQL and Python foundations can help construct and debug VQL queries when given the plugin documentation as context in the prompt.

# Feed VQL plugin docs as context then ask for help
ollama run llama3.1:13b \
"Using Velociraptor VQL syntax, write a query that:
 1. Lists all running processes
 2. Filters to processes where the executable path matches AppData or Temp
 3. For each matching process, retrieves all VAD entries with EXECUTE permission
 4. Returns: Pid, Name, Exe, VAD.Start, VAD.Protection, VAD.Type
 VQL uses FROM, SELECT, WHERE, and foreach() similar to SQL.
 Plugin names: pslist(), vad(pid=X)"

Keeping the model current with security knowledge

A local model’s training data has a cutoff date and will not know about recent CVEs, new malware families, or updated attacker tooling. For current threat intelligence, use a retrieval-augmented generation (RAG) approach: feed current threat reports, blog posts, or MITRE ATT&CK updates into the model’s context as part of the prompt rather than relying on its base training.

## Simple RAG: include a threat report as context
# Download a recent threat report PDF and extract text
pip install pymupdf
python3 -c "
import fitz  # pymupdf
doc = fitz.open('threat_report_2026.pdf')
text = ' '.join([page.get_text() for page in doc])
print(text[:8000])  # First 8000 chars as context
" | head -200 > report_context.txt

# Include as context in your query
(cat report_context.txt; echo "---"; \
 echo "Based on this threat report, what MITRE ATT&CK techniques are described?
       Write a Velociraptor hunt query to detect the primary initial access technique.") \
| ollama run llama3.1:13b

A local LLM is not a replacement for analytical skill. A model that hallucinates a nonexistent Windows event ID or writes a Sigma rule with incorrect field names will produce confident-sounding output that is wrong, and catching that requires knowing enough about the subject to verify the output. The right mental model is a capable but overconfident junior analyst whose work always needs a senior review pass. With that expectation set, the time savings on mechanical tasks are real and significant.