LLM Inference Provider Frameworks

Presenter: Dmitri Iourovitski Date: October 1, 2025 Duration: 15 minutes

Overview

Dmitri Iourovitski provided a comprehensive comparison of local LLM inference providers, focusing on practical trade-offs between ease of use, performance, and production readiness.

Scope & Focus

Covered: Local inference providers

Ollama / Llama.cpp
Hugging Face Transformers
VLLM

Not Covered: Cloud providers

Reasoning: Cloud providers (OpenAI, Anthropic, etc.) have figured this out; refer to their documentation
Better to learn from provider-specific resources

Provider 1: Ollama / Llama.cpp

Overview

Ollama is a wrapper around Llama.cpp
Makes model downloading easier
Simplified local deployment

Key Features

Focus: Local execution with ease

Designed for quick demos
GGUF format optimization
CPU-friendly architecture
Easy setup and use

GGUF Format

What it is: Binary format optimized for CPU execution

Advantages:

Easy for CPUs to process
Apple M1 chip compatibility
AMD CPU+GPU hybrid chip support
Accelerator-friendly

Use case: Quick prototyping and demos

Strengths

✅ Ease of Use:

Simplest to install and run
No GPU required
Minimal configuration
Great for beginners

✅ Platform Support:

macOS (M1/M2/M3 chips)
Windows
Linux
CPU-only systems

✅ Quick Demos:

Fastest path to running models locally
Immediate gratification
Low barrier to entry

Limitations

❌ No Batch Processing:

Cannot process multiple requests efficiently
Single-threaded approach
Poor for production workflows
Example: Processing 100 PDF pages requires sequential processing

❌ Opaque Memory Management:

ollama ps  # Check model loading and memory usage

Problems:

Automatic CPU/GPU memory splitting
No control over placement
Unpredictable behavior as context grows
Models silently move between CPU/GPU

Example Issue:

You have a GPU
Model loads partially to GPU
As max sequence length increases
Model silently offloads to CPU
Performance degrades unexpectedly

❌ Not Production-Tuned:

Limited concurrent request handling
No proper batching
Can quickly run out of memory
Server settings insufficient for real workloads

Configuration Options (exist but limited):

Number of models running simultaneously
Requests per model
But: No batching means memory issues remain

When to Use Ollama

✅ Good for:

Local development
Quick demos
Testing models
Learning LLMs
macOS development

❌ Avoid for:

Production deployments
Batch processing
High-throughput needs
Critical applications

Provider 2: Hugging Face Transformers

Overview

Amazing inference stack with fine-grained control over every aspect of model execution.

Key Features

Fine-Grained Control:

Layer-by-layer device assignment
Accelerator selection per layer
Partial layer sharding across GPUs
Complete memory management control

Example Capabilities:

"Layer 1-10 on GPU 1"
"Layer 11-20 on GPU 2"
"Layer 21 split 50/50 across GPU 1 & 2"

Strengths

✅ Cutting-Edge Features:

First to implement new techniques
Flash Attention support (since 2023!)
Latest model architectures
Immediate access to new releases

Flash Attention:

Ollama: Recently added for limited models
Transformers: Easy setup since 2023
Significant performance improvement

✅ Feature-Rich:

Advanced quantization (AWQ, etc.)
Dynamic quantization during inference
Near full-precision with quantization
Extensive optimization options

✅ Model Architecture Flexibility:

Non-standardized approach
Each model has own architecture
Can use models before official release
Architecture available before weights

Example: Qwen3-Next

Architecture released early
Transformers support immediate
Run models day-one when weights drop
No waiting for provider support

✅ Production Ready:

Used in production successfully
Reliable and stable
Extensive tooling
Strong community support

Performance Example

Q1 2.5 VL (7B model) - OCR on 20 pages of images:

Provider	Time	Quantization
Ollama (3090)	35 minutes	GGUF standard
Transformers (3090)	8 minutes	AWQ (aggressive)

Note: Transformers was faster despite MORE aggressive quantization due to Flash Attention and optimization

Quantization Options

AWQ (Activation-aware Weight Quantization):

Dynamic during inference
Maintains near-full precision
Better than static GGUF quantization
Performance + quality

Limitations

❌ Developer-Centric:

Requires understanding of:
- Layer offloading
- Sharding parameters
- Memory management
- Model architecture

Risk: Can misconfigure and severely impact performance

❌ Steep Learning Curve:

Not standardized across models
Each model = different architecture
Manual configuration required
Need to understand internals

Architecture Variability:

Llama: Own architecture
BERT: Own architecture
Qwen: Own architecture
No universal interface

❌ Manual Memory Management:

You control everything
Complexity increases
Easy to make mistakes
Requires expertise

When to Use Transformers

✅ Good for:

Production deployments (with expertise)
Maximum performance needs
Latest model access
Custom optimization
Advanced use cases
GPU-equipped systems

✅ Acceptable Requirements:

Single GPU (12GB+)
Developer expertise
Time to optimize
Performance critical

❌ Avoid for:

Beginners
Quick prototypes
Limited GPU memory (<12GB)
CPU-only systems

Developer Tooling

Advantages:

Extensive debugging tools
Profiling capabilities
Memory analyzers
Performance metrics

PyTorch Backend:

Uses PyTorch underneath
Can use other backends
Flexible architecture
Strong ecosystem

Provider 3: VLLM

Overview

Optimized for high-throughput production inference

Key Innovation: Paged Attention

Problem: Traditional attention loads everything to one device

Device overload = crash
Memory management issues

Flash Attention: Controls matrix construction

Smart optimization
Still single-device focused

Paged Attention (VLLM):

Pages attention like virtual memory
Prevents memory overflow
Distributed attention
Production-grade reliability

Strengths

✅ Production Ready:

Out-of-the-box production optimization
Designed for scale
Robust handling
Enterprise-grade

✅ High Throughput:

Efficient batching
Concurrent request handling
Maximum GPU utilization
Optimized for volume

✅ First-Class Batching:

Batching as core feature
Extensive documentation
Request batching controls
Token allocation per request
Seamless implementation

✅ Enterprise Use Cases:

PDF processing at scale
RAG implementations
Multiple concurrent services
Production workloads

Limitations

❌ Heavy Setup:

Complex installation
Configuration intensive
Steep learning curve

Installation Tip: Use Docker for easier setup

❌ GPU Requirements:

Minimum: ~12GB GPU
Optimal: High-end GPUs (3090, 4090, 5090)
Not suitable for smaller GPUs
Won't work on macOS

Under 12GB: Limited value, constant paging overhead

❌ Overkill for Small Scale:

Not worth it for single requests
Overhead for simple use cases
Better alternatives for low volume

When to Use VLLM

✅ Perfect for:

Production GPU clusters
High-throughput needs
Multiple concurrent users
Batch processing
Enterprise deployments

✅ Hardware Requirements:

3090, 4090, 5090 GPUs
12GB+ VRAM
Multiple GPUs (ideal)

✅ Use Cases:

PDF processing pipelines
RAG at scale
Open WebUI + background processing
Multiple simultaneous services

Example Scenario:

Open WebUI for chatting
Background RAG processing
PDF analysis running
All sharing same GPU efficiently

Concurrent Request Problem (Ollama)

Issue: Using Ollama + Open WebUI + background tasks

Processes wait on each other
Memory duplication
Hardware throttling (especially M1 chips)
Performance degradation

VLLM Solution:

Proper request queuing
Efficient batching
No duplication
Optimized resource use

Comparison Matrix

Feature	Ollama/Llama.cpp	HF Transformers	VLLM
Ease of Use	⭐⭐⭐⭐⭐ Easiest	⭐⭐⭐ Moderate	⭐ Difficult
Performance	⭐⭐ Low	⭐⭐⭐⭐⭐ Excellent	⭐⭐⭐⭐⭐ Maximum
Batching	❌ No	✅ Manual	✅ Automatic
Production	❌ No	✅ Yes	✅ Yes
GPU Needs	❌ CPU OK	⚠️ 1 GPU	✅ Multiple GPUs
Learning Curve	Low	Steep	Very Steep
Platform	Any (Mac OK)	GPU/PyTorch	GPU only
Setup Time	Minutes	Hours	Hours/Days
Batching Docs	N/A	Manual	Extensive

Key Takeaways

1. Choose Based on Use Case

Ollama/Llama.cpp:

📱 Local development
🎓 Learning
🔬 Prototyping
💻 macOS development

Hugging Face Transformers:

🔧 Flexible production
🚀 Latest models
⚡ Performance tuning
🎯 Single GPU optimization

VLLM:

🏢 Production at scale
📊 High throughput
🖥️ GPU clusters
🔄 Concurrent requests

2. Hardware Matters

Hardware	Recommendation
CPU only	Ollama
Single GPU (<12GB)	Ollama or Transformers
Single GPU (12-24GB)	Transformers
Single High-End GPU	VLLM or Transformers
Multiple GPUs	VLLM
macOS M-series	Ollama

3. Memory Management Philosophy

Ollama: Automatic (opaque)

➕ Easy
➖ Unpredictable

Transformers: Manual (transparent)

➕ Complete control
➖ Complex

VLLM: Managed (optimized)

➕ Production-grade
➖ Heavy

4. Flash Attention Importance

Why it matters:

Significant performance boost
Memory efficiency
Should be standard (2023 tech!)

Support:

❌ Ollama: Limited/recent
✅ Transformers: Full (since 2023)
✅ VLLM: Full

5. Batching is Critical

For production:

Required for efficiency
Cost optimization
Resource utilization
User experience

Provider Support:

❌ Ollama: None
⚠️ Transformers: Manual
✅ VLLM: First-class

Real-World Scenarios

Scenario 1: Personal AI Assistant

Requirements:

Single user
Occasional queries
macOS laptop
No GPU

Recommendation: Ollama

Easy setup
Sufficient performance
No GPU needed
Perfect for personal use

Scenario 2: Development Team RAG

Requirements:

10-20 developers
Document processing
Single 3090 GPU
Custom models

Recommendation: Hugging Face Transformers

Manual batching acceptable
Latest models access
Custom optimization
Developer flexibility

Scenario 3: Production SaaS

Requirements:

1000s of users
24/7 availability
Multiple GPUs
Cost optimization critical

Recommendation: VLLM

Production-grade reliability
Automatic batching
Maximum throughput
Proven at scale

Scenario 4: Research Lab

Requirements:

Latest models immediately
Custom architectures
Experimentation
Multiple GPUs

Recommendation: Hugging Face Transformers

Day-one model access
Architecture flexibility
Research features
Complete control

Advanced Topics

Combining Providers

VLLM + Transformers:

VLLM can use Transformers as backend
Best of both worlds
Production + flexibility

SG Lang

Mentioned but not covered:

Alternative to VLLM
Worth investigating
Similar use cases

Architecture Standardization

PyTorch vs. Transformers:

PyTorch: Push for standardization
Transformers: Embrace variability
Trade-offs in flexibility vs. consistency

Q&A Highlights

Q: Llama.cpp has serving capabilities - have you tried?

A: Yes, tried it. Fundamental limitations:

Not optimized for dynamic batching
Design choices prevent deep optimization
CPU activity overhead
Not primary goal of project

Q: Can you repeat the question? (Audience member)

A: (Questions repeated for recording/clarity throughout talk)

Resources

Official Documentation

Installation

Ollama:

# macOS
brew install ollama

# Linux
curl https://ollama.ai/install.sh | sh

Transformers:

pip install transformers torch

VLLM (Docker recommended):

docker pull vllm/vllm-openai:latest

Community Resources

AIMUG Discord
Hugging Face Forums
VLLM GitHub discussions

Future Considerations

Monitoring Trends

New providers emerging
Protocol standardization
Performance improvements
Hardware evolution

Optimization Techniques

Quantization advances
Attention mechanisms
Distributed inference
Memory efficiency

Related Sessions:

Video: Watch the full presentation in the October 2025 showcase recording

Overview​

Scope & Focus​

Provider 1: Ollama / Llama.cpp​

Overview​

Key Features​

GGUF Format​

Strengths​

Limitations​

When to Use Ollama​

Provider 2: Hugging Face Transformers​

Overview​

Key Features​

Strengths​

Performance Example​

Quantization Options​

Limitations​

When to Use Transformers​

Developer Tooling​

Provider 3: VLLM​

Overview​

Key Innovation: Paged Attention​

Strengths​

Limitations​

When to Use VLLM​

Concurrent Request Problem (Ollama)​

Comparison Matrix​

Key Takeaways​

1. Choose Based on Use Case​

2. Hardware Matters​

3. Memory Management Philosophy​

4. Flash Attention Importance​

5. Batching is Critical​

Real-World Scenarios​

Scenario 1: Personal AI Assistant​

Scenario 2: Development Team RAG​

Scenario 3: Production SaaS​

Scenario 4: Research Lab​

Advanced Topics​

Combining Providers​

SG Lang​

Architecture Standardization​

Q&A Highlights​

Q: Llama.cpp has serving capabilities - have you tried?​

Q: Can you repeat the question? (Audience member)​

Resources​

Official Documentation​

Installation​

Community Resources​

Future Considerations​

Monitoring Trends​

Optimization Techniques​

Overview

Scope & Focus

Provider 1: Ollama / Llama.cpp

Overview

Key Features

GGUF Format

Strengths

Limitations

When to Use Ollama

Provider 2: Hugging Face Transformers

Overview

Key Features

Strengths

Performance Example

Quantization Options

Limitations

When to Use Transformers

Developer Tooling

Provider 3: VLLM

Overview

Key Innovation: Paged Attention

Strengths

Limitations

When to Use VLLM

Concurrent Request Problem (Ollama)

Comparison Matrix

Key Takeaways

1. Choose Based on Use Case

2. Hardware Matters

3. Memory Management Philosophy

4. Flash Attention Importance

5. Batching is Critical

Real-World Scenarios

Scenario 1: Personal AI Assistant

Scenario 2: Development Team RAG

Scenario 3: Production SaaS

Scenario 4: Research Lab

Advanced Topics

Combining Providers

SG Lang

Architecture Standardization

Q&A Highlights

Q: Llama.cpp has serving capabilities - have you tried?

Q: Can you repeat the question? (Audience member)

Resources

Official Documentation

Installation

Community Resources

Future Considerations

Monitoring Trends

Optimization Techniques