Monday, April 6, 2026

The Complete AI Deployment Decision Guide: Local, Cloud, or Edge?

The Complete AI Deployment Decision Guide: Local, Cloud, or Edge?

You've built an AI-powered feature. Now the hardest question: where does the model actually run?

This isn't a technology decision -- it's a business decision. Cost, privacy, latency, quality, and team capability all factor in. Here's the framework we use to help teams decide.

The Three Deployment Models

Cloud API (OpenAI, Anthropic, Google)

You rent someone else's model and infrastructure.

Zero infrastructure to manage
Access to the most capable models (GPT-4, Claude, Gemini Pro)
Pay per token, scales automatically
Model updates happen automatically

Best for: Prototyping, complex reasoning tasks, teams without ML infrastructure experience.

Real cost at scale: 1 million tokens/day at GPT-4 pricing = ~$900/month. That's one senior engineer's weekly salary for unlimited AI reasoning.

Self-Hosted (vLLM, TGI on your cloud/on-prem)

You run open-source models on your own infrastructure.

Full control over data and model behavior
Predictable costs at scale (GPU rental, not per-token)
Customizable through fine-tuning
No vendor lock-in

Best for: High-volume production, privacy-sensitive industries, teams that need customization.

Real cost at scale: An A100 instance on AWS costs ~$3/hour. Running 24/7 = ~$2,200/month. But it handles millions of tokens per day -- 10-100x cheaper than API at volume.

Edge / On-Device

The model runs on the end user's device.

Zero server costs per user
Complete privacy
Works offline
Limited to smaller models (1-7B parameters)

Best for: Consumer apps, privacy-first products, offline scenarios, latency-critical applications.

Real cost at scale: $0 per query after initial model distribution. The cost is in developing and optimizing the edge model.

graph TB
  A["New AI Project"] --> B{"Data Sensitive?"}
  B -->|Yes| C["On-Premise / Edge"]
  B -->|No| D{"Latency Critical?"}
  D -->|Yes| E["Edge / Dedicated GPU"]
  D -->|No| F{"Cost Sensitive?"}
  F -->|Yes| G["Serverless / Shared"]
  F -->|No| H["Managed Cloud Service"]

The Decision Matrix

Start Here: Volume and Quality Requirements

Daily token volume < 100K?
  -> Cloud API (simplest, cheapest at low volume)

Daily token volume 100K - 10M?
  -> Cloud API if quality is paramount
  -> Self-hosted if cost is paramount

Daily token volume > 10M?
  -> Self-hosted (cloud API costs become prohibitive)

Need offline / <50ms latency?
  -> Edge deployment

Factor 2: Data Sensitivity

Public data, no PII?
  -> Any option works

Contains PII but cloud-processable?
  -> Cloud API with BAA (Business Associate Agreement)
  -> Self-hosted in your VPC

Highly regulated (healthcare, finance, government)?
  -> Self-hosted on-premises
  -> Edge deployment

Never leaves the device (user requirement)?
  -> Edge only

Factor 3: Model Capability Needs

Need state-of-the-art reasoning (complex analysis, creative writing)?
  -> Cloud API (GPT-4, Claude Opus)

Need good general capability (chat, summarization, code)?
  -> Self-hosted Llama 3.2 70B or Mixtral
  -> Cloud API Llama/Mistral endpoints (cheaper)

Need basic capability (classification, extraction, simple Q&A)?
  -> Self-hosted 7B model
  -> Edge 3B model

Need domain-specific expertise?
  -> Fine-tuned self-hosted model + RAG

Factor 4: Team Capability

No ML/DevOps team?
  -> Cloud API (let the provider handle everything)

Small DevOps team (1-3 people)?
  -> Cloud API or managed self-hosted (Anyscale, Together AI)

Dedicated ML infrastructure team?
  -> Self-hosted (full control, maximum optimization)

Hybrid Architectures: The Real Answer

Most production systems in 2026 use a hybrid approach:

Pattern 1: Edge + Cloud Fallback

Simple queries handled on-device (fast, free, private)
Complex queries routed to cloud API (better quality)
Example: Apple Intelligence routes between on-device and Private Cloud Compute

Pattern 2: Self-Hosted + Cloud Overflow

Self-hosted handles baseline traffic (predictable cost)
Cloud API handles traffic spikes (elastic scaling)
Example: Run vLLM on reserved instances, overflow to OpenAI during peaks

Pattern 3: Tiered Model Routing

Small model classifies the query complexity
Simple queries -> small fast model (self-hosted 7B)
Complex queries -> large model (self-hosted 70B or cloud API)
Example: Save 70% on inference costs by routing easy questions to cheap models

Pattern 4: RAG + Fine-Tuned Self-Hosted

Fine-tuned model for domain tone and knowledge
RAG pipeline for current, specific facts
Self-hosted for privacy and cost control
Example: Legal AI that speaks like a lawyer (fine-tuned) and cites current case law (RAG)

Cost Comparison: A Real Scenario

Scenario: AI customer support chatbot. 50,000 conversations/day, average 2,000 tokens each. Total: 100 million tokens/day.

Approach	Monthly Cost	Quality	Latency	Privacy
GPT-4 API	~$45,000	Best	1-3s	Cloud processed
GPT-4 Mini API	~$4,500	Very Good	0.5-1s	Cloud processed
Self-hosted 70B (8x A100)	~$17,600	Good	0.3-1s	Full control
Self-hosted 7B (2x A100)	~$4,400	Adequate	0.1-0.5s	Full control
Hybrid (7B + GPT-4 fallback)	~$6,000	Good+	0.1-1s	Mostly private

The hybrid approach (Pattern 3) routes 90% of conversations to the cheap 7B model and only escalates complex issues to GPT-4. Best of both worlds.

The Five-Step Deployment Playbook

Prototype with Cloud API -- Validate the use case. Don't optimize prematurely.
Measure actual usage -- Track token volume, query complexity distribution, latency requirements, and privacy constraints. Data beats assumptions.
Identify the crossover point -- At what volume does self-hosting become cheaper than API? For most teams, it's around 5-10 million tokens/day.
Start self-hosting incrementally -- Run a self-hosted model alongside the API. Route easy queries to self-hosted, keep complex ones on the API. Gradually shift traffic.
Optimize continuously -- Quantize models, enable speculative decoding, tune batch sizes, implement caching. Each optimization compounds.

Key Takeaway

There's no universally "right" deployment strategy. The right choice depends on your specific combination of volume, sensitivity, quality needs, and team capability.

But here's the trend: the center of gravity is moving toward self-hosted and edge. Open-source models are closing the quality gap with cloud APIs. Quantization makes them run on commodity hardware. And privacy regulations are pushing data processing closer to the user.

Start with the simplest option that works. Optimize when the data tells you to.

This completes our six-part series on AI model optimization and deployment. From quantization basics to production serving to deployment strategy -- you now have the full picture for running AI in 2026.

Sources & References:
1. AWS — "Machine Learning on AWS" — https://aws.amazon.com/machine-learning/
2. Ollama — "Run Models Locally" — https://ollama.com/
3. Apple — "Core ML on Device" — https://developer.apple.com/documentation/coreml

Tools mentioned in this post

Disclosure: the links below are affiliate links. If you sign up via them, we earn a small commission at no extra cost to you. This helps fund the writing of more posts like this one.

Amazon — books and tools referenced in the post. Sign up
Anthropic Claude API — production LLM access. Sign up
OpenAI Platform — GPT-4 and embedding APIs. Sign up

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-05 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

AmtocSoft Tech Insights

Monday, April 6, 2026