Phase 17: 基础设施与生产
本阶段包含 28 课时。
原始课程来源:AI Engineering from Scratch (MIT License)
- Managed LLM Platforms — Bedrock, Vertex AI, Azure OpenAI
- Inference Platform Economics — Fireworks, Together, Baseten, Modal, Replicate, Anyscale
- GPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler, Gang Scheduling
- vLLM Serving Internals: PagedAttention, Continuous Batching, Chunked Prefill
- EAGLE-3 Speculative Decoding in Production
- SGLang and RadixAttention for Prefix-Heavy Workloads
- TensorRT-LLM on Blackwell with FP8 and NVFP4
- Inference Metrics — TTFT, TPOT, ITL, Goodput, P99
- Production Quantization — AWQ, GPTQ, GGUF K-quants, FP8, MXFP4/NVFP4
- Cold Start Mitigation for Serverless LLMs
- Multi-Region LLM Serving and KV Cache Locality
- Edge Inference — Apple Neural Engine, Qualcomm Hexagon, WebGPU/WebLLM, Jetson
- LLM Observability Stack Selection
- Prompt Caching and Semantic Caching Economics
- Batch APIs — the 50% Discount as Industry Standard
- Model Routing as a Cost-Reduction Primitive
- Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-d
- vLLM Production Stack with LMCache KV Offloading
- AI Gateways — LiteLLM, Portkey, Kong AI Gateway, Bifrost
- Shadow Traffic, Canary Rollout, and Progressive Deployment for LLMs
- A/B Testing LLM Features — GrowthBook, Statsig, and the Vibes Problem
- Load Testing LLM APIs — Why k6 and Locust Lie
- SRE for AI — Multi-Agent Incident Response, Runbooks, Predictive Detection
- Chaos Engineering for LLM Production
- Security — Secrets, API Key Rotation, Audit Logs, Guardrails
- Compliance — SOC 2, HIPAA, GDPR, PCI-DSS, EU AI Act, ISO 42001
- FinOps for LLMs — Unit Economics and Multi-Tenant Attribution
- Self-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLang