MongoDB
Designing a Neural Network Inference Service
Build a production ML inference service — model versioning, batching, GPU optimization, latency-throughput tradeoffs, and serving frameworks.
S
srikanthtelkalapally888@gmail.com
Deploying neural networks in production requires balancing latency, throughput, cost, and model freshness.
Inference Workload Characteristics
Online (synchronous):
User waits for response
Latency SLA: <100ms
Batch size: 1-32
Use case: Recommendations, fraud detection, chatbots
Batch (asynchronous):
Process large datasets
No latency requirement
Batch size: 1000-10000+
Use case: Nightly scoring, image processing
Model Optimization Pipeline
Trained model (PyTorch/TensorFlow)
↓
Export to ONNX (framework-agnostic)
↓
Quantization (FP32 → INT8 or FP16)
→ 4x smaller, 2-4x faster, ~1% accuracy loss
↓
TensorRT / OpenVINO compilation
→ Hardware-specific optimizations
↓
Serving format: Triton, TorchServe
Dynamic Batching
Problem:
Single GPU inference for 1 request: 10ms
Single GPU inference for 32 requests: 12ms
→ 32x throughput for 20% more latency!
Dynamic batching:
Collect requests for max 5ms
Batch together → single GPU forward pass
Return responses to each requester
Result: Much higher GPU utilization
Triton Inference Server
Triton features:
Dynamic batching
Model versioning
Concurrent model execution
GPU + CPU backends
ONNX, TensorRT, PyTorch, TensorFlow
Model config:
max_batch_size: 32
dynamic_batching:
preferred_batch_size: [8, 16]
max_queue_delay_microseconds: 5000
Model Versioning
Model registry:
v1: accuracy 89%, latency 45ms
v2: accuracy 91%, latency 52ms ← current
v3: accuracy 93%, latency 85ms ← staging
Deployment:
Shadow mode: v3 gets 5% traffic, logs predictions
A/B test: Compare v3 vs v2 business metrics
Canary: 10% → 50% → 100%
Rollback: Instant switch to v2 if issues
GPU Resource Management
GPU sharing:
MIG (Multi-Instance GPU): Partition A100 into 7 GPUs
Time-slicing: Multiple models share GPU time
Auto-scaling:
Scale pods based on GPU utilization (target 70%)
Scale-to-zero for infrequent models (CPU fallback)
Latency Budget
End-to-end 100ms budget:
Network: 5ms
Pre-process: 10ms
Queue wait: 5ms
Inference: 50ms ← biggest cost
Post-process: 10ms
Response: 20ms
Conclusion
Production inference requires model optimization (quantization + TensorRT), dynamic batching for GPU efficiency, and progressive rollout for safe model updates. Triton is the de facto serving standard.