MongoDB

Designing a Neural Network Inference Service

Build a production ML inference service — model versioning, batching, GPU optimization, latency-throughput tradeoffs, and serving frameworks.

S

srikanthtelkalapally888@gmail.com

Deploying neural networks in production requires balancing latency, throughput, cost, and model freshness.

Inference Workload Characteristics

Online (synchronous):
  User waits for response
  Latency SLA: <100ms
  Batch size: 1-32
  Use case: Recommendations, fraud detection, chatbots

Batch (asynchronous):
  Process large datasets
  No latency requirement
  Batch size: 1000-10000+
  Use case: Nightly scoring, image processing

Model Optimization Pipeline

Trained model (PyTorch/TensorFlow)
    ↓
Export to ONNX (framework-agnostic)
    ↓
Quantization (FP32 → INT8 or FP16)
  → 4x smaller, 2-4x faster, ~1% accuracy loss
    ↓
TensorRT / OpenVINO compilation
  → Hardware-specific optimizations
    ↓
Serving format: Triton, TorchServe

Dynamic Batching

Problem:
  Single GPU inference for 1 request: 10ms
  Single GPU inference for 32 requests: 12ms
  → 32x throughput for 20% more latency!

Dynamic batching:
  Collect requests for max 5ms
  Batch together → single GPU forward pass
  Return responses to each requester

Result: Much higher GPU utilization

Triton Inference Server

Triton features:
  Dynamic batching
  Model versioning
  Concurrent model execution
  GPU + CPU backends
  ONNX, TensorRT, PyTorch, TensorFlow

Model config:
max_batch_size: 32
dynamic_batching:
  preferred_batch_size: [8, 16]
  max_queue_delay_microseconds: 5000

Model Versioning

Model registry:
  v1: accuracy 89%, latency 45ms
  v2: accuracy 91%, latency 52ms  ← current
  v3: accuracy 93%, latency 85ms  ← staging

Deployment:
  Shadow mode: v3 gets 5% traffic, logs predictions
  A/B test: Compare v3 vs v2 business metrics
  Canary: 10% → 50% → 100%
  Rollback: Instant switch to v2 if issues

GPU Resource Management

GPU sharing:
  MIG (Multi-Instance GPU): Partition A100 into 7 GPUs
  Time-slicing: Multiple models share GPU time

Auto-scaling:
  Scale pods based on GPU utilization (target 70%)
  Scale-to-zero for infrequent models (CPU fallback)

Latency Budget

End-to-end 100ms budget:
  Network:      5ms
  Pre-process:  10ms
  Queue wait:   5ms
  Inference:    50ms  ← biggest cost
  Post-process: 10ms
  Response:     20ms

Conclusion

Production inference requires model optimization (quantization + TensorRT), dynamic batching for GPU efficiency, and progressive rollout for safe model updates. Triton is the de facto serving standard.

Share this article