Edge Inference Without the Cloud Round-Trip

Running inference on a camera chip is a different discipline than running it in a data center. You have no GPU, limited DRAM, and a hard realtime deadline imposed by the video frame rate.

The target

Sub-100ms end-to-end latency on a Panasonic i-PRO S-series camera. That means model inference + pre/post-processing must complete in under one frame at 30fps (~33ms for the model alone).

Getting there

INT8 quantization was non-negotiable. FP32 models were 4x too slow. The accuracy drop was acceptable for anomaly detection use cases where false positives are cheap to review.

Custom ONNX export. The vendor SDK expected a specific input format. Exporting directly from PyTorch to ONNX with static shapes cut 15ms off preprocessing.

Frame skipping. Not every frame needs inference. A lightweight motion detector triggers the model only on activity, cutting average compute by ~70%.

The real lesson

Edge deployment is a packaging problem as much as a modeling problem. Getting the model right is half the job; getting it onto the device without bitrot, version mismatches, or silent quantization errors is the other half.