Back To Schedule
Friday, July 12 • 11:50am - 12:10pm

The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to optimize the end-to-end inference pipeline as a whole. This paper presents \emph{NeoCPU}, a comprehensive approach of CNN model inference on CPUs that employs a full-stack and systematic scheme of optimizations. \emph{NeoCPU} optimizes the operations as templates without relying on third-parties libraries, which enables further improvement of the performance via operation- and graph-level joint optimization. Experiments show that \emph{NeoCPU} achieves up to 3.45$\times$ lower latency for CNN model inference than the current state-of-the-art implementations on various kinds of popular CPUs.

Speakers
YL

Amazon
YW

Amazon
RY

Amazon
ML

Amazon
VS

Amazon
YW

## Yida Wang

Amazon

Friday July 12, 2019 11:50am - 12:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX