Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs
For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to efficiently perform parallelizable small operations. The attempt of executing the computation graph in parallel in deep learning frameworks usually involves much resource contention among concurrent operations, leading to inferior performance on manycore CPUs. To address these issues, in this paper, we propose Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs. Specifically, Graphi minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling. Our experiments show that the parallel execution consistently outperforms the sequential one. The training times on four different neural networks with Graphi are 2.1x to 9.5x faster than those with TensorFlow on a 68-core Intel Xeon Phi processor.