TY - JOUR
T1 - Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading
AU - Zhou, Huan
AU - Li, Mingze
AU - Wang, Ning
AU - Min, Geyong
AU - Wu, Jie
N1 - Funding Information:
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants 62172255 and 61872221, and in part by EU Horizon 2020 Research and Innovation Programme under the Marie Sklodowska-Curie under Grant 101008297.
Publisher Copyright:
© 1990-2012 IEEE.
PY - 2023/2/1
Y1 - 2023/2/1
N2 - With the rapid development of Internet-of-Things (IoT) and the explosive advance of deep learning, there is an urgent need to enable deep learning inference on IoT devices in Mobile Edge Computing (MEC). To address the computation limitation of IoT devices in processing complex Deep Neural Networks (DNNs), computation offloading is proposed as a promising approach. Recently, partial computation offloading is developed to dynamically adjust task assignment strategy in different channel conditions for better performance. In this paper, we take advantage of intrinsic DNN computation characteristics and propose a novel Fused-Layer-based (FL-based) DNN model parallelism method to accelerate inference. The key idea is that a DNN layer can be converted to several smaller layers in order to increase partial computation offloading flexibility, and thus further create the better computation offloading solution. However, there is a trade-off between computation offloading flexibility as well as model parallelism overhead. Then, we investigate the optimal DNN model parallelism and the corresponding scheduling and offloading strategies in partial computation offloading. In particular, we propose a Particle Swarm Optimization with Minimizing Waiting (PSOMW) method, which explores and updates the FL strategy, path scheduling strategy, and path offloading strategy to reduce time complexity and avoid invalid solutions. Finally, we validate the effectiveness of the proposed method in commonly used DNNs. The results show that the proposed method can reduce the DNN inference time by an average of 12.75 times compared to the legacy No FL (NFL) algorithm, and is very close to the optimal solution achieved by the Brute Force (BF) algorithm with the difference of less than 0.04%.
AB - With the rapid development of Internet-of-Things (IoT) and the explosive advance of deep learning, there is an urgent need to enable deep learning inference on IoT devices in Mobile Edge Computing (MEC). To address the computation limitation of IoT devices in processing complex Deep Neural Networks (DNNs), computation offloading is proposed as a promising approach. Recently, partial computation offloading is developed to dynamically adjust task assignment strategy in different channel conditions for better performance. In this paper, we take advantage of intrinsic DNN computation characteristics and propose a novel Fused-Layer-based (FL-based) DNN model parallelism method to accelerate inference. The key idea is that a DNN layer can be converted to several smaller layers in order to increase partial computation offloading flexibility, and thus further create the better computation offloading solution. However, there is a trade-off between computation offloading flexibility as well as model parallelism overhead. Then, we investigate the optimal DNN model parallelism and the corresponding scheduling and offloading strategies in partial computation offloading. In particular, we propose a Particle Swarm Optimization with Minimizing Waiting (PSOMW) method, which explores and updates the FL strategy, path scheduling strategy, and path offloading strategy to reduce time complexity and avoid invalid solutions. Finally, we validate the effectiveness of the proposed method in commonly used DNNs. The results show that the proposed method can reduce the DNN inference time by an average of 12.75 times compared to the legacy No FL (NFL) algorithm, and is very close to the optimal solution achieved by the Brute Force (BF) algorithm with the difference of less than 0.04%.
UR - http://www.scopus.com/inward/record.url?scp=85142814359&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85142814359&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2022.3222509
DO - 10.1109/TPDS.2022.3222509
M3 - Article
AN - SCOPUS:85142814359
SN - 1045-9219
VL - 34
SP - 475
EP - 488
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 2
ER -