TY - GEN
T1 - Reducing makespans of DAG scheduling through interleaving overlapping resource utilization
AU - Duan, Yubin
AU - Wang, Ning
AU - Jie, W.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12
Y1 - 2020/12
N2 - As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.
AB - As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85102204473&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102204473&partnerID=8YFLogxK
U2 - 10.1109/MASS50613.2020.00055
DO - 10.1109/MASS50613.2020.00055
M3 - Conference contribution
AN - SCOPUS:85102204473
T3 - Proceedings - 2020 IEEE 17th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2020
SP - 392
EP - 400
BT - Proceedings - 2020 IEEE 17th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on Mobile Ad Hoc and Smart Systems, MASS 2020
Y2 - 10 December 2020 through 13 December 2020
ER -