TY - GEN
T1 - Distributed Task-Based Training of Tree Models
AU - Yan, Da
AU - Chowdhury, Md Mashiur Rahman
AU - Guo, Guimu
AU - Kahlil, Jalal
AU - Jiang, Zhe
AU - Prasad, Sushil
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Decision trees and tree ensembles are popular supervised learning models on tabular data. Two recent research trends on tree models stand out: (1) bigger and deeper models with many trees, and (2) scalable distributed training frameworks. However, existing implementations on distributed systems are IO-bound leaving CPU cores underutilized. They also only find best node-splitting conditions approximately due to row-based data partitioning scheme. In this paper, we target the exact training of tree models by effectively utilizing the available CPU cores. The resulting system called TreeServer adopts a column-based data partitioning scheme to minimize communication, and a node-centric task-based engine to fully explore the CPU parallelism. Experiments show that TreeServer is up to 10× faster than models in Spark MLlib. We also showcase TreeServer's high training throughput by using it to build big 'deep forest' models.
AB - Decision trees and tree ensembles are popular supervised learning models on tabular data. Two recent research trends on tree models stand out: (1) bigger and deeper models with many trees, and (2) scalable distributed training frameworks. However, existing implementations on distributed systems are IO-bound leaving CPU cores underutilized. They also only find best node-splitting conditions approximately due to row-based data partitioning scheme. In this paper, we target the exact training of tree models by effectively utilizing the available CPU cores. The resulting system called TreeServer adopts a column-based data partitioning scheme to minimize communication, and a node-centric task-based engine to fully explore the CPU parallelism. Experiments show that TreeServer is up to 10× faster than models in Spark MLlib. We also showcase TreeServer's high training throughput by using it to build big 'deep forest' models.
UR - http://www.scopus.com/inward/record.url?scp=85136379193&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136379193&partnerID=8YFLogxK
U2 - 10.1109/ICDE53745.2022.00213
DO - 10.1109/ICDE53745.2022.00213
M3 - Conference contribution
AN - SCOPUS:85136379193
T3 - Proceedings - International Conference on Data Engineering
SP - 2237
EP - 2249
BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PB - IEEE Computer Society
T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022
Y2 - 9 May 2022 through 12 May 2022
ER -