Distributed Task-Based Training of Tree Models

Da Yan, Md Mashiur Rahman Chowdhury, Guimu Guo, Jalal Kahlil, Zhe Jiang, Sushil Prasad

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Decision trees and tree ensembles are popular supervised learning models on tabular data. Two recent research trends on tree models stand out: (1) bigger and deeper models with many trees, and (2) scalable distributed training frameworks. However, existing implementations on distributed systems are IO-bound leaving CPU cores underutilized. They also only find best node-splitting conditions approximately due to row-based data partitioning scheme. In this paper, we target the exact training of tree models by effectively utilizing the available CPU cores. The resulting system called TreeServer adopts a column-based data partitioning scheme to minimize communication, and a node-centric task-based engine to fully explore the CPU parallelism. Experiments show that TreeServer is up to 10× faster than models in Spark MLlib. We also showcase TreeServer's high training throughput by using it to build big 'deep forest' models.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PublisherIEEE Computer Society
Pages2237-2249
Number of pages13
ISBN (Electronic)9781665408837
DOIs
StatePublished - 2022
Externally publishedYes
Event38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia
Duration: May 9 2022May 12 2022

Publication series

NameProceedings - International Conference on Data Engineering
Volume2022-May
ISSN (Print)1084-4627

Conference

Conference38th IEEE International Conference on Data Engineering, ICDE 2022
Country/TerritoryMalaysia
CityVirtual, Online
Period5/9/225/12/22

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Distributed Task-Based Training of Tree Models'. Together they form a unique fingerprint.

Cite this