Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication

Yubin Duan, Ning Wang, Jie Wu

Research output: Contribution to journalArticlepeer-review

13 Scopus citations


Due to the additive property of most machine learning objective functions, the training can be distributed to multiple machines. Distributed machine learning is an efficient way to deal with the rapid growth of data volume at the cost of extra inter-machine communication. One common implementation is the parameter server system which contains two types of nodes: worker nodes, which are used for calculating updates, and server nodes, which are used for maintaining parameters. We observe that inefficient communication between workers and servers may slow down the system. Therefore, we propose a graph partition problem to partition data among workers and parameters among servers such that the total training time is minimized. Our problem is NP-Complete. We investigate a two-step heuristic approach that first partitions data, and then partitions parameters. We consider the trade-off between partition time and the saving in training time. Besides, we adopt a multilevel graph partition approach to fit the bipartite graph partitioning. We implement both approaches based on an open-source parameter server platform - PS-lite. Experiment results on synthetic and real-world datasets show that both approaches could significantly improve the communication efficiency up to 14 times compared with the random partition.

Original languageEnglish (US)
Article number9406385
Pages (from-to)1802-1814
Number of pages13
JournalIEEE Transactions on Network Science and Engineering
Issue number2
StatePublished - Apr 1 2021

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Computer Science Applications
  • Computer Networks and Communications


Dive into the research topics of 'Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication'. Together they form a unique fingerprint.

Cite this