Sangam: A Confluence of Knowledge Streams

HARP: A MACHINE LEARNING FRAMEWORK ON TOP OF THE COLLECTIVE COMMUNICATION LAYER FOR THE BIG DATA SOFTWARE STACK

Show simple item record

dc.contributor Qiu, Judy
dc.creator Bingjing, Zhang
dc.date 2017-05-16T19:50:38Z
dc.date 2017-05-16T19:50:38Z
dc.date 2017-05
dc.date.accessioned 2023-02-21T11:20:47Z
dc.date.available 2023-02-21T11:20:47Z
dc.identifier http://hdl.handle.net/2022/21445
dc.identifier.uri http://localhost:8080/xmlui/handle/CUHPOERS/253100
dc.description Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2017
dc.description Almost every field of science is now undergoing a data-driven revolution requiring analyzing massive datasets. Machine learning algorithms are widely used to find meaning in a given dataset and discover properties of complex systems. At the same time, the landscape of computing has evolved towards computers exhibiting many-core architectures of increasing complexity. However, there is no simple and unified programming framework allowing for these machine learning applications to exploit these new machines’ parallel computing capability. Instead, many efforts focus on specialized ways to speed up individual algorithms. In this thesis, the Harp framework, which uses collective communication techniques, is prototyped to improve the performance of data movement and provides high-level APIs for various synchronization patterns in iterative computation. In contrast to traditional parallelization strategies that focus on handling high volume training data, a less known challenge is that the high dimensional model is also in high volume and difficult to synchronize. As an extension of the Hadoop MapReduce system, Harp includes a collective communication layer and a set of programming interfaces. Iterative machine learning algorithms can be parallelized through efficient synchronization methods utilizing both inter-node and intra-node parallelism. The usability and efficiency of Harp’s approach is validated on applications such as K-means Clustering, Multi-Dimensional Scaling, Latent Dirichlet Allocation and Matrix Factorization. The results show that these machine learning applications can achieve high parallel performance on Harp.
dc.language en
dc.publisher [Bloomington, Ind.] : Indiana University
dc.subject MACHINE LEARNING
dc.subject COLLECTIVE COMMUNICATION
dc.subject BIG DATA
dc.title HARP: A MACHINE LEARNING FRAMEWORK ON TOP OF THE COLLECTIVE COMMUNICATION LAYER FOR THE BIG DATA SOFTWARE STACK
dc.type Doctoral Dissertation


Files in this item

Files Size Format View
Zhangthesis.pdf 3.422Mb application/pdf View/Open

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse