A Novel Compression Algorithm Decision Method for Spark Shuffle Process - Details

Author：

Huang, Shanshan (Huang, Shanshan.) | Xu, Jungang (Xu, Jungang.) | Liu, Renfeng (Liu, Renfeng.) | Liao, Husheng (Liao, Husheng.) (Scholars：廖湖声)

Indexed by：

CPCI-S

Abstract：

With　the　wide　application　of　Spark　big　data　platform,　some　problems　in　practical　application　are　exposed,　and　one　of　the　main　problems　is　performance　optimization.　The　Shuffle　module　of　Spark　is　one　of　the　core　modules　of　Spark,　and　it　is　also　an　important　module　of　some　other　distributed　big　data　computing　frameworks.　The　design　of　Shuffle　module　is　the　key　factor　that　directly　determines　the　performance　of　big　data　computing　framework.　The　main　optimization　parameters　of　Shuffle　process　involve　the　CPU　utilization,　I/O　literacy　rate,　network　transmission　rate,　and　one　of　these　factors　is　likely　to　be　the　bottleneck　during　the　execution　of　application.　The　network　data　transmission　time　consumption,　I/O　read　and　write　time,　and　the　CPU　utilization　are　closely　related　with　the　size　of　the　data　processing.　As　a　result,　Spark　provides　compression　configuration　options　and　different　compression　algorithms　for　users　to　select.　Different　compression　algorithms　have　different　effects　in　compression　rate　and　compression　ratio,　but　the　default　configuration　is　usually　selected　by　all　users　even　though　they　run　different　applications,　so　the　optimal　configuration　cannot　be　achieved.　In　order　to　achieve　the　optimal　configuration　of　compression　algorithm　for　the　Shuffle　process,　one　cost　optimization　model　for　Spark　Shuffle　process　is　proposed　in　this　paper,　which　enables　users　to　get　the　best　compression　configuration　before　application　execution.　The　experimental　results　show　that　the　prediction　model　for　compression　configuration　has　an　accuracy　of　58.3%,　and　the　proposed　cost　optimization　model　can　improve　the　performance　by　48.9%.

Keyword：

Shuffle process Spark cost model compression configuration

Author Community：

[ 1 ] [Huang, Shanshan]Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing, Peoples R China
[ 2 ] [Xu, Jungang]Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing, Peoples R China
[ 3 ] [Liu, Renfeng]Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing, Peoples R China
[ 4 ] [Huang, Shanshan]Beijing Univ Technol, Fac Informat Technol, Beijing, Peoples R China
[ 5 ] [Liao, Husheng]Beijing Univ Technol, Fac Informat Technol, Beijing, Peoples R China

Reprint Author's Address：

[Huang, Shanshan]Univ Chinese Acad Sci, Sch Comp & Control Engn, Beijing, Peoples R China;;[Huang, Shanshan]Beijing Univ Technol, Fac Informat Technol, Beijing, Peoples R China

Email：

huangss118@emails.bjut.edu.cn |
xujg@ucas.ac.cn |
liurenfeng16@mails.ucas.ac.cn |
liaohs@bjut.edu.cn

Show more details

Related Keywords：

A novel compression algorithm decision method for spark shuffle process
2017，5th IEEE International Conference on Big Data, Big Data 2017
A Fine-Grained Task Monitoring Mechanism in Spark Platform
2017，International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE)
面向Spark的批处理应用执行时间预测模型
2021，李硕
Comparatively investigating the leading and trailing spark plug on the hydrogen rotary engine
2022，FUEL

Source ：

2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)

ISSN： 2639-1589

Year： 2017

Page： 2931-2940

Language： English

Cited Count：

WoS CC Cited Count： 4

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

信息学部

Get Fulltext

Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to