Skip to main navigation Skip to search Skip to main content

A Simple Approach for Data Cleansing on Hadoop Framework using File Merging Technique

  • Al Ain University of Science and Technology
  • United Arab Emirates University
  • Universiti Sains Malaysia

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Hadoop framework is known for being top-notch in processing these huge files and providing useful data. Unfortunately, in a scenario with many small files, the framework is inefficient and fails to deliver. These small files cause many issues when the framework's processing criteria and performance levels. Moreover, these small files contain content that is useless or provides no benefit in the key-value decision-making. To overcome this issue of small files and unnecessary content, this paper proposes a simple data cleansing and file merging approach based on specific type and size that will not only be effective but will increase the framework's performance by approx. 68%. This algorithm ensures the output will be a few huge files with essential/important data. The results show that the proposed system not only improves the framework's performance but also reduces deadlocks in the framework processes, which is approximately 68 % improvement over the base Hadoop framework processing.

Original languageEnglish
Title of host publication2022 9th International Conference on Software Defined Systems, SDS 2022
EditorsLarbi Boubshir, Boubaker Daachi, Abdellah Mokrane, Yaser Jararweh, Benkhelifa Elhadj
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350346718
DOIs
StatePublished - 2022
Externally publishedYes
Event9th International Conference on Software Defined Systems, SDS 2022 - Paris, France
Duration: 12 Dec 202215 Dec 2022

Publication series

Name2022 9th International Conference on Software Defined Systems, SDS 2022

Conference

Conference9th International Conference on Software Defined Systems, SDS 2022
Country/TerritoryFrance
CityParis
Period12/12/2215/12/22

Keywords

  • Big Data
  • Data Cleansing
  • HDFS
  • Hadoop

Fingerprint

Dive into the research topics of 'A Simple Approach for Data Cleansing on Hadoop Framework using File Merging Technique'. Together they form a unique fingerprint.

Cite this