A Hybrid Approach To Data Pre-Processing Methods
A Hybrid Approach To Data Pre-Processing Methods
Abstract: This is an era of big data, as data is growing Let us discuss the importance of Data quality
exponentially and resources are running out of infrastructure,
so it is required to accommodate all the data that gets A. Dimensions on quality of data
generated. We collect data in enormous amounts to derive The most critical dimensions include data being
meaningful conclusions, perform effective data analytics and Accurate, data being consistent, data should be Unique, data
improve decision making. As we don’t have enough should be in Timeline fashion and must be relevant, data
infrastructures to support data storage for huge volumes, it is should be Valid and finally, data should be complete.
needed to clean the data in compulsion. It is a mandatory to
carry out a step before doing anything with the data. We call it B. Profiling of data sets or information
pre-processing of data and this is carried out in various steps. Profiling information rotates around building up a
Pre-processing includes data cleaning, data integration, data standard's structure to guarantee productive appraisal of the
filtering, and data transformation and so on. As such pre- nature of the information bolstered by a definition and
processing is not limited to the number of steps or a number of
attributes on the nature of the information.
methods or definitive methods. We must innovatively pre-
process the data before it is being consumed for data analytics. To guarantee the nature of information in a Big Data
It has become a responsibility for every data analyst or big framework, the information may be surveyed and changed
data researcher to handpick data for his or her analytics. through various emphases with an end goal to purge and
Considering all these techniques in mind we are proposing a furthermore progress from an unstructured to a progressively
hybrid technique to leverage various algorithms available to organized state.
pre-process our data along with minor modifications such as at
the run time, choosing an algorithm or technique wisely based C. Framework of Big data quality
on the data that we have. The building up of Data Quality structure will decide and
Key words: Big Data, Data Pre-processing, Data Quality guarantee the information with improved quality. Procedures
checks.
to scrub information, de dupe information, evacuate
I. INTRODUCTION adulterated information examples and a lot more sub-process
will structure some portion of the quality system.
Big Data is evolved enormously over the last 2 decades.
It always considers data being analysed or viewed with 3 V’s D. Quality of Input Data
in mind which are volume, variety, and velocity. In the last Quality of data for the yet advancing and developing
few years as the era has grown and day today, there is a new field of Big Data is an exceptionally mind-boggling subject.
release of apps on Google play store, there are upcoming Enormous associations prior thought having caught
start-ups, there are a lot of messages exchanged over social information from different business forms, different
networks and this is boon to humankind but on the side, divisions, deals, benefits, geologies, and area parameters,
these are all contributing to an import V of Big Data which is and so on would enable them to amplify their business and
volume. It also holds well with respect to 2nd V which is spread in more territories deliberately. Be that as it may, the
velocity. If we analyse the statistics in the last 5 years of test remained how to tap and mine productively the colossal
data, we notice that it is doubled, and it is also predicted that volumes of information in a subjective way utilizing
data will grow to exponential volume in the next few years. institutionalized quality procedures that likewise purge and
These 2 problems can be handled by increasing the improve the information quality as a component of the Big
infrastructure, which becomes more challenging is the 3rd V Data life cycle. Guaranteeing and changing the information
which is the variety with various systems generating data to a quality one is critical for any industry's Big Data stage to
every day is in a different format and it is complex to bring have the option to investigate the information precisely and
every data into a single format and even on doing so, it is evaluate designs that help in formulating future systems
very difficult to know when a data is generating in the precisely or as most ideal.
newest format which is never seen before. If we do not clean
our data by pre-processing before it is fed to analytic graphs, II. II. LITERATURE SURVEY
then it will add extra complications into analytics in addition Data is gathered from numerous sources. This crude
to the existing complications. There exist hundreds of information might be wrecked with abnormalities including
techniques and thousands of ways to pre-process the big data tainted qualities, seriously arranged and inadmissible for
which is being brought from various sources in various utilization by the Big Data application, a mix of organized,
formats. It is always mandatory to leverage these techniques semi-organized and unstructured information. Such
and algorithms to pre-process all the big data to carry out information needs to shift and rinsed, reformatted and
effective data analytics. organized; deduped, expels illicit qualities and packed. These
pre-processing steps are significant to change the
2
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
V. RESULTS AND DISCUSSIONS
A. Comparison Study and Results of error rate and execution time. This was carried out on a
Below is a comparison study of various data pre- dataset of 50000 records of stock data with 10 columns.
processing techniques and their results, which is in terms
TABLE II. COMPARISON STUDY AND RESULTS
Data Pre-Processing Step Algorithm/techniques Error Rate (in %) Execution Time (in Seconds)
Hot deck imputation 0.45 0.078
Cold deck imputation 0.37 0.082
Regression imputation 0.25 0.113
Stochastic Regression imputation 0.21 0.124
Interpolation and extrapolation 0.41 0.119
Data Imputation Mean imputation 0.39 0.085
KNN 0.33 0.124
Fuzzy K Means 0.27 0.168
singular value decomposition 0.43 0.107
Bayesian principal component 0.24 0.119
analysis
Data Cube Aggregation 0.2 0.012
3
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
Decremental Reduction Optimization 0.268 +/- 3.5 0.0055
Procedure 2 (DROP2)
Decremental Reduction Optimization 0.303 +/- 5.1 0.0070
Procedure 4 (DROP4)
4
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.