0% found this document useful (0 votes)
6 views4 pages

A Hybrid Approach To Data Pre-Processing Methods

Data science

Uploaded by

abbas.sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

A Hybrid Approach To Data Pre-Processing Methods

Data science

Uploaded by

abbas.sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2020 IEEE International Conference for Innovation in Technology (INOCON)

Bengaluru, India. Nov 6-8, 2020

A Hybrid Approach to Data Pre-processing


Methods
Vinod Desai Dr. Dinesha H A
Dept. of CSE Professor and Head, Dept. of CSE,
AITM Belagavi SGBIT Belagavi,
[email protected] [email protected]

Abstract: This is an era of big data, as data is growing Let us discuss the importance of Data quality
exponentially and resources are running out of infrastructure,
so it is required to accommodate all the data that gets A. Dimensions on quality of data
generated. We collect data in enormous amounts to derive The most critical dimensions include data being
meaningful conclusions, perform effective data analytics and Accurate, data being consistent, data should be Unique, data
improve decision making. As we don’t have enough should be in Timeline fashion and must be relevant, data
infrastructures to support data storage for huge volumes, it is should be Valid and finally, data should be complete.
needed to clean the data in compulsion. It is a mandatory to
carry out a step before doing anything with the data. We call it B. Profiling of data sets or information
pre-processing of data and this is carried out in various steps. Profiling information rotates around building up a
Pre-processing includes data cleaning, data integration, data standard's structure to guarantee productive appraisal of the
filtering, and data transformation and so on. As such pre- nature of the information bolstered by a definition and
processing is not limited to the number of steps or a number of
attributes on the nature of the information.
methods or definitive methods. We must innovatively pre-
process the data before it is being consumed for data analytics. To guarantee the nature of information in a Big Data
It has become a responsibility for every data analyst or big framework, the information may be surveyed and changed
data researcher to handpick data for his or her analytics. through various emphases with an end goal to purge and
Considering all these techniques in mind we are proposing a furthermore progress from an unstructured to a progressively
hybrid technique to leverage various algorithms available to organized state.
pre-process our data along with minor modifications such as at
the run time, choosing an algorithm or technique wisely based C. Framework of Big data quality
on the data that we have. The building up of Data Quality structure will decide and
Key words: Big Data, Data Pre-processing, Data Quality guarantee the information with improved quality. Procedures
checks.
to scrub information, de dupe information, evacuate
I. INTRODUCTION adulterated information examples and a lot more sub-process
will structure some portion of the quality system.
Big Data is evolved enormously over the last 2 decades.
It always considers data being analysed or viewed with 3 V’s D. Quality of Input Data
in mind which are volume, variety, and velocity. In the last Quality of data for the yet advancing and developing
few years as the era has grown and day today, there is a new field of Big Data is an exceptionally mind-boggling subject.
release of apps on Google play store, there are upcoming Enormous associations prior thought having caught
start-ups, there are a lot of messages exchanged over social information from different business forms, different
networks and this is boon to humankind but on the side, divisions, deals, benefits, geologies, and area parameters,
these are all contributing to an import V of Big Data which is and so on would enable them to amplify their business and
volume. It also holds well with respect to 2nd V which is spread in more territories deliberately. Be that as it may, the
velocity. If we analyse the statistics in the last 5 years of test remained how to tap and mine productively the colossal
data, we notice that it is doubled, and it is also predicted that volumes of information in a subjective way utilizing
data will grow to exponential volume in the next few years. institutionalized quality procedures that likewise purge and
These 2 problems can be handled by increasing the improve the information quality as a component of the Big
infrastructure, which becomes more challenging is the 3rd V Data life cycle. Guaranteeing and changing the information
which is the variety with various systems generating data to a quality one is critical for any industry's Big Data stage to
every day is in a different format and it is complex to bring have the option to investigate the information precisely and
every data into a single format and even on doing so, it is evaluate designs that help in formulating future systems
very difficult to know when a data is generating in the precisely or as most ideal.
newest format which is never seen before. If we do not clean
our data by pre-processing before it is fed to analytic graphs, II. II. LITERATURE SURVEY
then it will add extra complications into analytics in addition Data is gathered from numerous sources. This crude
to the existing complications. There exist hundreds of information might be wrecked with abnormalities including
techniques and thousands of ways to pre-process the big data tainted qualities, seriously arranged and inadmissible for
which is being brought from various sources in various utilization by the Big Data application, a mix of organized,
formats. It is always mandatory to leverage these techniques semi-organized and unstructured information. Such
and algorithms to pre-process all the big data to carry out information needs to shift and rinsed, reformatted and
effective data analytics. organized; deduped, expels illicit qualities and packed. These
pre-processing steps are significant to change the

978-1-7281-9744-9/20/$31.00 ©2020 IEEE 1


Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
information to levels appropriate or important for of it. The means engaged with this procedure are expelling
examination. the erroneous, fragmented or then again superfluous
Information, as we presently comprehend, must be area information from the information gained with the goal that it
cognizant and reasonable for utilization by its clients. very well may be prepared and dissected upon to extricate
Suitability of information attributes like the arrangement, useful incentive from it.
structure, timing and information type varieties should be
tended to also from a quality point of view. The better the III. PROPOSED WORK
information source better might be the yield quality for the Considering all these techniques, we are proposing a
pre-processed information for example Information Quality hybrid technique to leverage multiple algorithms in various
is relative to its source quality. Subsequently, this quality of steps to pre-process our data. We would like to bring in some
the executives and restoring of information distortions and novelty here by understanding the data before pre-processing
complexities have a roundabout holding on for exertion and and choosing the techniques wisely based on data that we
cost spend to deliver quality information for examination in have. It is effective to keep pre-processing unique to your
Big Data frameworks or applications. data instead of generalizing it. Pre-processing and data are
Data pre-processing is a data mining approach that is always a one to one mapping and we can’t apply every pre-
used in order to transform raw data in an efficient and most processing step to all the available data sets. It varies with
useful format. the use case or problem we are trying to solve. It is better to
consider, combine available techniques as needed with minor
Below are various Techniques: modification to bring in effectiveness wherever required
such as precision considerations, data ranging and so on. In
A. Integration and Consolidation of Data
most of the techniques pre-processing is done entire data set.
Big Data which might be sourced from numerous areas We can filter and handpick data for pre-processing to make it
could be in various formats or structures like semi-organized effective on the other side there are limitations with respect
or organized or non-structured or unstructured, varying to necessary and minimal volume of data that is required to
positions, garbage, and so on. Information from every one of carry out any analytics. As our use case is to develop an end
these sources needs to be combined homogeneously to frame to end system for stock prediction, we will have data coming
a solitary and last wellspring of truth for the information to from the number of sources and it has data in various
be utilized in the Big Data framework. Advancements like formats, it makes an at most urge or need to bring all the data
ETL Extract, Transform and Load are well known and built- in the same form, normalize, integrate which will all be done
up components. in multiple steps as part of data pre-processing.
B. Enrichment and Enhancements to Big Data or Input IV. METHODOLOGY
Data
Here we are proposing an innovative and novel data
Input Big Data sets from different sources are solidified, evaluation step before starting actual data pre-processing. In
and during that information, subtleties are refreshed utilizing the data evaluation check, we conduct data quality (DQ)
extra data got from other strong sources to make intertwined checks and numerous tests to define and conclude how good
information that is improved with more data and perhaps at or how bad the data is. Based on the results of these Data
the same time upgraded subjectively. quality checks we dynamically define our hybrid data pre-
C. Change to Big Data processing technique make use of necessary pre-processing
steps as per data and DQ results. Every DQ check results in
Data change will include numerous stages or sub-handled
choosing a new pre-processing technique which is hybrid
like catching or pulling information from various sources,
and combination of various data pre-processing steps. There
information may be reformatted, standardized, amassed,
are few rules defined to categorize or classify data based on
even refreshed utilizing administrative norms.
its quality. According to the class or category it classifies to
D. Decrease or Reducing Data the pre-processing steps are applied. For example, if we learn
Information decrease is the way toward lessening the that data is not having any missing values then we need not
measure of information with the goal that it becomes non- take care of replacing or filling missing values. That saves
excess. This aide in expanding the information stockpiling time of our data pre-processing. So, we have various data
effectiveness and lessening costs by evacuating information quality checks before we go for actual pre-processing. We
that isn't significant and holding just the important parts for can also define certain thresholds for few of the DQ checks,
that specific work/task. for example: there can 5 to 10 % of erroneous values or
skewed data for a particular use case where results are not
E. Discretization of Data impacted with 5 to 10 % of erroneous values so that the time
This procedure concentrates and isolates information into to correct them or remove them is saved.
interims so it tends to be proficiently used inside accessible As stock market data is a time-series data set, we can
mining calculations and methods. define, Let S(i,j,k)stock price of i’th day or j’th month of
F. Cleansing of Data k’th year. So the following represents the time series from
the first day of the first month of the first year to the last day
It is a procedure that improves the Quality of Data by of 12th month of K +5 year as it is time series data of 5years.
evacuating the information which decreases the ease of use

S(i,j,k) = s(1,j,k), s(2,j,k),….. ,s(1,j+1,k),….. ,s(1,j+12,k),… ,s(1,j+1,k+1), ……… ,s(31,j+12,k+5)

2
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
V. RESULTS AND DISCUSSIONS

TABLE I. POSSIBLE DQ CHECKS AND THEIR PRE-PROCESSING MAPPINGS


Data Quality Check Pre-Processing Step
Number of Columns in Data set Remove unnecessary and unwanted columns if the number of columns > 100
Data Skewness If skewness < 5 % ignore, else remove skewed records from the dataset
Data Completeness Missing values for an attribute should not be more than 25%, remove attributes with missing
values/NaN> 25%
Data Volume Number of records should always be from 1000 to 10000 for better analysis – correct data
accordingly, generate records for the low count, aggregate records for high count
Data Distribution Analyse Data distribution and normalize data
Data Randomness Adjust data randomness to maintain data uniformity
Data uniqueness Remove Duplicate records

A. Comparison Study and Results of error rate and execution time. This was carried out on a
Below is a comparison study of various data pre- dataset of 50000 records of stock data with 10 columns.
processing techniques and their results, which is in terms
TABLE II. COMPARISON STUDY AND RESULTS
Data Pre-Processing Step Algorithm/techniques Error Rate (in %) Execution Time (in Seconds)
Hot deck imputation 0.45 0.078
Cold deck imputation 0.37 0.082
Regression imputation 0.25 0.113
Stochastic Regression imputation 0.21 0.124
Interpolation and extrapolation 0.41 0.119
Data Imputation Mean imputation 0.39 0.085
KNN 0.33 0.124
Fuzzy K Means 0.27 0.168
singular value decomposition 0.43 0.107
Bayesian principal component 0.24 0.119
analysis
Data Cube Aggregation 0.2 0.012

Dimensionality Reduction 0.5 0.0117

Data Compression 0.8 0.0123

Numerosity Reduction 0.65 0.0112


Data Ranging/Reduction

Discretisation and concept hierarchy 0.4 0.0125


generation
Data sampling 0.3 0.0127

Feature selection 0.9 0.0130

Uniform 0.2 0.0143


Data Distribution Normal 0.4 0.0157
Un Even 0.5 0.0135
Nominal Scale Variables 0.23 0.010
Min max normalisation 0.27 0.075
Zero – mean normalisation 0.32 0.015
Decimal scaling 0.35 0.05
Bias and Variance 0.45 0.055
Data Normalisation Selection of the Housekeeping Genes 0.50 0.06
(HG)
Differential Expression Analysis 0.30 0.05
Prediction Errors 0.42 0.083
Trimmed Mean of ‫ܯ‬-values 0.37 0.076
Upper Quartile 0.40 0.069
Edited Nearest Neighbour –ENN 0.65 +/- 1.2 0.0148
Noise Removal Repeated ENN (RENN) 0.65 +/- 1.2 0.0150
All-KNN 0.65 +/- 1.2 0.0080

3
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
Decremental Reduction Optimization 0.268 +/- 3.5 0.0055
Procedure 2 (DROP2)
Decremental Reduction Optimization 0.303 +/- 5.1 0.0070
Procedure 4 (DROP4)

TABLE III. RESULTS OF HYBRID TECHNIQUE (PROPOSED METHOD)


Data Pre-Processing Step Algorithms/Techniques Error Rate Execution Time
(in %) (in seconds)
Data Imputation Mean + KNN 0.23 0.022
Data Ranging/Reduction Dimensionality Reduction + Data Sampling 0.35 0.0023
Data Distribution Uniform 0.2 0.0048
Data Normalisation Zero – mean normalisation + Nominal Scale 0.21 0.0089
Variables
Noise Removal Drop NaN values 0.22 0.0046

VI. CONCLUSION Tomek links technique for imbalanced medical data”,ICOACS.


10.1109/ICOACS,7563084,2016.
The results of the proposed method show that every pre- [5] Aksehir, Z. D., Oruc, Y., Elibol, A., Akleylek, S., &Kilid, E., “On the
processing algorithm has its own pros and cons. So we have Analysis of Work Accidents Data by Using Data Pre-processing and
to choose and pick the algorithm according to our data and Statistical Techniques”, 2nd International Symposium on
use case or the problem statement. As stated before, every Multidisciplinary Studies and Innovative Technologies
(ISMSIT),2018.
algorithm does not fit to every data set or every problem
statement. This holds true even after our analysis and several [6] G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality
of data with neural networks,” science, vol. 313, no. 5786, pp. 504-
experiments. After trying various combinations of pre- 507, 2006.
processing algorithms on stock market data that is both [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton,"Imagenet
historical and current transactional data, their results reveal classification with deep convolutional neural networks," in Advances
that the combination of Mean and KNN algorithm has better in neural information processing systems, 2012, pp. 1097-1105.
execution time and minimal error rate as part of data [8] J. Han and M. Kamber, “Data Pre-processing Techniques for Data
imputation step. In similar lines, the combination of Mining,” in Data mining: concepts and techniques, San
sampling data and then carrying out dimensionality Francisco:Morgan Kaufmann Publishers, 2001.
reduction on the data as part of data reduction produces [9] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Pre-
processing for Supervised Leaning” , World Academy of Science,
better execution time and a lower error rate. Stock data also Engineering and Technology International Journal of Computer,
fits and works very well with data being uniform. Applying Electrical, Automation, Control and Information Engineering, Vol. 1,
the nominal scale values and zero mean normalization No.12, 2007 pp. 4091-4096.
algorithms in sequence provides effective results with [10] Samsani, S.,”An RST based efficient pre-processing technique for
respect to data normalization. Hence it declares that we handling inconsistent data”, IEEE International Conference on
could not gain efficiency in all the pre-processing steps, so Computational Intelligence and Computing Research (ICCIC), 2016.
there are 3 out of 5 pre-processing steps that give us better [11] Abasova, J., Janosik, J., Simoncicova, V., &Tanuska, P.,” Proposal of
Effective Pre-processing Techniques of Financial Data” IEEE 22nd
results after applying hybrid techniques. And our research International Conference on Intelligent Engineering Systems (INES),
work continues to take a few enhancements from this paper 2018.
to peruse further effectiveness in the data pre-processing [12] G. Grupta, and S. Malhotra, (2015) “Text Documents Tokenization
steps in our upcoming research work in the future. for Word Frequency Count using Rapid Miner (Taking Resume as an
Example),” International Journal of Computer Applications (0975-
REFERENCES 8887), International Conference on Advancement in Engineering and
Technology (ICAET 2015).
[1] Nair, P., & Kashyap, I.,“Hybrid Preprocessing Technique for
Handling Imbalanced Data and Detecting Outliers for KNN [13] “Data pre-processing”, [online]. Available: http://
Classifier”, International Conference on Machine Learning, Big Data, www.techopedia.com/ definition/14650/data pre-processing.
Cloud and Parallel Computing (COMITCon),2019. [Accessed Mar. 20, 2018].
[2] P. Nair, and I. Kashyap,“ Optimization of kNN Classifier Using [14] Aneyrao, T. A., & Fadnavis, R. A., “Analysis for data pre-processing
Hybrid Pre-processing Model for Handling Imbalanced Data”, to prevent direct discrimination in data mining”, World Conference
unpublished. on Futuristic Trends in Research and Innovation for Social Welfare
(Start-up Conclave), 2016.
[3] Q. Kang, X. Chen,S Li, and M. Zhou. “A Noise-Filtered Under
Sampling Scheme for Imbalanced Classification”. IEEE [15] Hajian, Sara, and Josep Domingo-Ferrer. "A methodology for direct
TRANSACTIONS ON CYBERNETICS, Vol: 47, Issue: 12,Pages: and indirect discrimination prevention in data mining" Knowledge
4263 – 4274,2017. and Data Engineering, IEEE Transactions on 25.7: 1445-1459,2013.
[4] L. Wang, M. Zengd, B. Zou,W.Faran and X. Liu. “Effective
prediction of three common diseases by combining SMOTE with

4
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.

You might also like