0% found this document useful (0 votes)

6 views4 pages

A Hybrid Approach To Data Pre-Processing Methods

Data science

Uploaded by

abbas.sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

A Hybrid Approach To Data Pre-Processing Methods

Data science

Uploaded by

abbas.sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2020 IEEE International Conference for Innovation in Technology (INOCON)

Bengaluru, India. Nov 6-8, 2020

A Hybrid Approach to Data Pre-processing

Methods
Vinod Desai Dr. Dinesha H A
Dept. of CSE Professor and Head, Dept. of CSE,
AITM Belagavi SGBIT Belagavi,
[email protected] [email protected]

Abstract: This is an era of big data, as data is growing Let us discuss the importance of Data quality
exponentially and resources are running out of infrastructure,
so it is required to accommodate all the data that gets A. Dimensions on quality of data
generated. We collect data in enormous amounts to derive The most critical dimensions include data being
meaningful conclusions, perform effective data analytics and Accurate, data being consistent, data should be Unique, data
improve decision making. As we don’t have enough should be in Timeline fashion and must be relevant, data
infrastructures to support data storage for huge volumes, it is should be Valid and finally, data should be complete.
needed to clean the data in compulsion. It is a mandatory to
carry out a step before doing anything with the data. We call it B. Profiling of data sets or information
pre-processing of data and this is carried out in various steps. Profiling information rotates around building up a
Pre-processing includes data cleaning, data integration, data standard's structure to guarantee productive appraisal of the
filtering, and data transformation and so on. As such pre- nature of the information bolstered by a definition and
processing is not limited to the number of steps or a number of
attributes on the nature of the information.
methods or definitive methods. We must innovatively pre-
process the data before it is being consumed for data analytics. To guarantee the nature of information in a Big Data
It has become a responsibility for every data analyst or big framework, the information may be surveyed and changed
data researcher to handpick data for his or her analytics. through various emphases with an end goal to purge and
Considering all these techniques in mind we are proposing a furthermore progress from an unstructured to a progressively
hybrid technique to leverage various algorithms available to organized state.
pre-process our data along with minor modifications such as at
the run time, choosing an algorithm or technique wisely based C. Framework of Big data quality
on the data that we have. The building up of Data Quality structure will decide and
Key words: Big Data, Data Pre-processing, Data Quality guarantee the information with improved quality. Procedures
checks.
to scrub information, de dupe information, evacuate
I. INTRODUCTION adulterated information examples and a lot more sub-process
will structure some portion of the quality system.
Big Data is evolved enormously over the last 2 decades.
It always considers data being analysed or viewed with 3 V’s D. Quality of Input Data
in mind which are volume, variety, and velocity. In the last Quality of data for the yet advancing and developing
few years as the era has grown and day today, there is a new field of Big Data is an exceptionally mind-boggling subject.
release of apps on Google play store, there are upcoming Enormous associations prior thought having caught
start-ups, there are a lot of messages exchanged over social information from different business forms, different
networks and this is boon to humankind but on the side, divisions, deals, benefits, geologies, and area parameters,
these are all contributing to an import V of Big Data which is and so on would enable them to amplify their business and
volume. It also holds well with respect to 2nd V which is spread in more territories deliberately. Be that as it may, the
velocity. If we analyse the statistics in the last 5 years of test remained how to tap and mine productively the colossal
data, we notice that it is doubled, and it is also predicted that volumes of information in a subjective way utilizing
data will grow to exponential volume in the next few years. institutionalized quality procedures that likewise purge and
These 2 problems can be handled by increasing the improve the information quality as a component of the Big
infrastructure, which becomes more challenging is the 3rd V Data life cycle. Guaranteeing and changing the information
which is the variety with various systems generating data to a quality one is critical for any industry's Big Data stage to
every day is in a different format and it is complex to bring have the option to investigate the information precisely and
every data into a single format and even on doing so, it is evaluate designs that help in formulating future systems
very difficult to know when a data is generating in the precisely or as most ideal.
newest format which is never seen before. If we do not clean
our data by pre-processing before it is fed to analytic graphs, II. II. LITERATURE SURVEY
then it will add extra complications into analytics in addition Data is gathered from numerous sources. This crude
to the existing complications. There exist hundreds of information might be wrecked with abnormalities including
techniques and thousands of ways to pre-process the big data tainted qualities, seriously arranged and inadmissible for
which is being brought from various sources in various utilization by the Big Data application, a mix of organized,
formats. It is always mandatory to leverage these techniques semi-organized and unstructured information. Such
and algorithms to pre-process all the big data to carry out information needs to shift and rinsed, reformatted and
effective data analytics. organized; deduped, expels illicit qualities and packed. These
pre-processing steps are significant to change the

978-1-7281-9744-9/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
information to levels appropriate or important for of it. The means engaged with this procedure are expelling
examination. the erroneous, fragmented or then again superfluous
Information, as we presently comprehend, must be area information from the information gained with the goal that it
cognizant and reasonable for utilization by its clients. very well may be prepared and dissected upon to extricate
Suitability of information attributes like the arrangement, useful incentive from it.
structure, timing and information type varieties should be
tended to also from a quality point of view. The better the III. PROPOSED WORK
information source better might be the yield quality for the Considering all these techniques, we are proposing a
pre-processed information for example Information Quality hybrid technique to leverage multiple algorithms in various
is relative to its source quality. Subsequently, this quality of steps to pre-process our data. We would like to bring in some
the executives and restoring of information distortions and novelty here by understanding the data before pre-processing
complexities have a roundabout holding on for exertion and and choosing the techniques wisely based on data that we
cost spend to deliver quality information for examination in have. It is effective to keep pre-processing unique to your
Big Data frameworks or applications. data instead of generalizing it. Pre-processing and data are
Data pre-processing is a data mining approach that is always a one to one mapping and we can’t apply every pre-
used in order to transform raw data in an efficient and most processing step to all the available data sets. It varies with
useful format. the use case or problem we are trying to solve. It is better to
consider, combine available techniques as needed with minor
Below are various Techniques: modification to bring in effectiveness wherever required
such as precision considerations, data ranging and so on. In
A. Integration and Consolidation of Data
most of the techniques pre-processing is done entire data set.
Big Data which might be sourced from numerous areas We can filter and handpick data for pre-processing to make it
could be in various formats or structures like semi-organized effective on the other side there are limitations with respect
or organized or non-structured or unstructured, varying to necessary and minimal volume of data that is required to
positions, garbage, and so on. Information from every one of carry out any analytics. As our use case is to develop an end
these sources needs to be combined homogeneously to frame to end system for stock prediction, we will have data coming
a solitary and last wellspring of truth for the information to from the number of sources and it has data in various
be utilized in the Big Data framework. Advancements like formats, it makes an at most urge or need to bring all the data
ETL Extract, Transform and Load are well known and built- in the same form, normalize, integrate which will all be done
up components. in multiple steps as part of data pre-processing.
B. Enrichment and Enhancements to Big Data or Input IV. METHODOLOGY
Data
Here we are proposing an innovative and novel data
Input Big Data sets from different sources are solidified, evaluation step before starting actual data pre-processing. In
and during that information, subtleties are refreshed utilizing the data evaluation check, we conduct data quality (DQ)
extra data got from other strong sources to make intertwined checks and numerous tests to define and conclude how good
information that is improved with more data and perhaps at or how bad the data is. Based on the results of these Data
the same time upgraded subjectively. quality checks we dynamically define our hybrid data pre-
C. Change to Big Data processing technique make use of necessary pre-processing
steps as per data and DQ results. Every DQ check results in
Data change will include numerous stages or sub-handled
choosing a new pre-processing technique which is hybrid
like catching or pulling information from various sources,
and combination of various data pre-processing steps. There
information may be reformatted, standardized, amassed,
are few rules defined to categorize or classify data based on
even refreshed utilizing administrative norms.
its quality. According to the class or category it classifies to
D. Decrease or Reducing Data the pre-processing steps are applied. For example, if we learn
Information decrease is the way toward lessening the that data is not having any missing values then we need not
measure of information with the goal that it becomes non- take care of replacing or filling missing values. That saves
excess. This aide in expanding the information stockpiling time of our data pre-processing. So, we have various data
effectiveness and lessening costs by evacuating information quality checks before we go for actual pre-processing. We
that isn't significant and holding just the important parts for can also define certain thresholds for few of the DQ checks,
that specific work/task. for example: there can 5 to 10 % of erroneous values or
skewed data for a particular use case where results are not
E. Discretization of Data impacted with 5 to 10 % of erroneous values so that the time
This procedure concentrates and isolates information into to correct them or remove them is saved.
interims so it tends to be proficiently used inside accessible As stock market data is a time-series data set, we can
mining calculations and methods. define, Let S(i,j,k)stock price of i’th day or j’th month of
F. Cleansing of Data k’th year. So the following represents the time series from
the first day of the first month of the first year to the last day
It is a procedure that improves the Quality of Data by of 12th month of K +5 year as it is time series data of 5years.
evacuating the information which decreases the ease of use

S(i,j,k) = s(1,j,k), s(2,j,k),….. ,s(1,j+1,k),….. ,s(1,j+12,k),… ,s(1,j+1,k+1), ……… ,s(31,j+12,k+5)

2
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
V. RESULTS AND DISCUSSIONS

TABLE I. POSSIBLE DQ CHECKS AND THEIR PRE-PROCESSING MAPPINGS

Data Quality Check Pre-Processing Step
Number of Columns in Data set Remove unnecessary and unwanted columns if the number of columns > 100
Data Skewness If skewness < 5 % ignore, else remove skewed records from the dataset
Data Completeness Missing values for an attribute should not be more than 25%, remove attributes with missing
values/NaN> 25%
Data Volume Number of records should always be from 1000 to 10000 for better analysis – correct data
accordingly, generate records for the low count, aggregate records for high count
Data Distribution Analyse Data distribution and normalize data
Data Randomness Adjust data randomness to maintain data uniformity
Data uniqueness Remove Duplicate records

A. Comparison Study and Results of error rate and execution time. This was carried out on a
Below is a comparison study of various data pre- dataset of 50000 records of stock data with 10 columns.
processing techniques and their results, which is in terms
TABLE II. COMPARISON STUDY AND RESULTS
Data Pre-Processing Step Algorithm/techniques Error Rate (in %) Execution Time (in Seconds)
Hot deck imputation 0.45 0.078
Cold deck imputation 0.37 0.082
Regression imputation 0.25 0.113
Stochastic Regression imputation 0.21 0.124
Interpolation and extrapolation 0.41 0.119
Data Imputation Mean imputation 0.39 0.085
KNN 0.33 0.124
Fuzzy K Means 0.27 0.168
singular value decomposition 0.43 0.107
Bayesian principal component 0.24 0.119
analysis
Data Cube Aggregation 0.2 0.012

Dimensionality Reduction 0.5 0.0117

Data Compression 0.8 0.0123

Numerosity Reduction 0.65 0.0112

Data Ranging/Reduction

Discretisation and concept hierarchy 0.4 0.0125

generation
Data sampling 0.3 0.0127

Feature selection 0.9 0.0130

Uniform 0.2 0.0143

Data Distribution Normal 0.4 0.0157
Un Even 0.5 0.0135
Nominal Scale Variables 0.23 0.010
Min max normalisation 0.27 0.075
Zero – mean normalisation 0.32 0.015
Decimal scaling 0.35 0.05
Bias and Variance 0.45 0.055
Data Normalisation Selection of the Housekeeping Genes 0.50 0.06
(HG)
Differential Expression Analysis 0.30 0.05
Prediction Errors 0.42 0.083
Trimmed Mean of ‫ܯ‬-values 0.37 0.076
Upper Quartile 0.40 0.069
Edited Nearest Neighbour –ENN 0.65 +/- 1.2 0.0148
Noise Removal Repeated ENN (RENN) 0.65 +/- 1.2 0.0150
All-KNN 0.65 +/- 1.2 0.0080

3
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.
Decremental Reduction Optimization 0.268 +/- 3.5 0.0055
Procedure 2 (DROP2)
Decremental Reduction Optimization 0.303 +/- 5.1 0.0070
Procedure 4 (DROP4)

TABLE III. RESULTS OF HYBRID TECHNIQUE (PROPOSED METHOD)

Data Pre-Processing Step Algorithms/Techniques Error Rate Execution Time
(in %) (in seconds)
Data Imputation Mean + KNN 0.23 0.022
Data Ranging/Reduction Dimensionality Reduction + Data Sampling 0.35 0.0023
Data Distribution Uniform 0.2 0.0048
Data Normalisation Zero – mean normalisation + Nominal Scale 0.21 0.0089
Variables
Noise Removal Drop NaN values 0.22 0.0046

VI. CONCLUSION Tomek links technique for imbalanced medical data”,ICOACS.

10.1109/ICOACS,7563084,2016.
The results of the proposed method show that every pre- [5] Aksehir, Z. D., Oruc, Y., Elibol, A., Akleylek, S., &Kilid, E., “On the
processing algorithm has its own pros and cons. So we have Analysis of Work Accidents Data by Using Data Pre-processing and
to choose and pick the algorithm according to our data and Statistical Techniques”, 2nd International Symposium on
use case or the problem statement. As stated before, every Multidisciplinary Studies and Innovative Technologies
(ISMSIT),2018.
algorithm does not fit to every data set or every problem
statement. This holds true even after our analysis and several [6] G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality
of data with neural networks,” science, vol. 313, no. 5786, pp. 504-
experiments. After trying various combinations of pre- 507, 2006.
processing algorithms on stock market data that is both [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton,"Imagenet
historical and current transactional data, their results reveal classification with deep convolutional neural networks," in Advances
that the combination of Mean and KNN algorithm has better in neural information processing systems, 2012, pp. 1097-1105.
execution time and minimal error rate as part of data [8] J. Han and M. Kamber, “Data Pre-processing Techniques for Data
imputation step. In similar lines, the combination of Mining,” in Data mining: concepts and techniques, San
sampling data and then carrying out dimensionality Francisco:Morgan Kaufmann Publishers, 2001.
reduction on the data as part of data reduction produces [9] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Pre-
processing for Supervised Leaning” , World Academy of Science,
better execution time and a lower error rate. Stock data also Engineering and Technology International Journal of Computer,
fits and works very well with data being uniform. Applying Electrical, Automation, Control and Information Engineering, Vol. 1,
the nominal scale values and zero mean normalization No.12, 2007 pp. 4091-4096.
algorithms in sequence provides effective results with [10] Samsani, S.,”An RST based efficient pre-processing technique for
respect to data normalization. Hence it declares that we handling inconsistent data”, IEEE International Conference on
could not gain efficiency in all the pre-processing steps, so Computational Intelligence and Computing Research (ICCIC), 2016.
there are 3 out of 5 pre-processing steps that give us better [11] Abasova, J., Janosik, J., Simoncicova, V., &Tanuska, P.,” Proposal of
Effective Pre-processing Techniques of Financial Data” IEEE 22nd
results after applying hybrid techniques. And our research International Conference on Intelligent Engineering Systems (INES),
work continues to take a few enhancements from this paper 2018.
to peruse further effectiveness in the data pre-processing [12] G. Grupta, and S. Malhotra, (2015) “Text Documents Tokenization
steps in our upcoming research work in the future. for Word Frequency Count using Rapid Miner (Taking Resume as an
Example),” International Journal of Computer Applications (0975-
REFERENCES 8887), International Conference on Advancement in Engineering and
Technology (ICAET 2015).
[1] Nair, P., & Kashyap, I.,“Hybrid Preprocessing Technique for
Handling Imbalanced Data and Detecting Outliers for KNN [13] “Data pre-processing”, [online]. Available: http://
Classifier”, International Conference on Machine Learning, Big Data, www.techopedia.com/ definition/14650/data pre-processing.
Cloud and Parallel Computing (COMITCon),2019. [Accessed Mar. 20, 2018].
[2] P. Nair, and I. Kashyap,“ Optimization of kNN Classifier Using [14] Aneyrao, T. A., & Fadnavis, R. A., “Analysis for data pre-processing
Hybrid Pre-processing Model for Handling Imbalanced Data”, to prevent direct discrimination in data mining”, World Conference
unpublished. on Futuristic Trends in Research and Innovation for Social Welfare
(Start-up Conclave), 2016.
[3] Q. Kang, X. Chen,S Li, and M. Zhou. “A Noise-Filtered Under
Sampling Scheme for Imbalanced Classification”. IEEE [15] Hajian, Sara, and Josep Domingo-Ferrer. "A methodology for direct
TRANSACTIONS ON CYBERNETICS, Vol: 47, Issue: 12,Pages: and indirect discrimination prevention in data mining" Knowledge
4263 – 4274,2017. and Data Engineering, IEEE Transactions on 25.7: 1445-1459,2013.
[4] L. Wang, M. Zengd, B. Zou,W.Faran and X. Liu. “Effective
prediction of three common diseases by combining SMOTE with

4
Authorized licensed use limited to: Somaiya University. Downloaded on October 03,2024 at 02:34:51 UTC from IEEE Xplore. Restrictions apply.

Big Data Analytics Seminar Report
No ratings yet
Big Data Analytics Seminar Report
27 pages
Big Data Analytics and Case Studies
No ratings yet
Big Data Analytics and Case Studies
5 pages
Big Data Analytics-Report
No ratings yet
Big Data Analytics-Report
7 pages
M1 Q&a
No ratings yet
M1 Q&a
26 pages
Big Data Handling for Researchers
No ratings yet
Big Data Handling for Researchers
8 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Big Data A Survey Dinesh
No ratings yet
Big Data A Survey Dinesh
9 pages
Understanding Big Data Analytics Concepts
No ratings yet
Understanding Big Data Analytics Concepts
39 pages
J Ijdsa 20241005 11
No ratings yet
J Ijdsa 20241005 11
14 pages
Bigdata
No ratings yet
Bigdata
54 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
Module 1 Notes
No ratings yet
Module 1 Notes
12 pages
Viralheat Inc. Tech Stack Overview
No ratings yet
Viralheat Inc. Tech Stack Overview
68 pages
Module 3 Free Elective
No ratings yet
Module 3 Free Elective
19 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
10 pages
2017 Iaria Advances in SW Data Quality
No ratings yet
2017 Iaria Advances in SW Data Quality
20 pages
Thesis 2018
No ratings yet
Thesis 2018
82 pages
Big Data Analytics in Power Systems
No ratings yet
Big Data Analytics in Power Systems
20 pages
37 A Review Paper On Big Data Analytics
No ratings yet
37 A Review Paper On Big Data Analytics
4 pages
Big Data
No ratings yet
Big Data
65 pages
2403RES29 - Hemant Choudhary - CS546 - Assignment - 1
No ratings yet
2403RES29 - Hemant Choudhary - CS546 - Assignment - 1
14 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
20 pages
Bda 3
No ratings yet
Bda 3
2 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Big Data: Challenges and Solutions
No ratings yet
Big Data: Challenges and Solutions
6 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
69 pages
Notes of Data Science Unit 3
No ratings yet
Notes of Data Science Unit 3
22 pages
Most Frequent Attribute in Data Analysis
No ratings yet
Most Frequent Attribute in Data Analysis
86 pages
Bigdata Unit III
No ratings yet
Bigdata Unit III
22 pages
Bda Notes
No ratings yet
Bda Notes
13 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Big Data Analytics: Concepts & Techniques
No ratings yet
Big Data Analytics: Concepts & Techniques
32 pages
Machine Learning and Bigdata
No ratings yet
Machine Learning and Bigdata
27 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
No ratings yet
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
6 pages
02-DataQuality Compressed
No ratings yet
02-DataQuality Compressed
71 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Big Data Analytics Primer
No ratings yet
Big Data Analytics Primer
6 pages
Big Data Testing: Key Components
No ratings yet
Big Data Testing: Key Components
9 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
52 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Ak As2
No ratings yet
Ak As2
15 pages
Handling Uncertainty in The Big Data Processing
No ratings yet
Handling Uncertainty in The Big Data Processing
6 pages
Understanding Data Analytics Essentials
No ratings yet
Understanding Data Analytics Essentials
15 pages
Augmenting Your Quality Systems Using Big Data Analytics and Machine
No ratings yet
Augmenting Your Quality Systems Using Big Data Analytics and Machine
14 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Abbas Saifee Makasarwala - Lab-4 Process Scheduling Algorithm
No ratings yet
Abbas Saifee Makasarwala - Lab-4 Process Scheduling Algorithm
4 pages
SE Exp1-1
No ratings yet
SE Exp1-1
28 pages
Experiment No. 9: Program Code and Output
No ratings yet
Experiment No. 9: Program Code and Output
7 pages
Exp 05
No ratings yet
Exp 05
3 pages
DTV020EXT - Export To Data Lake Transitions To Synapse Link - Overview
No ratings yet
DTV020EXT - Export To Data Lake Transitions To Synapse Link - Overview
25 pages
Data Security and Privacy
100% (2)
Data Security and Privacy
3 pages
Dell Powervault Md3 Iscsi Array Series: Introducing The New Md3, The Next Generation of Affordable Storage
No ratings yet
Dell Powervault Md3 Iscsi Array Series: Introducing The New Md3, The Next Generation of Affordable Storage
2 pages
OBIEE 11g Lookup Tables Guide
No ratings yet
OBIEE 11g Lookup Tables Guide
7 pages
Good Relational Designs
No ratings yet
Good Relational Designs
2 pages
File Input and Output
No ratings yet
File Input and Output
2 pages
Computer Science Coursework Example
100% (2)
Computer Science Coursework Example
5 pages
ChoiceOrder 298258
No ratings yet
ChoiceOrder 298258
2 pages
Introduction To IBM SPSS Statistics
No ratings yet
Introduction To IBM SPSS Statistics
2 pages
Unit 3 Notes For Visual Basic Part 3 PDF
No ratings yet
Unit 3 Notes For Visual Basic Part 3 PDF
23 pages
Informatica 120qs
No ratings yet
Informatica 120qs
30 pages
Understanding Cache Memory Systems
No ratings yet
Understanding Cache Memory Systems
110 pages
Asignación 6 y 7. Inglés 2. Resuelta
No ratings yet
Asignación 6 y 7. Inglés 2. Resuelta
7 pages
Database Design Fundamentals
No ratings yet
Database Design Fundamentals
23 pages
Viro Wsi Brochure en
No ratings yet
Viro Wsi Brochure en
8 pages
Year 11 Biology: Cells & Life
No ratings yet
Year 11 Biology: Cells & Life
21 pages
Design and Deployment of TinyURL
No ratings yet
Design and Deployment of TinyURL
14 pages
Data Science Solutions With Python Fast and Scalable Models Using
100% (1)
Data Science Solutions With Python Fast and Scalable Models Using
128 pages
Revision Practice
No ratings yet
Revision Practice
6 pages
Oracle SQL and PLSQL Training Course Syllabus
No ratings yet
Oracle SQL and PLSQL Training Course Syllabus
5 pages
SAP BI/BW-BO Integration Guide
No ratings yet
SAP BI/BW-BO Integration Guide
24 pages
Student Attendance Analysis Report
No ratings yet
Student Attendance Analysis Report
6 pages
GIS for Vector-Borne Disease Control
No ratings yet
GIS for Vector-Borne Disease Control
6 pages
Plasma TV Education in Civic Teaching
No ratings yet
Plasma TV Education in Civic Teaching
92 pages
Data Structures Lab for CS Students
No ratings yet
Data Structures Lab for CS Students
6 pages
Insurance's Role in Business Growth
No ratings yet
Insurance's Role in Business Growth
25 pages
Test PL 300 - EX 1
No ratings yet
Test PL 300 - EX 1
116 pages
MT6735 Android Scatter
No ratings yet
MT6735 Android Scatter
8 pages
EMC LUN Replication Setup Guide
No ratings yet
EMC LUN Replication Setup Guide
12 pages
TZBPC1 - 98 Aug 21 2014 Lessons and Labs
No ratings yet
TZBPC1 - 98 Aug 21 2014 Lessons and Labs
602 pages

A Hybrid Approach To Data Pre-Processing Methods

Uploaded by

A Hybrid Approach To Data Pre-Processing Methods

Uploaded by

2020 IEEE International Conference for Innovation in Technology (INOCON)

Bengaluru, India. Nov 6-8, 2020

A Hybrid Approach to Data Pre-processing

978-1-7281-9744-9/20/$31.00 ©2020 IEEE 1

S(i,j,k) = s(1,j,k), s(2,j,k),….. ,s(1,j+1,k),….. ,s(1,j+12,k),… ,s(1,j+1,k+1), ……… ,s(31,j+12,k+5)

TABLE I. POSSIBLE DQ CHECKS AND THEIR PRE-PROCESSING MAPPINGS

Dimensionality Reduction 0.5 0.0117

Data Compression 0.8 0.0123

Numerosity Reduction 0.65 0.0112

Discretisation and concept hierarchy 0.4 0.0125

Feature selection 0.9 0.0130

Uniform 0.2 0.0143

TABLE III. RESULTS OF HYBRID TECHNIQUE (PROPOSED METHOD)

VI. CONCLUSION Tomek links technique for imbalanced medical data”,ICOACS.

You might also like