0% found this document useful (0 votes)
11 views17 pages

Unit 1

The document outlines the principles of data analytics, focusing on data management, architecture, and various data sources. It details the importance of data architecture in organizing and managing data within an organization, along with methods for data analysis such as regression and cluster analysis. Additionally, it discusses data collection methods, including primary and secondary data, and emphasizes the significance of effective data management for business growth.

Uploaded by

greekathena0501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

Unit 1

The document outlines the principles of data analytics, focusing on data management, architecture, and various data sources. It details the importance of data architecture in organizing and managing data within an organization, along with methods for data analysis such as regression and cluster analysis. Additionally, it discusses data collection methods, including primary and secondary data, and emphasizes the significance of effective data management for business growth.

Uploaded by

greekathena0501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

PRINCIPLE OF DATAANALYTICS

IIIYEARII SEM
1. DATA MANAGEMENT

Syllabus:
DesignDataArchitectureandmanagethedataforanalysis
UnderstandvarioussourcesofDatalikeSensors/Signals/GPSetc.
DataManagement
DataQuality(noise,outliers,missingvalues,duplicatedata)
DataProcessing

TEXTBOOKS:
1. Student’sHandbookforAssociateAnalytics–II,III.
2. DataMiningConceptsandTechniques,Han,Kamber,3rdEdition,MorganKaufmannPublishers.

REFERENCEBOOKS:
1. IntroductiontoDataMining,Tan,SteinbachandKumar,AddisionWisley,2006.
2. DataMiningAnalysisandConcepts, M.ZakiandW.Meira

Mr.V.IndivaruTeja Page1
s
INTRODUCTION

Data:Dataiscollectionofdataobjectsand theirattributes.

Attribute:Acollectionofattributesthat describesanobject.Object arealso knownasinstance,record or


entity. Examples: eye color of a person, temperature, etc.
Attributeisalso knownasvariable,field,characteristic,dimension,or feature.

Attributescanbe categoricalorquantitative.
Categoricalattributeshavea finite numberofpossiblevalues,withno orderingamongthevalues (e.g.,
occupation, brand, color).
Categorical attributes are also called nominal attributes, because their values are "names of things.
Quantitativeattributesarenumericandhaveanimplicit orderingamongvalues(e.g.,age,income, price)

DATA ARCHTECTURE:

Dataarchitectureisasetofrules, policies, standardsand modelsthat governanddefinethetypeofdata collected


and how it is used, stored, managed and integrated within an organization and its database systems.
Itprovidesa formalapproachtocreatingand managingthe flowofdataandhowit isprocessedacross an
organization's IT systems and applications.

Data architecture describes the structure of an organization'slogical and physical data assets and
datamanagement resources.

It is an offshoot of enterprise architecture that comprises the models, policies, rules,and standards that
govern the collection, storage, arrangement, integration, and use of data in organizations.

Anorganization'sdata architectureisthe purviewofdata architects.

Data architecture design is important for creating a vision of interactions occurring between data
systems,
Ex: if data architect wants to implement data integration, so it will need interaction between
twosystemsandby using dataarchitecture thevisionary model of datainteraction during theprocesscan be
achieved.

Data architecture also describes the type of data structures applied to manage data and it provides an
easy way for data preprocessing.
Thedataarchitectureisformedbydividingintothreeessentialmodelsandthenarecombined:

Mr.V.IndivaruTeja Page2
s
 Conceptualmodel: –
It is a business model which uses Entity Relationship (ER) model for relation between entities and
their attributes.
 Logicalmodel: –
Itisamodelwhereproblemsarerepresentedintheformoflogicsuchasrowsandcolumnof data, classes, xml
tags and other DBMS techniques.
 Physicalmodel –:
Physical models holds the database design like which type of database technology will be suitable
for architecture.
Adataarchitectisresponsibleforallthedesign,creation,manage,deploymentofdataarchitecture and
defineshow data is to be stored and retrieved, other decisions are made by internal bodies.
FactorsthatinfluenceDataArchitecture:
Few influences that can have an effect on data architecture are business policies, business
requirements, Technology used, economics, and data processing needs.

 Businessrequirements–
These include factors such as the expansion of business, the performance ofthe system access,
data management, transaction management, making use of raw data by converting them into
image files and records, and then storing in data warehouses. Data warehouses are the main
aspects of storing transactions in business.

 Businesspolicies–
The policies are rules that are useful for describing the way of processing data. These policies are
made by internal organizational bodies and other government agencies.

 Technologyinuse–
This includes using the example of previously completed data architecture design and also using
existing licensed software purchases, database technology.

 Businesseconomics–
The economical factors such as business growth and loss, interest rates, loans, condition ofthe
market, and the overall cost will also have an effect on design architecture.

Mr.V.IndivaruTeja Page3
s
 Dataprocessingneeds–
These include factors such as mining of the data, large continuous transactions, database
management, and other data pre-processing needs.

DATAANALYTICS METHODS:

Dataanalystsuseanumberofmethodsandtechniquestoanalyse data.

 Regression analysis: Regression analysis is a set of statistical processes used to estimate the
relationships between variables to determine how changes to one or more variables might affect
another. For example, how might social media spending affect sales?

 Factor analysis: Factor analysis is a statistical method for taking a massive data set and
reducing it to a smaller, more manageable one. This has the added benefit of often uncovering
hidden patterns. In a business setting, factor analysis is often used to explore things likecustomer
loyalty.
 Cohort analysis: Cohort analysis is used to break a dataset down into groups that sharecommon
characteristics, or cohorts, for analysis. This is often used to understand customer segments.
 Cluster analysis: StatisticsSolutions defines cluster analysis as “a class of techniques that are
used to classify objects or cases into relative groups called clusters.” It can be used to reveal
structures in data — insurance firms might use cluster analysis to investigate why certain
locations are associated with particular insurance claims, for instance.
 Time series analysis: Statistics Solutionsdefines time series analysisas “a statistical technique
that deals with time series data, or trend analysis. Time series data means that data is in a series
of particular time periods or intervals. Time series analysis can be used to identify trends and
cycles over time, e.g., weekly sales numbers. It is frequently used for economic and sales
forecasting.
 Sentiment analysis: Sentiment analysis uses tools such as natural language processing, text
analysis, computational linguistics, and so on, to understand the feelings expressed in the data.
While the previous six methods seek to analyse quantitative data (data that can be measured),
sentiment analysis seeks to interpret and classifyqualitative data byorganizing it into themes. It is
often used to understand how customers feel about a brand, product, or service.

UNDERSTANDINGDIFFERENTSOURCESOFDATA:

Datacollectionis theprocess of acquiring,collecting,extracting,andstoring thevoluminousamount


ofdatawhichmaybeinthestructuredorunstructuredformliketext,video,audio,XMLfiles,
records,orotherimagefilesusedinlaterstagesofdataanalysis. In the process of big data analysis, “Data
collection” is the initial step before starting to analyze the patterns or useful information in data. The
data whichis tobeanalyzedmustbecollectedfrom different valid sources.

The data which is collected is known as raw data which is not useful now but on cleaning the impure
and utilizing that data for further analysis forms information, the information obtained is known as
“knowledge”.
Knowledge has many meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collectinformation-rich data.
Data collection starts with askingsome questionssuch as what type of data is to be collected and
whatisthesourceofcollection.Mostofthedatacollectedareoftwotypesknownas“qualitativedata“

Mr.V.IndivaruTeja Page4
s
whichisagroupofnon-numericaldatasuchaswords,sentencesmostlyfocusonbehaviorand actions of the
group and another one is “quantitative data” which is in numerical forms and can be calculated using
different scientific tools and sampling data.
Theactualdataisthenfurtherdividedmainlyintotwotypesknownas:
1. Primarydata
2. Secondarydata

1. Primarydata:

The data which is Raw, original, and extracted directly from the official sources is known as primary
data.Thistype of dataiscollected directly by performing techniques such as questionnaires, interviews,
and surveys. The data collected must be according to the demand and requirements of the target
audience on which analysisis performed otherwiseit wouldbe a burden in the data processing.
Fewmethodsofcollectingprimarydata:
1. Interviewmethod:
The data collected during this process is through interviewing the target audience by a person called
interviewer and the person whoanswers theinterview is known as the interviewee. Some basic business
or productrelated questions are asked andnoted down in the form of notes, audio, or video and this data
is stored for processing. These can be both structured and unstructured like personal interviews or
formal interviews through telephone, face to face, email, etc.
2. Surveymethod:
The survey method is the process of research where a list of relevant questions are asked and answers
are noted down in the form of text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey answers are stored for
analysing data. Examples are online surveys or surveys through social media polls.
3. Observationmethod:
The observation method is a method of data collection in which the researcher keenly observes the
behaviour and practices of the target audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In thismethod, the data is collected directly by
posting a few questions on the participants. For example, observing a group of customers and their
behaviour towards the products. The data obtained will be sent for processing.
4. Experimentalmethod:
The experimental method is the process of collecting data through performing experiments, research,
and investigation. The most frequently used experiment methods are CRD, RBD, LSD, FD.

Mr.V.IndivaruTeja Page5
s
 CRD- Completely Randomized design is a simple experimental design used in data analyticswhich
is based on randomization and replication. Itis mostly used for comparing the experiments.
 RBD-RandomizedBlockDesign is anexperimental designinwhich theexperimentis divided
intosmallunitscalledblocks.Randomexperimentsareperformedoneachoftheblocksand results are
drawn using a technique known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but
contains rows and columns. It is an arrangement of NxN squares with an equal amount ofrows and
columns which contain letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzleis an example of a Latin square design.
 FD- Factorial design is an experimental design where each experiment has two factors each with
possible values and on performing trail other combinational factors are derived.

2. Secondarydata:

Secondary data is the data which has already been collected and reused again for some valid purpose.
This type of data is previously recorded from primary data and it has two types of sources named
internal source and external source.
Internalsource:
These types of data can easily be found within the organization such as market record, a sales record,
transactions, customer data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
Externalsource:
The data which can’t be found at internal organizations and can be gained through external third party
resources is external source data. The cost and time consumption is more because this contains a huge
amount of data. Examples of external sources are Government publications, news publications,Registrar
General of India, planning commission, international labour bureau, syndicate services, and other non-
governmental publications.
Othersources:
 Sensorsdata: Withtheadvancementof IoTdevices,thesensorsofthesedevicescollectdata which can be
used for sensor data analytics to track the performance and usage of products. 
Sensor data is the output of a device that detects and responds to some type of input from
thephysicalenvironment.Theoutputmaybeusedtoprovideinformationorinputtoanothersystem or to
guide a process.
 Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by
users on different platforms can be predicted and collected with their permission for data analysis.
The search engines also provide their data through keywords and queries searched mostly. 

DataManagement:

 Data management is the process of managing tasks like extracting data, storing data, transferring
data, processing data, and then securing data with low-cost consumption. 
 Main motive of data management is to manage and safeguard the people’s and organization data in
an optimal way so that they can easily create, access, delete, and update the data. 
 Because data managementis an essential process in each and every enterprise growth, withoutwhich
the policies and decisions can’t be made for business advancement. The better the data management
the better productivity in business. 

Mr.V.IndivaruTeja Page6
s
 Large volumes of data like big data are harder to manage traditionally so there must be theutilization
of optimal technologies and tools for data management such as Hadoop, Scala, Tableau, AWS, etc.
Which can further used for big data analysisin achieving improvements in patterns. 
 Datamanagementcanbeachievedbytrainingtheemployeesnecessarilyandmaintenanceby DBA, data
analyst, and data architects.

Managingdigitaldatainanorganizationinvolvesabroadrangeoftasks,policies,procedures,and practices. The


work of data management has a wide scope, covering factors such as how to:

 Create,access,and updatedataacrossadiversedatatier
 Storedataacrossmultiplecloudsand onpremises
 Providehighavailabilityanddisaster recovery
 Usedatainagrowing varietyofapps,analytics,andalgorithms
 Ensuredataprivacyand security
 Archiveanddestroydatainaccordancewithretentionschedulesandcompliancerequirements
 Adata management platform isthe foundationalsystemforcollectingandanalyzing large volumes of
data across an organization. Commercial data platforms typically include software tools for
management, developed by the database vendor or by third-party vendors. These data management
solutions help IT teams and DBAs perform typical tasks such as:
 Identifying,alerting,diagnosing,andresolvingfaultsinthedatabasesystemorunderlying
infrastructure
 Allocatingdatabasememoryandstorageresources
 Makingchangesinthedatabase design
 Optimizingresponsestodatabasequeriesforfasterapplicationperformance
Theincreasingly popular cloud data platforms allow businesses to scale up or down quickly and cost-
effectively. Some are available as a service, allowing organizations to save even more.

Remotesensing
Remotesensingisthe acquisitionofinformationaboutanobjectorphenomenonwithoutmaking physical
contact with it.

 A source ofradiation (A). It can be naturalor artificial. Radiation emitted by the source reaches the
Earth'ssurfaceand it isalteredbythepresenceoftheobjectsonthat surface.Remotesensingstudies that
alteration. Objects themselves can emit radiation as well.

 Objects(B)thatinteractwithradiationorcanemitit,asmentioned above.

 Anatmosphere(C) throughwhichradiationmoves fromthesourcetotheobjects. Theatmospherealso


interacts with the radiation and alters it.

Mr.V.IndivaruTeja Page7
s
 A receiver (D) which receives the radiation once it has been emitted or altered bythe objects. The
receptormeasuresthe intensityoftheradiationcoming fromdifferent points intheareabeingstudied and,
with them, generates its final product (in most cases, and image).
For describing the receivers that make part of a remote sensing system, we will separate them into two
components: sensors and platforms

The sensor is the element that can read the electromagnetic radiation and register its intensity for a
given zone of the spectrum. It can be a simple photographic camera or a more specialized sensor.
Passivesensorsusenaturalsourceofradiation(inmost cases,sunlight),and just measurethat radiation
asitisreflectedontheEarth'ssurface.Activesensorsemittheirownradiation, andthencollectitback
afterithasbeenreflected.
Sensor data is the output of a device that detects and responds some type of input from the physical
environment. The output may be used to provide information or input to another system or to guide a
process.

DimensionalityReduction:

Dimensionality reduction isthe process of reducing the number of random variables or attributes
under consideration. ... High-dimensionality reduction has emerged as one of the significant tasks in
data mining applications.

In machine learning classification problems, there are often too many factors on the basis of which the
finalclassificationisdone.Thesefactorsarebasicallyvariablescalledfeatures.Thehigherthe
numberoffeatures,theharderitgetstovisualizethetrainingsetandthenworkonit.Sometimes, most of these
features are correlated, and hence redundant. This is where dimensionality reduction algorithms come
into play. Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction.

Therearetwocomponentsofdimensionalityreduction:
 Featureselection:In this,wetry tofindasubsetof theoriginalsetof variables,orfeatures,toget a smaller
subset which can be used tomodel the problem. It usually involves three ways:

Mr.V.IndivaruTeja Page8
s
1. Filter
2. Wrapper
3. Embedded

 Featureextraction:Thisreducesthedatainahighdimensionalspacetoalowerdimensionspace,
i.e.aspacewithlesserno.ofdimensions.

CommonMethodstoPerformDimensionalityReduction:

a. MissingValues
While exploring data, if we encounter missing values, what we do? Our first step should be to identify
the reason. Then need to impute missing values/ drop variables using appropriate methods. But, what if
we have too many missing values? Should we impute missing values or dropthe variables?

b. LowVariance
Let’s think ofa scenario where we have a constant variable (allobservations have the same value, 5) in
our data set. Do you think, it can improve the power of model? Of course NOT, because it has zero
variance.

c. Decision Trees
It is one of my favourite techniques. We canuse it as anultimate solutionto tackle multiplechallenges.
Such as missing values, outliers and identifying significant variables. Several data scientists used
decision tree and it worked well for them.

d. RandomForest
Random Forest issimilar to decision tree. Just be careful that random forests have a tendency to bias
towards variables that have more no. of distinct values i.e. favor numeric variables over
binary/categorical values.

Mr.V.IndivaruTeja Page9
s
e. HighCorrelation
Dimensions exhibiting higher correlation can lower down the performance of a model. Moreover, it is
not good to have multiplevariables of similar information. You can use Pearsoncorrelation matrix to
identify the variables with high correlation. And select one of them using VIF (Variance Inflation
Factor).

f. BackwardFeatureElimination
In this method, we start with all n dimensions. Compute the sum of a square of error (SSR) after
eliminating each variable (n times). Then, identifying variables whose removal has produced the
smallest increase in the SSR. And thus removing it finally, leaving us with n-1 input features.

Repeatthisprocessuntilnoothervariablescanbedropped.

DATA QUALITY:

Data qualityis a measure of the condition of databased on factors such as accuracy, completeness,
consistency, reliability and whether it's up to date.

Severalkeyfactorstoexaminedataquality:

 Existence
o Istheredatatowork with?
o Example:DidtheorganizationactuallycollectdataonsalesperformanceinChina?
 Consistency
o Ifadatapointappears inmultiplelocations,doesitbearthesamemeaning?
o Example: Indatasetsthat containrevenue bystore for a givenweek, isthe same number
associated with a particular store in all data sets?
 Accuracy
o Doesthe datarepresentrealfactsandproperties?
o Example:Arereportedsalesrepresentativeofwhatactuallyhappenedinthestore?
 Integrity
o Doesthedatadepictgenuine relationships?
o Example: In a report of customers and billing addresses, is each customer linked to the
right billing address?
 Validity
o Dothedataentries make sense?
o Example: If data in a column “location” is linked to data “price,” are the related values
consistent with allowable values in the data set and when compared with external
benchmarks?

Organizations with good data quality practices will have a process for automating data collection and
entry (since many mistakes are caused by human error), user profiles defining who should be able to
access different data types, and a dashboard to monitor data qualitychanges over time.

Poordataqualitynegativelyaffectsmanydataprocessingefforts. Examples of
data quality problems:

 Noiseandoutliers
 Missingvalue
 Duplicatedata

Mr.V.IndivaruTeja Page10
s
 Wrongdata

Handlingnoisyorincompletedata:

The data stored in a database may reflect noise,exceptional cases, or incomplete data objects.When
mining data regularities, theseobjects may confuse the process, causing the knowledge model
constructed tooverfit the data. As a result, the accuracy of the discovered patterns can be poor. Data
cleaning methodsanddataanalysis methodsthat canhandle noisearerequired, aswellasoutlier mining
methods for the discovery and analysis of exceptional cases.

DATACLEANING ASA PROCESS

(I) MissingValues:

Missingvaluescanbefilledbyfollowing methods

1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuplecontains several attributes
with missing values. It is especially poor when the percentageof missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming andmay not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute valuesby the same
constant, such as a label like “Unknown”. If missing values arereplaced by, say, “Unknown,” then the
mining program may mistakenly think thatthey form an interesting concept, since they all have a value
in common—that of“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: For example, suppose that theaverage income of
AllElectronics customers is $56,000. Use this value to replace themissing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple:For example, if
classifying customers according to credit risk, replace the missing valuewith the average income value
for customers in the same credit risk category as thatof the given tuple.
6. Use the most probable value to fill in the missing value: This may be determinedwith regression,
inference-based tools using a Bayesian formalism, or decision treeinduction. For example, using the
othercustomerattributes in yourdataset, youmayconstruct adecisiontreetopredict the missing values for
income.

(II) NoisyData

Noiseisarandomerroror varianceinameasuredvariable.Followingaredatasmoothing techniques

1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”that is, the
values around it. The sorted values are distributed into a numberof “buckets,” or bins.
In smoothing by bin means, eachvalue in a bin is replaced by the mean value of the bin. For example,
the mean ofthevalues 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replacedby
the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin
median. In smoothing by bin boundaries, the minimumand maximum values in a given bin areidentified
as the bin boundaries. Eachbin value is then replaced by the closest boundaryvalue.

Mr.V.IndivaruTeja Page11
s
Binningmethodsfordatasmoothing

2. Regression: Data can be smoothed by fitting the data to a function, such as withregression. Linear
regression involves finding the “best” line to fit two attributes so that one attribute can be used topredict
the other.
Multiple linear regressions is an extension of linear regression, where more than two attributes
areinvolved and the data are fit to a multidimensional surface.

3. Clustering: Outliers maybe detectedbyclustering, where similar valuesareorganized into groups,or


“clusters.” Intuitively, values that fall outside of the set of clusters maybe considered outliers.

(III) Data CleaningasaProcess

The firststepindatacleaningasaprocessisdiscrepancydetection.

Mr.V.IndivaruTeja Page12
s
Discrepancies can be caused by several factors, including poorly designed data entry forms that have
manyoptional fields, Humanerror in data entry, deliberate errors (e.g., respondents not wantingtodivulge
information about themselves), and data decay (e.g., outdated addresses).This discrepancy may also
lead to system errors, errors in the processing time. It may also cause due to inconsistency data
representation.
By using the Meta data which is data about data we can avoid discrepancy, which is used to avoid the
unnecessary data from data sets.
Field overloading is another error that occurs when the new attributes are defined unused portions of
defined attributes.

Thedatashouldalsobeexaminedregardinguniquerules,consecutiverulesand nullrule:

A unique rule says that each value of the given attribute must be different fromallother values for that
attribute.
A Consecutive rule says that there can be no missing values between the lowest and highest values for
the attribute, and that all values must also be unique (e.g., as in check numbers).
Anullrulespecifiestheuseofblanks,questionmarks,specialcharacters,orotherstringsthatmay indicate the
null condition (e.g., where a value for a given attribute is not available)

Thetoolsthatcanaidinthestep ofdiscrepancydetection:

Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-
checking) to detect errors and make corrections in the data.
Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and
detecting data that violate such conditions. Some data inconsistencies may be corrected manually. The
data transformation also used for discrepancies and these transform are defined to correct them.
Datamigrationtools allow simple transformationsto be specified, suchasto replace the string “gender”
bysex”.
ETL(extraction/transformation/loading)toolsallowusersto specifytransformsthroughagraphicaluser
interface (GUI).

PREPROCESSINGTHEDATA

The data preprocessing techniques are used to avoid the incomplete, noisy and inconsistent.
After the mining process, whether there are any outliers, noisy and inconsistent are present in
the data or not we have to check them by the preprocessing techniques. If the mining data has
the noisy, incomplete and inconsistent then these are to be eliminated by using preprocessing
techniques.

If suppose the mining data has all the outliers then the resulting data will also be an outlier and
low quality data. The incomplete data can occur for a number of reasons such as attribute
interest maynotbealwaysbeavailablelike misunderstandingwhiletherecordisbeingentering and
malfunction equipment.

The noisy data can occur for a number of reasons some of them are the data which collects
information may be faulty. While entering the data there may be a human or computer error.
There may also be a error while transforming the data.

Mr.V.IndivaruTeja Page13
s
Foravoidingalltheseproblemsthedataprocessingtechniquesareused.Theyare
 DataCleaning
 DataIntegration
 DataTransformation
 DataReduction

1. DataCleaningis a processwhich isused toclean the dirtydata ornoisy data, remove outliers and
resolving inconsistency. This dirty data may lead to confusion and the output is unreliable.

2. Data Integration is a technique or process which is used to integrate or consolidate several


number of files, data bases, data cubes etc.,. The attributes representing in a particular taskmay
have distinct names in distinct data bases which may lead to inconsistency and redundancy.

3. Data Transformation is a process which is used to convert the data into appropriate form for
mining. For an instance the best example is normalization and aggregation.

4. Data Reduction is a process which is used to reduce the irrelevant data and unused data from
the data sets (i.e to make into smaller volumes)

Mr.V.IndivaruTeja Page14
s
Formsofdatapreprocessing

Mr.V.IndivaruTeja Page15
s
Glossory

Sensor Data:Sensor data is the output ofa device that detects and responds to some type of input from
the physicalenvironment. The output may be used to provide information or input to another systemor
guide a process.

Data discretization is defined as a process of converting continuous dataattribute values into a finite
set of intervals and associating with each interval some specific data valueIfdiscretization leads to
anunreasonablysmallnumberofdataintervals,thenitmayresultinsignificantinformationloss.

Dimensionality reduction is the process ofreducingthe number of random variables or attributes under
consideration. ... High-dimensionality data reduction, as part of adatapre-processing-step, is
extremely important in many real-world applications.

Attribute subsetSelection isatechniquewhichisusedfor datareductionin data mining process.


Datareduction reduces the size ofdata so that itcan be usedfor analysis purposes more efficiently. Need
of Attribute Subset Selection- Thedataset may have a large number of attributes.

Data qualityrefers to the overall utility of a dataset(s) as a function of its ability to be easilyprocessed
and analyzed for other uses, usually bya database, data warehouse, or data analytics system.

PreviousJNTUHExamQuestions

1. Whatisattribute?Explaindifferent typesofattributes?
2. Whatisdatapre-processing?
3. Explainvariousmethodsforhandlingmissingvalues?
4. Whatisdataanalytics?Whatis needofdata analytics?
5. Explaindifferenttypesofdata analytics?
6. Whatisdataarchitecture?Explaindataintegration?
7. ExplaindifferencebetweenInformationand Data?
8. Whatissensordata?
9. Howmissingvaluesarehandled?
10. Howto handlenoisydataindatamining?

Mr.V.IndivaruTeja Page16
s
ImportantQuestions:

1. Whatisdataarchitecture?
2. Differentiatebetweendataandinformation?
3. WriteanoteonsubsetselectioninAttributesfor datareduction?
4. ExplainDataqualityindetail?
5. Whatareoutliers?Explain withanexample?
6. What doyoumeanbydatapreprocessing?Whyitis needed?
7. Explainvariousmethodsforhandlingmissingdatavalues?
8. Explainsamplingmethodsfor datareduction?
9. Howtohandle data noising?
10. Definethefollowing:
(a) DataCleaning
(b) Datapreprocessing
(c) Dataintegration
(d) Datatransformation
(e) Cluster
(f) Outlier
(g) Dataquality

Mr.V.IndivaruTeja Page17
s

You might also like