0% found this document useful (0 votes)

11 views17 pages

Unit 1

The document outlines the principles of data analytics, focusing on data management, architecture, and various data sources. It details the importance of data architecture in organizing and managing data within an organization, along with methods for data analysis such as regression and cluster analysis. Additionally, it discusses data collection methods, including primary and secondary data, and emphasizes the significance of effective data management for business growth.

Uploaded by

greekathena0501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views17 pages

Unit 1

Uploaded by

greekathena0501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

PRINCIPLE OF DATAANALYTICS

IIIYEARII SEM
1. DATA MANAGEMENT

Syllabus:
DesignDataArchitectureandmanagethedataforanalysis
UnderstandvarioussourcesofDatalikeSensors/Signals/GPSetc.
DataManagement
DataQuality(noise,outliers,missingvalues,duplicatedata)
DataProcessing

TEXTBOOKS:
1. Student’sHandbookforAssociateAnalytics–II,III.
2. DataMiningConceptsandTechniques,Han,Kamber,3rdEdition,MorganKaufmannPublishers.

REFERENCEBOOKS:
1. IntroductiontoDataMining,Tan,SteinbachandKumar,AddisionWisley,2006.
2. DataMiningAnalysisandConcepts, M.ZakiandW.Meira

Mr.V.IndivaruTeja Page1
s
INTRODUCTION

Data:Dataiscollectionofdataobjectsand theirattributes.

Attribute:Acollectionofattributesthat describesanobject.Object arealso knownasinstance,record or

entity. Examples: eye color of a person, temperature, etc.
Attributeisalso knownasvariable,field,characteristic,dimension,or feature.

Attributescanbe categoricalorquantitative.
Categoricalattributeshavea finite numberofpossiblevalues,withno orderingamongthevalues (e.g.,
occupation, brand, color).
Categorical attributes are also called nominal attributes, because their values are "names of things.
Quantitativeattributesarenumericandhaveanimplicit orderingamongvalues(e.g.,age,income, price)

DATA ARCHTECTURE:

Dataarchitectureisasetofrules, policies, standardsand modelsthat governanddefinethetypeofdata collected

and how it is used, stored, managed and integrated within an organization and its database systems.
Itprovidesa formalapproachtocreatingand managingthe flowofdataandhowit isprocessedacross an
organization's IT systems and applications.

Data architecture describes the structure of an organization'slogical and physical data assets and
datamanagement resources.

It is an offshoot of enterprise architecture that comprises the models, policies, rules,and standards that
govern the collection, storage, arrangement, integration, and use of data in organizations.

Anorganization'sdata architectureisthe purviewofdata architects.

Data architecture design is important for creating a vision of interactions occurring between data
systems,
Ex: if data architect wants to implement data integration, so it will need interaction between
twosystemsandby using dataarchitecture thevisionary model of datainteraction during theprocesscan be
achieved.

Data architecture also describes the type of data structures applied to manage data and it provides an
easy way for data preprocessing.
Thedataarchitectureisformedbydividingintothreeessentialmodelsandthenarecombined:

Mr.V.IndivaruTeja Page2
s
 Conceptualmodel: –
It is a business model which uses Entity Relationship (ER) model for relation between entities and
their attributes.
 Logicalmodel: –
Itisamodelwhereproblemsarerepresentedintheformoflogicsuchasrowsandcolumnof data, classes, xml
tags and other DBMS techniques.
 Physicalmodel –:
Physical models holds the database design like which type of database technology will be suitable
for architecture.
Adataarchitectisresponsibleforallthedesign,creation,manage,deploymentofdataarchitecture and
defineshow data is to be stored and retrieved, other decisions are made by internal bodies.
FactorsthatinfluenceDataArchitecture:
Few influences that can have an effect on data architecture are business policies, business
requirements, Technology used, economics, and data processing needs.

 Businessrequirements–
These include factors such as the expansion of business, the performance ofthe system access,
data management, transaction management, making use of raw data by converting them into
image files and records, and then storing in data warehouses. Data warehouses are the main
aspects of storing transactions in business.

 Businesspolicies–
The policies are rules that are useful for describing the way of processing data. These policies are
made by internal organizational bodies and other government agencies.

 Technologyinuse–
This includes using the example of previously completed data architecture design and also using
existing licensed software purchases, database technology.

 Businesseconomics–
The economical factors such as business growth and loss, interest rates, loans, condition ofthe
market, and the overall cost will also have an effect on design architecture.

Mr.V.IndivaruTeja Page3
s
 Dataprocessingneeds–
These include factors such as mining of the data, large continuous transactions, database
management, and other data pre-processing needs.

DATAANALYTICS METHODS:

Dataanalystsuseanumberofmethodsandtechniquestoanalyse data.

 Regression analysis: Regression analysis is a set of statistical processes used to estimate the
relationships between variables to determine how changes to one or more variables might affect
another. For example, how might social media spending affect sales?

 Factor analysis: Factor analysis is a statistical method for taking a massive data set and
reducing it to a smaller, more manageable one. This has the added benefit of often uncovering
hidden patterns. In a business setting, factor analysis is often used to explore things likecustomer
loyalty.
 Cohort analysis: Cohort analysis is used to break a dataset down into groups that sharecommon
characteristics, or cohorts, for analysis. This is often used to understand customer segments.
 Cluster analysis: StatisticsSolutions defines cluster analysis as “a class of techniques that are
used to classify objects or cases into relative groups called clusters.” It can be used to reveal
structures in data — insurance firms might use cluster analysis to investigate why certain
locations are associated with particular insurance claims, for instance.
 Time series analysis: Statistics Solutionsdefines time series analysisas “a statistical technique
that deals with time series data, or trend analysis. Time series data means that data is in a series
of particular time periods or intervals. Time series analysis can be used to identify trends and
cycles over time, e.g., weekly sales numbers. It is frequently used for economic and sales
forecasting.
 Sentiment analysis: Sentiment analysis uses tools such as natural language processing, text
analysis, computational linguistics, and so on, to understand the feelings expressed in the data.
While the previous six methods seek to analyse quantitative data (data that can be measured),
sentiment analysis seeks to interpret and classifyqualitative data byorganizing it into themes. It is
often used to understand how customers feel about a brand, product, or service.

UNDERSTANDINGDIFFERENTSOURCESOFDATA:

Datacollectionis theprocess of acquiring,collecting,extracting,andstoring thevoluminousamount

ofdatawhichmaybeinthestructuredorunstructuredformliketext,video,audio,XMLfiles,
records,orotherimagefilesusedinlaterstagesofdataanalysis. In the process of big data analysis, “Data
collection” is the initial step before starting to analyze the patterns or useful information in data. The
data whichis tobeanalyzedmustbecollectedfrom different valid sources.

The data which is collected is known as raw data which is not useful now but on cleaning the impure
and utilizing that data for further analysis forms information, the information obtained is known as
“knowledge”.
Knowledge has many meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collectinformation-rich data.
Data collection starts with askingsome questionssuch as what type of data is to be collected and
whatisthesourceofcollection.Mostofthedatacollectedareoftwotypesknownas“qualitativedata“

Mr.V.IndivaruTeja Page4
s
whichisagroupofnon-numericaldatasuchaswords,sentencesmostlyfocusonbehaviorand actions of the
group and another one is “quantitative data” which is in numerical forms and can be calculated using
different scientific tools and sampling data.
Theactualdataisthenfurtherdividedmainlyintotwotypesknownas:
1. Primarydata
2. Secondarydata

1. Primarydata:

The data which is Raw, original, and extracted directly from the official sources is known as primary
data.Thistype of dataiscollected directly by performing techniques such as questionnaires, interviews,
and surveys. The data collected must be according to the demand and requirements of the target
audience on which analysisis performed otherwiseit wouldbe a burden in the data processing.
Fewmethodsofcollectingprimarydata:
1. Interviewmethod:
The data collected during this process is through interviewing the target audience by a person called
interviewer and the person whoanswers theinterview is known as the interviewee. Some basic business
or productrelated questions are asked andnoted down in the form of notes, audio, or video and this data
is stored for processing. These can be both structured and unstructured like personal interviews or
formal interviews through telephone, face to face, email, etc.
2. Surveymethod:
The survey method is the process of research where a list of relevant questions are asked and answers
are noted down in the form of text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey answers are stored for
analysing data. Examples are online surveys or surveys through social media polls.
3. Observationmethod:
The observation method is a method of data collection in which the researcher keenly observes the
behaviour and practices of the target audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In thismethod, the data is collected directly by
posting a few questions on the participants. For example, observing a group of customers and their
behaviour towards the products. The data obtained will be sent for processing.
4. Experimentalmethod:
The experimental method is the process of collecting data through performing experiments, research,
and investigation. The most frequently used experiment methods are CRD, RBD, LSD, FD.

Mr.V.IndivaruTeja Page5
s
 CRD- Completely Randomized design is a simple experimental design used in data analyticswhich
is based on randomization and replication. Itis mostly used for comparing the experiments.
 RBD-RandomizedBlockDesign is anexperimental designinwhich theexperimentis divided
intosmallunitscalledblocks.Randomexperimentsareperformedoneachoftheblocksand results are
drawn using a technique known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but
contains rows and columns. It is an arrangement of NxN squares with an equal amount ofrows and
columns which contain letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzleis an example of a Latin square design.
 FD- Factorial design is an experimental design where each experiment has two factors each with
possible values and on performing trail other combinational factors are derived.

2. Secondarydata:

Secondary data is the data which has already been collected and reused again for some valid purpose.
This type of data is previously recorded from primary data and it has two types of sources named
internal source and external source.
Internalsource:
These types of data can easily be found within the organization such as market record, a sales record,
transactions, customer data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
Externalsource:
The data which can’t be found at internal organizations and can be gained through external third party
resources is external source data. The cost and time consumption is more because this contains a huge
amount of data. Examples of external sources are Government publications, news publications,Registrar
General of India, planning commission, international labour bureau, syndicate services, and other non-
governmental publications.
Othersources:
 Sensorsdata: Withtheadvancementof IoTdevices,thesensorsofthesedevicescollectdata which can be
used for sensor data analytics to track the performance and usage of products. 
Sensor data is the output of a device that detects and responds to some type of input from
thephysicalenvironment.Theoutputmaybeusedtoprovideinformationorinputtoanothersystem or to
guide a process.
 Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by
users on different platforms can be predicted and collected with their permission for data analysis.
The search engines also provide their data through keywords and queries searched mostly. 

DataManagement:

 Data management is the process of managing tasks like extracting data, storing data, transferring
data, processing data, and then securing data with low-cost consumption. 
 Main motive of data management is to manage and safeguard the people’s and organization data in
an optimal way so that they can easily create, access, delete, and update the data. 
 Because data managementis an essential process in each and every enterprise growth, withoutwhich
the policies and decisions can’t be made for business advancement. The better the data management
the better productivity in business. 

Mr.V.IndivaruTeja Page6
s
 Large volumes of data like big data are harder to manage traditionally so there must be theutilization
of optimal technologies and tools for data management such as Hadoop, Scala, Tableau, AWS, etc.
Which can further used for big data analysisin achieving improvements in patterns. 
 Datamanagementcanbeachievedbytrainingtheemployeesnecessarilyandmaintenanceby DBA, data
analyst, and data architects.

Managingdigitaldatainanorganizationinvolvesabroadrangeoftasks,policies,procedures,and practices. The

work of data management has a wide scope, covering factors such as how to:

 Create,access,and updatedataacrossadiversedatatier
 Storedataacrossmultiplecloudsand onpremises
 Providehighavailabilityanddisaster recovery
 Usedatainagrowing varietyofapps,analytics,andalgorithms
 Ensuredataprivacyand security
 Archiveanddestroydatainaccordancewithretentionschedulesandcompliancerequirements
 Adata management platform isthe foundationalsystemforcollectingandanalyzing large volumes of
data across an organization. Commercial data platforms typically include software tools for
management, developed by the database vendor or by third-party vendors. These data management
solutions help IT teams and DBAs perform typical tasks such as:
 Identifying,alerting,diagnosing,andresolvingfaultsinthedatabasesystemorunderlying
infrastructure
 Allocatingdatabasememoryandstorageresources
 Makingchangesinthedatabase design
 Optimizingresponsestodatabasequeriesforfasterapplicationperformance
Theincreasingly popular cloud data platforms allow businesses to scale up or down quickly and cost-
effectively. Some are available as a service, allowing organizations to save even more.

Remotesensing
Remotesensingisthe acquisitionofinformationaboutanobjectorphenomenonwithoutmaking physical
contact with it.

 A source ofradiation (A). It can be naturalor artificial. Radiation emitted by the source reaches the
Earth'ssurfaceand it isalteredbythepresenceoftheobjectsonthat surface.Remotesensingstudies that
alteration. Objects themselves can emit radiation as well.

 Objects(B)thatinteractwithradiationorcanemitit,asmentioned above.

 Anatmosphere(C) throughwhichradiationmoves fromthesourcetotheobjects. Theatmospherealso

interacts with the radiation and alters it.

Mr.V.IndivaruTeja Page7
s
 A receiver (D) which receives the radiation once it has been emitted or altered bythe objects. The
receptormeasuresthe intensityoftheradiationcoming fromdifferent points intheareabeingstudied and,
with them, generates its final product (in most cases, and image).
For describing the receivers that make part of a remote sensing system, we will separate them into two
components: sensors and platforms

The sensor is the element that can read the electromagnetic radiation and register its intensity for a
given zone of the spectrum. It can be a simple photographic camera or a more specialized sensor.
Passivesensorsusenaturalsourceofradiation(inmost cases,sunlight),and just measurethat radiation
asitisreflectedontheEarth'ssurface.Activesensorsemittheirownradiation, andthencollectitback
afterithasbeenreflected.
Sensor data is the output of a device that detects and responds some type of input from the physical
environment. The output may be used to provide information or input to another system or to guide a
process.

DimensionalityReduction:

Dimensionality reduction isthe process of reducing the number of random variables or attributes
under consideration. ... High-dimensionality reduction has emerged as one of the significant tasks in
data mining applications.

In machine learning classification problems, there are often too many factors on the basis of which the
finalclassificationisdone.Thesefactorsarebasicallyvariablescalledfeatures.Thehigherthe
numberoffeatures,theharderitgetstovisualizethetrainingsetandthenworkonit.Sometimes, most of these
features are correlated, and hence redundant. This is where dimensionality reduction algorithms come
into play. Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction.

Therearetwocomponentsofdimensionalityreduction:
 Featureselection:In this,wetry tofindasubsetof theoriginalsetof variables,orfeatures,toget a smaller
subset which can be used tomodel the problem. It usually involves three ways:

Mr.V.IndivaruTeja Page8
s
1. Filter
2. Wrapper
3. Embedded

 Featureextraction:Thisreducesthedatainahighdimensionalspacetoalowerdimensionspace,
i.e.aspacewithlesserno.ofdimensions.

CommonMethodstoPerformDimensionalityReduction:

a. MissingValues
While exploring data, if we encounter missing values, what we do? Our first step should be to identify
the reason. Then need to impute missing values/ drop variables using appropriate methods. But, what if
we have too many missing values? Should we impute missing values or dropthe variables?

b. LowVariance
Let’s think ofa scenario where we have a constant variable (allobservations have the same value, 5) in
our data set. Do you think, it can improve the power of model? Of course NOT, because it has zero
variance.

c. Decision Trees
It is one of my favourite techniques. We canuse it as anultimate solutionto tackle multiplechallenges.
Such as missing values, outliers and identifying significant variables. Several data scientists used
decision tree and it worked well for them.

d. RandomForest
Random Forest issimilar to decision tree. Just be careful that random forests have a tendency to bias
towards variables that have more no. of distinct values i.e. favor numeric variables over
binary/categorical values.

Mr.V.IndivaruTeja Page9
s
e. HighCorrelation
Dimensions exhibiting higher correlation can lower down the performance of a model. Moreover, it is
not good to have multiplevariables of similar information. You can use Pearsoncorrelation matrix to
identify the variables with high correlation. And select one of them using VIF (Variance Inflation
Factor).

f. BackwardFeatureElimination
In this method, we start with all n dimensions. Compute the sum of a square of error (SSR) after
eliminating each variable (n times). Then, identifying variables whose removal has produced the
smallest increase in the SSR. And thus removing it finally, leaving us with n-1 input features.

Repeatthisprocessuntilnoothervariablescanbedropped.

DATA QUALITY:

Data qualityis a measure of the condition of databased on factors such as accuracy, completeness,
consistency, reliability and whether it's up to date.

Severalkeyfactorstoexaminedataquality:

 Existence
o Istheredatatowork with?
o Example:DidtheorganizationactuallycollectdataonsalesperformanceinChina?
 Consistency
o Ifadatapointappears inmultiplelocations,doesitbearthesamemeaning?
o Example: Indatasetsthat containrevenue bystore for a givenweek, isthe same number
associated with a particular store in all data sets?
 Accuracy
o Doesthe datarepresentrealfactsandproperties?
o Example:Arereportedsalesrepresentativeofwhatactuallyhappenedinthestore?
 Integrity
o Doesthedatadepictgenuine relationships?
o Example: In a report of customers and billing addresses, is each customer linked to the
right billing address?
 Validity
o Dothedataentries make sense?
o Example: If data in a column “location” is linked to data “price,” are the related values
consistent with allowable values in the data set and when compared with external
benchmarks?

Organizations with good data quality practices will have a process for automating data collection and
entry (since many mistakes are caused by human error), user profiles defining who should be able to
access different data types, and a dashboard to monitor data qualitychanges over time.

Poordataqualitynegativelyaffectsmanydataprocessingefforts. Examples of
data quality problems:

 Noiseandoutliers
 Missingvalue
 Duplicatedata

Mr.V.IndivaruTeja Page10
s
 Wrongdata

Handlingnoisyorincompletedata:

The data stored in a database may reflect noise,exceptional cases, or incomplete data objects.When
mining data regularities, theseobjects may confuse the process, causing the knowledge model
constructed tooverfit the data. As a result, the accuracy of the discovered patterns can be poor. Data
cleaning methodsanddataanalysis methodsthat canhandle noisearerequired, aswellasoutlier mining
methods for the discovery and analysis of exceptional cases.

DATACLEANING ASA PROCESS

(I) MissingValues:

Missingvaluescanbefilledbyfollowing methods

1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuplecontains several attributes
with missing values. It is especially poor when the percentageof missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming andmay not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute valuesby the same
constant, such as a label like “Unknown”. If missing values arereplaced by, say, “Unknown,” then the
mining program may mistakenly think thatthey form an interesting concept, since they all have a value
in common—that of“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: For example, suppose that theaverage income of
AllElectronics customers is $56,000. Use this value to replace themissing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple:For example, if
classifying customers according to credit risk, replace the missing valuewith the average income value
for customers in the same credit risk category as thatof the given tuple.
6. Use the most probable value to fill in the missing value: This may be determinedwith regression,
inference-based tools using a Bayesian formalism, or decision treeinduction. For example, using the
othercustomerattributes in yourdataset, youmayconstruct adecisiontreetopredict the missing values for
income.

(II) NoisyData

Noiseisarandomerroror varianceinameasuredvariable.Followingaredatasmoothing techniques

1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”that is, the
values around it. The sorted values are distributed into a numberof “buckets,” or bins.
In smoothing by bin means, eachvalue in a bin is replaced by the mean value of the bin. For example,
the mean ofthevalues 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replacedby
the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin
median. In smoothing by bin boundaries, the minimumand maximum values in a given bin areidentified
as the bin boundaries. Eachbin value is then replaced by the closest boundaryvalue.

Mr.V.IndivaruTeja Page11
s
Binningmethodsfordatasmoothing

2. Regression: Data can be smoothed by fitting the data to a function, such as withregression. Linear
regression involves finding the “best” line to fit two attributes so that one attribute can be used topredict
the other.
Multiple linear regressions is an extension of linear regression, where more than two attributes
areinvolved and the data are fit to a multidimensional surface.

3. Clustering: Outliers maybe detectedbyclustering, where similar valuesareorganized into groups,or

“clusters.” Intuitively, values that fall outside of the set of clusters maybe considered outliers.

(III) Data CleaningasaProcess

The firststepindatacleaningasaprocessisdiscrepancydetection.

Mr.V.IndivaruTeja Page12
s
Discrepancies can be caused by several factors, including poorly designed data entry forms that have
manyoptional fields, Humanerror in data entry, deliberate errors (e.g., respondents not wantingtodivulge
information about themselves), and data decay (e.g., outdated addresses).This discrepancy may also
lead to system errors, errors in the processing time. It may also cause due to inconsistency data
representation.
By using the Meta data which is data about data we can avoid discrepancy, which is used to avoid the
unnecessary data from data sets.
Field overloading is another error that occurs when the new attributes are defined unused portions of
defined attributes.

Thedatashouldalsobeexaminedregardinguniquerules,consecutiverulesand nullrule:

A unique rule says that each value of the given attribute must be different fromallother values for that
attribute.
A Consecutive rule says that there can be no missing values between the lowest and highest values for
the attribute, and that all values must also be unique (e.g., as in check numbers).
Anullrulespecifiestheuseofblanks,questionmarks,specialcharacters,orotherstringsthatmay indicate the
null condition (e.g., where a value for a given attribute is not available)

Thetoolsthatcanaidinthestep ofdiscrepancydetection:

Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-
checking) to detect errors and make corrections in the data.
Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and
detecting data that violate such conditions. Some data inconsistencies may be corrected manually. The
data transformation also used for discrepancies and these transform are defined to correct them.
Datamigrationtools allow simple transformationsto be specified, suchasto replace the string “gender”
bysex”.
ETL(extraction/transformation/loading)toolsallowusersto specifytransformsthroughagraphicaluser
interface (GUI).

PREPROCESSINGTHEDATA

The data preprocessing techniques are used to avoid the incomplete, noisy and inconsistent.
After the mining process, whether there are any outliers, noisy and inconsistent are present in
the data or not we have to check them by the preprocessing techniques. If the mining data has
the noisy, incomplete and inconsistent then these are to be eliminated by using preprocessing
techniques.

If suppose the mining data has all the outliers then the resulting data will also be an outlier and
low quality data. The incomplete data can occur for a number of reasons such as attribute
interest maynotbealwaysbeavailablelike misunderstandingwhiletherecordisbeingentering and
malfunction equipment.

The noisy data can occur for a number of reasons some of them are the data which collects
information may be faulty. While entering the data there may be a human or computer error.
There may also be a error while transforming the data.

Mr.V.IndivaruTeja Page13
s
Foravoidingalltheseproblemsthedataprocessingtechniquesareused.Theyare
 DataCleaning
 DataIntegration
 DataTransformation
 DataReduction

1. DataCleaningis a processwhich isused toclean the dirtydata ornoisy data, remove outliers and
resolving inconsistency. This dirty data may lead to confusion and the output is unreliable.

2. Data Integration is a technique or process which is used to integrate or consolidate several

number of files, data bases, data cubes etc.,. The attributes representing in a particular taskmay
have distinct names in distinct data bases which may lead to inconsistency and redundancy.

3. Data Transformation is a process which is used to convert the data into appropriate form for
mining. For an instance the best example is normalization and aggregation.

4. Data Reduction is a process which is used to reduce the irrelevant data and unused data from
the data sets (i.e to make into smaller volumes)

Mr.V.IndivaruTeja Page14
s
Formsofdatapreprocessing

Mr.V.IndivaruTeja Page15
s
Glossory

Sensor Data:Sensor data is the output ofa device that detects and responds to some type of input from
the physicalenvironment. The output may be used to provide information or input to another systemor
guide a process.

Data discretization is defined as a process of converting continuous dataattribute values into a finite
set of intervals and associating with each interval some specific data valueIfdiscretization leads to
anunreasonablysmallnumberofdataintervals,thenitmayresultinsignificantinformationloss.

Dimensionality reduction is the process ofreducingthe number of random variables or attributes under
consideration. ... High-dimensionality data reduction, as part of adatapre-processing-step, is
extremely important in many real-world applications.

Attribute subsetSelection isatechniquewhichisusedfor datareductionin data mining process.

Datareduction reduces the size ofdata so that itcan be usedfor analysis purposes more efficiently. Need
of Attribute Subset Selection- Thedataset may have a large number of attributes.

Data qualityrefers to the overall utility of a dataset(s) as a function of its ability to be easilyprocessed
and analyzed for other uses, usually bya database, data warehouse, or data analytics system.

PreviousJNTUHExamQuestions

1. Whatisattribute?Explaindifferent typesofattributes?
2. Whatisdatapre-processing?
3. Explainvariousmethodsforhandlingmissingvalues?
4. Whatisdataanalytics?Whatis needofdata analytics?
5. Explaindifferenttypesofdata analytics?
6. Whatisdataarchitecture?Explaindataintegration?
7. ExplaindifferencebetweenInformationand Data?
8. Whatissensordata?
9. Howmissingvaluesarehandled?
10. Howto handlenoisydataindatamining?

Mr.V.IndivaruTeja Page16
s
ImportantQuestions:

1. Whatisdataarchitecture?
2. Differentiatebetweendataandinformation?
3. WriteanoteonsubsetselectioninAttributesfor datareduction?
4. ExplainDataqualityindetail?
5. Whatareoutliers?Explain withanexample?
6. What doyoumeanbydatapreprocessing?Whyitis needed?
7. Explainvariousmethodsforhandlingmissingdatavalues?
8. Explainsamplingmethodsfor datareduction?
9. Howtohandle data noising?
10. Definethefollowing:
(a) DataCleaning
(b) Datapreprocessing
(c) Dataintegration
(d) Datatransformation
(e) Cluster
(f) Outlier
(g) Dataquality

Mr.V.IndivaruTeja Page17
s

Unit II
No ratings yet
Unit II
8 pages
SRU ADA Unit-1
No ratings yet
SRU ADA Unit-1
50 pages
Data Analytics for CSE Students
No ratings yet
Data Analytics for CSE Students
186 pages
Data Analytics Unit 1 Notes
No ratings yet
Data Analytics Unit 1 Notes
23 pages
Da Notes U1 U3 Complete
No ratings yet
Da Notes U1 U3 Complete
56 pages
U1 Da (R18) 20102021
No ratings yet
U1 Da (R18) 20102021
23 pages
Da Unit 1 & 2
No ratings yet
Da Unit 1 & 2
44 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Data Mining Introduction & Techniques
No ratings yet
Data Mining Introduction & Techniques
9 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
DAUnit 1
No ratings yet
DAUnit 1
20 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
DA Unit 1
No ratings yet
DA Unit 1
24 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
U1, U2 Q&a
No ratings yet
U1, U2 Q&a
21 pages
Data Mining
No ratings yet
Data Mining
48 pages
DA Unit 1
No ratings yet
DA Unit 1
33 pages
Unit 1
No ratings yet
Unit 1
36 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
29 pages
Data Analytics Fundamentals for Business
No ratings yet
Data Analytics Fundamentals for Business
42 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
UNIT-1: What Is Data Analytics? Why Data Analytics Is Important? What Is The Role of Data Analytics and Ways To Use It?
No ratings yet
UNIT-1: What Is Data Analytics? Why Data Analytics Is Important? What Is The Role of Data Analytics and Ways To Use It?
10 pages
Data Mining Techniques and Models
No ratings yet
Data Mining Techniques and Models
84 pages
CMR-BDA - Unit-I
No ratings yet
CMR-BDA - Unit-I
102 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Data Mining Practical 123
No ratings yet
Data Mining Practical 123
26 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
27 pages
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
No ratings yet
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
84 pages
Data Analytics Unit-I
No ratings yet
Data Analytics Unit-I
25 pages
Module 1
No ratings yet
Module 1
41 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
INTERNSHIP
No ratings yet
INTERNSHIP
7 pages
Most Frequent Attribute in Data Analysis
No ratings yet
Most Frequent Attribute in Data Analysis
86 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
DM Module 1
No ratings yet
DM Module 1
13 pages
Unit 1 DM
No ratings yet
Unit 1 DM
62 pages
Data Mining for Computer Science Students
No ratings yet
Data Mining for Computer Science Students
52 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
Unit 1 1
No ratings yet
Unit 1 1
99 pages
1 Da
No ratings yet
1 Da
12 pages
Introd Ata Lytics
No ratings yet
Introd Ata Lytics
32 pages
Data Objects and Discretization in Mining
No ratings yet
Data Objects and Discretization in Mining
76 pages
Unit I - Introduction To Data Mining and Analytics (1) 2
No ratings yet
Unit I - Introduction To Data Mining and Analytics (1) 2
78 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
DWDM Notes
0% (1)
DWDM Notes
59 pages
Pixel-Oriented Visualization in Data Analytics
No ratings yet
Pixel-Oriented Visualization in Data Analytics
61 pages
Unit 1 Introduction To Data Analytics
No ratings yet
Unit 1 Introduction To Data Analytics
20 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Data Ming Unit 2
No ratings yet
Data Ming Unit 2
8 pages
Ctit QB Solution-U1
No ratings yet
Ctit QB Solution-U1
12 pages
Data Analytics
No ratings yet
Data Analytics
17 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
Data Mining
No ratings yet
Data Mining
4 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Unit 1
No ratings yet
Unit 1
59 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
IB - Computer - Science - HL - Revision - Guide (5) - 60-72
No ratings yet
IB - Computer - Science - HL - Revision - Guide (5) - 60-72
13 pages
Unit - 2 PDA
No ratings yet
Unit - 2 PDA
20 pages
Unit 5-2
No ratings yet
Unit 5-2
28 pages
Unit 1-Pda - Extra
No ratings yet
Unit 1-Pda - Extra
79 pages
Unit-2 Pda
No ratings yet
Unit-2 Pda
69 pages
Short Answers
No ratings yet
Short Answers
2 pages
Unit - 3 PDA
No ratings yet
Unit - 3 PDA
20 pages
ASM - PART1 - Data Structures and Algorithms - SE07204 - LE - NGOC - TIEN - BH01688
No ratings yet
ASM - PART1 - Data Structures and Algorithms - SE07204 - LE - NGOC - TIEN - BH01688
46 pages
FlyCo Drone Quality Analysis
No ratings yet
FlyCo Drone Quality Analysis
4 pages
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
No ratings yet
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
6 pages
Blast Design PDF
No ratings yet
Blast Design PDF
244 pages
(A) State Why The Velocity and The Acceleration Are Both Described As Vector Quantities
No ratings yet
(A) State Why The Velocity and The Acceleration Are Both Described As Vector Quantities
8 pages
Electronics Workshop Manual
No ratings yet
Electronics Workshop Manual
61 pages
Mystic Numbers & Pythagorean Lore
100% (1)
Mystic Numbers & Pythagorean Lore
87 pages
Prelim Lab Exercise 01 It Elec02 - Ite Elective 02 (Vb2010)
No ratings yet
Prelim Lab Exercise 01 It Elec02 - Ite Elective 02 (Vb2010)
4 pages
Leisuwash Model List
No ratings yet
Leisuwash Model List
8 pages
MySQL - MySQL 5.7 Reference Manual - 15.8
No ratings yet
MySQL - MySQL 5.7 Reference Manual - 15.8
3 pages
DLL August 18 22
No ratings yet
DLL August 18 22
3 pages
Statistics Project Part 2
No ratings yet
Statistics Project Part 2
257 pages
Thermoregulation (Final)
No ratings yet
Thermoregulation (Final)
9 pages
TOS Grade 5 Math
No ratings yet
TOS Grade 5 Math
1 page
Aco TSP
No ratings yet
Aco TSP
7 pages
Linkers, Loaders & OS Basics
No ratings yet
Linkers, Loaders & OS Basics
25 pages
Grub4DOS Acronis Boot Setup Guide
No ratings yet
Grub4DOS Acronis Boot Setup Guide
8 pages
Differentiation Rules Chart - 230407 - 124119
No ratings yet
Differentiation Rules Chart - 230407 - 124119
2 pages
Jan3hr 2016 - Unlocked
No ratings yet
Jan3hr 2016 - Unlocked
24 pages
SPIDEX-DENTEX EN GKN LS Stromag
No ratings yet
SPIDEX-DENTEX EN GKN LS Stromag
28 pages
A Review On Electric Vehicle Battery Modelling-2016
No ratings yet
A Review On Electric Vehicle Battery Modelling-2016
50 pages
Innovation in Coffee Post-Harvest: Development of An Automation System For The Drying Process
No ratings yet
Innovation in Coffee Post-Harvest: Development of An Automation System For The Drying Process
9 pages
MIC Micro Project (10,31,33,45)
100% (2)
MIC Micro Project (10,31,33,45)
8 pages
Ficha Tecnica 3000 - Dividers - DS - R5
No ratings yet
Ficha Tecnica 3000 - Dividers - DS - R5
5 pages
9.22 VHF Circuit Board/Schematic Diagrams and Parts List (PCB 8486473Z03)
No ratings yet
9.22 VHF Circuit Board/Schematic Diagrams and Parts List (PCB 8486473Z03)
56 pages
Project Name - Demand Forecasting For E-Commerce
No ratings yet
Project Name - Demand Forecasting For E-Commerce
13 pages
GR 10 MLIT Revision Document - 2025
100% (2)
GR 10 MLIT Revision Document - 2025
78 pages
Lesson 2 - Nature and Organization of Matter
No ratings yet
Lesson 2 - Nature and Organization of Matter
6 pages
Chemical Engineering Lab Guide
100% (2)
Chemical Engineering Lab Guide
7 pages
Chapter05 Circular Motion S
No ratings yet
Chapter05 Circular Motion S
9 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

PRINCIPLE OF DATAANALYTICS

Attribute:Acollectionofattributesthat describesanobject.Object arealso knownasinstance,record or

Dataarchitectureisasetofrules, policies, standardsand modelsthat governanddefinethetypeofdata collected

Anorganization'sdata architectureisthe purviewofdata architects.

Datacollectionis theprocess of acquiring,collecting,extracting,andstoring thevoluminousamount

Managingdigitaldatainanorganizationinvolvesabroadrangeoftasks,policies,procedures,and practices. The

 Anatmosphere(C) throughwhichradiationmoves fromthesourcetotheobjects. Theatmospherealso

DATACLEANING ASA PROCESS

Noiseisarandomerroror varianceinameasuredvariable.Followingaredatasmoothing techniques

3. Clustering: Outliers maybe detectedbyclustering, where similar valuesareorganized into groups,or

(III) Data CleaningasaProcess

2. Data Integration is a technique or process which is used to integrate or consolidate several

Attribute subsetSelection isatechniquewhichisusedfor datareductionin data mining process.

You might also like