UNIT-2 AI
CYCLE
SESSON:-3 DATA ACQUISTION
INTRODUCTION
This stage is abo ut acqui ri ng data for
project.
Data c an be piece of information o r
facts and stat istic c ollected to gether
for referenc e or analysi s.
Whenever we want an A I project to
be predict an o utput , we need t o
train it.
SIGNIFICANCE OF DATA
AI project means an arti ficial ly i nt el ligent project t hat capable o f
making decissions or p erforming some intell igent task s.
To build an AI system , you would need so urc e large amounts o f
data to create dat a sets for Training, Testi ng and evaluation, and
then deployment of A I project.
TWO TYPES OF DATA SET
TRAINI NG DATA TESTIN G DATA
It is used for training t he It is used for the testi ng
model. model af ter it is t rained.
The data size i s very bi g It is smal ler than training set,
about 70% to 80%. about 20% to 30%.
Eg. Exam syllabus of who le Eg. Exam paper is testing
year taught by teacher data
Example:
If you want t o make an arti ficial intelligent system whic h can
predict the salary of any employee based on his previo us
salaries, you would be feed t he data o f his previous sal ari es into
the machine. T his is the data wit h whi ch machine can be trained.
Now o nce it is ready, i t will predict his next salary effici entl y.
The previo us salary data here known as Trai ni ng data while the
next sal ary predi ction data set is known as Testing data.
TYPES OF DATA USED IN AI PROJECT
STRUC TURED DATA UNSTRUCTURED DATA
I t co nfo rms to s ome e xi sti ng da ta I t doe s no t co nfo rm to a ny pr e -
mode l . e xi sti ng da ta mode l .
I t ha s we l l de f in e d r e la ti on shi ps I t ha s un de f in e d r e la ti on shi p
a mong i ts e l e men ts. a mong i ts e le me nts .
I t doe s no t r eq ui r e e xtra I t r eq ui r es mor e pr e -pr oc e ssi ng
pr e pr oc es si ng be fo re b e in g be f ore be in g an a ly ze d or se a r ch ed .
a na l yz e d or s e ar c he d.
I t i s of te n qu al i ta tiv e da ta .
I t i s of te n is th e qu an ti tati ve
da ta .
QUALITY DATA CHARECTERSTICS
Data Features
Bo th structured and unstruc tured data have certain data
features.
Data features refers to the type o f data yo u want to co llect . For
example, fo r AI syst em used for predi ctig avarage savings, the
data features could be salary/income, inflatio n rat e, avarag e
spending etc.
Fo r an AI system analysing social media post , the data features
required would be social -media-post, pl atform, t ime po sted et c.
DATA ACQUITIONS
Data ac quiti on begins when you acquire required dat a in qual ity
form. Before actual data ac qusiti on happens, so me proc essi ng is
required where fo llowi ng quest ions are answered.
➢ What are the d ata features need ?
➢ How freq uent do you have to col lect the d ata?
➢ What hap pens if you d on’t hav e eno ugh d ata?
➢ What ki nd o f analysis need s to be d one?
➢ How d oes t he analy sis info rm the action?
IDENTIFYING DATA REQUIRMENTS
Af ter answering all t he questios menti oned above, yo u can
fi nal ize the dat a requirement by doing these:
➢ G ro up t ogether the revelent d ata features in lo gic ally relat ed
structures.
➢ Be cl ear ab out the relat ionship of d ata in and outsi de t he
lo gical dat a structure.
➢ U se consistent and stand ard ized terminolo gy and fo rmat.
FINDING RELIABLE DATA SOURCES
IN TERV IEW SURV EY
In s u rve ys , f irs t the go al o f the
It is o ne o f t he mo s t ef fe ct ive
so ur ces o f d at a ga t her ing. s ur ve y is a s ce rt a ined a nd
t her ea f te r t he qu es t io nna ire s ar e
A n int er view re fe rs t o o ne - o n- o ne
dra f te d a cco rdin gly.
co nve rsa tio n b et we en a n a na lyst
a nd t he u sers a nd clie nt to f ind A s u rve y re f ers t o s t udy of th e
o ut abo u t s ys t ems , it s fu nct ion s , o pinio n s , r es po n s es , et c. Of a
sho r tco min gs an d f law s. gr ou p o f s t akeh o lder s .
FINDING RELIABLE DATA SOURCES
Observation
U nder observatio n met ho d, the respo nsi ble person observes t he
team in a real working envi ro nment and gets ideas abo ut the
required data and its form, and subsequently documents the
observatio n.
The o bservati on method refers t o human or mechanical
watching, notic ing, or perceiving of what people actually do o r
what events take place i n a speci fic wo rking envi ro nment.
FINDING RELIABLE DATA SOURCES
API(Application Programming
Interface)
AP I is a specialized technique i n whi ch speci fic type of data is
co llected through the use of a pro grammi ng interfac e, such as
using social media programs ‘ interface, data l ike people’s most
preferred game, mo st liked post, most used time etc. May be
gathered.
An A PI refers t o appl icatio n Programming Interfac e that works
behi nd a popular sof tware program or game t o co llect specifi c
type o f dat a pert aining to users ‘ way o f using that program.
FINDING RELIABLE DATA SOURCES
Web Scraping
Web Scraping, web harvesting, or web data extractio n is data Scrapi ng
used for extracting data fro m websi te. A web scraper i s a spec ialized
to ol designed to carry t he web scrapi ng.
Web Scraping refers to a data co llectio n techni que using a too l called
web scraper that extracts data fro m websit es.
Finding Reliable Data Sources
Sensors
Sensors are electronic Sensors can measure vari ous different
parameters such as weather, humi dit y, body t emperat ure, blo od
pressure, heart beat, weight and many mo re. For i nst anc e, you
can see that mo dern medical diagno sis and wearable like Fitbit,
Apple watch, make good use of senso rs.
Internet o f Things ( Io T) cannot functio n witho ut senso rs.
Finding Reliable Data Sources
Cameras
Cameras , because of their video rec ordi ng and image c apturing
features have proven t o be go od dat a coll ec tion to ols i n various
situati ons such as traffic rul es vio lation, automatic detect ion of
fl aws in design and Outloo k of products, pl aces, buildings etc.
The method o f dat a coll ec tion using cameras is a way to coll ec t
data graphi cally or in video from about t he l ook, desi gn or ac tion
as per the requirement s.
Finding Reliable Data Sources
The Internet
Sea rching the internet for data as per one’s r equir ements is
commonly us ed technique.
You should no t take data direc tly from the Internet for the following
two reas ons.
• The data might no t authentic , inac cura te or fro m unr eliable
so urc es.
• Even if the data r eliable, it canno t be dir ectly taken it is c opyright
pro tec ted beca use of I PR(Intellec tual Pr oper ty Rights).
The internet
You can take Data from the Internet only a fter ens uring the
following two things.
• The so urc e o f data is authentic and reliable.
• The data has been licens ed for public use through licenc es like
c reative –c ommo ns, c opylef t and other open-source lincens es o r
thro ugh pers onally o bta ined permissio n fro m the c opyright o wner.
• You c an c ollec t the data fr om go ver nment hos ted webs ite like
data.gov.in, india.go v.in, mos pi.nic.in...etc .
The Internet
Ther e a re two terms ass oc iated based on who collects the data.
• Prima ry Data :- it is type that you gather by yours elf. it means
you are ac tively invo lved in the so urcing of info rmatio n.
• Sec ondar y Da ta: - it is all aro und us. It is eas ily ac cess ible on the
internet and requires fewer reso urc es to ga ther, unlik e pr imar y
data.
• In this c ase the c ollec tio n o f primary data ha s been done by
so meone else befo re getting uploaded to the inter net. Sec ondar y
data co mes in the for m of sear ch results.
Acquiring Data
Af ter identi fying the dat a requirement s, required Data features
and appropriate and rel iabl e data sources, fi nal ly data i s
co llected in required from.
That is, in data ac quisit ion, data i s understoo d, gathered,
fi ltered, and cl eaned and finally stored in a data sto rage system.
Data ac quisit ion refers to understanding, gat hering, fi ltering,
cleaning dat a as per requirement o f the AI system so as t o train
it usi ng the co llected data.
Assignment
• What is data set ?
• What are d ata features? G ive examp le.
• What is data acquisi tion?
• Li st some co mmonly used so urces o f d ata col lection.
• Discuss b riefly ab out the fo llowing method s of d ata c ollecti on:
1)Interv iew, 2)Surv ey, 3) ob ser vation, 4) AP I, 5) web scraping ,
6)S enso r, 7) The Int ernet.
Assignment
• Why should the Internet be avo id ed as a data collect ion
sources?
• Li st some g overnment sites that can b e used for Data
co llectio n?
• Co llect the d ata fo r sessi on 2(p rob lem scop ing ) assig nment
p roject.