DATA SCIENCE
PROCESS
What is Data
Science?
Dat a Sci en ce i s a fi el d that combi nes math, progr ammin g, and subject
kn owl edge t o st u dy data. It wor ks wi th both organized data (l ike tabl es
and nu m ber s) an d u n organi zed data (li ke text, i mages, or vi deos). The
mai n pur pose i s t o fi n d useful informati on, make predict i ons, and create
sm ar t syst ems t h at can wor k automat ical ly.
Why is it
important?
• Bu si n esses m ake better , faster , and more i nformed deci sions.
• Appl i cat i on s: recommendati on systems, fr aud detecti on, cust omer
anal yt i cs, heal t hcare predi cti ons, etc
Data Process Overview
• De fi n e Pro ble m
• Da t a Co lle c t io n
• Da t a Cle a nin g &
Pre pa ra t io n
• E x plo ra t o ry Da t a An a ly sis
(E DA)
• M o de ling
• E v a lu a t io n
• De plo y me nt
• Co mmun ic a t io n
• M o n it o ring & M a int e na n c e
Step 1: Define the
Problem
• U nde rs ta n d w h a t qu e s tio n y o u a re try in g to
a nsw e r.
• M a ke th e pro b le m c le a r a n d s pe c ifi c .
• De c ide th e s c o pe o f th e pro je c t ( w h a t’ s
inc lu de d o r exc lu de d) .
• Ide n tify w h o w ill u se th e re su lts.
• Se t s u c c e s s c rite ria ( h o w y o u w ill me a su re if
the s o lu tio n w o rk s) .
• Exa mp le : Pre d ic t to mo rro w ’ s w e a th e r u sin g
pa st c lima te d a ta .
Step 2: Data
Collection
• So u rc e s o f da ta :
1.Da ta ba se s a n d sp re a ds h e e ts
2.On lin e so u rc e s a n d A PIs ( e . g. , w e a th e r, s o c ia l me dia ,
ma ps )
3.Se n s o rs a n d sma rt d e v ic e s (Io T)
4.Su rv e y s , ex pe rime n ts , o r ma n u a l re c o rds
5.We b d a ta ( c o lle c te d th ro u g h s c ra pin g o r do w n lo a ds )
• Ch a lle n ge s:
1.M iss in g o r in c o mple te in f o rma tio n
2.Erro rs a n d in c o n sis te n c ie s
3.Ve ry la rge da ta s e ts th a t a re h a rd to ma n a ge
4.Priv a c y a n d s e c u rity c o n c e rn s w h e n h a n dlin g s e n sitiv e
Step 3: Data Cleaning &
•
•
Preparation
Ra w da ta is us u a lly me ss y a n d n o t re a dy to u se .
Pro ble ms o fte n f o u n d:
• M iss in g v a lu e s
• D u plic a te re c o rds
• Wro n g o r in c o n sis te n t f o rma ts ( like d a te s, u n its,
o r tex t)
• O u tlie rs th a t do n ’ t fi t th e pa tte rn
• Cle a n in g ma ke s th e da ta a c c u ra te , c o n s iste n t,
a n d re lia ble .
Step 4: Exploratory Data
Analysis (EDA)
• Lo o k c lo se ly a t th e da ta to u n de rs ta n d it
be tte r.
• Fin d pa tte rn s , tre n ds, a n d u n u su a l
v a lu e s .
• U se v is u a ls like c h a rts a n d gra p h s:
1. H is to gra ms → s h o w
distribu tio n
2. Sc a tte r plo ts → sh o w
re la tio n sh ips
3. H e a tma ps → sh o w
c o rre la tio n s
Step 5: Data Analysis /
Modeling
• Use d a ta to a nswer q uestions or ma ke p red ictions.
• Ty p es of a na ly sis:
1. D escrip tive → Wha t ha pp ened ?
2. D ia g nostic → Why d id it ha p p en?
3. Pred ictive → Wha t mig ht ha p p en nex t?
4. Prescrip tive → Wha t should b e d one?
• Method s used:
1. S ta tistica l tests
2. Reg ression mod els
3. Foreca sting techniq ues
4. Group ing or clustering d a ta
Step 6:
Evaluation
In this ste p, w e te s t h o w w e ll th e mo d e l pe rf o rms a n d
w he th e r it c a n ma ke re lia ble pre d ic tio n s . Th e mo de l’ s
o utput is c o mpa re d w ith a c tu a l re s u lts, a n d diff e re n t
me a su re s a re u se d de pe n din g o n th e ty pe o f pro ble m. Th e
go a l is to ma ke s u re th e mo de l is a c c u ra te , g e n e ra lize s
w e ll to n e w da ta , a n d is su ita b le f o r re a l u se .
• Ke y p o in ts:
1.Co mpa re pre dic tio n s w ith a c tu a l re s u lts .
2.U se me a s u re s (me tric s ) ba s e d o n pro ble m ty pe .
3.Ens u re th e mo de l w o rks o n u n s e e n da ta .
4.Se le c t th e b e st-pe rf o rmin g mo de l.
Step 7:
Deployment
Onc e th e mo de l is re a dy a n d te s te d, it is p u t in to re a l
use so o th e rs c a n be n e fi t f ro m it. D e plo y me n t me a n s
ma kin g th e mo de l a c c e ss ib le th ro u g h to o ls, a pp s, o r
sy ste ms w h e re it c a n giv e pre dic tio n s o r in sigh ts in re a l
time o r o n de ma n d.
• Ke y p o in ts:
1.Inte gra te th e mo de l in to a c tu a l sy ste ms o r
a pplic a tio n s.
2.Can be u s e d th ro u gh A PIs, da sh bo a rds, o r a pps .
3.Sho u ld be sc a la b le ( h a n dle mo re da ta ) , re lia b le , a n d
se c u re .
Step 8:
Communication
Afte r bu ild in g a n d te stin g a mo d e l, th e re su lts n e e d to b e
sh are d in a w a y th a t is e a s y to u n de rs ta n d . Th is ste p is
a bo ut tu rn in g te c h n ic a l o u tpu ts in to c le a r in sig h ts th a t
pe o ple c a n u s e . Visu a ls like c h a rts , gra ph s , a n d
da sh bo a rds ma ke it e a s ie r to ex pla in fi n din gs.
• Ke y p o in ts:
1 . Pre s e n t re su lts c le a rly w ith v isu a ls a n d re po rts .
2 . Ke e p ex pla n a tio n s simple a n d e a sy to
unde rsta n d.
3 . Fo c u s o n in s ig h ts , n o t te c h n ic a l ja rgo n .
Step 9: Monitoring &
Maintenance
Eve n a fte r de plo y me n t, mo de ls n e e d to be w a tc h e d a n d
update d. O v e r time , d a ta c a n c h a n ge , a n d th e mo de l
ma y be c o me le s s a c c u ra te . Re gu la r mo n ito rin g e n su re s
the mo de l c o n tin u e s to pe rf o rm w e ll.
• Ke y p o in ts:
1 . Tra c k pe rf o rma n c e o v e r time .
2 . Re tra in mo de ls w ith n e w da ta .
3 . U pda te f e a tu re s if c o n ditio n s c h a n ge .
Conclusion
T h e Data S c ie n ce process is n o t on ly abo u t
creatin g m o d e ls , bu t abo ut f ollo win g a f u ll s e t o f
steps to tu rn d ata in to u sef u l k n owledge. It go es
f rom defi n in g th e problem to co llectin g, c le an i n g ,
ex plo ri n g , m o d el in g, ev alu atin g, deploy in g ,
com m u n ic ati n g , an d m o n itorin g.