0% found this document useful (0 votes)
19 views18 pages

Unit-2 ML

Uploaded by

Sunil Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Unit-2 ML

Uploaded by

Sunil Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-2

Concept Learning: Introduction,


Concept Learning Task- Notation, Concept Learning Search, Version
spaces, Candidate Elimination Algorithm, Inductive Bias, Biased
hypothesis Space, Unbiased Learner, Bias-free Learning, Active
queries, Mistake bound/PAC model – basic results. Overview of
issues regarding data sources, success criteria
1.Introduction to Concept Learning
Concept learning is the process of acquiring knowledge about a
concept (a category or class) from examples provided by a
knowledgeable external source or through observations. The goal is to
build a model (hypothesis) that can accurately classify new examples
into appropriate categories.
2 Concept Learning Task - Notation
In concept learning, we typically denote the following:
Instance: An object to be classified. Denoted as xxx.
Concept: A function or rule that assigns a label (category) to each
instance. Denoted as ccc.
Hypothesis: A proposed concept that approximates the true concept.
Denoted as hhh.
Target Concept: The true concept we aim to learn. Denoted as
c∗c^*c∗.
3.Concept Learning Search
The concept learning search refers to the process of finding a
hypothesis (concept) that best fits the training examples DDD. This
search is typically done within a hypothesis space HHH, which is the
set of all possible hypotheses.
Steps in Concept Learning Search:

1
1.Initialization: Start with an initial hypothesis. This could be the
most specific hypothesis (e.g., covering no instances) or the most
general hypothesis (e.g., covering all instances).
Testing Hypotheses: Evaluate hypotheses against the training
examples. This involves checking if the hypothesis correctly classifies
all instances in DDD.
Refinement: Modify the current hypothesis to better fit the data. This
can involve specializing (narrowing down the conditions under which
the concept holds) or generalizing (relaxing conditions to cover more
instances).
Stopping Criterion: Decide when to stop refining the hypothesis. This
is often based on reaching a satisfactory level of accuracy or when no
further improvements can be made.
Output: Once the search concludes, output the final hypothesis, which
is the learned concept.
Concept learning involves systematically searching for a hypothesis
within a predefined hypothesis space that best fits a set of training
examples. The goal is to generalize from the examples to accurately
classify new instances. It's a foundational task in machine learning
and is crucial for various applications, from pattern recognition to
natural language understanding.0
4.Version Spaces
Version spaces are a key concept in concept learning, particularly in
the framework of version space learning. The version space represents
the set of all hypotheses that are consistent with the observed training
data. It helps to reduce the search space for the correct hypothesis by
narrowing down the set of possible concepts that could explain the
data.
General Hypothesis (G): The set of most general hypotheses that are
consistent with the training data.

2
Specific Hypothesis (S): The set of most specific hypotheses that are
consistent with the training data.
Version Space (V): The intersection of G and S, which contains all
hypotheses that are consistent with the training data.
Version spaces are crucial because they provide a structured way to
navigate the hypothesis space and converge towards the correct
concept efficiently.
concept learning is about learning to classify or predict based on
examples, using a structured approach involving hypotheses, version
spaces, and a search through hypothesis space to find the best-fitting
concept.
5.ML – Candidate Elimination Algorithm
The candidate elimination algorithm incrementally builds the version
space given a hypothesis space H and a set E of examples. The
examples are added one by one; each example possibly shrinks the
version space by removing the hypotheses that are inconsistent with
the example. The candidate elimination algorithm does this by
updating the general and specific boundary for each new example.
You can consider this as an extended form of the Find-S algorithm.
Consider both positive and negative examples.
Actually, positive examples are used here as the Find-S algorithm
(Basically they are generalizing from the specification).
While the negative example is specified in the generalizing form.
Terms Used:
Concept learning: Concept learning is basically the learning task of
the machine (Learn by Train data)
General Hypothesis: Not Specifying features to learn the machine.
G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes

3
Specific Hypothesis: Specifying features to learn machine (Specific
feature)
S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of
attributes.
Version Space: It is an intermediate of general hypothesis and
Specific hypothesis. It not only just writes one hypothesis but a set of
all possible hypotheses based on training data-set.
Algorithm:
Step1: Load Data set
Step2: Initialize General Hypothesis and Specific Hypothesis.
Step3: For each training example
Step4: If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it)
Step5: If example is Negative example
Make generalize hypothesis more specific.
Consider the dataset given below:

Algorithmic steps:
Initially : G = [[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?]]
S = [Null, Null, Null, Null, Null, Null]

4
For instance 1 : <'sunny','warm','normal','strong','warm ','same'> and
positive output.
G1 = G
S1 = ['sunny','warm','normal','strong','warm ','same']

For instance 2 : <'sunny','warm','high','strong','warm ','same'> and


positive output.
G2 = G
S2 = ['sunny','warm',?,'strong','warm ','same']

For instance 3 : <'rainy','cold','high','strong','warm ','change'> and


negative output.
G3 = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?], [?, ?, ?, ?, ?,
?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, 'same']]
S3 = S2

For instance 4 : <'sunny','warm','high','strong','cool','change'> and


positive output.
G4 = G3
S4 = ['sunny','warm',?,'strong', ?, ?]

At last, by synchronizing the G4 and S4 algorithm produce the


output.
Output :
G = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?]]
S = ['sunny','warm',?,'strong', ?, ?]
5
The Candidate Elimination Algorithm (CEA) is an improvement over
the Find-S algorithm for classification tasks. While CEA shares some
similarities with Find-S, it also has some essential differences that
offer advantages and disadvantages. Here are some advantages and
disadvantages of CEA in comparison with Find-S
Advantages of CEA over Find-S:
Improved accuracy: CEA considers both positive and negative
examples to generate the hypothesis, which can result in higher
accuracy when dealing with noisy or incomplete data.
Flexibility: CEA can handle more complex classification tasks, such
as those with multiple classes or non-linear decision boundaries.
More efficient: CEA reduces the number of hypotheses by generating
a set of general hypotheses and then eliminating them one by one.
This can result in faster processing and improved efficiency.
Better handling of continuous attributes: CEA can handle continuous
attributes by creating boundaries for each attribute, which makes it
more suitable for a wider range of datasets.
Disadvantages of CEA in comparison with Find-S:
More complex: CEA is a more complex algorithm than Find-S, which
may make it more difficult for beginners or those without a strong
background in machine learning to use and understand.
Higher memory requirements: CEA requires more memory to store
the set of hypotheses and boundaries, which may make it less suitable
for memory-constrained environments.
Slower processing for large datasets: CEA may become slower for
larger datasets due to the increased number of hypotheses generated.
Higher potential for overfitting: The increased complexity of CEA
may make it more prone to overfitting on the training data, especially
if the dataset is small or has a high degree of noise
6.Biased Hypothesis Space

6
A biased hypothesis space refers to a restricted set of hypotheses that
a learning algorithm considers during the learning process. This
restriction can be due to several factors:
Representation Limitations: The hypothesis space may be limited by
the choice of representation used to define hypotheses. For example,
in decision tree learning, the structure and depth of the tree determine
the hypotheses that can be represented.
Computational Constraints: Some algorithms may impose restrictions
on the complexity or size of hypotheses due to computational
limitations. Neural networks, for instance, may restrict the number of
hidden layers or neurons to manage computational resources.
Inductive Bias: Bias can also refer to the assumptions or preferences
built into the learning algorithm itself. This bias influences the types
of hypotheses considered more plausible or preferred over others,
based on prior knowledge or assumptions about the data.
Biased hypothesis spaces can expedite learning by focusing on more
likely hypotheses but may risk missing the true concept if it lies
outside the predefined biases.
7.Unbiased Learner
An unbiased learner, on the other hand, is one that does not favor any
particular hypothesis or hypothesis space over others without strong
empirical support from the data. Key characteristics of an unbiased
learner include:
Exploratory: It explores a wide range of hypotheses or hypothesis
spaces without prematurely discarding options that could potentially
fit the data well.
Data-Driven: It relies heavily on empirical evidence from the data to
determine the validity and accuracy of hypotheses, rather than relying
on preconceived biases or assumptions.

7
Flexible: It can adapt to different types of data distributions and
complexities without being overly constrained by predefined
structures or assumptions.
Unbiased learners are often desirable in scenarios where the true
underlying concept is not well-understood or where the data may
exhibit unexpected patterns that do not conform to initial
assumptions.
Relationship
The relationship between biased hypothesis spaces and unbiased
learners highlights a balance in machine learning. While biased
hypothesis spaces can provide computational efficiency and faster
convergence to solutions, unbiased learners can potentially discover
more diverse and accurate hypotheses that better capture the
underlying data distribution.
In practice, many machine learning algorithms exhibit some degree of
bias in their hypothesis spaces to manage computational complexity
and improve performance. However, the goal is often to strike a
balance that allows for effective learning while remaining open to
discovering new, unexpected insights from the data.
Unbiased Learner
An unbiased learner in machine learning refers to an algorithm or
approach that aims to minimize bias in the learning process. Here's
what characterizes an unbiased learner:
Exploratory Nature: An unbiased learner explores a broad range of
hypotheses or models without strong prior assumptions or constraints.
It seeks to avoid favoring certain hypotheses based on preconceived
biases or assumptions.
Data-driven: It relies heavily on empirical evidence from the data to
guide the learning process. Rather than assuming certain patterns or
structures in the data, it adapts to the data distribution and seeks
patterns that are supported by evidence.

8
Flexibility: An unbiased learner is flexible and adaptable to different
types of data and contexts. It does not impose rigid constraints or
biases that may limit its ability to discover unexpected patterns or
relationships in the data.
8.Bias-free Learning
Bias-free learning aims to eliminate or minimize biases that can arise
in the learning process. This includes biases introduced by the choice
of hypothesis space, feature selection, or algorithm design. Key
principles of bias-free learning include:
Equal Treatment: Ensuring that all relevant features or aspects of the
data are considered without favoring certain attributes over others
based on subjective judgments or assumptions.
Fairness: Ensuring fairness in the learning process by avoiding
discrimination against certain groups or classes within the data. This
is particularly important in applications involving sensitive attributes
such as race, gender, or socioeconomic status.
Transparency: Making the learning process transparent and
understandable, so that biases can be identified and mitigated
effectively.
Bias-free learning is crucial in domains where fairness, transparency,
and accountability are paramount, such as in healthcare, finance, and
social sciences.
9.Active Queries
Active learning refers to a learning paradigm where the algorithm
actively selects which data instances (queries) to label or query from
an oracle (typically a human annotator or domain expert). Active
queries are particularly useful in scenarios where labeled data is
scarce or expensive to obtain.
Query Selection: Active queries are chosen strategically to maximize
learning efficiency. This may involve selecting instances that the

9
learner is uncertain about or instances that are likely to reduce
uncertainty in the learned model.
Improving Performance: By actively choosing which data to query,
active learning can significantly reduce the number of labeled
instances needed to achieve a certain level of performance compared
to passive learning approaches.
Applications: Active learning is applied in various domains such as
document classification, image recognition, and medical diagnosis,
where labeled data is limited or costly.
Integration and Benefits
Integrating unbiased learning principles with active queries can
enhance the effectiveness and fairness of machine learning systems:
Reduced Bias: By actively seeking diverse instances for labeling,
active learning can help mitigate biases that may arise from
imbalanced data or biased training samples.
Efficient Learning: Active learning can accelerate the learning
process by focusing on informative data points, while unbiased
learning principles ensure that the learned model remains objective
and adaptable to different data distributions.
unbiased learners and bias-free learning aim to reduce bias and
promote fairness in machine learning, while active queries through
active learning strategies can enhance learning efficiency and
effectiveness by judiciously selecting data for labeling. Integrating
these concepts can lead to more robust and equitable machine
learning systems.
10.Mistake Bound Model:/PAC MODEL
In the mistake-bound model, an online learning algorithm receives a
series of examples one by one. Each time it makes an error, it is
informed and must correct itself. The goal is to minimize the total
number of mistakes made.

10
Specifically, the learner aims to make a finite number of mistakes
(denoted as M) for any sequence of examples.
The mistake-bound model is useful for scenarios where immediate
feedback is available, such as online classification tasks.
PAC Learning Model:
The PAC learning model focuses on generalization performance. It
aims to find a hypothesis that approximates the target concept well
based on a finite sample of labeled examples.
In PAC learning:
Probably: The hypothesis should be accurate with high probability.
Approximately: The hypothesis should approximate the target
concept.
Correct: The hypothesis should have low error on unseen examples.
The PAC learning strategy places an upper bound on the probability
of making an error by ensuring a minimum number of examples are
used for learning12.
Relating the Models:
Theorem: If an algorithm learns a concept class C in the mistake-
bound model, then it also learns C in the PAC model.
The proof involves assuming that the algorithm is conservative
(changes its hypothesis only when it makes a mistake) and
constructing an algorithm that behaves similarly in both models.
While transforming an algorithm to a lazy one (changing hypotheses
only on mistakes) is not strictly necessary, it simplifies the proof
PAC Learning Model: Definition Given a probability distribution P
on X, a concept C and a hypothesis H,
define the error of
H: err(H) = P(C4H) = P(c(x) 6= h(x))

11
Formally: err(h) = err(H) (h is the description of H)
We say that an algorithm PAC-learns concept class C if for any C ∈
C, an arbitrary distribution P on X, and arbitrary numbers 0 < , δ < 1,
the algorithm, which receives a poly(1/, 1/δ, n) number of i.i.d.
examples from P(X), outputs with probability at least 1 − δ a
hypothesis h such that err(h) ≤ .
If such an algorithm exists, we call C PAC-Learnable.
If an algorithm PAC-learns C and runs in poly(1/, 1/δ, n) time, we say
it PAC-learns C efficiently and we call C efficiently PAC-learnable.
PAC Learning Conjunctions Use the generalization algo for PAC
learning: provide m examples to it, run it as if online, keep the last h.
Let Pic(z) be the prob. that literal z (z ∈ h1, h1, h2, . . . hn ) is
inconsistent with a random example drawn from
P(X). err(h) = P(at least one literal in h inconsistent) ≤ P z Pic(z) Call
z bad if Pic(z) ≥ 2n .
So if h has no bad literals then err(h) ≤ X z 2n = 2n 2n =
Computational Learning Theory PAC Model 2 / 19
PAC Learning Conjunctions
Prob. that a bad literal z “survived” (was consistent with) one random
example is 1 − Pic(z) ≤ 1 − 2n Prob.
that z survived m such i.i.d. examples is thus at most 1 − 2n m So
prob. that one of the 2n possible bad literals survived m i.i.d.
examples is at most 2n 1 − 2n m ≤ 2ne− m2n because of the general
inequality 1 − x ≤ e −x for x ≥ 0.
Computational Learning Theory PAC Model 3 / 19 PAC Learning
Conjunctions To satisfy PAC-learning conditions, we need 2ne− m 2n
< δ after arrangements: m ≥ 2n ln 2n + ln 1 δ

12
Thus m ≤ poly(1/, 1/δ, n) example suffice to make err(h) ≤ with
probability at least 1 − δ. So the generalization algorithm PAC-learns
conjunctions.
Mistake-Bound Learnability Implies PAC-Learnability
Any mistake-bound learner L can be transformed into a PAC-learner.
Let M ≤ poly(n) be the mistake bound of L. Call L lazy if it changes
its hypo h only following a mistake.
If L is not lazy, make it lazy (prevent changing h after correct
decisions). Run L on the example set but halt if any hypo h survives
more than 1 ln M δ consecutive examples.
Output h. Observe that this will terminate within m = M ln M δ
examples. (Otherwise more than M mistakes would be made.)
Mistake-Bound Learnability Implies PAC-Learnability Prob.
that err(h) > is at most M(1 − ) 1
ln M δ < Me− ln M δ = M δ M = δ Since M ≤ poly(n) (condition of
MB learning), also m = M ln M δ ≤ poly(1/, 1/δ, n)
So all PAC-learning conditions satisfied: we have m ≤ poly(1/, 1/δ,
n), and err(h) ≤ with prob. at least 1 − δ.
Computational Learning Theory PAC Model PAC-Learning Implies
Consistency Although err(h) > 0 is allowed, the output h of a PAC-
learner is necessarily consistent with all the training examples (zero
“training error”).
Assume that given training set { x1, x2, . . . xm }, the algo outputs h
inconsistent with some xj (1 ≤ j ≤ m).
Distribution P(x) and numbers , δ are arbitrary so set them such that
Qm i=1 P(xi) > δ (implying that P(xj) > 0); < P(xj) (can be done
because P(xj) > 0) So with prob. > δ
the algo will output h such that err(h) ≥ P(xj) > , i.e. it does not PAC-
learn.

13
Computational Learning Theory PAC Model Consistency +
Polynomial ln |H| Imply PAC-Learning An algorithm using
hypothesis class H is C-consistent if, given an arbitrary example set
from an arbitrary concept C ∈ C, it returns a h ∈ H consistent with the
example set.
H ⊇ C is a necessary condition for C-consistency. A C-consistent
algorithm using H PAC-learns C if ln |H| ≤ poly(n). Why? Prob. that a
given bad h (err(h) > ) survives (i.e., is consistent with) a random
example is at most (1 − ).
Computational Learning Theory PAC Model Consistency +
Polynomial ln |H| Imply PAC-Learning Prob.
that h survives m i.i.d. examples is at most (1 − ) m. Prob. that one of
the bad hypotheses h ∈ H survives is at most |H|(1 − ) m ≤ |H|e −m.
To make this smaller than δ, it suffices to set the number of examples
to m = 1 ln |H| δ which is ≤ poly(1/, 1/δ, n) iff ln |H| ≤ poly(n).
Compare this to the similar result in the mistake-bound model
(Halving algorithm).
Computational Learning Theory PAC Model Consistency +
Polynomial VC(H) Imply PAC-Learning Using VC(H), a bound can
be established even for |H| = ∞: With probability at least δ, no bad
hypothesis h ∈ H survives m i.i.d. examples where m ≥ 8 VC(H) ln
16 + ln 2 δ (We omit the proof.)
Thus a C-consistent algorithm using H PAC-learns C if VC(H) ≤
poly(n). For example, let C = half-planes in R n . |H| = ∞ but VC(H) =
n + 1 ≤ poly(n). Computational Learning Theory PAC Model k-
Decision Trees (Binary) decision tree: a binary tree-graph non-leaf
vertices: binary variables leafs: class indicators Classification: go
from root to leaf, path according to truth-values of variables. k-DT =
dec. trees of max depth k Like k-term DNF, finding a consistent k-DT
is NP-hard (proof omitted).

14
k-DT thus cannot be PAC-learned efficiently + properly. Example:
v3 v5 1 1 0 0 1 0 1 3-Decision Tree Computational Learning Theory
PAC Model 11 / 19 PAC-Learning k-Decision Trees Efficiently

Every k-DT has an equivalent k-DNF: For every path going from root
to a 1 leaf, add to the DNF a k-conjunction of all variables on the path
(v3 ∨ v3 v5 for the example)
Thus k-DT ⊆ k-DNF and C = k-DT can be efficiently (but not
properly) PAC-learned using H = k-DNF. Note that also k-DT ⊆ k-
CNF Create a clause for each path to a 0 leaf (v3 ∨ v5 for the
example) Computational Learning Theory PAC Model
PAC-Learning k-Decision Trees Properly We will show that lg |k-DT|
≤ poly(n). Denote ck = |k-DT|. c1 = 2 (two options for the single
vertex = leaf) so lg c1 = 1 (1) ck+1 = nc2 k (n options for vertex, ck
options for each of the 2 subtrees) lg ck+1 = lg n + 2 lg ck (2) (1) and
(2) are a recursive formula for a geometric series in variable lg ck = lg
|k-DT|.
Solution exponential in k but polynomial in n. So C = k-DT can be
properly (but not efficiently) PAC-learned by a C-consistent
algorithm.
Computational Learning Theory PAC Model 13 / 19 Inconsistent
Learning Returning a hypothesis consistent with the training set may
not be possible for reasons such as H + C; C is not known (‘agnostic
learning’) so H + C cannot be excluded; There is ‘noise’ in data so the
training set may include the same instance as both a positive and a
negative example.
Define the training error err( c h) as the proportion of training
examples inconsistent with h. err( c h) is also called the empirical
risk. We are interested in the relationship btw. err(h) and err( c h).

15
Computational Learning Theory Inconsistent Learning 14 / 19
Hoeffding Inequality Hoeffding: Let { z1, z2, . . . , zm } be a set of
i.i.d. samples from P(z) on { 0, 1 }.
The probability that
P(1) − 1 m Pm i=1 zi > is at most 2e −2 2m. Let zi = 1 iff i.i.d.
example xi is misclassified by h. So P(1) = err(h) 1 m Xm i=1 zi =
err( c h) Thus for a given h, |err(h) − err( c h)| > with prob. at most 2e
−2 2m.
Computational Learning Theory Inconsistent Learning 15 / 19 Error
Bound for Inconsistent Learning For a finite H, the prob. that |err(h) −
err( c h)| > for some h ∈ H is at most |H|2e −22m We want to make
this no greater than δ. Solving δ = |H|2e −2 2m gives = r 1 m ln 2|H| δ
So with prob. at least 1 − δ, the difference btw. err(h) and err( c h) is
at most as above for all h ∈ H. Dilemma: A large H allows to achieve
a small err( c h) but means a loose bound on err(h).
Computational Learning Theory Inconsistent Learning Sample
Complexity for Inconsistent Learning Solving δ = |H|2e −22m instead
for m gives m = 1 2 2 ln 2|H| δ which is thus a number of examples
sufficient to make |err(h) − err( c h)| ≤ with prob. at least 1 − δ for all
h ∈ H. m ≤ poly(1/, 1/δ, n) iff ln |H| ≤ poly(n) Computational
Learning Theory Inconsistent Learning Error Bound for ERM
Assume the learner returns h = arg min h∈H err( c h) This is called
empirical risk minimization (ERM principle). Let h ∗ = arg minh∈H
err(h), i.e. h ∗ is the best hypothesis. Let further m = 1 2 2 ln 2|H| δ .
Then with prob. at least 1 − δ: ∀h ∈ H : err(h) ≤ err( c h) + which we
just proved ≤ err( c h ∗ ) + because h minimizes err c ≤ err(h ∗ ) + 2
because err( c h ∗ ) ≤ err(h ∗ ) + Computational Learning Theory
Inconsistent Learning Bias-Variance Trade-Off Put differently, with
prob. at least 1 − δ: err(h) ≤ min h∈H err(h) + 2r 1 2m ln 2|H| δ Large
H - large variance - small bias - first summand lower, second larger
Too large H: overfitting, too small H: underfitting The more training
data (m), the larger H can be ‘afforded’

16
11. Overview of issues regarding data sources, success criteria.
Lack of Record Uniqueness: Organizations often end up storing
multiple records for the same entity due to the vast number and
variety of applications used to capture, manage, store, and use data.
For instance, customer interactions across websites, social media,
sales, billing, and marketing can lead to duplicate records. Without
systematic identification and merging of customer identities, datasets
become cluttered with duplicates1.
Inaccurate Data: Human error, outdated information, and incorrect
data contribute to data quality issues. Ensuring accurate data entry and
regular updates is crucial to maintaining reliable datasets.
Data Literacy Skills Gap: Organizations sometimes lack the necessary
data literacy skills. Employees need to understand data quality
concepts and practices to prevent errors and improve data reliability.
Incomplete Data: Missing or incomplete data can hinder decision-
making. Organizations must address gaps in their datasets to ensure
comprehensive and reliable information.
Data Consistency: Inconsistencies arise when the same data element
is represented differently across various sources. Standardizing
formats, units, and terminology helps maintain consistency.
Data Integrity: Ensuring data integrity involves preventing
unauthorized changes, deletions, or corruption. Implementing access
controls and audit trails is essential.
Timeliness: Outdated data can lead to poor decision-making. Regular
updates and real-time data integration are critical for maintaining data
relevance.
Data Validity: Validating data against predefined rules or constraints
helps identify invalid entries. For example, ensuring that birthdates
fall within a reasonable range.
Data Completeness: Missing values or incomplete records impact data
quality. Organizations should validate and fill in missing information.

17
Duplicate Data: Duplicate records waste storage space and create
confusion. Detecting and merging duplicates is essential.
Data Accuracy: Accuracy refers to how closely data reflects reality.
Errors in measurements, calculations, or data entry affect accuracy.
Data Relevance: Irrelevant data adds noise to datasets. Regularly
reviewing and removing irrelevant information is crucial.

18

You might also like