Unit III Notes
Unit III Notes
The Origins of Machine Learning. Uses and Abuses of Machine Learning. How do
Machines Learn? - Abstraction and Knowledge Representation, Generalization.
Assessing the Success of Learning 4 Steps to Apply Machine Learning to Data.
Choosing a Machine Learning Algorithm - Thinking about the Input Data,
Thinking about Types of Machine Learning Algorithms, Matching Data to an
Appropriate Algorithm.
Machine Learning– Study of algorithms that– improve their performance– at some task–
with experience
• Role of Computer science: Efficient algorithms to– Solve the optimization problem–
Representing and evaluating the model for inference
● Modern ML (2000s–present):
Uses (Benefits)
Risks / Misuses
● Bias and Discrimination: Models can reflect or amplify biases in training data.
● Overfitting & Misinterpretation: Models perform well on training data but fail in
real-world settings.
Machines learn by identifying patterns and relationships in data through algorithms. One
common approach is supervised learning, where a model learns from labeled data to map
inputs to outputs—for instance, predicting house prices using historical records. In contrast,
unsupervised learning deals with unlabeled data, where the model discovers hidden
structures such as customer groups through clustering. Another important method is
reinforcement learning, where a model interacts with its environment and improves its
performance based on rewards or penalties, as seen in AlphaGo mastering the game of Go.
A powerful subset of machine learning is deep learning, which employs multi-layered
neural networks to capture high-level abstractions, making it effective in tasks like speech
recognition and image classification. Overall, the process of machine learning involves
feeding input data into the system, extracting meaningful features, training a model,
evaluating its performance, and finally using it for prediction or decision-making.
Abstraction: The process of reducing complexity by focusing on the essential features of data
or a problem while ignoring irrelevant details.
Example: Representing an image as “edges, corners, and objects” instead of raw pixels.
Together, abstraction + KR help AI systems reason, learn, and make decisions effectively.
2. Generalization
The ability of a machine learning model to apply what it learned from training data to new,
unseen data. Good generalization means the model captures underlying patterns rather than
memorizing (overfitting).
Example: A sentiment analysis model trained on movie reviews should correctly classify
unseen product reviews.
Evaluating how well a machine learning system has learned is a crucial step to ensure its
effectiveness. The process begins with measuring training performance, where the model is
checked to see how well it fits the given training data. However, good training results alone
are not enough, so the next step is to evaluate validation accuracy using unseen validation
data; this helps in fine-tuning hyperparameters and preventing overfitting. After validation,
the model’s testing or deployment performance is assessed on completely independent test
data using metrics such as accuracy, precision, recall, F1-score, or RMSE, depending on the
type of problem. Finally, a generalization check is performed to confirm that the model can
handle real-world data or cross-domain scenarios effectively, rather than only performing
well on the dataset it was trained on. Together, these steps ensure that the learning process
results in a robust and reliable machine learning model.
o Gather raw data, clean it (remove noise, handle missing values), and
transform into usable format.
o Assess model performance using metrics, optimize it, and deploy into real-
world systems for use.
Input Data:
Database Data
• Other Kinds of Data– say, time-related or sequence data, data streams, spatial data,
engineering design data, hypertext and multimedia data, graph and network data, the Web
• CSV Files
• Excel Files
• JSON Files
• SQL Databases
• Web APIs
• Web Scraping
• Streaming Data
• Pie Charts: Show parts of a whole (use with caution due to limited accuracy).
Data Processing is defined as the procedure of extracting information from huge sets of data.
In other words, we can say that data Processing is Processing knowledge from data.
• Data Processing is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets of data. The
knowledge discovery process includes Data cleaning, Data integration, Data selection, Data
transformation, Data Processing, Pattern evaluation, and Knowledge presentation.
• Data Processing is the act of automatically searching for large stores of information
to find trends and patterns that go beyond simple analysis procedures.
• Data Processing utilizes complex mathematical algorithms for data segments and
evaluates the probability of future events. Data Processing is also called Knowledge
Discovery of Data (KDD).
• First, you need to understand business and client objectives. You need to define what
your client wants (which many times even they do not know themselves)
• Using business objectives and current scenario, define your data Processing goals.
• A good data Processing plan is very detailed and should be developed to accomplish
both business and data Processing goals.
Data understanding:
In this phase, sanity check on data is performed to check whether its appropriate for the data
Processing goals.
• First, data is collected from multiple data sources available in the organization.
• These data sources may include multiple databases, flat filer or data cubes. There are
issues like object matching and schema integration which can arise during Data Integration
process. It is a quite complex and tricky process as data from various sources unlikely to
match easily. For example, table A contains an entity named cust_no whereas another table
B contains an entity named cust-id.
• Therefore, it is quite difficult to ensure that both of these given objects refer to the
same value or not. Here, Metadata should be used to reduce errors in the data integration
process.
• Next, the step is to search for properties of acquired data. A good way to explore the
data is to answer the data Processing questions (decided in business phase) using the query,
reporting, and visualization tools.
• Based on the results of query, the data quality should be ascertained. Missing data if
any should be acquired.
Data preparation:
• The data preparation process consumes about 90% of the time of the project.
• The data from different sources should be selected, cleaned, transformed, formatted,
anonymized, and constructed (if required).
• Data cleaning is a process to “clean” the data by smoothing noisy data and filling in
missing values.
• For example, for a customer demographics profile, age data is missing. The data is
incomplete and should be filled. In some cases, there could be data outliers. For instance, age
has a value 300. Data could be inconsistent. For instance, name of the customer is different in
different tables.
• Data transformation operations change the data to make it useful in data Processing.
Following transformation can be applied
Data transformation:
Data transformation operations would contribute toward the success of the Processing
process.
Aggregation: Summary or aggregation operations are applied to the data. I.e., the weekly
sales data is aggregated to calculate the monthly and yearly total.
Generalization: In this step, Low-level data is replaced by higher-level concepts with the
help of concept hierarchies. For example, the city is replaced by the county.
Normalization: Normalization performed when the attribute data are scaled up o scaled
down. Example: Data should fall in the range -2.0 to 2.0 post-normalization.
Modeling
• Based on the business objectives, suitable modeling techniques should be selected for
the prepared dataset.
• Create a scenario to test check the quality and validity of the model.
Evaluation:
In this phase, patterns identified are evaluated against the business objectives.
• Results generated by the data Processing model should be evaluated against the
business objectives.
Deployment:
1. Classification:
2. Clustering:
3. Regression:
4. Association Rules:
5. Outer detection:
This type of data Processing technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be used in
a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer
detection is also called Outlier Analysis or Outlier Processing.
6. Sequential Patterns:
7. Prediction:
The significant components of data Processing systems are a data source, data Processing
engine, data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.
Data Source:
• The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of historical data for data
Processing to be successful. Organizations typically store data in databases or data
warehouses.
• Data warehouses may comprise one or more databases, text files spreadsheets, or
other repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.
Different processes:
• Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data Processing procedure because the data
may not be complete and accurate.
• So, the first data requires to be cleaning and unifying. More information than needed
will be collected from various data sources, and only the data of interest will have to be
selected and passed to the server.
• 4These procedures are not as easy as we think. Several methods may be performed
on the data as part of selection, integration, and cleaning.
• The data Processing engine is a major component of any data Processing system. It
contains several modules for operating data Processing tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
• In other words, we can say data Processing is the root of our data Processing
architecture. It comprises instruments and software used to obtain insights and knowledge
from data collected from various data sources and stored within the data warehouse.
This is the initial preliminary step. It develops the scene for understanding what should be
done with the various decisions like transformation, algorithms, representation, etc.
end-user and the environment in which the knowledge discovery process will occur
( involves relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge discovery
onto one set involves the qualities that will be considered for the process. This process is
important because of Data Processing learns and discovers from the accessible data. This is
the evidence base for building the models. If some significant attributes are missing, at that
point, then the entire study may be unsuccessful from this respect, the more attributes are
considered. On the other hand, to organize, collect, and operate advanced data repositories
is expensive, and there is an arrangement with the opportunity for best understanding the
phenomena. This arrangement refers to an aspect where the interactive and iterative aspect
of the KDD is taking place. This begins with the best available data sets and later expands
and observes the impact in terms of knowledge discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling
the missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Processing algorithm in this context. For example, when one
suspects that a specific attribute of lacking reliability or has many missing data, at this point,
this attribute could turn into the objective of the Data Processing supervised algorithm. A
prediction model for these attributes will be created, and after that, missing data can be
predicted. The expansion to which one pays attention to this level relies upon numerous
factors. Regardless, studying the aspects is significant and regularly revealing by itself, to
enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Processing is prepared and
developed. Techniques here incorporate dimension reduction( for example, feature selection
and extraction and record sampling), also attribute transformation(for example,
discretization of numerical attributes and functional transformation). This step can be
essential for the success of the entire KDD project, and it is typically very project-specific.
For example, in medical assessments, the quotient of attributes may often be the most
significant factor and not each one by itself. In business, we may need to think about impacts
beyond our control as well as efforts and transient issues. For example, studying the impact
of advertising accumulation. However, if we do not utilize the right transformation at the
starting, then we may acquire an amazing effect that insights to us about the transformation
required in the next iteration. Thus, the KDD process follows upon itself and prompts an
understanding of the transformation required.
We are now prepared to decide on which kind of Data Processing to use, for example,
classification, regression, clustering, etcMost Data Processing techniques depend on
inductive learning, where a model is built explicitly or implicitly by generalizing from an
adequate number of preparing models. The fundamental assumption of the inductive
approach is that the prepared model applies to future cases. The technique also takes into
account the level of meta-learning for the specific set of accessible data.
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying
what causes a Data Processing algorithm to be fruitful or not in a specific issue. Thus, this
methodology attempts to understand the situation under which a Data Processing algorithm
is most suitable. Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.
At last, the implementation of the algorithm is reached. In this stage, we may need to utilize
the algorithm several times until a satisfying outcome is obtained. For example, by turning
the algorithms control parameters, such as the minimum number of instances in a single leaf
of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact
on the Data Processing algorithm results. For example, including a feature in step 4, and
repeat from there. This step focuses on the comprehensibility and utility of the induced
model. In this step, the identified knowledge is also recorded for further use. The last step is
the use, and overall feedback and discovery results acquire by Data Processing.
• Misuse Detection
• Anomaly Detection
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)
• Data Processing concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase direct mail response
rates.
• Risk Assessment and Fraud area also use the data-Processing concept for
identifying inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data Processing uses Educational Data
Processing (EDM) method.
• Curriculum development
• Retail Industry
• Telecommunication Industry
• Intrusion Detection
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data Processing. Some of the typical cases are
as follows −
• Design and construction of data warehouses for multidimensional data analysis and
data Processing.
Retail Industry
Data Processing has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web.
Data Processing in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data Processing in the retail industry −
• Customer Retention.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail,
web data
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data Processing is a
very important part of Bioinformatics. Following are the aspects in which data Processing
contributes for biological data analysis −
The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount of
data sets is being generated because of the fast numerical simulations in various fields such
as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following
are the applications of data Processing in the field of Scientific Applications −
• Graph-based Processing.
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking
User Interface: The knowledge discovered is discovered using data Processing tools is useful
only if it is interesting and above all understandable by the user. From good visualization
interpretation of data, Processing results can be eased and helps better understand their
requirements. To obtain good visualization many research is carried out for big data sets
that display and manipulate mined knowledge.
(iii) Processing Methodology Challenges: These challenges are related to data Processing
approaches and their limitations. Processing approaches that cause the problem are:
Different approaches may implement differently based upon data consideration. Some
algorithms require noise-free data. Most data sets contain exceptions, invalid or incomplete
information lead to complication in the analysis process and some cases compromise the
precision of the results.
(i) Complex data types: The database can include complex data elements, objects with
graphical data, spatial data, and temporal data. Processing all these kinds of data is not
practical to be done one device.
(ii) Processing from Varied Sources: The data is gathered from different sources on
Network. The data source may be of different kinds depending on how they are stored such
as structured, semi-structured or unstructured.
Performance: The performance of the data Processing system depends on the efficiency of
algorithms and techniques are using. The algorithms and techniques designed are not up to
the mark lead to affect the performance of the data Processing process.
(i) Efficiency and Scalability of the Algorithms: The data Processing algorithm must be
efficient and scalable to extract information from huge amounts of data in the database.
(ii) Improvement of Processing Algorithms: Factors such as the enormous size of the
database, the entire data flow and the difficulty of data Processing approaches inspire the
creation of parallel & distributed data Processing algorithms.
Algorithm choice: If data is linearly separable → Logistic Regression or Linear SVM works
well. If data has non-linear boundaries → Decision Trees or Random Forests perform better.
If dataset is very large and complex → Neural Networks (Deep Learning) may be chosen.
Algorithm choice: If relationship is linear → Linear Regression suffices. If data shows non-
linear trends → Polynomial Regression or Support Vector Regression (SVR). If high-
dimensional and complex → Ensemble methods (Gradient Boosting, Random Forest
Regression).
Algorithm choice: If clusters are spherical and similar size → K-Means is effective. If clusters
are arbitrary shapes → DBSCAN is better. If data is hierarchical in nature → Hierarchical
Clustering gives better insight.
Remaining Refer Hand written material.