Ai2 Lecture Notes
Ai2 Lecture Notes
By
H. Ateeq Ahmed, [Link], (Ph.D)
Asst. Professor of CSE,
Kurnool.
NOT FOR
SALE
Artificial Intelligence 1
UNIT- I
FOUNDATIONS OF AI
WHAT IS AI?
Artificial Intelligence (AI) is a branch of Science which deals with helping machines
finding solutions to complex problems in a more human-like fashion. This generally involves
borrowing characteristics from human intelligence, and applying them as algorithms in a
computer friendly way. A more or less flexible or efficient approach can be taken depending
on the requirements established, which influences how artificial the intelligent behaviour
appears. AI is generally associated with Computer Science, but it has many important links
with other fields such as Maths, Psychology, Cognition, Biology and Philosophy, among
many others. Our ability to combine knowledge from all these fields will ultimately benefit
our progress in the quest of creating an intelligent artificial being.
AI currently encompasses a huge variety of subfields, from general-purpose areas
such as perception and logical reasoning, to specific tasks such as playing chess, proving
mathematical theorems, writing poetry, and diagnosing diseases. Often, scientists in other
fields move gradually into artificial intelligence, where they find the tools and vocabulary to
systematize and automate the intellectual tasks on which they have been working all their
lives. Similarly, workers in AI can choose to apply their methods to any area of human
intellectual endeavour. In this sense, it is truly a universal field.
HISTORY OF AI
The origin of artificial intelligence lies in the earliest days of machine computations.
During the 1940s and 1950s, AI begins to grow with the emergence of the modern computer.
Among the first researchers to attempt to build intelligent programs were Newell and Simon.
Their first well known program, logic theorist, was a program that proved statements using
the accepted rules of logic and a problem-solving program of their own design. By the late
fifties, programs existed that could do a passable job of translating technical documents and it
was seen as only a matter of extra databases and more computing power to apply the
techniques to less formal, more ambiguous texts. Most problem-solving work revolved
around the work of Newell, Shaw and Simon, on the general problem solver (GPS).
Unfortunately, the GPS did not fulfil its promise and did not because of some simple lack of
computing capacity. In the 1970’s the most important concept of AI was developed known as
Expert System which exhibits as a set rules the knowledge of an expert. The application area
of expert system is very large. The 1980’s saw the development of neural networks as a
method learning examples.
Prof. Peter Jackson (University of Edinburgh) classified the history of AI into three
periods as:
1. Classical
2. Romantic
3. Modern
Artificial Intelligence 2
1. Classical Period:
It was started from 1950. In 1956, the concept of Artificial Intelligence came into existence.
During this period, the main research work carried out includes game plying, theorem
proving and concept of state space approach for solving a problem.
2. Romantic Period:
It was started from the mid 1960 and continues until the mid 1970. During this period people
were interested in making machine understand, that is usually mean the understanding of
natural language. During this period the knowledge representation technique “semantic net”
was developed.
3. Modern Period:
It was started from 1970 and continues to the present day. This period was developed to solve
more complex problems. This period includes the research on both theories and practical
aspects of Artificial Intelligence. This period includes the birth of concepts like Expert
system, Artificial Neurons, Pattern Recognition etc. The research of the various advanced
concepts of Pattern Recognition and Neural Network are still going on.
Components of AI
There are three types of components in AI
1) Hardware Components of AI
a) Pattern Matching
b) Logic Representation
c) Symbolic Processing
d) Numeric Processing
e) Problem Solving
f) Heuristic Search
g) Natural Language processing
h) Knowledge Representation
i) Expert System
j) Neural Network
k) Learning
l) Planning
m) Semantic Network
2) Software Components
a) Machine Language
b) Assembly language
c) High level Language
d) LISP Language
e) Fourth generation Language
f) Object Oriented Language
g) Distributed Language
h) Natural Language
i) Particular Problem Solving Language
YouTube: Engineering Drive By: H. Ateeq Ahmed
Artificial Intelligence 3
3) Architectural Components
a) Uniprocessor
b) Multiprocessor
c) Special Purpose Processor
d) Array Processor
e) Vector Processor
f) Parallel Processor
g) Distributed Processor
2. AI is a field of study that encompasses computational techniques for performing tasks that
apparently require intelligence when performed by humans.
3. AI is the branch of computer science that is concerned with the automation of intelligent
behaviour. A I is based upon the principles of computer science namely data structures used
in knowledge representation, the algorithms needed to apply that knowledge and the
languages and programming techniques used in their implementation.
4. AI is the field of study that seeks to explain and emulate intelligent behaviour in terms of
computational processes.
6. A I is the part of computer science concerned with designing intelligent computer systems,
that is, computer systems that exhibit the characteristics we associate with intelligence in
human behaviour such as understanding language, learning, reasoning and solving problems.
8. A I is the study of the computations that make it possible to perceive, reason, and act.
9. A I is the exciting new effort to make computers think machines with minds, in the full and
literal sense.
10. AI is concerned with developing computer systems that can store knowledge and
effectively use the knowledge to help solve problems and accomplish tasks. This brief
statement sounds a lot like one of the commonly accepted goals in the education of humans.
We want students to learn (gain knowledge) and to learn to use this knowledge to help solve
problems and accomplish tasks.
Artificial Intelligence 4
Deep Learning
Also within ML, deep learning takes inspiration from the activity of neurons within the brain
to learn how to recognize complex patterns through learned data. This is thanks to the use of
algorithms, mainly statistical calculations. The word ‘deep’ refers to the large number of
YouTube: Engineering Drive By: H. Ateeq Ahmed
Artificial Intelligence 5
levels of neurons that ML models simultaneously, which helps acquire rich representations of
data to obtain performance gains.
Natural Language Processing (NLP)
Natural language processing is the mechanism by which machines acquire the ability to
analyze, understand, and manipulate textual data. 2019 was a great year for NPL with Google
AI’s BERT and Transformer, Allen Institute’s ELMo, OpenAI’s Transformer, Ruder and
Howard’s ULMFit, and finally, Microsoft’s MT-DNN. All of these have shown that pre-
taught language models can substantially improve performance on a wide variety of NLP
tasks.
INTELLIGENT AGENTS
AGENTS
An AI system is composed of an agent and its environment. The agents act in their
environment. The environment may contain other agents.
An agent is anything that can perceive its environment through sensors and acts upon that
environment through effectors.
¥ A human agent has sensory organs such as eyes, ears, nose, tongue and skin parallel
to the sensors, and other organs such as hands, legs, mouth, for effectors.
¥ A robotic agent replaces cameras and infrared range finders for the sensors, and
various motors and actuators for effectors.
¥ A software agent has encoded bit strings as its programs and actions.
Agent Terminology
¥ Performance Measure of Agent − It is the criteria, which determines how
successful an agent is.
¥ Behavior of Agent − It is the action that agent performs after any given sequence of
percepts.
¥ Percept − It is agent’s perceptual inputs at a given instance.
¥ Percept Sequence − It is the history of all that an agent has perceived till date.
¥ Agent Function − It is a map from the precept sequence to an action.
Artificial Intelligence 6
This leads to a definition of an ideal rational agent: For each possible percept sequence,
an ideal rational agent should do whatever action is expected to maximize its performance
measure, on the basis of the evidence provided by the percept sequence and whatever built-in
knowledge the agent has.
Artificial Intelligence 7
ENVIRONMENTS
In this section, we will see how to couple an agent to an environment. In all cases,
however, the nature of the connection between them is the same: actions are done by the
agent on the environment, which in turn provides percepts to the agent. First, we will describe
the different types of environments and how they affect the design of agents. Then we will
describe environment programs that can be used as testbeds for agent programs.
Properties of environments
Environments come in several flavors. The principal distinctions to be made are as
follows:
Artificial Intelligence 8
Environment programs
The generic environment program in Figure 2.14 illustrates the basic relationship
between agents and environments. In this book, we will find it convenient for many of the
examples and exercises to use an environment simulator that follows this program structure.
The simulator takes one or more agents as input and arranges to repeatedly give each agent
the right percepts and receive back an action. The simulator then updates the environment
based on the actions, and possibly other dynamic processes in the environment that are not
considered to be agents (rain, for example). The environment is therefore defined by the
initial state and the update function. Of course, an agent that works in a simulator ought also
to work in a real environment that provides the same kinds of percepts and accepts the same
kinds of actions.
Artificial Intelligence 9
Turing Test
¥ The success of an intelligent behavior of a system can be measured with Turing Test.
¥ Two persons and a machine to be evaluated participate in the test. Out of the two
persons, one plays the role of the tester. Each of them sits in different rooms. The
tester is unaware of who is machine and who is a human. He interrogates the
questions by typing and sending them to both intelligences, to which he receives
typed responses.
¥ This test aims at fooling the tester. If the tester fails to determine machine’s response
from the human response, then the machine is said to be intelligent.
Artificial Intelligence 10
So far, we have talked about agents by describing their behavior—the action that is
performed after any given sequence of percepts. Now, we will have to bite the bullet and talk
about how the insides work.
The job of AI is to design the agent program: a function that implements the agent
mapping from percepts to actions. We assume this program will run on some sort of
computing device, which we will call the architecture. Obviously, the program we choose has
to be one that the architecture will accept and run. The architecture might be a plain
computer, or it might include special-purpose hardware for certain tasks, such as processing
camera images or filtering audio input. It might also include software that provides a degree
of insulation between the raw computer and the agent program, so that we can program at a
higher level. In general, the architecture makes the percepts from the sensors available to the
program, runs the program, and feeds the program's action choices to the effectors as they are
generated.
The relationship among agents, architectures, and programs can be summed up as
follows:
agent = architecture + program
Before we design an agent program, we must have a pretty good idea of the possible
percepts and actions, what goals or performance measure the agent is supposed to achieve,
and what sort of environment it will operate in. These come in a wide variety. Figure 2.3
shows the basic elements for a selection of agent types.
It may come as a surprise to some readers that we include in our list of agent types
programs that seem to operate in the entirely artificial environment defined by keyboard input
and character output on a screen. "Surely," one might say, "this is not a real environment, is
it?" In fact, what matters is not the distinction between "real" and "artificial" environments,
but the complexity of the relationship among the behavior of the agent, the percept sequence
generated by the environment, and the goals that the agent is supposed to achieve. Some
"real" environments are actually quite simple. For example, a robot designed to inspect parts
as they come by on a conveyer belt can make use of a number of simplifying assumptions:
that the lighting is always just so, that the only thing on the conveyer belt will be parts of a
certain kind, and that there are only two actions—accept the part or mark it as a reject.
Artificial Intelligence 11
Agent Programs
Intelligent Agents will all have the same skeleton, namely, accepting percepts from an
environment and generating actions. The early versions of agent programs will have a very
simple form (Figure 2.4). Each will use some internal data structures that will be updated as
new percepts arrive. These data structures are operated on by the agent's decision-making
procedures to generate an action choice, which is then passed to the architecture to be
executed.
There are two things to note about this skeleton program. First, even though we
defined the agent mapping as a function from percept sequences to actions, the agent program
receives only a single percept as its input. It is up to the agent to build up the percept
sequence in memory, if it so desires. In some environments, it is possible to be quite
successful without storing the percept sequence, and in complex domains, it is infeasible to
store the complete sequence.
Second, the goal or performance measure is not part of the skeleton program. This is
because the performance measure is applied externally to judge the behavior of the agent, and
YouTube: Engineering Drive By: H. Ateeq Ahmed
Artificial Intelligence 12
Example
At this point, it will be helpful to consider a particular environment, so that our
discussion can become more concrete. Mainly because of its familiarity, and because it
involves a broad range of skills, we will look at the job of designing an automated taxi driver.
We must first think about the percepts, actions, goals and environment for the taxi.
They are summarized in Figure 2.6 and discussed in turn.
The taxi will need to know where it is, what else is on the road, and how fast it is
going. This information can be obtained from the percepts provided by one or more
controllable TV cameras, the speedometer, and odometer. To control the vehicle properly,
especially on curves, it should have an accelerometer; it will also need to know the
mechanical state of the vehicle, so it will need the usual array of engine and electrical system
sensors. It might have instruments that are not available to the average human driver: a
satellite global positioning system (GPS) to give it accurate position information with respect
to an electronic map; or infrared or sonar sensors to detect distances to other cars and
obstacles. Finally, it will need a microphone or keyboard for the passengers to tell it their
destination.
The actions available to a taxi driver will be more or less the same ones available to a
human driver: control over the engine through the gas pedal and control over steering and
braking. In addition, it will need output to a screen or voice synthesizer to talk back to the
passengers, and perhaps some way to communicate with other vehicles.
What performance measure would we like our automated driver to aspire to?
Desirable qualities include getting to the correct destination; minimizing fuel consumption
and wear and tear; minimizing the trip time and/or cost; minimizing violations of traffic laws
and disturbances to other drivers; maximizing safety and passenger comfort; maximizing
profits. Obviously, some of these goals conflict, so there will be trade-offs involved.
Finally, were this a real project, we would need to decide what kind of driving
environment the taxi will face. Should it operate on local roads, or also on freeways? Will it
be in Southern California, where snow is seldom a problem, or in Alaska, where it seldom is
not? Will it always be driving on the right, or might we want it to be flexible enough to drive
on the left in case we want to operate taxis in Britain or Japan? Obviously, the more restricted
the environment, the easier the design problem.
Artificial Intelligence 13
Artificial Intelligence 14
*******
Artificial Intelligence 15
UNIT- II
SOLVING PROBLEMS BY SEARCHING
In previous chapter, we saw that simple reflex agents are unable to plan ahead. They
are limited in what they can do because their actions are determined only by the current
percept. Furthermore, they have no knowledge of what their actions do nor of what they are
trying to achieve.
In this chapter, we describe one kind of goal-based agent called a problem-solving
agent. Problem-solving agents decide what to do by finding sequences of actions that lead to
desirable states. We discuss informally how the agent can formulate an appropriate view of
the problem it faces. The problem type that results from the formulation process will depend
on the knowledge available to the agent: principally, whether it knows the current state and
the outcomes of actions. We then define more precisely the elements that constitute a
"problem" and its "solution," and give several examples to illustrate these definitions. Given
precise definitions of problems, it is relatively straightforward to construct a search process
for finding solutions.
Artificial Intelligence 16
EXAMPLE PROBLEMS
The range of task environments that can be characterized by well-defined problems is
vast. We can distinguish between so-called, toy problems, which are intended to illustrate or
exercise various problem-solving methods, and so-called real-world problems, which tend to
be more difficult and whose solutions people actually care about. In this section, we will give
examples of both. By nature, toy problems can be given a concise, exact description. This
means that they can be easily used by different researchers to compare the performance of
algorithms. Real-world problems, on the other hand, tend not to have a single agreed-upon
description, but we will attempt to give the general flavor of their formulations.
1. Toy Problems
The 8-puzzIe
The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3x3 board
with eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into
the space. The object is to reach the configuration shown on the right of the figure. One
important trick is to notice that rather than use operators such as "move the 3 tile into the
blank space," it is more sensible to have operators such as "the blank space changes places
with the tile to its left." This is because there are fewer of the latter kind of operator.
This leads us to the following formulation:
¥ States: a state description specifies the location of each of the eight tiles in one of the
nine squares. For efficiency, it is useful to include the location of the blank.
¥ Operators: blank moves left, right, up, or down.
¥ Goal test: state matches the goal configuration shown in Figure 3.4.
¥ Path cost: each step costs 1, so the path cost is just the length of the path.
The 8-puzzle belongs to the family of sliding-block puzzles. This general class is known
to be NP-complete, so one does not expect to find methods significantly better than the search
algorithms described in this chapter and the next. The 8-puzzle and its larger cousin, the 15-
puzzle, are the standard test problems for new search algorithms in Al.
Artificial Intelligence 17
The 8-queens problem can be defined as follows: Place 8 queens on an (8 by 8) chess board
such that none of the queens attacks any of the others. A configuration of 8 queens on the
board is shown in figure 1, but this does not represent a solution as the queen in the first
column is on the same diagonal as the queen in the last column.
Although efficient special-purpose algorithms exist for this problem and the whole n
queens family, it remains an interesting test problem for search algorithms. There are two
main kinds of formulation. The incremental formulation involves placing queens one by one,
whereas the complete-state formulation starts with all 8 queens on the board and moves them
around. In either case, the path cost is of no interest because only the final state counts;
algorithms are thus compared only on search cost. Thus, we have the following goal test and
path cost:
¥ Goal test: 8 queens on board, none attacked.
¥ Path cost: zero.
There are also different possible states and operators. Consider the following simple-minded
formulation:
¥ States: any arrangement of 0 to 8 queens on board.
¥ Operators: add a queen to any square.
In this formulation, we have 648 possible sequences to investigate. A more sensible
choice would use the fact that placing a queen where it is already attacked cannot work,
because subsequent placings of other queens will not undo the attack. So we might try the
following:
¥ States: arrangements of 0 to 8 queens with none attacked.
¥ Operators: place a queen in the left-most empty column such that it is not attacked by
any other queen.
It is easy to see that the actions given can generate only states with no attacks; but
sometimes no actions will be possible. For example, after making the first seven choices (left-
to-right) in Figure 1, there is no action available in this formulation. The search process must
try another choice. A quick calculation shows that there are only 2057 possible sequences to
investigate. The right formulation makes a big difference to the size of the search space.
Similar considerations apply for a complete-state formulation. For example, we could set the
problem up as follows:
¥ States: arrangements of 8 queens, one in each column.
¥ Operators: move any attacked queen to another square in the same column.
This formulation would allow the algorithm to find a solution eventually, but it would be
better to move to an unattacked square if possible.
Artificial Intelligence 18
2. Real-world problems
Route finding
We have already seen how route finding is defined in terms of specified locations and
transitions! along links between them. Route-finding algorithms are used in a variety of
applications, such! as routing in computer networks, automated travel advisory systems, and
airline travel planning! systems. The last application is somewhat more complicated, because
airline travel has a very complex path cost, in terms of money, seat quality, time of day, type
of airplane, frequent-flyer mileage awards, and so on. Furthermore, the actions in the problem
do not have completely known outcomes: flights can be late or overbooked, connections can
be missed, and fog or emergency maintenance can cause delays.
VLSI Layout
The design of silicon chips is one of the most complex engineering design tasks
currently undertaken, and we can give only a brief sketch here. A typical VLSI chip can have
as many as a million gates, and the positioning and connections of every gate are crucial to
the successful operation of the chip. Computer-aided design tools are used in every phase of
the process. Two of the most difficult tasks are cell layout and channel routing. These come
after the components and connections of the circuit have been fixed; the purpose is to lay out
the circuit on the chip so as to minimize area and connection lengths, thereby maximizing
speed. In cell layout, the primitive components of the circuit are grouped into cells, each of
which performs some recognized function. Each cell has a fixed footprint (size and shape)
and requires a certain number of connections to each of the other cells. The aim is to place
the cells on the chip so that they do not overlap and so that there is room for the connecting
wires to be placed between the cells. Channel routing finds a specific route for each wire
using the gaps between the cells. These search problems are extremely complex, but
definitely worth solving.
Robot navigation
Robot navigation is a generalization of the route-finding problem described earlier.
Rather than a discrete set of routes, a robot can move in a continuous space with (in principle)
an infinite set of possible actions and states. For a simple, circular robot moving on a flat
Artificial Intelligence 19
surface, the space is essentially two-dimensional. When the robot has arms and legs that must
also be controlled, the search space becomes many-dimensional. Advanced techniques are
required just to make the search space finite.
Artificial Intelligence 20
SEARCH STRATEGIES
Search Algorithm Terminologies
Search:
Searching is a step by step procedure to solve a search-problem in a given search space.
A search problem can have three main factors:
1. Search Space: Search space represents a set of possible solutions, which a
system may have.
2. Start State: It is a state from where agent begins the search.
3. Goal test: It is a function which observe the current state and returns whether
the goal state is achieved or not.
¥ Search tree: A tree representation of search problem is called Search tree. The
root of the search tree is the root node which is corresponding to the initial state.
¥ Actions: It gives the description of all the available actions to the agent.
¥ Transition model: A description of what each action do, can be represented as a
transition model.
¥ Path Cost: It is a function which assigns a numeric cost to each path.
¥ Solution: It is an action sequence which leads from the start node to the goal
node.
¥ Optimal Solution: If a solution has the lowest cost among all solutions.
Artificial Intelligence 21
Artificial Intelligence 22
1. Breadth-first Search:
o Breadth-first search is the most common search strategy for traversing a tree or graph.
This algorithm searches breadthwise in a tree or graph, so it is called breadth-first
search.
o BFS algorithm starts searching from the root node of the tree and expands all
successor node at the current level before moving to nodes of next level.
o The breadth-first search algorithm is an example of a general-graph search algorithm.
o Breadth-first search implemented using FIFO queue data structure.
Advantages:
o BFS will provide a solution if any solution exists.
o If there are more than one solutions for a given problem, then BFS will provide the
minimal solution which requires the least number of steps.
Disadvantages:
o t requires lots of memory since each level of the tree must be saved into memory to
expand the next level.
o BFS needs lots of time if the solution is far away from the root node.
Example
In the below tree structure, we have shown the traversing of the tree using BFS algorithm
from the root node S to goal node K. BFS search algorithm traverse in layers, so it will follow
the path which is shown by the dotted arrow, and the traversed path will be:
S---> A--->B---->C--->D---->G--->H--->E---->F---->I---->K
Time Complexity:
Time Complexity of BFS algorithm can be obtained by the number of nodes traversed in BFS
until the shallowest Node. Where the d= depth of shallowest solution and b is a node at every
state.
T (b) = 1+b2+b3+.......+ bd= O (bd)
Artificial Intelligence 23
Space Complexity:
Space complexity of BFS algorithm is given by the Memory size of frontier which is O(bd).
Completeness:
BFS is complete, which means if the shallowest goal node is at some finite depth, then BFS
will find a solution.
Optimality:
BFS is optimal if path cost is a non-decreasing function of the depth of the node.
2. Depth-first Search
o Depth-first search isa recursive algorithm for traversing a tree or graph data structure.
o It is called the depth-first search because it starts from the root node and follows each
path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is similar to the BFS algorithm.
Advantage:
o DFS requires very less memory as it only needs to store a stack of the nodes on the
path from root node to the current node.
o It takes less time to reach to the goal node than BFS algorithm (if it traverses in the
right path).
Disadvantage:
o There is the possibility that many states keep re-occurring, and there is no guarantee
of finding the solution.
o DFS algorithm goes for deep down searching and sometime it may go to the infinite
loop.
Example
In the below search tree, we have shown the flow of depth-first search, and it will follow the
order as:
Root node--->Left node ----> right node.
Artificial Intelligence 24
It will start searching from root node S, and traverse A, then B, then D and E, after traversing
E, it will backtrack the tree as E has no other successor and still goal node is not found. After
backtracking it will traverse node C and then G, and here it will terminate as it found goal
node.
Completeness: DFS search algorithm is complete within finite state space as it will expand
every node within a limited search tree.
Time Complexity: Time complexity of DFS will be equivalent to the node traversed by the
algorithm. It is given by:
T(n)= 1+ n2+ n3 +.........+ nm=O(nm)
Where, m= maximum depth of any node and this can be much larger than d (Shallowest
solution depth)
Space Complexity: DFS algorithm needs to store only single path from the root node, hence
space complexity of DFS is equivalent to the size of the fringe set, which is O(bm).
Optimal: DFS search algorithm is non-optimal, as it may generate a large number of steps or
high cost to reach to the goal node.
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
o Depth-limited search also has a disadvantage of incompleteness.
o It may not be optimal if the problem has more than one solution.
Example
Artificial Intelligence 25
Completeness: DLS search algorithm is complete if the solution is above the depth-limit.
Time Complexity: Time complexity of DLS algorithm is O(bℓ).
Space Complexity: Space complexity of DLS algorithm is O(b×ℓ).
Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also not
optimal even if ℓ>d.
Advantages:
o Uniform cost search is optimal because at every state the path with the least cost is
chosen.
o
Disadvantages:
o It does not care about the number of steps involve in searching and only concerned
about path cost. Due to which this algorithm may be stuck in an infinite loop.
Example
Artificial Intelligence 26
Completeness:
Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity:
Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal node. Then
the number of steps is = C*/ε+1. Here we have taken +1, as we start from state 0 and end to
C*/ε.
Hence, the worst-case time complexity of Uniform-cost search isO(b1 + [C*/ε])/.
Space Complexity:
The same logic is for space complexity so, the worst-case space complexity of Uniform-cost
search is O(b1 + [C*/ε]).
Optimal:
Uniform-cost search is always optimal as it only selects a path with the lowest path cost.
Advantages:
o It combines the benefits of BFS and DFS search algorithm in terms of fast search and
memory efficiency.
Disadvantages:
o The main drawback of IDDFS is that it repeats all the work of the previous phase.
Example
Following tree structure is showing the iterative deepening depth-first search.
IDDFS algorithm performs various iterations until it does not find the goal node.
The iteration performed by the algorithm is given as:
Artificial Intelligence 27
1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the goal node.
Completeness:
This algorithm is complete is if the branching factor is finite.
Time Complexity:
Let's suppose b is the branching factor and depth is d then the worst-case time complexity
is O(bd).
Space Complexity:
The space complexity of IDDFS will be O(bd).
Optimal:
IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of the
node.
Advantages:
o Bidirectional search is fast.
o Bidirectional search requires less memory
Disadvantages:
o Implementation of the bidirectional search tree is difficult.
o In bidirectional search, one should know the goal state in advance.
Example
In the below search tree, bidirectional search algorithm is applied. This algorithm divides one
graph/tree into two sub-graphs. It starts traversing from node 1 in the forward direction and
starts from goal node 16 in the backward direction.
The algorithm terminates at node 9 where two searches meet.
Artificial Intelligence 28
Heuristics function:
Heuristic is a function which is used in Informed Search, and it finds the most promising
path. It takes the current state of the agent as its input and produces the estimation of how
close agent is from the goal. The heuristic method, however, might not always give the best
solution, but it guaranteed to find a good solution in reasonable time. Heuristic function
estimates how close a state is to the goal. It is represented by h(n), and it calculates the cost of
an optimal path between the pair of states. The value of the heuristic function is always
positive.
Hence heuristic cost should be less than or equal to the estimated cost.
Artificial Intelligence 29
Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of both
the algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.
Artificial Intelligence 30
Disadvantages:
o It can behave as an unguided depth-first search in the worst case scenario.
o It can get stuck in a loop as DFS.
o This algorithm is not optimal.
Example
Consider the below search problem, and we will traverse it using greedy best-first search. At
each iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in
the below table.
In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.
Artificial Intelligence 31
2. A* Search Algorithm:
A* search is the most commonly known form of best-first search. It uses heuristic function
h(n), and cost to reach the node n from the start state g(n). It has combined features of UCS
and greedy best-first search, by which it solve the problem efficiently. A* search algorithm
finds the shortest path through the search space using the heuristic function. This search
algorithm expands less search tree and provides optimal result faster. A* algorithm is similar
to UCS except that it uses g(n)+h(n) instead of g(n).
In A* search algorithm, we use search heuristic as well as the cost to reach the node. Hence
we can combine both costs as following, and this sum is called as a fitness number.
Algorithm of A* search:
Step1: Place the starting node in the OPEN list.
Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and
stops.
Step 3: Select the node from the OPEN list which has the smallest value of evaluation
function (g+h), if node n is goal node then return success and stop, otherwise
Step 4: Expand node n and generate all of its successors, and put n into the closed list. For
each successor n', check whether n' is already in the OPEN or CLOSED list, if not then
compute evaluation function for n' and place into Open list.
Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the
back pointer which reflects the lowest g(n') value.
Step 6: Return to Step 2.
Advantages:
o A* search algorithm is the best algorithm than other search algorithms.
o A* search algorithm is optimal and complete.
o This algorithm can solve very complex problems.
Artificial Intelligence 32
Disadvantages:
o It does not always produce the shortest path as it mostly based on heuristics and
approximation.
o A* search algorithm has some complexity issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes in
the memory, so it is not practical for various large-scale problems.
Example
In this example, we will traverse the given graph using the A* algorithm. The heuristic value
of all states is given in the below table so we will calculate the f(n) of each state using the
formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start state.
Here we will use OPEN and CLOSED list.
Solution:
Artificial Intelligence 33
If the heuristic function is admissible, then A* tree search will always find the least cost path.
*******
Artificial Intelligence 34
UNIT- III
KNOWLEDGE REPRESENTATION
Humans are best at understanding, reasoning, and interpreting knowledge. Human knows
things, which is knowledge and as per their knowledge they perform various actions in the
real world. But how machines do all these things comes under knowledge representation
and reasoning. Hence we can describe Knowledge representation as following:
o Knowledge representation and reasoning (KR, KRR) is the part of Artificial
intelligence which concerned with AI agents thinking and how thinking contributes to
intelligent behavior of agents.
o It is responsible for representing information about the real world so that a computer
can understand and can utilize this knowledge to solve the complex real world
problems such as diagnosis a medical condition or communicating with humans in
natural language.
o It is also a way which describes how we can represent knowledge in artificial
intelligence. Knowledge representation is not just storing data into some database, but
it also enables an intelligent machine to learn from that knowledge and experiences so
that it can behave intelligently like a human.
What to Represent:
Following are the kind of knowledge which needs to be represented in AI systems:
o Object: All the facts about objects in our world domain. E.g., Guitars contains
strings, trumpets are brass instruments.
o Events: Events are the actions which occur in our world.
o Performance: It describe behavior which involves knowledge about how to do
things.
o Meta-knowledge: It is knowledge about what we know.
o Facts: Facts are the truths about the real world and what we represent.
o Knowledge-Base: The central component of the knowledge-based agents is the
knowledge base. It is represented as KB. The Knowledgebase is a group of the
Sentences (Here, sentences are used as a technical term and not identical with the
English language).
ONTOLOGICAL ENGINEERING
Ontology engineering is a set of tasks related to the development of ontologies for a
particular domain. ... One of the approaches for the formal conceptualization of represented
knowledge domains is the use of machine-interpretable ontologies, which provide structured
data in, or based on, RDF, RDFS, and OWL.
The process of building a knowledge base is called knowledge engineering. A
knowledge engineer is someone who investigates a particular domain, determines what
concepts are important in that domain, and creates a formal representation of the objects and
relations in the domain. Often, the knowledge engineer is trained in representation but is not
an expert in the domain at hand, be it circuit design, space station mission scheduling, or
YouTube: Engineering Drive By: H. Ateeq Ahmed
Artificial Intelligence 35
whatever. The knowledge engineer will usually interview the real experts to become educated
about the domain and to elicit the required knowledge, in a process called knowledge
acquisition. This occurs prior to, or interleaved with, the process of creating formal
representations. In this chapter, we will use domains that should already be fairly familiar, so
that we can concentrate on the representational issues involved.
One does not become a proficient knowledge engineer just by studying the syntax and
semantics of a representation language. It takes practice and exposure to lots of examples
before one can develop a good style in any language, be it a language for programming,
reasoning, or communicating. Sections 8.1 and 8.2 discuss the principles and pitfalls of
knowledge engineering. We then show how to represent knowledge in the fairly narrow
domain of electronic circuits in Section 8.3. A number of narrow domains can be tackled by
similar techniques, but domains such as shopping in a supermarket seem to require much
more general representations. In Section 8.4, we discuss ways to represent time, change,
objects, substances, events, actions, money, measures, and so on. These are important
because they show up in one form or another in every domain. Representing these very
general concepts is sometimes called ontological engineering.
Artificial Intelligence 36
systematic; biology aims to provide a taxonomy of all living and extinct species; library
science has developed a taxonomy of all fields of knowledge, encoded as the Dewey Decimal
system; tax authorities and other government departments have developed extensive
taxonomies of occupations and commercial products. Taxonomies are also an important
aspect of general common sense knowledge, as we will see in our investigations that follow.
First-order logic makes it easy to state facts about categories, either by relating objects
to categories or by quantifying over their members:
EVENTS
Situation calculus is perfect for the vacuum world, the wumpus world, or any world in
which a single agent takes discrete actions. Unfortunately, situation calculus has two
problems that limit its applicability. First, situations are instantaneous points in time, which
are not very useful for describing the gradual growth of a kitten into a cat, the flow of
electrons along a wire, or any other process where change occurs continuously over time.
Second, situation calculus works best when only one action happens at a time. When there
are multiple agents in the world, or when the world can change spontaneously, situation
calculus begins to break down. It is possible to prop it back up for a while by defining
composite actions, as in Exercise 7.12. If there are actions that have different durations, or
whose effects depend on duration, then situation calculus in its intended form cannot be used
at all.
Because of these limitations, we now turn to a different approach toward representing
change, which we call the event calculus, although the name is not standard. Event calculus is
rather like a continuous version of the situation-calculus "movie" shown in Figure 7.3. We
think of a particular universe as having both a "spatial" and a temporal dimension. The
"spatial" dimension ranges over all of the objects in an instantaneous "snapshot" or "cross-
section" of the universe. The temporal dimension ranges over time. An event is, informally,
just a "chunk" of this universe with both temporal and spatial extent. Figure 8.3 gives the
general idea.
Artificial Intelligence 37
Artificial Intelligence 38
well as some knowledge of its own knowledge, lack of knowledge, and inference procedures.
In effect, we want to have a model of the mental objects that are in someone's head (or
something's knowledge base) and of the mental processes that manipulate those mental
objects.
The model should be faithful, but it does not have to be detailed. We do not have to be
able to predict how many milliseconds it will take for a particular agent to make a deduction,
nor do we have to predict what neurons will fire when an animal is faced with a particular
visual stimulus. But we do want an abstract model that says that if a logical agent believes P
V Q and it learns not(symbol) P, then it should come to believe Q.
The first step is to ask how mental objects are represented. That is, if we have a
relation Believes(Agent,x), what kind of thing is xl First of all, its clear that x cannot be a
logical sentence. If Flies(Superman) is a logical sentence, we can't say Believes(Agent,
Flies(Superman)) because only terms (not sentences) can be arguments of relations. But if
Flies(Superman) is reified as a fluent, then it is a candidate for being a mental object, and
Believes can be a relation that takes an agent and a prepositional fluent that the agent believes
in. We could define other relations such as Knows and Wants to express other relationships
between agents and propositions. Relations of this kind are called propositional attitudes.
This appears to give us what we want: the ability for an agent to reason about the
beliefs of agents. Unfortunately, there is a problem with this approach. If Clark and
Superman are one and the same (i.e., Clark = Superman) then Clark flying and Superman
flying are one and the same event. Thus, if the object of propositional attitudes are reified
events, we must conclude that if Lois believes that Superman can fly, she also believes that
Clark can fly, even if she doesn't believe that Clark is Superman. That is,
There is a sense in which this is right: Lois does believe of a certain person, who
happens to be called Clark sometimes, that that person can fly. But there is another sense in
which this is wrong: if you asked Lois "Can Clark fly?" she would certainly say no. Reified
objects and events work fine for the first sense of Believes, but for the second sense we need
to reify descriptions of those objects and events, so that Clark and Superman can be different
descriptions (even though they refer to the same object).
Technically, the property of being able to freely substitute a term for an equal term is
called referential transparency. In first-order logic, every relation is referentially transparent.
We would like to define Believes (and the other propositional attitudes) as relations whose
second argument is referentially opaque—that is, one cannot substitute an equal term for the
second argument without changing the meaning.
Artificial Intelligence 39
Artificial Intelligence 40
make it easier to describe definition and properties of categories. It evolved from semantic
network It evolved from semantic network The principal inference task for description logic
are checking if one category is a subset of another by comparing their definition. The
principal inference task for description logic are checking if one category is a subset of
another by comparing their definition. By checking whether the membership criteria are
logically certifiable.
Artificial Intelligence 41
as "threshold probability" statements. For example, the default rule "My brakes are always
OK" really means "The probability that my brakes are OK, given no other information, is
sufficiently high that the optimal decision is for me to drive without checking them." When
the decision context changes—for example, when one is driving a heavily laden truck down a
steep mountain road—the default rule suddenly becomes inappropriate, even though there is
no new evidence to suggest that the brakes are faulty.
To date, no default reasoning system has successfully addressed all of these issues.
Furthermore, most systems are formally undecidable, and very slow in practice. There have
been several attempts to subsume default reasoning in a probabilistic system, using the idea
that a default rule is basically a conditional probability of 1 — f. For reasons already
mentioned, such an approach is likely to require a full integration of decision making before
it fully captures the desirable features of default reasoning.
Artificial Intelligence 42
There are some retail companies that have claimed an amount of success in sales after
deployment of advances in search. Well-known American luxury department store Neiman
Marcus, for example, has increased app usage and customer engagement after implementing
visual search. Other retail brands that are successfully using visual search include [Link],
Nordstrom, and Urban Outfitters.
*******
Artificial Intelligence 43
UNIT- IV
LEARNING FROM EXAMPLES
Learning involves generalization from experience. Computer system is said to have
learning if it is able to not only do the “repetition of same task” more effectively, but also the
similar tasks of the related domain. Learning is possible due to some factors like the skill
refinement and knowledge acquisition. Skill refinement refers to the situation of improving
the skill by performing the same task again and again. If machines are able to improve their
skills with the handling of task, they can be said having skill of learning.
What is learning?
¥ According to Herbert Simon, learning denotes changes in a system that enable a
system to do the same task more efficiently the next time.
FORMS OF LEARNING
1. Rote learning
¥ Rote learning is possible on the basis of memorization.
¥ This technique mainly focuses on memorization by avoiding the inner complexities.
So, it becomes possible for the learner to recall the stored knowledge.
For example: When a learner learns a poem or song by reciting or repeating it,
without knowing the actual meaning of the poem or song.
¥ So, in the following fig-a, points (x,y) are given in plane so that y = ƒ(x), and the
task is to find a function h(x) that fits the point well.
Artificial Intelligence 44
¥ In fig-b, a piecewise-linear 'h' function is given, while the fig-c shows more
complicated 'h' function.
¥ Both the functions agree with the example points, but differ with the values of 'y'
assigned to other x inputs.
Artificial Intelligence 45
¥ As shown in fig.(d), we have a function that apparently ignores one of the example
points, but fits others with a simple function. The true/ is unknown, so there are
many choices for h, but without further knowledge, we have no way to prefer (b),
(c), or (d).
Artificial Intelligence 46
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the same
concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Artificial Intelligence 47
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Artificial Intelligence 48
Artificial Intelligence 49
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
Example
Artificial Intelligence 50
Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon
the restrictions and bias that we have imposed on the data. To better understand the
Hypothesis Space and Hypothesis consider the following coordinate that shows the
distribution of some data:
Say suppose we have test data for which we have to determine the outputs or results. The test
data is as shown below:
Artificial Intelligence 51
But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data, algorithm and
constraints.
All these legal possible ways in which we can divide the coordinate plane to predict the
outcome of the test data composes of the Hypothesis Space.
¥ Each individual possible way is known as the hypothesis.
¥ Hence, in this example the hypothesis space would be like:
Artificial Intelligence 52
from the data, seem to be the best approaches to solving these problems. Furthermore, we
need systems that can adapt to changing conditions, that can be user-friendly by adapting to
needs of their individual users, and that can improve performance over time.
Artificial Intelligence 53
Classification
Classification is a process of finding a function which helps in dividing the dataset
into classes based on different parameters. In Classification, a computer program is trained on
the training dataset and based on that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).
Example:
The best example to understand the Classification problem is Email Spam Detection. The
model is trained on the basis of millions of emails on different parameters, and whenever it
receives a new email, it identifies whether the email is spam or not. If the email is spam, then
it is moved to the Spam folder.
Regression
Regression is a process of finding the correlations between dependent and
independent variables. It helps in predicting the continuous variables such as prediction
of Market Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example:
Suppose we want to do weather forecasting, so for this, we will use the Regression algorithm.
In weather prediction, the model is trained on the past data, and once the training is
completed, it can easily predict the weather for future days.
Artificial Intelligence 54
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the input value(x) with the discrete output variable(y).
continuous output variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into different
accurately. classes.
The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.
NONPARAMETRIC MODELS
Machine learning algorithms are classified into two distinct groups: parametric and
nonparametric models.
b0 + b1*x1 + b2*x2 = 0
where,
b0, b1, b2 → the coefficients of the line that control the intercept and slope
x1, x2 → input variables
The assumed functional form is always a linear combination of input variables and as such
parametric machine learning algorithms are also frequently referred to as ‘linear machine
learning algorithms.’
Artificial Intelligence 55
Artificial Intelligence 56
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.
Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example
SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization,
etc.
Artificial Intelligence 57
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Artificial Intelligence 58
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line.
Consider the below image:
Artificial Intelligence 59
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can
be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Artificial Intelligence 60
ENSEMBLE LEARNING
Ensemble learning is the process by which multiple models, such as classifiers or
experts, are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.) performance of a model, or reduce the likelihood of
an unfortunate selection of a poor one.
An ensemble-based system is obtained by combining diverse models (henceforth
classifiers). Therefore, such systems are also known as multiple classifier systems, or just
ensemble systems. There are several scenarios where using an ensemble based system makes
statistical sense, which are discussed below in detail. However, in order to fully and
practically appreciate the importance of using multiple classifier systems, it is perhaps
instructive to look at a psychological backdrop to this otherwise statistically sound argument:
we use such an approach routinely in our daily lives by asking the opinions of several experts
before making a decision. For example, we typically ask the opinions of several doctors
before agreeing to a medical procedure, we read user reviews before purchasing an item
(particularly big ticket items), we evaluate future employees by checking their references, etc.
In fact, even this article is reviewed by several experts before being accepted for publication.
In each case, a final decision is made by combining the individual decisions of several
experts. In doing so, the primary goal is to minimize the unfortunate selection of an
unnecessary medical procedure, a poor product, an unqualified employee or even a poorly
written and misguiding article.
Model Selection
This is perhaps the primary reason why ensemble based systems are used in practice: what is
the most appropriate classifier for a given classification problem? This question can be
interpreted in two different ways: i) what type of classifier should be chosen among many
competing models, such as multilayer perceptron (MLP), support vector machines (SVM),
decision trees, naive Bayes classifier, etc; ii) given a particular classification algorithm,
which realization of this algorithm should be chosen - for example, different initializations of
MLPs can give rise to different decision boundaries, even if all other parameters are kept
constant.
Data Fusion
In many applications that call for automated decision making, it is not unusual to receive data
obtained from different sources that may provide complementary information. A suitable
combination of such information is known as data or information fusion, and can lead to
Artificial Intelligence 61
improved accuracy of the classification decision compared to a decision based on any of the
individual data sources alone.
Confidence Estimation
The very structure of an ensemble based system naturally allows assigning a confidence to
the decision made by such a system. Consider having an ensemble of classifiers trained on a
classification problem. If a vast majority of the classifiers agree with their decisions, such an
outcome can be interpreted as the ensemble having high confidence in its decision. If,
however, half the classifiers make one decision and the other half make a different decision,
this can be interpreted as the ensemble having low confidence in its decision. It should be
noted that an ensemble having high confidence in its decision does not mean that decision is
correct, and conversely, a decision made with low confidence need not be incorrect.
Artificial Intelligence 62
decision process would allow a program to organize photos by person. Some cameras
and software like iPhoto has this capability.
¥ Product Recommendation: Given a purchase history for a customer and a large
inventory of products, identify those products in which that customer will be
interested and likely to purchase. A model of this decision process would allow a
program to make recommendations to a customer and motivate product purchases.
Amazon has this capability. Also think of Facebook, GooglePlus and LinkedIn that
recommend users to connect with you after you sign-up.
¥ Medical Diagnosis: Given the symptoms exhibited in a patient and a database of
anonymized patient records, predict whether the patient is likely to have an illness. A
model of this decision problem could be used by a program to provide decision
support to medical professionals.
¥ Stock Trading: Given the current and past price movements for a stock, determine
whether the stock should be bought, held or sold. A model of this decision problem
could provide decision support to financial analysts.
¥ Customer Segmentation: Given the pattern of behaviour by a user during a trial
period and the past behaviours of all users, identify those users that will convert to the
paid version of the product and those that will not. A model of this decision problem
would allow a program to trigger customer interventions to persuade the customer to
covert early or better engage in the trial.
¥ Shape Detection: Given a user hand drawing a shape on a touch screen and a
database of known shapes, determine which shape the user was trying to draw. A
model of this decision would allow a program to show the platonic version of that
shape the user drew to make crisp diagrams. The Instaviz iPhone app does this.
Artificial Intelligence 63
UNIT- V
LEARNING PROBABILISTIC MODELS
Most real-world data is stored in relational form. In contrast, most statistical learning
methods work with “flat” data representations, forcing us to convert our data into a form that
loses much of the relational structure. The recently introduced framework of probabilistic
relational models (PRMs) allows us to represent probabilistic models over multiple entities
that utilize the relations between them.
STATISTICAL LEARNING
Statistical Learning is a set of tools for understanding data. These tools broadly come under
two classes: supervised learning & unsupervised learning. Generally, supervised learning
refers to predicting or estimating an output based on one or more inputs. Unsupervised
learning, on the other hand, provides a relationship or finds a pattern within the given data
without a supervised output.
where,
¥ E(Y — ¥)² represents the expected value of the squared difference between actual
and expected result.
¥ [f(X) — ƒ(X)]² represents the reducible error. It is reducible because we can
potentially improve the accuracy of ƒ by better modeling.
¥ Var(ε) represents the irreducible error. It is irreducible because no matter how
well we estimate ƒ, we cannot reduce the error introduced by variance in ε.
Artificial Intelligence 64
parameters of a complex model. We will also look briefly at the problem of learning
structure.
The maximum-likelihood hypothesis is given by the value of that maximizes this expression.
The same value is obtained by maximizing the log likelihood,
(By taking logarithms, we reduce the product to a sum over the data, which is usually easier
to maximize.) To find the maximum-likelihood value of , we differentiate L with respect to
and set the resulting expression to zero:
In English, then, the maximum-likelihood hypothesis ML asserts that the actual proportion of
cherries in the bag is equal to the observed proportion in the candies unwrapped so far! It
appears that we have done a lot of work to discover the obvious. In fact, though, we have laid
out one standard method for maximum-likelihood parameter learning:
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero.
Artificial Intelligence 65
Figure 20.2 (a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes. (b) Model for the case where the wrapper color
depends (probabilistically) on the candy flavor.
The trickiest step is usually the last. In our example, it was trivial, but we will see that
in many cases we need to resort to iterative solution algorithms or other numerical
optimization techniques, as described in Chapter 4. The example also illustrates a significant
problem with maximum-likelihood learning in general: when the data set is small enough that
some events have not yet been observed—for instance, no cherry candies—the maximum
likelihood hypothesis assigns zero probability to those events. Various tricks are used to
avoid this problem, such as initializing the counts for each event to 1 instead of zero.
These results are very comforting, and it is easy to see that they can be extended to
any Bayesian network whose conditional probabilities are represented as tables. The most
important point is that, with complete data, the maximum-likelihood parameter learning
problem for a Bayesian network decomposes into separate learning problems, one for each
parameter. The second point is that the parameter values for a variable, given its parents, are
just the observed frequencies of the variable values for each setting of the parent values. As
before, we must be careful to avoid zeroes when the data set is small.
Artificial Intelligence 66
Example
Figure 10.8 shows a typical case. Assume that all the variables are binary. The model
contains a hidden variable E that is in the model but not the data set. The aim is to learn the
parameters of the model that includes the hidden variable E. There are 10 parameters to learn.
Note that, if E was not part of the model, the algorithm would have to learn P(A), P(B),
P(C ∣AB), P(D ∣ABC), which has 14 parameters. The reason for introducing hidden variables
is, paradoxically, to make the model simpler and, therefore, less prone to overfitting.
THE EM ALGORITHM
In the real-world applications of machine learning, it is very common that there are
many relevant features available for learning but only a small subset of them are observable.
So, for the variables which are sometimes observable and sometimes not, then we can use the
instances when that variable is visible is observed for the purpose of learning and then predict
its value in the instances when it is not observable.
On the other hand, Expectation-Maximization algorithm can be used for the latent
variables (variables that are not directly observable and are actually inferred from the values
of the other observed variables) too in order to predict their values with the condition that the
general form of probability distribution governing those latent variables is known to us. This
algorithm is actually at the base of many unsupervised clustering algorithms in the field of
machine learning.
It was explained, proposed and given its name in a paper published in 1977 by Arthur
Dempster, Nan Laird, and Donald Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent variables are involved and the data
is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation
(E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
Artificial Intelligence 67
Usage of EM algorithm –
¥ It can be used to fill the missing data in a sample.
¥ It can be used as the basis of unsupervised learning of clusters.
¥ It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
¥ It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
¥ It is always guaranteed that likelihood will increase with each iteration.
¥ The E-step and M-step are often pretty easy for many problems in terms of
implementation.
¥ Solutions to the M-steps often exist in the closed form.
Artificial Intelligence 68
Disadvantages of EM algorithm –
¥ It has slow convergence.
¥ It makes convergence to the local optima only.
¥ It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability).
Example
The expectation maximization or EM algorithm for learning belief networks with hidden
variables is essentially the same as the EM algorithm for clustering. The E step, shown
in Figure 10.9, involves probabilistic inference for each example to infer the probability
distribution of the hidden variable(s) given the observed variables for that example. The M
step of inferring the probabilities of the model from the augmented data is the same as in the
fully observable case discussed in the previous section, but, in the augmented data, the counts
are not necessarily integers.
Figure 10.9: EM algorithm for belief networks with hidden variables; E is a hidden variable.
The E-step computes P(E ∣ A,B,C,D) for each example, and the M-step learns probabilities
from complete data.
*******