Unit 5: Computer Vision
● Computer vision is the process of extraction of information from images, text, videos, etc.
● A system that can process, analyze and make sense of visual data in the same way as humans do.
Computer Vision and Artificial Intelligence
● Computer vision is a field of artificial intelligence (AI).
● AI enables computers to think, and computer vision enables AI to see, observe and make sense
of visual data (like images & videos).
Applications of Computer Vision
Facial Recognition :
● Security being the most important application involves the use of
Computer Vision for facial recognition.
● It can be either guest recognition or log maintenance of the visitors.
● It also finds its application in schools for an attendance system based
on facial recognition of students.
Face Filters :
● Modern-day apps like Instagram and Snapchat have a lot of features
based on the usage of computer vision.The application of face filters
is one among them.
● Through the camera, the machine or the algorithm is able to identify
the facial dynamics of the person and applies the facial filter selected.
Google's Search by Image :
● The maximum amount of searching for data on Google's search
engine comes from textual data, but at the same time it has an
interesting feature of getting search results through an image.
● This uses Computer Vision as it compares different features of the
input image to the database of images and gives us the search
result while at the same time analysing various features of the
image.
Computer Vision in Retail :
The retail field has been one of the fastest-growing fields and at the same
time is using Computer Vision for making the user experience more
fruitful.
Retailers can use Computer Vision techniques to track customers'
movements through stores, analyse navigational routes and detect
walking patterns.
Inventory Management is another such application. Through security camera image analysis, a
Computer Vision algorithm can generate a very accurate estimate of the items available in the store.
Also, it can analyse the use of shelf space to identify suboptimal configurations and suggest better item
placement.
Self-Driving Cars:
● Computer Vision is the fundamental technology behind the
development of autonomous vehicles.
● Most leading car manufacturers in the world are reaping the
benefits of investing in artificial intelligence for developing
on-road versions of hands-free technology.
● This involves the process of identifying the objects, getting
navigational routes and also at the same time environment
monitoring.
Google Translate App:
● All you need to do to read signs in a foreign language is to point
your phone's camera at the words and let the Google Translate
app tell you what it means in your preferred language almost
instantly.
● By using optical character recognition to see the image and
augmented reality to overlay an accurate translation, this is a
convenient tool that uses Computer Vision.
Computer Vision Tasks
The various applications of Computer Vision are based on a certain number of tasks that are performed
to get certain information from the input image which can be directly used for prediction or forms the
base for further analysis. The tasks used in a computer vision application are:
Classification
The image Classification problem is the task of assigning an input image one label from a fixed set of
categories. This is one of the core problems in CV that, despite its simplicity, has a large variety of
practical applications.
Classification+ Localisation
This is the task that involves both processes of identifying what object is present in the image and at the
same time identifying at what location that object is present in that image. It is used only for single
objects.
Object Detection
● Object detection is the process of finding instances of real-world objects such as faces, bicycles,
and buildings in images or videos.
● Object detection algorithms typically use extracted features and learning algorithms to recognize
instances of an object category.
● It is commonly used in applications such as image retrieval and automated vehicle parking
systems.
Instance Segmentation
● Instance Segmentation is the process of detecting instances of the objects, giving them a
category, and then giving each pixel a label based on that.
● A segmentation algorithm takes an image as input and outputs a collection of regions (or
segments).
Basics of Images
We all see a lot of images around us and use them daily either through our mobile phones or computer
system. But do we ask some basic questions to ourselves while we use them on a regular basis?
Basics of Pixels
● The word "pixel" means a picture element.
● Every photograph, in digital form, is made up of pixels. They are the smallest unit of information
that make up a picture.
● Usually round or square, they are typically arranged in a 2-dimensional grid.
● In the image below, one portion has been magnified many times over so that you can see its
composition in pixels.
As you can see, the pixels approximate the actual image. The more pixels you have, the more closely the
image resembles the original.
Resolution
● The number of pixels in an image is sometimes called the resolution.
● When the term is used to describe pixel count, one convention is to express resolution as the
width by the height, for example, a monitor resolution of 1280x1024.
● This means there are 1280 pixels from one side to the other, and 1024 from top to bottom.
● Another convention is to express the number of pixels as a single number, like a 5 mega pixel
camera (a megapixel is a million pixels).
● This means the pixels along the width multiplied by the pixels along the height of the image
taken by the camera equals 5 million pixels.
● In the case of our 1280x1024 monitors, it could also be expressed as 1280 x 1024 = 1,310,720,
or 1.31 mega pixels.
Pixel value
● Each of the pixels that represent an image
stored inside a computer has a pixel value that
describes how bright that pixel is, and/or what
colour it should be.
● The most common pixel format is the byte
image, where this number is stored as an 8-bit
integer giving a range of possible values from
O to 255.
● Typically, zero is to be taken as no colour or
black and 255 is taken to be full colour or
white. Why do we have a value of 255?
● In computer systems, computer data is in the form of ones and zeros, which we call the binary
system. Each bit in a computer system can have either a zero or a one.
● Since each pixel uses 1 byte of an image, which is equivalent to 8 bits of data. Since each bit can
have two possible values which tell us that the 8 bits can have 255 possibilities of values that
starts from 0 and ends at 255.
Grayscale Images
● Grayscale images are images that have a range of shades of gray without apparent colour.
● The darkest possible shade is black, which is the total absence of colour or zero value of pixel.
● The lightest possible shade is white, which is the total presence of colour or 255 value of a pixel.
Intermediate shades of gray are represented by equal brightness levels of the three primary
colours.
● A grayscale has each pixel of size 1 byte having a single plane of 2d array of pixels.
● The size of a grayscale image is defined as the Height x Width of that image.
Let us look at an image to understand grayscale images.
Here is an example of a grayscale image. As you check, the value of pixels is within the range of 0-255.
The computers store the images we see in the form of these numbers.
RGB Images
● All the images that we see around us are coloured images.
These images are made up of three primary colours Red,
Green, and Blue.
● All the colours that are present can be made by combining
different intensities of red, green, and blue.
1) What is the output colour when you put R=G=B=255?
2) What is the output colour when you put R=G=B=0?
Now the question arises, how do computers store RGB images?
● Every RGB image is stored in the form of three different channels called the R channel, G
channel, and the B channel.
● Each plane separately has many pixels with each pixel value varying from O to 255. All the three
planes when combined form a colour image.
● This means that in an RGB image, each pixel has a set of three different values which together
give colour to that particular pixel.
● As you can see, each colour image is stored in the form of three different channels, each having
different intensity. All three channels combine to form a colour we see.
● In the above given image, if we split the image into three different channels, namely Red (R),
Green {G) and Blue (B), the individual layers will have the following intensity of colors of the
individual pixels. These individual layers when stored in the memory looks like the image on the
extreme right.
● The images look in the grayscale image because each pixel has a value intensity of O to 255 and
as studied earlier, 0 is considered as black or no presence of colour and 255 means white or full
presence of colour. These three individual RGB values when combined form the colour of each
pixel.
● Therefore, each pixel in the RGB image has three values to form the complete colour.
No-Code AI Tools:
Introduction to Lobe
• Lobe.ai is an Auto-ML tool, which means that it is a no-code AI tool
• It works with image classification and allows a set of images with labels and will automatically find
the most optimal model to classify the images.
Introduction to Teachable Machine
• Teachable Machine is an AI, Machine Learning, and Deep Learning tool that was developed by Google
in 2017
• It runs on top of tensorflow.js which was also developed by the same company
• It is a web-based tool that allows training of a model based on different images, audio, or poses given
as input through webcam or pictures.
Image Features
● In computer vision and image processing, a feature is a piece of information that is relevant for
solving the computational task related to a certain application.
● Features may be specific structures in the image such as points, edges, or objects.
● For example: Imagine that your security camera is capturing an image. At the top of the image,
we are given six small patches of images.
● Our task is to find the exact location of those image patches in the image.
➢ Were you able to find the exact location of all the patches?
➢ Which one was the most difficult to find?
➢ Which one was the easiest to find?
Let us take individual patches into account at once and then check the exact location of those patches.
For Patch A and B: The patch A and Bare flat surfaces in the image and are spread over a lot of area.
They can be present at any location in a given area in the image.
For Patch C and D: The patches C and D are simpler as compared to A and B. They are edges of a
building and we can find an approximate location of these patches but finding the exact location is still
difficult. This is because the pattern is the same everywhere along the edge.
For Patch E and F: The patches E and Fare the easiest to find in the image. The reason is that E and F
are some corners of the building. This is because at the corners, wherever we move this patch it will
look different.
Conclusion
In image processing, we can get a lot of features from the image. It can be either a blob, an edge, or a
corner. These features help us to perform various tasks and then get the analysis done based on the
application. Now the question that arises is which of the following are good features to be used?
As you saw in the previous activity, the features having the corners are easy to find as they can be found
only at a particular location in the image, whereas the edges are spread over a line or an edge look the
same all along. This tells us that the corners are always good features to extract from an image followed
by the edges.
Let's look at another example to understand this.
Consider the images given below and apply the concept of good features for the following.
● In the above image how would we determine the exact location of each patch?
● The blue patch is a flat area and difficult to find and track. Wherever you move the blue patch it
looks the same.
● The black patch has an edge. Moved along the edge (parallel to edge), it looks the same.
● The red patch is a corner. Wherever you move the patch, it looks different, therefore it is unique.
● Hence, corners are considered to be good features in an image.
Convolution
● We have learnt that computers store images in numbers and that pixels are arranged in a
particular manner to create the picture we can recognize. These pixels have values varying from
O to 255 and the value of the pixel determines the color of that pixel.
● But what if we edit these numbers, will it bring a change to the image? The answer is yes.
● As we change the values of these pixels, the image changes. This process of changing pixel
values is the base of image editing.
● We all use a lot of images editing software like photoshop and at the same time use apps like
lnstagram and Snapchat, which apply filters to the image to enhance the quality of that image.
● As you can see, different filters applied to an image change the pixel values evenly throughout
the image. How does this happen?
● This is done with the help of the process of convolution and the convolution operator which is
commonly used to create these effects.
● Before we understand how the convolution operation works, let us try and create a theory for the
convolution operator by experiencing it using an online application.
Let us follow the following steps to understand how a convolution operator works. The steps to be
followed are:
Convolution: Explained
● Convolution is a simple mathematical operation that is fundamental to many common image
processing operators.
● Convolution provides a way of multiplying together two arrays of numbers, generally of
different sizes, but of the same dimensionality, to produce a third array of numbers of the same
dimensionality.
● In image processing, convolution is the process of transforming an image by applying a kernel
over each pixel and its local neighbors across the entire image. The kernel is a matrix of
values whose size and values determine the transformation effect of the convolution process.
● An (image) convolution is simply an element-wise multiplication of image arrays and another
array called the kernel followed by sum.
As you can see here,
I= Image Array
K = Kernel Array
I * K = Resulting array after performing the convolution operator
Note: The Kernel is passed over the whole image to get the resulting array after convolution.
CONVOLUTION - It is a concept of a filter with kernel matrix, where matrix is the number arranged in
rows and columns.
KERNEL matrix has same number of rows and columns and is initiated with random values.
CONVOLUTION == CONVOLUTION + CONVOLUTION OPERATION
Where convolution means the changes in the pixel of an input image.
And convolution operation means the processing of the data.
IMAGE ARRAY – 1) It represents the input image data.
2) Each element corresponds to pixel intensity values.
For Grayscale images – pixels ranges from 0 to 255.
For RGB images – each channel has its own array of pixel intensity values.
KERNEL ARRAY (filter) – Small matrix of numbers are designed to perform a specific operation on
image such as blurring, edging, sharpening of image, etc. and they are initialised randomly with any values.
Every kernel is small matrix that extends through the full depth of the input volume. During the forward pass,
we convolve each kernel across the width and height of the input image and compute dot products between
the pixel values of the source and kernel at corresponding positions.
The Convolution Process involves these steps.
(1)It places the Kernel Matrix over each pixel of the image (ensuring that the full Kernel is within the
image), multiplies each value of the Kernel with the corresponding pixel it is over.
(2)Then, sums the resulting multiplied values and returns the resulting value as the new value of the center
pixel.
(3) This process is repeated across the entire image.
What is a Kernel?
A Kernel is a matrix, which is slid across the image and multiplied with the input such that the output
is enhanced in a certain desirable manner.
● Each kernel has a different value for different kinds of effects that we want to apply to an image.
● In Image processing, we use the convolution operation to extract the features from the images
which can be later used for further processing especially in Convolution Neural Network (CNN),
which we will study later in the chapter.
● In this process, we overlap the centre of the image with the centre of the kernel to obtain the
convolution output. In the process of doing it, the output image becomes smaller as the
overlapping is done at the edge row and column of the image.
What if we want the output image to be of the exact size of the input image, how can we achieve this?
To achieve this, we need to extend the edge values out by one in the original image while overlapping
the centres and performing the convolution. This will help us keep the input and output image of the
same size. While extending the edges, the pixel values are considered zero.
Summary
1. Convolution is a common tool used for image editing.
2. It is an element-wise multiplication of an image and a kernel to get the desired output.
3. In computer vision applications, it is used in Convolutional Neural Network (CNN) to extract image
features.
Convolution Neural Networks (CNN)
A Convolutional Neural Network (CNN) is a Deep Learning algorithm that can take in an input
image, assign importance (learnable weights and biases) to various aspects/objects in the image, and
be able to differentiate one from the other.
Convolution is the key concept in Convolutional Neural Networks. Convolutional Neural Networks (CNN) are a type
of Deep Neural Network.
A CNN comprises of Convolutional Layer, Pooling Layer, and Fully-Connected Layer.
The process of deploying a CNN is as follows :
● In the above diagram, we give an input image, which is then processed through a CNN and then
gives predictions based on the label given in the particular dataset.
The different layers of a Convolutional Neural Network (CNN) are as follows:
A convolutional neural network consists of the following layers :
● Convolution Layer
● Rectified linear unit (ReLU)
● Pooling Layer
● Fully Connected Layer
Convolution Layer :
● It is the first layer of CNN. The objective of the Convolution Operation is to extract the
high-level features such as edges, from the input image.
● At the Convolution layer, a CNN applies convolution on to its inputs using a Kernel
Matrix that it calibrates through training.
● For this reason, CNNs are very good at feature matching in images and object
classification.
● CNN need not be limited to only one Convolutional Layer.
● Conventionally, the first Convolution Layer is responsible for capturing the Low-Level features
such as edges, colour, gradient orientation, etc.
● With added layers, the architecture adapts to the High-Level features as well, giving us a network
that has a wholesome understanding of images in the dataset.
● It uses convolution operation on the images. In the convolution layer, several kernels are used to
produce several features. The output of this layer is called the feature map.
● A feature map is also called an activation map. We can use these terms interchangeably.
There are several uses we derive from the feature map :
• We reduce the image size so that it can be processed more efficiently.
• We only focus on the features of the image that can help us in processing the image further.
For example, you might only need to recognize someone's eyes, nose, and mouth to recognize the
person. You might not need to see the whole face.
Rectified Linear Unit Function
● The next layer in the Convolution Neural Network is the Rectified Linear Unit function or the
ReLU layer.
● After we get the feature map, it is then passed onto the ReLU layer. This layer simply gets rid of
all the negative numbers in the feature map and lets the positive number stay as it is. The process
of passing it to the ReLU layer introduces non - linearity in the feature map.
Let us see it through a graph.
If we see the two graphs side by side, the one on the left is a linear graph. This graph when passed
through the ReLU layer gives the one on the right. The ReLU graph starts with a horizontal straight line
and then increases linearly as it reaches a positive number.
Now the question arises, why do we pass the feature map to the ReLU layer?
It is to make the colour change more obvious and more abrupt?
As shown in the above-convolved image, there is a smooth grey gradient change from black to white.
After applying the ReLu function, we can see a more abrupt color change which makes the edges more
obvious and acts as a better feature for the further layers in a CNN as it enhances the activation layer.
Pooling Layer
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the
Convolved Feature while still retaining the important features.
Two types of pooling can be performed on an image.
1. Max Pooling : Max Pooling returns the maximum value from the portion of the image covered by the
Kernel.
2. Average Pooling : Max Pooling returns the maximum value from the portion of the image covered by
the Kernel.
The pooling layer is important in the CNN as it performs a series of tasks which are as follows :
1. Makes the image smaller and more manageable
2. Makes the image more resistant to small transformations, distortions, and translations in the input
image.
A small difference in the input image will create a very similar pooled image.
Fully Connected Layer
● The final layer in the CNN is the Fully Connected Layer (FC layer).
● The objective of a fully connected layer is to take the results of the convolution/pooling process
and use them to classify the image into a label (in a simple classification example).
● The output of convolution/pooling is flattened into a single vector of values, each representing a
probability that a certain feature belongs to a label.
For example, if the image is of a cat, features representing things like whiskers or fur should have high
probabilities for the label "cat".