Advance Computer Vision – Guess Paper 1 with Solutions
CLO #1: Understand the fundamental concepts of Computer Vision, Image Formation, and filtering.
Q1.
a. Why is computer vision considered a challenging problem, even though human vision appears natural
and effortless? Identify three key factors that contribute to its complexity.
2 Points
Answer:
1. Variability in Image Acquisition: Images can vary due to changes in lighting, viewpoint, scale,
occlusion, and noise, making it hard for algorithms to generalize.
2. Complexity of Visual Scenes: Real-world scenes are cluttered with overlapping objects, textures, and
shadows, increasing the difficulty of object recognition.
3. Ambiguity and Context Dependence: Images often have ambiguous or incomplete information
requiring contextual understanding beyond raw pixel data.
Rationale: Human vision is a result of biological evolution and contextual cognition, while computer vision
must infer meaning solely from pixel values under varied conditions.
b. The images shown below are quite different, but their histograms are the same. Suppose that each
image is blurred with a 3 x 3 averaging mask. Would the histograms of the blurred images still be equal?
Explain.
2 Points
If your answer is no, sketch the two histograms.
2 Points
Answer:
No, after blurring, the histograms will not remain equal.
Although the original histograms are identical, the spatial arrangement of pixel intensities is different.
Blurring smooths pixel intensities based on neighbours, so differences in local patterns affect the post-
blur histogram differently.
Sketch explanation: One histogram will show smoother peaks and less contrast, while the other might
retain distinct peaks shifted differently due to different spatial distributions.
Rationale: Histogram is a frequency distribution of intensities, which ignores spatial info. Blurring modifies
intensities based on local neighbourhoods, which differ between images with the same histogram.
c. In each application an averaging mask is applied to input images to reduce noise, and then a Laplacian
mask is applied to enhance small details. Would the result be the same if the order of these operations
were reversed?
2 Points
Answer:
No, the result will differ.
Applying the averaging mask first smooths the image, reducing noise before edge enhancement.
If the Laplacian is applied first, it will highlight noise as well as edges; subsequent averaging will then
blur these enhanced noise details, reducing edge sharpness.
Thus, applying smoothing before edge enhancement gives cleaner edges.
Rationale: Filtering operations are not commutative. Noise reduction before edge detection improves results by
suppressing false edges.
d. Consider a horizontal intensity profile I(x) of ten pixels: I= (10,12,15,25,45,50,48,30,20,15) where
x=0,,,,,9
1. Compute the first derivative using the forward difference method for the first five pixels.
1 Point
2. Compute the second derivative using the central difference method for the first five pixels.
1 Point
Answer:
1. First derivative (forward difference):
f′(x)=I(x+1)−I(x)
x Calculation Result
0 12 - 10 2
1 15 - 12 3
2 25 - 15 10
3 45 - 25 20
4 50 - 45 5
2. Second derivative (central difference):
f′′(x)=I(x+1)−2I(x)+I(x−1)
For x=1 to 4 (central difference requires neighbours):
x Calculation Result
1 15 - 2*12 + 10 = 1 1
2 25 - 2*15 + 12 = 7 7
3 45 - 2*25 + 15 = 10 10
4 50 - 2*45 + 25 = -15 -15
Q2.
The matrices in the left column are the output of applying Gaussian filters with different bandwidths for
a single octave in the SIFT detection algorithm. On the right, we have the Difference of Gaussian images.
Fill in the blank areas in the Gaussian filtered images so that there are only 2 SIFT keypoints located at
(x=2, y=scale=2), and (x=1, y=scale=3), as marked by "X" in the difference of Gaussian images. This
is before removing edges and low contrast points, and sub-pixel tuning. Also, fill in the Difference of
Gaussian values.
10 Points
Explain why we have key points in the above-mentioned locations, and why we do not have keypoints
in other locations.
5 Points
Answer:
Filling in the Gaussian filtered images: Values are chosen so that subtracting adjacent scales gives
positive or negative extrema exactly at those points (x=2, scale=2) and (x=1, scale=3). This creates
local maxima or minima in DoG.
Filling DoG: Difference of Gaussian images are computed by subtracting Gaussian images at adjacent
scales. Values at keypoints are significantly higher or lower than neighbors.
Why keypoints at these locations: Keypoints are detected as local extrema in scale-space (in both
space and scale dimensions), representing stable, repeatable features invariant to scale and rotation.
No keypoints elsewhere: Because those points are not local extrema; they may be flat, edges, or low
contrast points rejected to improve robustness.
Rationale: SIFT keypoints correspond to scale-space extrema of DoG, which helps detect distinctive, stable
points invariant to image transformations.
CLO #2: Understand the state-of-the-art architecture of computer vision.
Q3.
a. Ali and Bilal are trying to redesign the LeNet conv net architecture to reduce the number of weights.
Ali wants to reduce the number of feature maps in the first convolution layer. Bilal wants to reduce the
number of hidden units in the last layer before the output. Briefly explain whose approach is better?
Why?
2 Points
Answer:
Ali’s approach is generally better because the number of parameters in convolutional layers is often
much larger due to large spatial dimensions and filter sizes. Reducing feature maps early reduces total
computations and parameters significantly.
Bilal’s approach reduces parameters only in the fully connected layers, which may be smaller
compared to convolutional layers in modern architectures.
Rationale: Convolutional layers dominate parameter count in early layers; reducing their size often yields more
parameter savings.
b. One possible way to address the vanishing gradient problem in deep networks is to use the tanh
activation function. However, during lectures, it was discussed that in classification tasks neural networks
output the probability of each class. Given probabilities must always be non-negative, do you think using
tanh could distort these probabilities? Briefly explain your reasoning.
2 Points
Answer:
Yes, tanh outputs values in [−1,1][-1,1][−1,1], which include negative values, incompatible with the
probability requirement of non-negative outputs between 0 and 1.
This can distort the interpretation of outputs as probabilities. Instead, softmax or sigmoid activations
are used to generate valid probabilities.
c. In a multi-label classification task, each instance may belong to multiple categories simultaneously.
1. Which activation function should be used in the output layer for multi-label classification, and why is it
preferred over softmax?
2 Points
2. Explain why Categorical Cross-Entropy is not suitable for multi-label classification and which loss
function should be used instead.
2 Points
3. Suppose your dataset has N classes — what would be the appropriate architecture for the final layer of
your neural network?
2 Points
Answer:
1. Use sigmoid activation in each output neuron independently because classes are not mutually
exclusive. Softmax enforces exclusivity, which is inappropriate for multi-label.
2. Use Binary Cross-Entropy (BCE) loss instead of Categorical Cross-Entropy because BCE treats each
class independently as a binary classification, suitable for multi-label scenarios.
3. The output layer should have N neurons, each with sigmoid activation, to independently predict
presence/absence of each class.
Q4.
a. Explain how the generator and discriminator in a cGAN differ from those in a standard GAN, in terms
of their inputs/outputs, architecture, and loss function.
2 Points
Answer:
Generator: In cGAN, input includes both noise zzz and conditional label yyy; output is an image
conditioned on yyy. In standard GAN, input is noise only.
Discriminator: In cGAN, input is an image plus the label yyy, learning to distinguish if image matches
label. In GAN, input is image only.
Loss function: cGAN incorporates conditioning yyy into both generator and discriminator loss terms
to enforce class-conditional generation.
b. Suppose a GAN is trained to generate images of animals, but after training, it only produces images
resembling a few specific types of animals.
What could be causing this issue?
2 Points
How would you diagnose/detect this particular phenomenon?
2 Points
How would you modify the training process to ensure more diverse outputs?
2 Points
Answer:
The problem is mode collapse, where the generator produces limited output modes that fool the
discriminator.
Diagnose by observing lack of variety in generated samples, and by tracking metrics like diversity
score or latent space coverage.
Modify training by adding techniques like minibatch discrimination, feature matching, unrolled GANs,
or adding noise and regularization to encourage diversity.
c. The loss function for the CGAN is given as:
Interpret the role of the label y in both terms of the loss function.
2 Points
Why is conditioning on y important for generating meaningful outputs?
2 Points
Answer:
Label y provides conditioning information: in the discriminator, it judges whether the image matches
the label y; in the generator, it guides generation towards images corresponding to y.
Conditioning enables control over the class/type of generated images, allowing targeted generation
rather than random sampling.
Q5.
a. Self-attention computes dot products between keys and queries. If a given self-attention has 4 attention
heads and its one-dimensional input size is 3, how many dot products will it compute? Show your
calculations.
2 Points
Answer:
For a sequence length L=3, number of dot products per head = L×L=9.
For 4 heads, total dot products = 4×9=36
b. Which token is used as a feature representation of the input Image/video in vision transformer (ViT)?
2 Points
Answer:
The [CLS] token (classification token) is used as the global feature representation for downstream
tasks.
c. Vision Transformers often require large-scale datasets for effective training. Suggest one strategy to
improve the performance of ViTs when training data is limited.
2 Points
Answer:
Use transfer learning with pretrained ViT weights on large datasets and fine-tune on the smaller
dataset.
Alternatively, use data augmentation or regularization techniques.