What does BERT stand for?
A) Basic Encoder for Robust Transformers
B) Bidirectional Encoder Representations from Transformers
C) Binary Encoded Recursive Transformers
D) Balanced Embedding Representation Technology
Which of the following is NOT a
component of the transformer architecture?
A) Multi-head attention
B) Feed-forward neural networks
C) Positional encoding
D) Convolutional layers
What is the primary advantage of
the transformer architecture over RNNs?
A) Lower computational complexity
B) Ability to handle variable-length sequences
C) Parallel processing of input sequences
D) Smaller model size
What pre-training task does BERT
use to learn bidirectional context?
A) Next Sentence Prediction
B) Masked Language Modeling
C) Machine Translation
D) Both A and B
What is the purpose of the [CLS] token in BERT?
A) To mark the end of a sentence
B) To represent the entire sequence for classification tasks
C) To separate two sentences in the input
D) To mask random words in the input
How do transformer models handle out-of-vocabulary words?
A) ignore them
B) Use sub-word tokenization
C) Assign them a random embedding
D) Assign embedding of the closest word from vocabulary
What is the key difference between BERT and GPT models?
A) BERT uses encoders only while GPT uses decoders only
B) BERT is bidirectional while GPT is unidirectional
C) BERT is for classification tasks only, while GPT is for generation
D) Both A and B
What is the primary purpose of self-attention in transformer models?
A) To reduce the model size
B) To speed up training
C) To eliminate the need for positional encoding
D) To capture dependencies between
different positions in a sequence
What is the purpose of the scaling factor
in the scaled dot-product attention?
A) To normalize the input
B) To prevent vanishing gradients
C) To stabilize the gradients,
especially for large dimension inputs
D) To increase the model's capacity
What is the purpose of positional encoding in transformer models?
A) To add information about the order of the sequence
B) To increase the model's vocabulary
C) To reduce computational complexity
D) To enable multi-head attention