Introduction to Large Language Models
Large Language Models and Prompt Engineering
Ram N Sangwan • Overview of LLMs and their Significance
• Encoders and Decoders
• Understanding the architecture
• Components of LLMs
LLMs – What you need to Know.
• LLM Architectures
• What else can LLMs do?
• Prompting and Training
• How do we affect the distribution over the vocabulary?
• Decoding
• How do LLMs generate text using these distributions?
Encoders and Decoders
Multiple architectures focused on encoding and decoding, i.e.,
embedding and text generation
All Models built on the Transformer Architecture.
Each type of model has different capabilities (embedding / generation)
Models of each type come in a variety of sizes (# of parameters)
Transformers
Encoders
Encoder – models that convert a <-0.44,…,-1.1> [sentence]
sequence of words to an embedding They
<-0.27,…,4.31> They
(vector representation)
sent
<1.54,…,-2.92> sent
me
<0.91,…,-1.78> me
Examples
a <-0.71,…,2.45> a
Embed-light, BERT, RoBERTA,
DistillBERT, SBERT,…
Primary uses: embedding tokens, sentences, & documents
Decoders
Decoder – models take a sequence of They
words and output next word
sent
lion
Examples me
GPT-4, Llama, BLOOM, Falcon, …
a
Primary uses: text generation,
chat-style models, (including QA, etc…)
Encoders - Decoders
Encoder-decoder - encodes a sequence of words and use the encoding + to output a next
word
Examples
T5, UL2, BART,…
They <-0.44,…,-1.1>
<-0.27,…,4.31> הם
sent
<1.54,…,-2.92> שלחו
me <0.91,…,-1.78>
לי
a <-0.71,…,2.45>
Transformers Architecture
Transformers architecture eliminates the need for recurrent
or convolutional layers
The Encoder stack contains multiple Encoders.
Each Encoder contains:
• Multi-Head Attention layer
• Feed-forward layer
The Decoder stack contains many Decoders. Each
Decoder contains:
• Two Multi-Head Attention layers
• Feed-forward layer
Output - generates the final output, and contains:
• Linear layer
• Softmax Function.
Transformers Architecture
• The data, that leaves the encoder, is a deep
representation of the structure and meaning of
the input sequences.
• This representation is inserted into the middle of
the decoder to influence decoder's self-attention
mechanism.
Decoding
• The process of generating text with an LLM
I wrote to the zoo to send me a pet. They sent me a ________
Word lion elephant dog cat panther alligator
Probability 0.03 0.02 0.45 0.4 0.05 0.01
• Decoding happens iteratively, 1 word at a time
• At each step of decoding, we use the distribution over vocabulary and select 1
word to emit.
• The word is appended to the input, the decoding process continues.
Greedy Decoding
• Pick the highest probability word at each step
I wrote to the zoo to send me a pet. They sent me a ________
Word lion elephant dog cat panther alligator
Probability 0.03 0.02 0.45 0.4 0.05 0.01
I wrote to the zoo to send me a pet. They sent me a dog _______
Word EOS elephant dog cat panther alligator
Probability 0.99 0.02 0.45 0.4 0.05 0.01
Output: I wrote to the zoo to send me a pet. They sent me a dog.
Non-Deterministic Decoding
• Pick randomly among high probability candidates at each step.
I wrote to the zoo to send me a pet. They sent me a ________
Word small elephant dog cat panda alligator
Probability 0.01 0.02 0.25 0.4 0.05 0.01
I wrote to the zoo to send me a pet. They sent me a small________
Word small elephant dog cat panda red
Probability 0.01 0.02 0.25 0.4 0.05 0.01
I wrote to the zoo to send me a pet. They sent me a small red________
Word small elephant dog cat panda red
Probability 0.01 0.02 0.25 0.4 0.05 0.01
Output: I wrote to the zoo to send me a pet. They sent me a small red panda.
Thank You