Introduction to Large Language Model (LLM)
A I i n P hys i c s Wo r ks h o p – A p r i l 1 1 , 2 0 2 5
E k t a Va t s
A s s i s t a n t P r o f e s s o r, D o c e n t
D e p a r t m e n t o f I n f o r m a t i o n Te c h n o l o g y , U p p s a l a U n i v e r s i t y ( U U )
B e i j e r R e s e a r c h e r, B e i j e r L a b o r a t o r y f o r A I R e s e a r c h , B e i j e r s t i f t e l s e n
Navigating in the Artificial Intelligence Era
The future?
Image generated using ChatGPT
Navigating in the Artificial Intelligence Era
The future?
Holographic displays, autonomous bicycles, drone deliveries
Ångström à UNGSTROM
Navigating in the Artificial Intelligence Era
Generative AI – Generating new content!
Learn patterns and structure of input data, and generate new samples that exhibit similar characteristics.
A I Ta x o n o m y
Technology that enables computers and
Artificial
Intelligence machines to simulate human intelligence
and problem-solving capabilities
Generative
AI
A I Ta x o n o m y
Artificial
Intelligence
Machine Subset of AI, focuses on developing
Learning
systems that can learn from data
Deep and make decisions based on data
Learning
Generative
AI
AI Evolution Through Decades
Source: sk hynix newsroom
AI Evolution Through Decades
2024
The Nobel Prizes in
Physics and Chemistry
goes to AI!
Synthetic image generation
SORA: text-to-video model
Text:
Turn the library into a spaceship.
Output:
Video generation!
Source – leewayhertz Source – Open AI, SORA
Synthetic image generation
Source – Open AI, DALL·E -2 Source: Twitter, Benjamin Hilton Source: Microsoft Copilot
Image generation!
Two dogs dressed like roman soldiers on a pirate ship looking at An image of a cat enjoying sprinklers.
New York City through a spyglass
Text-to-image models: DALL·E
Midjourney
Synthetic image generation
Source – leewayhertz TTS: Text-to-song; STS: Speech-to-singing
A I Ta x o n o m y
Artificial
Intelligence
Machine
Learning
Subset of ML, uses Artificial Neural Networks
Deep to learn from data
Learning
Specific type of Generative AI model that
LLMs
focuses on generating human-like text
Generative AI Subset of AI, focuses on generating new
content
LLMs: Large Language Models
Large Language Model (LLM)
• A type of language model
Language Model
• Type of machine learning model trained to predict probability distribution over words
imorgon 0.5
Vi ses snart 0.3
på 0.2
• Predicts the probability of a word in a sequence based on the previous word
Language Model
• Type of machine learning model trained to predict probability distribution over words
imorgon 0.5 Google search and advanced language models
Vi ses snart 0.3
på 0.2
Language Model
• Classic definition: Probability distribution over a sequence of tokens
• For example, if vocabulary of a set of tokens is V = {ate, ball, cheese, mouse, the},
a language model p might assign:
p(!"#, $%&'#, (!#, !"#, )"##'#) = 0.02,
p(!"#, )"##'#, (!#, !"#, $%&'#) = 0.01,
p($%&'#, !"#, !"#, )"##'#, (!#) = 0.0001.
Source: https://stanford-cs324.github.io/winter2022/lectures/introduction/
Language Model
• Classic definition: Probability distribution over a sequence of tokens
• For example, if vocabulary of a set of tokens is V = {ate, ball, cheese, mouse, the},
a language model p might assign:
p(!"#, $%&'#, (!#, !"#, )"##'#) = 0.02,
p(!"#, )"##'#, (!#, !"#, $%&'#) = 0.01,
p($%&'#, !"#, !"#, )"##'#, (!#) = 0.0001.
• Example: Neural language models - RNNs including LSTMs, Transformers
p(cheese | (!#, !"#) = some-neural-network(ate, the, cheese)
Large Language Model (LLM)
• A type of language model
• Why large:
• Trained using massive datasets
• With the rise of deep learning and availability of large computational resources, the size of
neural language models has increased.
Berzelius!
Image source: nsc.liu.se
Large Language Model (LLM)
Scaling-up –
bigger is better?
2018 2023
2025: GPT-5?
Source – cobusgreyling.medium.com
Architecture
• LLM is a type of transformer model
• Transformer:
• A neural network that learns context and meaning
by tracking relationships in sequential data
Architecture
• LLM is a type of transformer model TRANSFORMER
(output) How are you?
• Transformer:
• A neural network that learns context and meaning
by tracking relationships in sequential data
• Encoder-decoder architecture
Hur mår du?
• Encoder extracts features from an input sequence
(input)
• Decoder uses the features to produce an output
sentence Case: translation
Architecture Can be used independently!
• LLM is a type of transformer model TRANSFORMER
(output) How are you?
• Transformer:
• A neural network that learns context and meaning
by tracking relationships in sequential data
• Encoder-decoder architecture
Hur mår du?
• Encoder extracts features from an input sequence
(input)
• Decoder uses the features to produce an output
sentence Case: translation
Tr a n s f o r m e r – R e c o m m e n d e d R e a d i n g
• Transformers by Lucas Beyer
• Link: https://www.youtube.com/watch?v=EixI6t5oif0
• Deep Learning course (1RT720)
• Given in period 3
• Course responsible: Niklas Wahlström
• LLMs & Societal Consequences of AI (1RT730)
• Given in period 1
• Course responsible: Ekta Vats
Ty p e s o f L L M s
Ty p e s o f L L M s
• Encoder only:
• Suited for tasks that can understand language,
• such as toxicity classification and sentiment analysis.
Sentiment analysis of tweets
• Example: BERT (Bidirectional Encoder Representations from Transformers)
Ty p e s o f L L M s
• Decoder only:
• Suited for generating language and content,
• such as story writing, blog generation, open-domain Q/A and virtual assistants.
ChatBots
• Example: GPT-3 (Generative Pretrained Transformer 3)
Ty p e s o f L L M s
• Encoder-decoder:
• Combine the encoder and decoder components of the transformer architecture
• Suited for both understanding and generating content,
• such as translation, text-to-code and summarisations.
• Example: T5 (Text-to-Text Transformers) by HuggingFace
Ty p e s o f L L M s
• Encoder-decoder:
• Example: T5 (Text-to-Text Transformers) by HuggingFace.
“Translate English to Swedish: Thank you very much.”
T5
“Tack så mycket.”
Ty p e s o f L L M s
• Encoder-decoder:
• Example: T5 (Text-to-Text Transformers) by HuggingFace.
“summarize: Sweden became NATO’s newest member on Thursday (7 March 2024),
upon depositing its instrument of accession to the North Atlantic Treaty with the
Government of the United States in Washington DC. With Sweden’s accession, NATO
now counts 32 countries among its members.”
T5
“Sweden joined NATO on March 7, 2024, becoming its 32nd member.”
LLM Lifecycle
LLM Lifecycle
1. Collect training data – e.g. Common Crawl
2. Train a LLM – e.g. GPT-3
3. Adapt it for downstream tasks – e.g. Q/A
4. Deploy LLM to users – e.g. Chatbot
Data is the fuel!
Important
• To understand and document the composition of your training dataset
AI is the engine!
Further reading: https://stanford-cs324.github.io/winter2022/lectures/legality/
Important
• To understand and document the composition of your training dataset
• To be aware of copyright law (IPR, licenses), privacy law, high-risk applications
• Is training LLM on this data a copyright/privacy violation?
• Can it cause intentional harm – spam, harassment, disinformation, phishing attacks?
• Are you deploying LLM in healthcare or education?
• Github code and licensing
Further reading: https://stanford-cs324.github.io/winter2022/lectures/legality/
Multimodal LLMs
Multimodal LLMs
• Unimodal LLMs are trained on a single modality of data, such as text
• Lack visual context
Multimodal LLMs
• Unimodal LLMs are trained on a single modality of data, such as text
• Lack visual context
• Multimodal LLMs:
• Understand and generate content across multiple modalities,
• such as text, images, audio, video, or other sensor data
• Vision-language models:
• Multimodal models that can learn from images and text
Vision-Language Models
• Main idea: Unify the image and text representation, and feed it to a textual decoder for
generation
Vision-Language Models
• Main idea: Unify the image and text representation, and feed it to a textual decoder for
generation
• Typically consists of two main components:
• Vision encoder: extracts image features and encodes them into a format that can be understood
by the language decoder. Example: ResNet, ViT
• Language decoder: takes these encoded visual features along with any textual input and
generates descriptions, or captions, etc. Example: GPT-2, GPT-3
Vision-Language Models
https://huggingface.co/blog/vlms
Vision-Language Models
https://huggingface.co/blog/vlms
Contrastive Language-Image Pretraining (CLIP)
• CLIP differs from traditional vision-language models
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
• CLIP differs from traditional vision-language models
• It does not generate text descriptions or captions for images
• Focuses on learning a joint representation space where images and text can be compared directly
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
• CLIP differs from traditional vision-language models
• It does not generate text descriptions or captions for images
• Focuses on learning a joint representation space where images and text can be compared directly
• Enables various downstream tasks, such as
• Zero-shot image classification: classify images into one of several classes, without any prior
training or knowledge of the classes
• Zero-shot image retrieval, text-based image generation
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
1. Jointly trains a text encoder and an image encoder to predict the correct image—text pair
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
1. Jointly trains a text encoder and an image encoder to predict the correct image—text pair
Contrastive Pre-training
Contrastive learning framework
Text embeddings
• Maximize the cosine similarities between
correct image-text pairs
Transformer
• Minimize the cosine similarities for
dissimilar pairs (non-diagonal elements)
(ResNet or ViT)
Image embeddings
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
2. Converts training dataset classes into captions
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
3. Estimates the best caption for the given input image for zero-shot prediction
• Calculate similarity between an image vector and multiple caption vectors, selecting the
one with the highest score
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)
3. Estimates the best caption for the given input image for zero-shot prediction
• Calculate similarity between an image vector and multiple caption vectors, selecting the
one with the highest score
• Trained on 400M image-text pairs
• CLIP can perform image tasks using only text, no extra training needed
• CLIP encodings + decoder models (e.g. GPT-2) => image captioning
Some Interesting Examples
Modern Handwriting Recognition using ChatGPT
Prompt: Can you recognise this text?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Modern Handwriting Recognition using ChatGPT
Prompt: Can you recognise this text?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Old Handwriting Recognition using ChatGPT
Prompt: Can you read this?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Old Handwriting Recognition using ChatGPT
Prompt: Can you read this?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Ancient Handwriting Recognition using Microsoft Copilot
Nota bene is the Latin phrase meaning note well
OCR and Document Question Answering
• Text…
Question: Who is in cc in this letter?
Answer: T.F. Riehl
https://huggingface.co/docs/transformers/main/en/tasks/document_question_answering
Understanding Complex Parking Signs – ChatGPT
Anyone?
Prompt: Suppose it is Wednesday and the time is 4PM. Am I allowed to park my car at this spot?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
U n d e r s t a n d i n g C o m p l e x P a r k i n g S i g n s – G P T- 4 V
Prompt: Suppose it is Wednesday and the time is 4PM. Am I allowed to park my car at this spot?
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Visual Question Answering
Marino, Kenneth, et al. "Ok-vqa: A visual question answering benchmark requiring external knowledge." Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 2019.
Visual Question Answering
LLaVA: Large Language and Vision Assistant - CLIP visual encoder + Vicuna LLM
Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2024).
Visual Question Answering
LLaVA: Large Language and Vision Assistant - CLIP visual encoder + Vicuna LLM
Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Whisper by OpenAI
• Speech-to-text model, performs:
• Speech recognition
• Speech translation
• Spoken language identification
• Voice activity detection
Focus: Swedish speech
Image source: LinkedIn
LLMs and Video Analysis
Platform: Azure AI
Challenges and limitations
Challenges and Limitations
• Out-of-date training data
This example was tested on 10 March, 2025
Challenges and Limitations
• Hallucinations
• Facts are sometimes extrapolated,
• LLMs try to invent facts,
• articulating the inaccurate information in a convincing way.
Challenges and Limitations
• Hallucinations
• Facts are sometimes extrapolated,
• LLMs try to invent facts,
• articulating the inaccurate information in a convincing way.
This example was tested in Spring 2024
Challenges and Limitations
• Hallucinations
• Facts are sometimes extrapolated,
• LLMs try to invent facts,
• articulating the inaccurate information in a convincing way.
Tested: October 2024
Challenges and Limitations
Tested: March 2025
Challenges and Limitations
• Hallucinations
Challenges and Limitations
• Hallucinations
What is the correct answer?
Challenges and Limitations
• Hallucinations
5,162,060
• It is a language model!
• Training data lacks focus on math concepts and problem-solving.
Hallucinations, Case – Law
This example was shared by a lawyer
Hallucinations, Case – Law
This example was shared by a lawyer
Hallucinations, Case – Law
Microsoft Copilot did not hallucinate!
H a l l u c i n a t i o n s , C a s e – M i n i G P T- 4
Source: MiniGPT-4 by HuggingFace
H a l l u c i n a t i o n s , C a s e – M i n i G P T- 4
Source: MiniGPT-4 by HuggingFace
Hallucinations, Case - HTRFlow
HTRFlow model Bevittna
(Bevittna translates to witness)
Source: https://huggingface.co/spaces/Riksarkivet/htr_demo (2023)
Potential Solution: Retrieval Augmented Generation (RAG)
• Helps address both hallucinations and out-of-date training data issues
Retrieval Augmented Generation (RAG)
• Build LLMs with current and reliable information
• Extending its utility to specific data sources
• Allows LLMs to go beyond their knowledge-base, enabling them to access real-time
data, and providing up to date responses
• Example use case: Improve Math Q/A
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG)
This example was on 10 March, 2025
RAG Example Projects from UU course on LLMs
• Teaching LLM to teach AI
RAG Example Projects from UU course on LLMs
• Teaching LLM to teach AI
• Chat with 1177.se
RAG Example Projects from UU course on LLMs
• Teaching LLM to teach AI
• Chat with 1177.se
• LLM powered teaching assistant for Smarter Education
• Exercise sheet generation
• Based on age, grade and interest of the child
Challenges and Limitations
• Bias and misinformation as ethical concerns
Social Biases in Language Models: http://uu.diva-portal.org/smash/get/diva2:1696604/FULLTEXT01.pdf
Challenges and Limitations
• Bias and misinformation as ethical concerns
• Key pointer:
• LLM has inherited society's stereotypes due to the training data being fed into it.
• Other cases: voice assistants and FaceID
Vo i c e A s s i s t a n t s a n d A c c e n t B i a s
• Data problem!
• The higher the quantity and diversity of speech samples in a corpus, the more accurate
the resulting model
Amazon researchers improved Irish-accented training data by using a voice conversion model
Source: https://www.amazon.science/blog/how-alexa-learned-to-speak-with-an-irish-accent (2023)
Face ID and Limitations
Source: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212 (2018)
Face ID and Limitations
Source: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212 (2018)
Face ID and Limitations
• This bias arise from the imbalance in the training data
• Lighting conditions: Darker skin tones might reflect less light, potentially affecting
recognition accuracy
Face ID and Limitations
• This bias arise from the imbalance in the training data
• Lighting conditions: Darker skin tones might reflect less light, potentially affecting
recognition accuracy
• Modern face recognition systems:
• Diverse training data
• Algorithmic improvements
• Aim to achieve robust performance, regardless of skin tone or lighting conditions.
• Biases can still persist
Future Insights!
Scaling-up – bigger is better?
• Quantitatively: different capabilities
• Qualitatively: different societal impact
Future Insights!
Scaling-up – bigger is better? In-Context Learning
Zero-shot
One-shot
Future Insights!
Few-shot
https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec04.pdf
Scaling-up – bigger is better? In-Context Learning
Future Insights!
AI powered precision in healthcare
Analyse medical images, and using LLMs correlate medical findings with patient
history, delivering comprehensive diagnostics and potential treatment options.
Also… Applications in Physics!
Advanced Multimodal LLMs
• Example: NExT-GPT (Any-to-Any Multimodal LLM)
• Text + Video à Text + Image
ImageBind by Meta
6 modalities:
Images/videos,
Audio, Text,
Depth, Thermal,
Inertial measurement units (IMUs)
+
Diffusion Models (generation)
https://next-gpt.github.io
You can select “off” for this option,
in your ChatGPT settings.
We conducted an exam for
• Total 10 questions
Link
• Following are our findings:
• Both excel in creative thinking, image generation and summarization.
• ChatGPT struggles with legal reasoning but excels in math.
• Neither handles old handwriting well.
• Overall, score for Copilot was higher.
• Curious to know how it works in Physics…
Thank you!
Email :
[email protected]Webpage: https://www.ektavats.se