0% found this document useful (0 votes)
48 views102 pages

UU EktaVats AI Physics

The document provides an overview of Large Language Models (LLMs) and their evolution within the field of Artificial Intelligence, highlighting their architecture, types, and applications. It discusses the importance of training data, the lifecycle of LLMs, and introduces multimodal models that integrate text and visual data. Additionally, it explores various examples of LLM applications, including text generation, image recognition, and visual question answering.

Uploaded by

Nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views102 pages

UU EktaVats AI Physics

The document provides an overview of Large Language Models (LLMs) and their evolution within the field of Artificial Intelligence, highlighting their architecture, types, and applications. It discusses the importance of training data, the lifecycle of LLMs, and introduces multimodal models that integrate text and visual data. Additionally, it explores various examples of LLM applications, including text generation, image recognition, and visual question answering.

Uploaded by

Nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Large Language Model (LLM)

A I i n P hys i c s Wo r ks h o p – A p r i l 1 1 , 2 0 2 5

E k t a Va t s

A s s i s t a n t P r o f e s s o r, D o c e n t

D e p a r t m e n t o f I n f o r m a t i o n Te c h n o l o g y , U p p s a l a U n i v e r s i t y ( U U )

B e i j e r R e s e a r c h e r, B e i j e r L a b o r a t o r y f o r A I R e s e a r c h , B e i j e r s t i f t e l s e n
Navigating in the Artificial Intelligence Era
The future?

Image generated using ChatGPT


Navigating in the Artificial Intelligence Era
The future?

Holographic displays, autonomous bicycles, drone deliveries


Ångström à UNGSTROM
Navigating in the Artificial Intelligence Era

Generative AI – Generating new content!

Learn patterns and structure of input data, and generate new samples that exhibit similar characteristics.
A I Ta x o n o m y
Technology that enables computers and
Artificial
Intelligence machines to simulate human intelligence
and problem-solving capabilities

Generative
AI
A I Ta x o n o m y

Artificial
Intelligence

Machine Subset of AI, focuses on developing


Learning
systems that can learn from data

Deep and make decisions based on data


Learning

Generative
AI
AI Evolution Through Decades

Source: sk hynix newsroom


AI Evolution Through Decades

2024

The Nobel Prizes in


Physics and Chemistry
goes to AI!
Synthetic image generation

SORA: text-to-video model

Text:
Turn the library into a spaceship.

Output:

Video generation!

Source – leewayhertz Source – Open AI, SORA


Synthetic image generation

Source – Open AI, DALL·E -2 Source: Twitter, Benjamin Hilton Source: Microsoft Copilot

Image generation!

Two dogs dressed like roman soldiers on a pirate ship looking at An image of a cat enjoying sprinklers.
New York City through a spyglass

Text-to-image models: DALL·E


Midjourney
Synthetic image generation

Source – leewayhertz TTS: Text-to-song; STS: Speech-to-singing


A I Ta x o n o m y

Artificial
Intelligence

Machine
Learning

Subset of ML, uses Artificial Neural Networks


Deep to learn from data
Learning

Specific type of Generative AI model that


LLMs
focuses on generating human-like text

Generative AI Subset of AI, focuses on generating new


content

LLMs: Large Language Models


Large Language Model (LLM)

• A type of language model


Language Model

• Type of machine learning model trained to predict probability distribution over words

imorgon 0.5

Vi ses snart 0.3

på 0.2

• Predicts the probability of a word in a sequence based on the previous word


Language Model

• Type of machine learning model trained to predict probability distribution over words

imorgon 0.5 Google search and advanced language models

Vi ses snart 0.3

på 0.2
Language Model

• Classic definition: Probability distribution over a sequence of tokens

• For example, if vocabulary of a set of tokens is V = {ate, ball, cheese, mouse, the},

a language model p might assign:

p(!"#, $%&'#, (!#, !"#, )"##'#) = 0.02,

p(!"#, )"##'#, (!#, !"#, $%&'#) = 0.01,

p($%&'#, !"#, !"#, )"##'#, (!#) = 0.0001.

Source: https://stanford-cs324.github.io/winter2022/lectures/introduction/
Language Model

• Classic definition: Probability distribution over a sequence of tokens

• For example, if vocabulary of a set of tokens is V = {ate, ball, cheese, mouse, the},

a language model p might assign:

p(!"#, $%&'#, (!#, !"#, )"##'#) = 0.02,

p(!"#, )"##'#, (!#, !"#, $%&'#) = 0.01,

p($%&'#, !"#, !"#, )"##'#, (!#) = 0.0001.

• Example: Neural language models - RNNs including LSTMs, Transformers

p(cheese | (!#, !"#) = some-neural-network(ate, the, cheese)


Large Language Model (LLM)

• A type of language model

• Why large:

• Trained using massive datasets

• With the rise of deep learning and availability of large computational resources, the size of

neural language models has increased.


Berzelius!

Image source: nsc.liu.se


Large Language Model (LLM)

Scaling-up –
bigger is better?

2018 2023

2025: GPT-5?
Source – cobusgreyling.medium.com
Architecture

• LLM is a type of transformer model

• Transformer:

• A neural network that learns context and meaning

by tracking relationships in sequential data


Architecture

• LLM is a type of transformer model TRANSFORMER


(output) How are you?

• Transformer:

• A neural network that learns context and meaning

by tracking relationships in sequential data

• Encoder-decoder architecture
Hur mår du?
• Encoder extracts features from an input sequence
(input)
• Decoder uses the features to produce an output

sentence Case: translation


Architecture Can be used independently!

• LLM is a type of transformer model TRANSFORMER


(output) How are you?

• Transformer:

• A neural network that learns context and meaning

by tracking relationships in sequential data

• Encoder-decoder architecture
Hur mår du?
• Encoder extracts features from an input sequence
(input)
• Decoder uses the features to produce an output

sentence Case: translation


Tr a n s f o r m e r – R e c o m m e n d e d R e a d i n g

• Transformers by Lucas Beyer

• Link: https://www.youtube.com/watch?v=EixI6t5oif0

• Deep Learning course (1RT720)

• Given in period 3

• Course responsible: Niklas Wahlström

• LLMs & Societal Consequences of AI (1RT730)

• Given in period 1

• Course responsible: Ekta Vats


Ty p e s o f L L M s
Ty p e s o f L L M s

• Encoder only:

• Suited for tasks that can understand language,

• such as toxicity classification and sentiment analysis.

Sentiment analysis of tweets

• Example: BERT (Bidirectional Encoder Representations from Transformers)


Ty p e s o f L L M s

• Decoder only:

• Suited for generating language and content,

• such as story writing, blog generation, open-domain Q/A and virtual assistants.

ChatBots

• Example: GPT-3 (Generative Pretrained Transformer 3)


Ty p e s o f L L M s

• Encoder-decoder:

• Combine the encoder and decoder components of the transformer architecture

• Suited for both understanding and generating content,

• such as translation, text-to-code and summarisations.

• Example: T5 (Text-to-Text Transformers) by HuggingFace


Ty p e s o f L L M s

• Encoder-decoder:

• Example: T5 (Text-to-Text Transformers) by HuggingFace.

“Translate English to Swedish: Thank you very much.”

T5

“Tack så mycket.”
Ty p e s o f L L M s

• Encoder-decoder:

• Example: T5 (Text-to-Text Transformers) by HuggingFace.

“summarize: Sweden became NATO’s newest member on Thursday (7 March 2024),


upon depositing its instrument of accession to the North Atlantic Treaty with the
Government of the United States in Washington DC. With Sweden’s accession, NATO
now counts 32 countries among its members.”

T5

“Sweden joined NATO on March 7, 2024, becoming its 32nd member.”


LLM Lifecycle
LLM Lifecycle

1. Collect training data – e.g. Common Crawl

2. Train a LLM – e.g. GPT-3

3. Adapt it for downstream tasks – e.g. Q/A

4. Deploy LLM to users – e.g. Chatbot


Data is the fuel!
Important

• To understand and document the composition of your training dataset

AI is the engine!

Further reading: https://stanford-cs324.github.io/winter2022/lectures/legality/


Important

• To understand and document the composition of your training dataset

• To be aware of copyright law (IPR, licenses), privacy law, high-risk applications

• Is training LLM on this data a copyright/privacy violation?

• Can it cause intentional harm – spam, harassment, disinformation, phishing attacks?

• Are you deploying LLM in healthcare or education?

• Github code and licensing

Further reading: https://stanford-cs324.github.io/winter2022/lectures/legality/


Multimodal LLMs
Multimodal LLMs

• Unimodal LLMs are trained on a single modality of data, such as text

• Lack visual context


Multimodal LLMs

• Unimodal LLMs are trained on a single modality of data, such as text

• Lack visual context

• Multimodal LLMs:

• Understand and generate content across multiple modalities,

• such as text, images, audio, video, or other sensor data

• Vision-language models:

• Multimodal models that can learn from images and text


Vision-Language Models

• Main idea: Unify the image and text representation, and feed it to a textual decoder for

generation
Vision-Language Models

• Main idea: Unify the image and text representation, and feed it to a textual decoder for

generation

• Typically consists of two main components:

• Vision encoder: extracts image features and encodes them into a format that can be understood

by the language decoder. Example: ResNet, ViT

• Language decoder: takes these encoded visual features along with any textual input and

generates descriptions, or captions, etc. Example: GPT-2, GPT-3


Vision-Language Models

https://huggingface.co/blog/vlms
Vision-Language Models

https://huggingface.co/blog/vlms
Contrastive Language-Image Pretraining (CLIP)

• CLIP differs from traditional vision-language models

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

• CLIP differs from traditional vision-language models

• It does not generate text descriptions or captions for images

• Focuses on learning a joint representation space where images and text can be compared directly

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

• CLIP differs from traditional vision-language models

• It does not generate text descriptions or captions for images

• Focuses on learning a joint representation space where images and text can be compared directly

• Enables various downstream tasks, such as

• Zero-shot image classification: classify images into one of several classes, without any prior

training or knowledge of the classes

• Zero-shot image retrieval, text-based image generation

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

1. Jointly trains a text encoder and an image encoder to predict the correct image—text pair

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

1. Jointly trains a text encoder and an image encoder to predict the correct image—text pair

Contrastive Pre-training
Contrastive learning framework
Text embeddings
• Maximize the cosine similarities between
correct image-text pairs
Transformer
• Minimize the cosine similarities for
dissimilar pairs (non-diagonal elements)

(ResNet or ViT)

Image embeddings
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

2. Converts training dataset classes into captions

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

3. Estimates the best caption for the given input image for zero-shot prediction

• Calculate similarity between an image vector and multiple caption vectors, selecting the

one with the highest score

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Contrastive Language-Image Pretraining (CLIP)

3. Estimates the best caption for the given input image for zero-shot prediction

• Calculate similarity between an image vector and multiple caption vectors, selecting the

one with the highest score

• Trained on 400M image-text pairs

• CLIP can perform image tasks using only text, no extra training needed

• CLIP encodings + decoder models (e.g. GPT-2) => image captioning


Some Interesting Examples
Modern Handwriting Recognition using ChatGPT

Prompt: Can you recognise this text?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Modern Handwriting Recognition using ChatGPT

Prompt: Can you recognise this text?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Old Handwriting Recognition using ChatGPT

Prompt: Can you read this?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Old Handwriting Recognition using ChatGPT

Prompt: Can you read this?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Ancient Handwriting Recognition using Microsoft Copilot

Nota bene is the Latin phrase meaning note well


OCR and Document Question Answering

• Text…

Question: Who is in cc in this letter?

Answer: T.F. Riehl

https://huggingface.co/docs/transformers/main/en/tasks/document_question_answering
Understanding Complex Parking Signs – ChatGPT

Anyone?

Prompt: Suppose it is Wednesday and the time is 4PM. Am I allowed to park my car at this spot?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
U n d e r s t a n d i n g C o m p l e x P a r k i n g S i g n s – G P T- 4 V

Prompt: Suppose it is Wednesday and the time is 4PM. Am I allowed to park my car at this spot?

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Visual Question Answering

Marino, Kenneth, et al. "Ok-vqa: A visual question answering benchmark requiring external knowledge." Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 2019.
Visual Question Answering
LLaVA: Large Language and Vision Assistant - CLIP visual encoder + Vicuna LLM

Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2024).
Visual Question Answering
LLaVA: Large Language and Vision Assistant - CLIP visual encoder + Vicuna LLM

Source: Medium article on Exploring Multimodal Large Language Models: A Step Forward in AI
Whisper by OpenAI

• Speech-to-text model, performs:

• Speech recognition

• Speech translation

• Spoken language identification

• Voice activity detection


Focus: Swedish speech

Image source: LinkedIn


LLMs and Video Analysis

Platform: Azure AI
Challenges and limitations
Challenges and Limitations

• Out-of-date training data

This example was tested on 10 March, 2025


Challenges and Limitations

• Hallucinations

• Facts are sometimes extrapolated,

• LLMs try to invent facts,

• articulating the inaccurate information in a convincing way.


Challenges and Limitations

• Hallucinations

• Facts are sometimes extrapolated,

• LLMs try to invent facts,

• articulating the inaccurate information in a convincing way.

This example was tested in Spring 2024


Challenges and Limitations

• Hallucinations

• Facts are sometimes extrapolated,

• LLMs try to invent facts,

• articulating the inaccurate information in a convincing way.

Tested: October 2024


Challenges and Limitations

Tested: March 2025


Challenges and Limitations

• Hallucinations
Challenges and Limitations

• Hallucinations

What is the correct answer?


Challenges and Limitations

• Hallucinations

5,162,060

• It is a language model!

• Training data lacks focus on math concepts and problem-solving.


Hallucinations, Case – Law

This example was shared by a lawyer


Hallucinations, Case – Law

This example was shared by a lawyer


Hallucinations, Case – Law

Microsoft Copilot did not hallucinate!


H a l l u c i n a t i o n s , C a s e – M i n i G P T- 4

Source: MiniGPT-4 by HuggingFace


H a l l u c i n a t i o n s , C a s e – M i n i G P T- 4

Source: MiniGPT-4 by HuggingFace


Hallucinations, Case - HTRFlow

HTRFlow model Bevittna

(Bevittna translates to witness)

Source: https://huggingface.co/spaces/Riksarkivet/htr_demo (2023)


Potential Solution: Retrieval Augmented Generation (RAG)

• Helps address both hallucinations and out-of-date training data issues


Retrieval Augmented Generation (RAG)

• Build LLMs with current and reliable information

• Extending its utility to specific data sources

• Allows LLMs to go beyond their knowledge-base, enabling them to access real-time

data, and providing up to date responses

• Example use case: Improve Math Q/A


Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG)

This example was on 10 March, 2025


RAG Example Projects from UU course on LLMs

• Teaching LLM to teach AI


RAG Example Projects from UU course on LLMs

• Teaching LLM to teach AI

• Chat with 1177.se


RAG Example Projects from UU course on LLMs

• Teaching LLM to teach AI

• Chat with 1177.se

• LLM powered teaching assistant for Smarter Education

• Exercise sheet generation

• Based on age, grade and interest of the child


Challenges and Limitations

• Bias and misinformation as ethical concerns

Social Biases in Language Models: http://uu.diva-portal.org/smash/get/diva2:1696604/FULLTEXT01.pdf


Challenges and Limitations

• Bias and misinformation as ethical concerns

• Key pointer:

• LLM has inherited society's stereotypes due to the training data being fed into it.

• Other cases: voice assistants and FaceID


Vo i c e A s s i s t a n t s a n d A c c e n t B i a s

• Data problem!

• The higher the quantity and diversity of speech samples in a corpus, the more accurate

the resulting model

Amazon researchers improved Irish-accented training data by using a voice conversion model
Source: https://www.amazon.science/blog/how-alexa-learned-to-speak-with-an-irish-accent (2023)
Face ID and Limitations

Source: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212 (2018)


Face ID and Limitations

Source: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212 (2018)


Face ID and Limitations

• This bias arise from the imbalance in the training data

• Lighting conditions: Darker skin tones might reflect less light, potentially affecting

recognition accuracy
Face ID and Limitations

• This bias arise from the imbalance in the training data

• Lighting conditions: Darker skin tones might reflect less light, potentially affecting

recognition accuracy

• Modern face recognition systems:

• Diverse training data

• Algorithmic improvements

• Aim to achieve robust performance, regardless of skin tone or lighting conditions.

• Biases can still persist


Future Insights!
Scaling-up – bigger is better?
• Quantitatively: different capabilities

• Qualitatively: different societal impact

Future Insights!
Scaling-up – bigger is better? In-Context Learning

Zero-shot

One-shot
Future Insights!

Few-shot

https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec04.pdf
Scaling-up – bigger is better? In-Context Learning

Future Insights!

AI powered precision in healthcare

Analyse medical images, and using LLMs correlate medical findings with patient

history, delivering comprehensive diagnostics and potential treatment options.

Also… Applications in Physics!


Advanced Multimodal LLMs

• Example: NExT-GPT (Any-to-Any Multimodal LLM)

• Text + Video à Text + Image


ImageBind by Meta
6 modalities:
Images/videos,
Audio, Text,
Depth, Thermal,
Inertial measurement units (IMUs)
+
Diffusion Models (generation)

https://next-gpt.github.io
You can select “off” for this option,
in your ChatGPT settings.
We conducted an exam for

• Total 10 questions
Link
• Following are our findings:

• Both excel in creative thinking, image generation and summarization.

• ChatGPT struggles with legal reasoning but excels in math.

• Neither handles old handwriting well.

• Overall, score for Copilot was higher.

• Curious to know how it works in Physics…


Thank you!
Email : [email protected]
Webpage: https://www.ektavats.se

You might also like