HCIA-AI V4.0 Training Material
HCIA-AI V4.0 Training Material
Foreword
1. AI Overview
2. AI Technologies
4. AI Applications
"Artificial intelligence is the science of making machines do things that would require intelligence
if done by men."
-- Marvin Minsky
• "The branch of computer science concerned with making computers behave like humans."
-- A widespread early definition of AI proposed by John McCarthy at the Dartmouth
Conference in 1956. However, it seems that this definition ignores the possibility of strong
AI. According to another definition, AI is the intelligence (weak AI) demonstrated by
artificial machines.
• Alan Turing discussed the question "Can machines think" in his paper Computing Machinery
and Intelligence.
What Is Intelligence?
⚫ Professor Howard Gardner proposed the theory of multi-intelligence, and listed eight
capabilities that reflect multi-intelligence:
Verbal/Linguistic
Logical/Mathematical
Visual/Spatial
Bodily/Kinesthetic
Musical/Rhythmic
Inter-personal/Social
Intra-personal/Introspective
Naturalist
6 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• 1. Linguistic intelligence
It refers to the ability to express thoughts and understand others orally and literally, to grasp
speeches, semantics, and grammars masterly, and to think in words, express in words, and
appreciate the deep meaning of languages. They are competent for jobs such as political
activists, presenters, lawyers, speakers, editors, writers, journalists, and teachers.
• 2. Logical-mathematical intelligence
It refers to the ability to calculate, measure, infer, conclude, classify, and carry out complex
mathematical operations. This type of intelligence possesses sensitivity to logical means and
relationships, statements and propositions, functions, and other related abstract concepts.
They are competent for jobs such as scientists, accountants, statisticians, engineers, and
computer software R&D personnel.
• 3. Spatial intelligence
It refers to the ability to accurately perceive the visual space and surroundings and present the
perception in the form of graphics. This type of intelligence possesses sensitivity to colors,
lines, shapes, forms, and spatial relationships. They are competent for jobs such as interior
designers, architects, photographers, painters, and pilots.
• 4. Bodily-kinesthetic intelligence
It refers to the ability to express thoughts and feelings with the body and to make or operate
objects with hands. This type of intelligence possesses special physical skills such as balance,
coordination, agility, strength, flexibility and speed, and abilities triggered by tactile sensation.
They are component for jobs such as athletes, actors, dancers, surgeons, jewellers, and
mechanics.
What Is AI?
⚫ Artificial intelligence can be interpreted as "artificial" plus "intelligence". "Artificial" means being designed, created,
and made by humans. "Intelligence" means thinking and behaving like humans.
⚫ Artificial intelligence is a new technical science that deals with the research and development of theories, methods,
techniques, and application systems for simulating and extending human intelligence. In 1956, the term "Artificial
intelligence" was first coined by John McCarthy, who defined it as "the science and engineering of making intelligent
machines." The purpose of AI is to make machines intelligent and give them human thoughts.
⚫ AI has become an interdisciplinary subject that overlaps with various fields.
• Machine learning can be understood from multiple aspects. Tom Mitchell, known as the
"godfather of global machine learning", defined machine learning as: For a task T and
performance metric P, if the performance of a computer program measured by P on T self-
improves with experience E, the computer program is learning from experience E. These
definitions are simple and abstract. However, as we deepen our understanding of machine
learning, we will find that the connotation and extension of machine learning are changing
over time. Because a variety of fields and applications are involved and machine learning
develops rapidly, it is not easy to define machine learning simply and clearly.
Relationships Among AI, Machine Learning, and Deep Learning (1)
Area of
research
Way to
implement
Current
mainstream
Many schools of
thought have emerged
during the
development of AI.
• The term "artificial intelligence" was previously used to describe machines that imitate and
demonstrate "human" cognitive skill related to human thinking.
• A machine learning algorithm mainly builds a model based on sample data (referred to as
trained data), so that prediction or decision-making can be made without explicit
programming.
• Deep learning is a type of machine learning, and machine learning is a one and only path for
implementing artificial intelligence. The concept of deep learning originates from the study
of artificial neural networks. A multi-layer perceptron containing multiple hidden layers is a
deep learning structure. Deep learning uses higher level features derived from the lower
level features to form a hierarchical representation. The motivation of deep learning
research is to establish a neural network that simulates the human brain for analysis and
learning. The neural network simulates the mechanism of the human brain to interpret
data, such as images, sounds, and texts.
Major Schools of AI - Symbolism
⚫ Symbolism is also called Logicism, Psychologism, or Computerism.
⚫ Symbolism believes that AI is built upon mathematical logic. Followers of this school of thought
believe that symbols are the cognitive primitives of humans, and the human cognition process
is based on inferring and calculating a variety of symbols. In their opinion, both humans and
computers use physical symbol systems, so computers can be used to simulate the intelligent
behavior of humans.
Apple in the eyes of symbolism
• After 1956, symbolism developed heuristic algorithms, expert systems, and knowledge
engineering theories and technologies, and made significant progress in the 1980s.
• The successful launch and application of the expert system is of great significance for
leading AI to engineering application and linking AI theory with practice.
• Units in a network can represent neurons, and connections can represent synapses, just as
in the human brain.
• Behaviorism concerns more about application practices and how to learn from the
environment continuously to make corrections.
• School masterpiece: Brooks' hexapod walking robot, a control system that simulates insect
behavior based on the perception-action pattern.
• This school is a somewhat similar to an adaptive control system, which collects data using
sensors (environments) and acts on the system.
Three Elements for AI Development
Data center, distributed
Independent of the school of thought, computing, cloud
computing, edge
AI research needs three key elements. computing, and high-
performance computing
(HPC)
Computing power
The engine of AI and the driving
force behind AI systems
Data Algorithms
The powerhouse of The brain of AI and
AI and the fuel for the command center
AI systems of AI systems
Data mining, data analysis, Machine learning (ML),
data warehouse, data deep learning (DL), natural
visualization, data language processing (NLP),
security, and privacy computer vision (CV), and
protection recommendation system
⚫ Weak AI
The weak AI theory holds that it is impossible to create an intelligent machine capable of reasoning
and problem solving. Such a machine only looks intelligent but does not have human intelligence or
self-awareness.
• Strong AI refers to AI that can compete with humans in all aspects. Therefore, strong AI is
not limited to a specified field, but makes robots comparable to humans in all aspects.
Strong AI can think, plan, solve problems, abstract thinking, understand complex concepts,
quickly learn, and learn from experience. Currently, it is believed that if the human brain
can be simulated and all neurons and synapses in the human brain can be imitated on the
same scale, strong AI will naturally occur.
• Now we are in the weak AI phase. Weak AI alleviates human intellectual labor, similar to
advanced bionics. AI outperforms humans only in some aspects.
Contents
1. AI Overview
2. AI Technologies
◼ AI Technologies
NLP
CV
Foundation Model
Multimodal
Other
1. AI Overview
2. AI Technologies
AI Technologies
◼ NLP
CV
Foundation Model
viewed as sequence-to-sequence
mapping. Named Entity Recognition
1. AI Overview
2. AI Technologies
AI Technologies
NLP
◼ CV
Foundation Model
Car Car
Car
Car
Car
Car
Car
Car
Car
• Due to different appearances, shapes, and arrangements of objects, as well as lighting and
shading during imaging, object detection has always been the largest challenge in the field
of computer vision.
• Despite similar techniques used, object recognition is slightly different from object
detection. Given a specific object, target recognition is aimed to find an instance of the
object in an image. This is not to classify but to judge whether the object appears in the
image, and if so, the object is located. For example, a real-time image taken by a security
camera is monitored to identify a face of a person.
CV Task - Image Segmentation
⚫ Image segmentation is the process of partitioning an image into multiple segments based on
the problem to be solved.
⚫ There are many algorithms and application methods for image segmentation. Common ones
include connected component segmentation, motion segmentation, and object segmentation.
▫ Semantic segmentation refers to pixel-level image recognition, that is, marking the
object category to which each pixel in the image belongs.
▫ Input: image.
▫ Output: segmented images with the same resolution as the input image and label of
each pixel category.
CV Task - Object Tracking
⚫ Object tracking is a core research area in computer vision and has a wide range of applications,
such as intelligent transportation, security monitoring, human-machine interaction, and
autonomous driving.
⚫ Tracking algorithms can obtain the trajectories of target objects in terms of time to analyze
their movement behaviors.
Image inpainting
34 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
1. AI Overview
2. AI Technologies
AI Technologies
NLP
CV
◼ Foundation Model
Feed Forward
Masked
Multi-Head Multi-Head
In 2014, researchers in machine translation proposed the Attention Attention
Seq2Seq model. This was a new way to implement end-to-
Positional Positional
end machine translation based on the RNN architecture. Encoding Encoding
• The core idea of the Seq2Seq model is to use an encoder network to encode an input
sequence into a vector in a fixed dimension or a series of hidden states, and then the
decoder network generates a target sequence word by word.
Where Does a Foundation Model Come from? (2)
Algorithm: From
millions of Transformer
parameters to
billions of ResNet
parameters
VGG
ELMO
AlexNet
Small models are light and highly efficient. Large models feature higher processing capabilities and
They are suitable when resources are limited and accuracy. They are suitable when high complexity and
only one task needs to be performed, for example, accuracy are required.
a watch that is only used to check the time. For example, a smartwatch can be used to check the
time, heartbeat, sleep, and other parameters.
General
Computing
knowledge
power
• Foundation models require massive data for learning. It is costly to train models for each
industry and scenario from scratch. Therefore, the learning of foundation models is based
on a basic model pre-trained on massive data and then fine-tuning is performed based on
industry data.
Foundation Models That Can Speak and Paint
⚫ Unlike conventional single-modal models (which process only one type of information such as text,
images, or audio), a foundation model can integrate multi-modal data (text, image, video, and audio) for
comprehensive understanding and inference.
⚫ Cross-modal understanding: The relationship between different types of data can be understood. For
example, information is extracted from an image and described in text, or an image or video is generated
based on a text description.
Foundation Model
• Emergence also exists in daily life, for example, snowflake formation, traffic jams, animal
migration, and vortex formation. Snowflakes are used as an example for explanation. A
snowflake is composed of small water molecules. However, if a large number of water
molecules interact with each other under the premise of external temperature change, a
regular, symmetric, and beautiful snowflake will be formed at the macro level.
• When the model size is not large enough, tasks cannot be processed properly. However,
when the model size exceeds a threshold, those tasks can be properly performed in a
sudden.
Stepwise Thinking - CoT
⚫ COT is an improved prompting strategy that helps LLMs perform better in complex inference tasks, such as
arithmetic, common sense, and symbolic inference. Unlike ICL, where prompts are constructed using simple input-
output pairs, CoT combines intermediate inference steps, which can import the final output to prompts.
Model Output
Model Output
A: The cafeteria had 23 apples originally. They used 20 to make
A: The answer is 27. × lunch. So, they had 23 - 20 = 3. They bought 6 more apples, so they
have 3 + 6 = 9. The answer is 9.
√
47 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• Compared with the traditional in-context learning (which uses x1, y1, x2, y2, ... xtest as the
input to enable the foundation model to provide the output ytest), the chain of thought
introduces intermediate inference prompts.
• https://bbs.huaweicloud.com/blogs/406077.
Contents
1. AI Overview
2. AI Technologies
4. AI Applications
05/2024 12/2024
DeepSeek-V2 DeepSeek-V3
07/2023 DeepSeek-V2 was launched Developed with 671 billion
Launch of DeepSeek 11/2023 at a price point similar to parameters, V3 was trained in
LLM Chat mainstream LLMs, securing just 55 days and at one-tenth
Liang Wenfeng, co- seventh spot in a ranking the cost of similar models,
DeepSeek-LLM Chat
founder of High-Flyer, by the University of outperforming Llama 3.1 and
was developed with 67
launched DeepSeek. Waterloo. Qwen 2.5 and rivaling GPT-4o.
billion parameters.
Leading new
High-Flyer Start modes
11/2024 01/2025
DeepSeek-R1-Lite- DeepSeek-R1
11/2023 Preview
DeepSeek Coder DeepSeek claimed that DeepSeek-R1 is positioned to
its reasoning model provide performance on par
A free and open course
exceeds OpenAI o1 in with OpenAI o1 and supports
model was released,
logical inference, model distillation.
designed for doing tasks.
mathematical reasoning,
and real-time problem
solving.
• DeepSeek provides performance that rivals GPT-4o, but at a much lower training cost.
While ChatGPT-4 ran up costs reaching US$78–100 million, DeepSeek-V3 was launched
with just US$5.576 million (2.788 million H800 GPU hours)—just 5–10% of similar
models.
• DeepSeek uses the new Mixture of Experts architecture, Multi-Head Latent Attention,
auxiliary loss-free load balancing, and Multi-Token Prediction.
• Challenging big tech monopoly: For a long time, the AI field has been dominated by a
few tech giants that boast countless resources and other advantages. With DeepSeek,
there is legitimate competition to the existing monopoly. DeepSeek boasts strong
capabilities and a unique open-source strategy that have combined into various
breakthroughs and widespread uptake in major markets. Many users and developers
have moved to DeepSeek for both personal and work issues, disrupting the previous
balance and prompting rival companies to reassess their market strategies.
• Intense industry competition: DeepSeek's success has inspired more companies and
research institutions to invest in AI. To gain a competitive edge, major companies are
investing more in R&D by introducing more cutting-edge technologies, which have not
only improved technical performance but also reduced costs and expanded application
scenarios. In the LLM market, this has sparked a price war, forcing other companies to
lower prices to remain competitive. This greater competition has catalyzed faster
technological iterations and ensured a baseline for AI development costs.
• Lower development barrier: Open-source models have reduced the barrier for
enterprises and developers to develop AI. Instead of starting from scratch, developers
instead have free tools to quickly build complex AI models or applications. Now, SMEs or
other small-scale teams can develop and train applications quicker and more
economically.
• More importance attached to inference: Since development costs are set to decrease
with each iteration, the focus will shift to balancing cost and performance in the inference
phase. This will likely lead to changes in computing power structures. The importance of
inference computing power will increase significantly, driven by user needs.
• Technology sharing: Open-source models can be shared and used by global developers,
encouraging wider technology sharing and collaboration. By open-sourcing code and
models, ecosystem users can work together to improve AI technology, while companies
can actively participate in open-source projects to share their achievements and hear
from others' experience.
• Secure and economical R&D: Open-source projects can lower R&D costs and risks. SMEs
and startups can leverage open-source AI to create their products or services without
substantial investment or uncertainty from starting from scratch. Further feedback from
the open-source community can help refine and enhance products for added market
competitiveness.
1. AI Overview
2. AI Technologies
4. AI Applications
◼ AI Application History
AI Application Cases
Other
Object
detection
Machine
Transformer
translation
Image
generation
… …
Specific optimization tasks; single task types A wide coverage of domains; diverse tasks
Generalization
Traditional Weak adaptation to new samples; single-type data Strong capabilities in cross-domain task processing,
inference knowledge transfer, and adaptation to new samples LLM
DL model
applications
applications
Model capability
Poor generation capabilities; mostly used for Strong creativity, zero-shot & few-shot, and logical
discrimination tasks reasoning
1. AI Overview
2. AI Technologies
4. AI Applications
AI Application History
◼ AI Application Cases
Scenario 2
Belt tear monitoring Belt deviation monitoring
Scenario 2: Underground intelligent monitoring
Challenges
Coal conveyor It's unsafe and time-consuming to inspect a coal conveyor belt
belt longer than 20 km. Multiple belt segments need to collaborate to
prevent coal accumulation.
Behavior protocol Not wearing safety helmets, smoking, passing under equipment,
violations walking in laneways, and sitting on belts
Operational Improper operations during gas drainage and water inspection and
violations drainage
• https://e.huawei.com/cn/material/wireless/4d008289c5424b31a055c71eaaed5790
AI Helps Protect Nature - Preserving Chile's Biodiversity
⚫ The Nature Guardian project uses Huawei Cloud and AI to research and protect Darwin's foxes in
the Nahuelbuta Mountains. The Nature Guardian is an acoustic monitoring system developed by
Rainforest Connection (RFCx) and has been effective in several projects.
⚫ It consists of solar devices equipped with microphones and antennas. These devices collect
sound data from the surrounding environment and transmit it to the cloud through wireless
networks for AI data analysis. Each device can cover three square kilometers around the clock.
⚫ The Nature Guardian can capture animal sounds as well as illegal noise made by poachers'
gunshots or trucks and electric saws of illegal loggers.
⚫ The trained AI model can identify the sounds of different animals, enabling experts to study the
distribution and behavior of species and helping with environmental protection through
adaptive management.
⚫ If a threat is identified, the system sends a real-time alarm to the ranger's mobile application for
a fast response.
Darwin's Fox
Source:https://www.huawei.com/cn/tech4all/stories/nature-guardians-for-biodiversity-in-chile
• https://www.huawei.com/cn/tech4all/stories/nature-guardians-for-biodiversity-in-chile
AI Safeguards Nature - Protecting Wildlife in Greece with a Shield of Sound
• https://www.huawei.com/cn/tech4all/stories/wildlife-greece
Scaling Law in AI for Science
Multi-disciplinary and multi-scale data
to train four fundamental science
models: life, materials, fluids, and
electromagnetism Atomic orbitals Crystal structure Small molecule Proteins
Universal
Material
Model
Cell
Universal Universal
Universal
AI4Science Fluid
Life Model
Model Model
Turbine simulation
Brain
Universal
electromagnetic
model
Weather forecasting
Electromagnetism
Universal Materials Fluids
Universal Electromagnetic
Life
Universal Material Model Universal Fluid Model Universal Life Model
model Model
Multi-
Small
disciplinary Materials
molecule
Protein Fluid Gas Electromagnetism Gene Cell Organ
data
Weather Model
More accurate
Drug targets
Hit Lead
compound compound Screening result
candidate candidate library
Compound library library
Drug molecule model, database Ye a r s 1 month
producing 100 million
Compound property predictor Compound optimizer
structurally novel
compounds Lead compound discovery cycle
Event analysis Report generation Intelligent online services Government service digital human
Government hotline Document writing Government service assistant Policy recommendation
… … … …
1. AI Overview
2. AI Technologies
4. AI Applications
▫ Future of AI
• Controllability.
Is Seeing Still Believing?
⚫ As AI technologies develop, we begin to question the credibility of images, audio, and video.
Now, technologies such as multi-modal large models and generative adversarial network (GAN)
can be used to produce false images and videos, making it difficult to distinguish between what
is true and what is false.
For example, Lyrebird is a tool that can automatically imitate human voices from several minutes of
recording samples.
Deepfake can generate videos with fake faces.
True? False?
AI
• Confidential computing: Data transmission and computing are confidential, and privacy
protection is costly.
Model
1. AI Overview
2. AI Technologies
4. AI Applications
▫ Future of AI
• Digital government: smart city, smart government, smart emergency response, sunny
kitchen, smart water conservancy, and the like. For example, sunny kitchen uses AI machine
vision to identify food types on meal plates and calculate prices and calories. Smart water
conservancy and management can solve problems in the following aspects: 1. black water
management and sewage monitoring (AI video algorithm for sewage identification, satellite
remote sensing, and water pump gate scheduling); 2. reservoir (enhanced video, intelligent
power supply, and integrated pole site for water conditions) 3. rivers and lakes (spectral
water quality analysis, and 5G access to drones/ships).
• Smart transportation: airports (IOC queuing management – sensing an extra long passenger
queue: AI + video analysis; stand scheduling: operation optimization AI algorithm + big data
+ IoT + GIS + video + simulation...), and highways (holographic intersection, unmanned
driving...).
Value - Growing Market
The global AI market was expected to reach USD638.2 billion in 2024 (USD538.1
Market scale
billion in 2023).
In the first half of 2024, the market scale of AI foundation model solutions in
Foundation
Model
China was CNY1.38 billion. The estimated CAGR is 56.2% from 2024 to 2028,
Solutions reaching a market scale of CNY21.1 billion by 2028.
Enhanced Open-source
Multimodal Improved self- Stricter ethical Energy
Wider industry personalization and
and cross- learning and security efficiency and
applications and democratized
modal learning capabilities control sustainability
adaptability AI technologies
• Wider industry applications: As technologies become more mature and adaptable, large
models are poised to find broader applications in traditional industries, such as
manufacturing, energy, and logistics, to support more efficient problem-solving and
decision-making.
• Multimodal and cross-modal learning: Future large models will not be limited to a single
domain such as text or images. Instead, they will be able to process and integrate
multiple types of data (such as text, images, videos, and sounds) to provide a more
comprehensive AI application experience.
• Enhanced personalization and adaptability: Large models will better adapt to specific
needs of individual users and provide customized services. For example, in the education
and health fields, models can provide personalized suggestions based on individuals'
learning progress or health status.
• Improved self-learning capabilities: Future large models will have stronger self-learning
capabilities, allowing them to learn new patterns and knowledge through continuous
interaction and feedback. This reduces the reliance on large-scale labeled data.
• Stricter ethical and security control: As large models become more widespread, ethical,
privacy, and security issues will receive more attention. Future development will focus
on ensuring the transparency, explainability, and fairness of models while strengthening
data protection.
• Energy efficiency and sustainability: Given the enormous energy consumption required to
train large models, future research will focus on improving the energy efficiency of
models and reducing the carbon footprint to support sustainable development.
Summary
⚫ This chapter covered the definition and history of AI, its applications and sectors, and
controversial topics and future trends in the field.
B. Semantic segmentation
C. Intelligent driving
D. Video analysis
• Answer:
▫ BCD
Recommendations
⚫ Huawei Cloud
https://www.huaweicloud.com/intl/en-us/
Understanding
Data Learning algorithm
(Performance
(Experience 𝐸) (Task 𝑇)
measure 𝑃)
• Deep learning is a sub-field of machine learning. To understand deep learning, you need to
first understand the fundamentals of machine learning.
▫ Task 𝑇 represents how the machine learning system should process a sample.
Historical
Experience
data
Summarize Train
Training data
Machine
learning
New
Model Prediction
data
Task rules change over time, for example, Data distribution changes over time and
Rules are complex or difficult to describe, part-of-speech tagging, in which new words programs need to adapt to new data
for example, speech recognition. or word meanings can be generated at any constantly, for example, sales trend
time. forecast.
High
Manual Machine learning
rules algorithms
Complexity of
rules
Simple Rule-based
Low questions algorithms
Small Large
Scale of the problem
Target equation
𝑓: 𝑋 → 𝑌
Ideal
Actual
Training data Learning Hypothesis function
𝐷: {(𝑥1 , 𝑦1 ) ⋯ , (𝑥𝑛 , 𝑦𝑛 )} algorithm 𝑔≈𝑓
⚫ The objective function 𝑓 is unknown, and the learning algorithm cannot obtain a
perfect function 𝑓.
⚫ Hypothesis function 𝑔 approximates function 𝑓, but may be different from function
𝑓.
Supervised learning
Feature 1 ······ Feature n Target
algorithm
Suitable for
Weather Temperature Wind Speed
Exercise
Sunny High High
Yes
Rainy Low Medium
No
Sunny Low Low
Yes
15 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Supervised Learning - Regression
⚫ Regression reflects the features of sample attributes in a dataset. A function is used to express
the sample mapping relationship and further discover the dependency between attributes.
Examples include:
◼ How much money can I make from stocks next week?
◼ What will the temperature be on Tuesday?
Monday Tuesday
38° ?
Semi-supervised learning
Feature 1 ······ Feature n Unknown
algorithm
Model
𝑟𝑡+1
𝑠𝑡+1 Environment
• Reinforcement learning uses a series of actions to maximize the reward function to learn a
model.
• Both good and bad behavior can help reinforcement learning models learn.
Feature Model
Data Data Model Model
extraction and deployment and
preparation cleansing training evaluation
selection integration
Feedback and
iteration
4 80 9 Southeast 1100
Data cleansing
Data Data
standardization
preprocessing
Fill in missing values, Standardize data to
and detect and reduce noise and
eliminate noise and improve model
other abnormal points accuracy
Data dimension
reduction
Simplify data
attributes to avoid the
curse of
dimensionality
Missing value
Invalid value
Misfielded value
Simplifies models
Shortens training
for easy
time
interpretation
Improves model
Avoids the curse generalization
of dimensionality and avoids
overfitting
Common method:
Embedded method process
• LASSO regression
• The penalty attribute of the model is highlighted, which is used to eliminate insignificant
features.
Overall Process of Building an AI Model
Model building process
1 2 3
6 5 4
• After data cleansing and feature extraction, we need to start building the model. The
general procedure for building a model is shown above (supervised learning).
Supervised Learning Example - Learning Phase
⚫ Use a classification model to determine whether a person is a basketball player based on
specific features.
Features (attributes) Target (label)
• Model example
Supervised Learning Example - Prediction Phase
Name City Age Label
Marine Miami 45 ?
Unknown data
Julien Miami 52 ? Recent data cannot
New determine whether they are
data Fred Orlando 20 ? a basketball player.
Michelle Boston 34 ?
Nicolas Phoenix 90 ?
IF city = Miami → Probability = +0.7
IF city= Orlando → Probability = +0.2
Apply the IF age > 42 → Probability = +0.05*age + 0.06
model IF age <= 42 → Probability = +0.01*age + 0.02
• Model example
• Generalization
The accuracy of predictions based on actual data
• Explainability
Predicted results are easy to explain
• Prediction speed
The time needed to make a prediction
• Once the form of a problem's hypothesis is given, all possible functions constitute a space,
which is hypothesis space. The problem of machine learning is searching for a suitable
fitting function in a hypothesis space.
• Underfitting: It occurs if the mathematical model is too simple or the training time is too
short. To solve the underfitting problem, use a more complex model or extend the training
time.
Model Effectiveness (2)
⚫ Model capacity, also known as model complexity, is the capability of the model to fit various functions.
◼ With sufficient capacity to handle task complexity and training data volumes, the algorithm results are optimal.
◼ Models with an insufficient capacity cannot handle complex tasks because underfitting may occur.
◼ Models with a large capacity can handle complex tasks, but overfitting may occur when the capacity is greater
than the amount required by a task.
Underfitting: Overfitting:
features not learned Good fitting
noises learned
41 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
▫ Theoretically, if there is infinite amount of data and a perfect model, the error can be
eliminated.
Test error
Error
Training error
Model complexity
RSS ( y i − yi )
R2 = 1 − = 1− i =1
2
TSS m
( y
i =1
i − yi )
⚫
w k +1 = w k − f w ( x i
)
𝜂 is the learning rate. 𝑖 indicates the 𝑖-th data
k
• Typically, the gradient descent method shows a decreasing descent in a very small range
close to the target point, but the gradient may fluctuate in a specific range of the target
point.
Machine Learning Training Methods - Gradient Descent (2)
⚫ Batch gradient descent (BGD) uses the sum of gradients of all 𝑚 samples of the dataset at the current
point to update the weight parameter.
1 m
wk +1 = wk − f wk ( x i )
m i =1
⚫ Stochastic gradient descent (SGD) randomly uses the gradient of a random sample of the dataset at the
current point to update the weight parameter in the current gradient.
wk +1 = wk − f wk ( x i )
⚫ Mini-batch gradient descent (MBGD) combines the features of BGD and SGD, and chooses the gradients
of 𝑛 samples in a dataset each time to update the weight parameter.
1 t + n −1
wk +1 = wk − f wk ( xi )
n i=t
51 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Machine Learning Training Methods - Gradient Descent (3)
⚫ Comparison of gradient descent methods
◼ SGD randomly chooses samples for each training pass, causing instability. As a result, the loss
function fluctuates or even produces reverse displacement during the process of dropping to the
minimum.
◼ BGD is the most stable, but it consumes too many compute resources. MBGD is a balance between
BGD and SGD.
BGD
Use all training samples for each training pass.
SGD
One training sample is used for each training pass.
MBGD
A certain number of training samples are used for
each training pass.
Model
Train
Use hyperparameters to
control training
Hyperparameters are
Common
configurations outside
hyperparameters
the model.
• Parameters are a part of a model that is learned from historical training data and key to
machine learning algorithms. They have the following characteristics:
• Examples:
1. Divide a dataset into a training set, validation set, and test set.
2. Optimize the model parameters using the training set based on the model
performance metrics.
3. Search for model hyperparameters using the validation set based on model
Hyperparameter performance metrics.
search general 4. Perform step 2 and step 3 alternately until model parameters and
process hyperparameters are determined, and assess the model using the test set.
•Grid search
•Random search
•Heuristic intelligent search
Search algorithms •Bayesian search
(step 3)
Hyperparameter 1
4
Hyperparameter 2
Hyperparameter 1
⚫ Note:
◼ In a random search, a search is first performed within a broad range, and
then the range is narrowed based on the location of the best result.
◼ Some hyperparameters are more important than others and affect random
search preferences.
Hyperparameter 2
• After a dataset is divided into a fixed training set and a fixed test set, if the test set error is
small, there is a problem. A small test set implies statistical uncertainty around the
estimated mean test error, making it difficult to claim that 𝐴 algorithm works better than 𝐵
algorithm on the given task. When the dataset has hundreds of thousands of samples or
more, this is not a serious problem. When the dataset is too small, alternative procedures
enable one to use all the samples in the estimation of the mean test error, at the price of
increased computational workload.
• In k-fold CV, 𝑘 is typically greater than or equal to 2. It usually starts from 3 and is set to 2
only when the original dataset is small. k-fold CV can effectively avoid over-learning and
under-learning, and the final result is also persuasive.
Cross-Validation (2)
Full dataset
Naive Bayes
• 𝑥Simple linear regression analysis refers to regression analysis where only one independent
variable and one dependent variable exist, and their relationship can be approximately
represented by a straight line. Multiple linear regression analysis involves two or more
independent variables and the relationship between independent variables and dependent
variables is linear. The linear regression model is a straight line only when the variable x is
one-dimensional. It is a hyperplane when this variable is multi-dimensional. For example,
the price of an apartment is determined by a variety of factors such as the area, layout, and
location. Prediction of the apartment price based on these factors can be abstracted into a
linear regression problem.
2m
⚫ We want the predicted value approaches the actual value as far as possible, that is, to minimize the loss value. We
can use a gradient descent algorithm to calculate the weight parameter 𝑤 when the loss function reaches the
minimum, thereby complete model building.
63 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Linear Regression Extension - Polynomial Regression
⚫ Polynomial regression is an extension of linear regression. Because the complexity of a dataset
exceeds the possibility of fitting performed using a straight line (obvious underfitting occurs if
the original linear regression model is used), polynomial regression is used.
hw ( x ) = w1 x + w2 x 2 + + wn x n + b
Here, 𝑛-th power indicates the degree of the
polynomial.
Grape?
Boy? Orange?
Apple?
Girl? Banana?
T
e wk x
p( y = k | x; w) = K
, k = 1, 2 ,K
e
l =1
wlT x
Class Probability
Grape? 0.09
Banana? 0.01
Small Large
Short- Long-
It could be a nosed nosed It could be a
It could be a
squirrel. giraffe.
rat.
Stays on Stays in It could be an
land water elephant.
• How to construct a decision tree is very important. We should determine the topological
structure of feature attributes based on attribute selection measures. The key step is to
split attributes. That is, different branches are constructed based on the differences of a
feature attribute on a node.
• The common learning algorithms used to generate a decision tree include ID3, C4.5, and
CART.
Structure of a Decision Tree
Root node
Subnode
Subnode
• All nodes except the root node are called leaf nodes.
Key to Decision Tree Construction
⚫ A decision tree requires feature attributes and an appropriate tree structure. The key step of
constructing a decision tree is to divide data of all feature attributes, compare the result sets in terms of
purity, and select the attribute with the highest purity as the data point for dataset division.
⚫ Purity is measured mainly through the information entropy and GINI coefficient. The formula is as
follows:
K K
H ( X )= - pk log 2 ( pk ) Gini = 1 − pk2
k =1 2 k =12
𝑚𝑖𝑛𝑗,𝑠 [𝑚𝑖𝑛𝑐1 𝑦𝑖 − 𝑐1 + 𝑚𝑖𝑛𝑐2 𝑦𝑖 − 𝑐2 ]
𝑥𝑖 ∈𝑅1 𝑗,𝑠 𝑥𝑖 ∈𝑅2 𝑗,𝑠
⚫ 𝑝𝑘 indicates the probability that a sample belongs to category 𝑘 (in a total of K categories). A larger purity difference
between the sample before and after division indicates a better decision tree.
⚫ Common decision tree algorithms include ID3, C4.5, and CART.
Mapping
or
Two-dimensional data set with Both the division methods on the left and right can divide
two sample categories data. But which is correct?
Maximize the
distance from each
support vector to
the line
Polynomial
Linear kernel
kernel
Gaussian Sigmoid
kernel kernel Input space High-dimensional
feature space
P ( X 1 , , X n | Ck ) P ( C k )
P ( Ck | X 1 , , X n ) =
P ( X 1 , , X n )
◼ 𝑋1 , 𝑋2 , … , 𝑋𝑛 are data features, which are usually described by m measurement values of the attribute set.
◼ For example, the attribute of the color feature may be red, yellow, and blue.
• Class conditional independence: The Bayes classifier assumes that the effect of an attribute
value on a given class is independent of the values of other attributes. This assumption is
made to simplify the calculation and becomes "naive" in this sense.
• Bayes classifiers show high accuracy and speed when applied to large databases.
Naive Bayes (2)
⚫ Feature independent hypothesis example:
◼ If a fruit is red, round, and about 10 cm in diameter, it can be considered an apple.
◼ A Naive Bayes classifier believes that each of these features independently contributes to the
probability of the fruit being an apple, regardless of any possible correlation between color,
roundness, and diameter features.
Training set
Large
Ensemble model
Subset 1 Prediction 1
Subset n Prediction n
k-means clustering
k-means clustering
automatically classifies
unlabeled data.
x2 x2
⚫ This course first describes the definition and types of machine learning, as well as
problems machine learning solves. Then, it introduces key knowledge points of
machine learning, including the overall procedure (data preparation, data cleansing,
feature selection, model evaluation, and model deployment), common algorithms
(including linear regression, logistic regression, decision tree, SVM, Naive Bayes, k-
NN, ensemble learning, and k-means clustering), and hyperparameters.
• Answers: 1. D 2. False
Recommendations
⚫ Huawei Talent
https://e.huawei.com/en/talent/portal/#/
1. Perceptron
5. Transformer Architecture
Problem
analysis
Task locating
Machine
learning
algorithm
• Problem analysis and task locating: Locate the problem, convert the problem into a
machine learning problem, determine required data, and then collect the required data.
This part includes data exploration, scenario abstraction, and algorithm selection. For
example, logistic regression can be used to determine whether we can drink coffee, and the
Naive Bayes algorithm can be used to distinguish junk mails from normal ones. AI systems
do not directly detect but gather data from doctors, such as human body's tolerance to
caffeine and whether it causes sleeplessness.
• Data cleansing: Data cleansing is the process of rechecking and validating data to remove
duplicate information, correct errors, and ensure data consistency. Adoption (random,
systematic, and layered), normalization, noise removal, and data filtering
• Feature selection
• Model training
⚫ The figure on the left shows an example of a perceptron that receives two
input signals. x1 and x2 are the input signals, y is the output signal, and w1
x1 w1 and w2 are the weights.
y1 ⚫ The circles in the figure are neurons or nodes. When the input signals are
w2 sent to the neuron, they are multiplied by a fixed weight (w1 x 1 and w2 x
2). The neuron calculates the sum of the input signals and outputs 1 only
x2 when the sum exceeds a certain threshold. This is also called neuron
activation.
• In 1957, Frank Rosenblatt at the aviation lab of the Cornell University invented the
perceptron artificial neural network. Considered a simplest form of a feedforward neural
network model, the network is a binary linear classifier. Its activation function is a sign
function sign (x). The perceptron is the first practical application of the artificial neural
network, marking a new stage of neural network development. After implementing the
perceptron with software, Frank Rosenblatt started to build the hardware perceptron
MarkI. He used 400 optoelectronic devices as neurons and adjustable potentiometers as
synaptic weights. Motors are used to adjust potentiometers to implement weight changes
during learning. He built the hardware perceptron in this way and used it for image
recognition. Due to technical limitations at that time, perceptrons based on physical
implementation are rare. This image recognition system drew attention from many parties
and received a large amount of fund from the U.S. Navy.
• Signals mentioned here can be seen as something that has mobility like an electric current
or a river. Similar to an electric current in a wire sending electrons forward, perceptron
signals form a flow and transmit information forward. The difference from an electric
current is that a perceptron signal has only two values, current (1) and no current (0). In
this document, 0 indicates that no signal is transmitted, and 1 indicates that signals are
transmitted.
Mathematical Representation of a Perceptron
⚫ Input signals of the perceptron have their own inherent weights, which determine the
importance of each signal. A larger weight indicates that the signal is more important.
A weight is similar to the resistance in an electric circuit. Resistance determines how difficult it is for current to flow. The lower the
resistance, the larger the current flowing through. When it comes to the perceptron, a larger weight indicates a larger signal that
passes through the perceptron. Resistance and weight play the same role in controlling the signal flow difficulty (or ease).
(w1, w2, θ) = (0.5, 0.5, 0.7) (w1, w2, θ) = (−0.5, −0.5, −0.7) (w1, w2, θ) = ?
⚫ When the OR gate truth table is expressed in a two-dimensional ⚫ The XOR gate is also referred to as a logical XOR circuit. The output is 1
coordinate system, the single-layer perceptron linearly separates the only when either x1 or x2 is 1.
two types of values. ⚫ Can we use a single-layer perceptron to distinguish between 0 and 1 in
an XOR gate?
x2 x2 XOR gate
OR gate
?
?
?
Replace the question marks (?) with the AND gate, NAND gate,
and OR gate to implement the XOR gate.
10 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Implementation of the XOR Gate — Two-layer Perceptron
x1 s1
s2
x2
The XOR gate can be implemented by Truth table of the XOR gate
combining the AND, NAND, and OR gates. implementation process
x2 s2
x2 XOR gate
⚫ Question: How does a multi-layer
perceptron transform the
perceptron function from linear to
nonlinear?
x1
1 𝑥>0
x1 𝑠𝑖𝑔𝑛 𝑥 = ቊ a1
−1 𝑥 ≤ 0
1 𝑥>0
x2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
1 𝑥>0 a2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
−1 𝑥 ≤ 0
y
−1 𝑥 ≤ 0
𝜽𝟏 𝜽𝟐
3-layer perceptron
w2
x2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
1 𝑥>0 y
−1 𝑥 ≤ 0
If a single-layer perceptron is used, this formula is decisive for discrimination, and it is a linear classifier.
x2 a2 y 𝑎2= 𝑥1 ∗ 𝑤3 + 𝑥2 ∗ 𝑤4 + 𝜃1 – linear
𝑎1 ∗ 𝑤5 + 𝑎2 ∗ 𝑤6 + 𝜃2 =
𝑤5 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + 𝜃1
+ 𝑤6 𝑥1 ∗ 𝑤3 + 𝑥2 ∗ 𝑤4 + 𝜃1 + 𝜃2
x1
-1
...
...
x2
Hidden layer
Output layer
• What is deep learning? In the previous course, we have learned that deep
learning is a subset of machine learning.
• Generally, the deep learning architecture is a deep neural network. The depth
refers to the number of layers in the neural network.
• The network is built by simulating the human neural network.
𝑏2 𝑏3
𝑏1 ⚫ X indicates the input data.
(1) (1) (1)
𝑦1 𝑦2 𝑦3
𝑥1
⚫ W indicates the weight.
(2)
(2) (2)
𝑦3 ⚫ B is the threshold for activating a
𝑦1 𝑦2
𝑥2 (3)
neuron, and is referred to as a bias
𝑦3
𝑦1
(3) (3)
𝑦2
in a feedforward neural network.
𝑥3
𝑗𝑖 Node 𝑗 at layer 𝐿 is ⚫ Y is the output of a neuron, and
𝑦1
(4) (4)
𝑦2 𝑤𝑙
𝑤143 connected to node 𝑖 at represents the output after the
layer 𝐿 − 1. input is processed by the activation
Layer 𝐿
Hidden layer 1 Hidden layer 2 function.
• The more hidden layers, the stronger identification capability of the neural
network.
Adding Nodes at a Single Layer
⚫ As we shift from single-layer to the multi-layer perceptrons, we realize that we can increase
both the number of layers on a network and the number of network nodes at a single layer.
How to identify
different Algorithms that humans Answer
handwritten 5s think of
Features that
Machine learning
humans think of Answer
(SIFT, HOG, etc) (SVM, KNN, etc)
Neural network
Answer
(deep learning)
Feature
Training
data learning
Established
identification
mode
Self-optimization
Training set
Test Flowers?
data
Identification
...
Validation and
test sets Leaves?
The validation set is used to evaluate the model training at the end of a phase, so that we can make
adjustments based on the validation results.
The test set is generally data that has never been seen by a formally-working model. The results
obtained with the test set directly reflect the model's performance.
35 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• https://3ms.huawei.com/km/static/image/detail.html?fid=57074
• https://3ms.huawei.com/km/static/image/detail.html?fid=56891
How Does a Neural Network Learn?
Your answer
Learning Simulation test
Reflection
Answer
Identification
result Calculation
error
True result
Neural network Parameter adjustment
learning process
𝑊 ′ = [𝑤0 ′ , 𝑤1 ′ , … , 𝑤𝑛 ′ ] 𝑇 is as follows:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
𝛻𝑓 𝑤0 ′ , 𝑤1 ′ , … , 𝑤𝑛 ′ = [ , ,…, ] | ′
𝜕𝑤0 𝜕𝑤1 𝜕𝑤𝑛 𝑊=𝑊
• The BGD algorithm uses the entire training set for each learning. Therefore, each update is
made towards the correct direction, ensuring convergence to the extrema in the end. A
convex function converges to a global extremum, and a non-convex function may converge
to a local extremum. However, this algorithm requires longer learning time and more
memory resources.
Stochastic Gradient Descent (SGD) Algorithm
⚫ A common variant, Incremental Gradient Descent or Stochastic Gradient Descent, has
been developed to solve the issues found in the BGD algorithm. One of its
implementations is online learning, which updates the gradient based on each sample.
1 𝜕C(𝑡𝑑 , 𝑜𝑑 ) 𝜕C(𝑡𝑑 , 𝑜𝑑 )
∆𝑤𝑖 = −𝜂 ⟹ ∆𝑤𝑖 = −𝜂
𝑛 x 𝜕𝑤𝑖 𝜕𝑤𝑖
𝑑∈𝐷 𝑑∈𝐷
⚫ SGD:
Initialize each 𝑤𝑖 to a random value with a smaller absolute value.
Before the termination condition is met, do as follows:
◼ Randomly select <X, t> in the training set:
− Input X to this unit and calculate the output o.
𝜕C(𝑡𝑑,𝑜𝑑)
− For each 𝑤𝑖 in this unit: 𝑤𝑖 += -ησ𝑑∈𝐷 𝜕𝑤𝑖
• This gradient descent algorithm updates the weight based on each sample. Most training
samples contain noise. As a result, when the extremum is approximated to, the gradient
direction is oriented up and down near the extremum but difficult to converge to the
extremum.
Mini-batch Gradient Descent (MBGD) Algorithm
⚫ The MBGD algorithm was designed to address the issues found in the previous two gradient descent
algorithms and is now most widely used. It uses a small fixed batch size (BS) of samples to compute ∆𝑤𝑖
and update weights.
Initialize each 𝑤𝑖 to a random value with a smaller absolute value.
Before the termination condition is met, do as follows:
◼ Initialize each ∆𝑤𝑖 to zero.
◼ For each <X, t> in the next batch of samples (number of samples = BS) obtained from the training set, do as
follows:
− Input X to this unit and calculate the output o.
1 𝜕C(𝑡𝑑,𝑜𝑑 )
− For each 𝑤𝑖 in this unit: ∆𝑤𝑖 += -η𝑛 σx σ𝑑∈𝐷 𝜕𝑤𝑖
• This gradient descent algorithm considers both the efficiency and gradient stability. It may
easily overshoot the local minimum and is the most commonly used gradient descent
algorithm in actual work. The batch size varies with specific problems and is usually set to
32.
Network Training Process
1 1
⚫ Forward propagation: 𝐶(𝑊) = 2 (𝑦 − 𝑎3 )2=2 (𝑦 − 𝑔(𝑊2 𝑔 𝑊1 𝑋 ))2 .
𝑑𝐶
⚫ If the parameter 𝑊1 needs to be updated, according to 𝑊𝑡+1 = 𝑊𝑡 − 𝜂 𝑑𝑊,
𝑑𝐶 𝜕𝐶(𝑊) 𝜕𝑎3 𝜕𝑧3 𝜕𝑎2 𝜕𝑧2
compute: 𝑑𝑊1 = 𝜕𝑎3 𝜕𝑧3 𝜕𝑎2 𝜕𝑧2 𝜕𝑊1
.
Forward propagation
Back propagation
derivative (the derivative of z with respect to x) may be expressed by (the derivative of z with
respect to t) multiplied by (the derivative of t with respect to x).
⚫ This is the chain rule.
43 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• Neural networks propagate forward layer by layer. How can we calculate the gradients of
all parameters based on the gradient of the loss function?
w1 w2 w3 w4
x Z1|Y1 Z2|y2 Z3|Y3 Z4|Y4 C
1
⚫ The network weight | w | is usually less than 1. Therefore, |σ'(z) w| ≤ 4. When using the chain rule, the
𝜕C
value of becomes smaller when the number of layers increases, resulting in gradient vanishing.
𝜕𝑏1
⚫ When the network weight | w | is large, that is |σ'(z) w| > 1, the gradient explodes.
⚫ Solution: Use gradient clipping to prevent the gradient from exploding, and use the ReLU activation
function and LSTM neural network to mitigate gradient vanishing.
• This problem occurs when the value of w is large. However, this problem rarely occurs
when the sigmoid activation function is used. It is because that the value of 𝜎 ′ 𝑧 is also
related to w ([equation]), unless the input value [equation] of this layer is within a small
range.
• Gradient exploding and vanishing are caused by large network depth and unstable network
weight update. In essence, they are caused by the chain rule in gradient backward
propagation.
• Gradient clipping is proposed to solve gradient exploding. The principle is to set a gradient
clipping threshold. When being updated, if the gradient exceeds the threshold, the gradient
must be limited under the threshold. This prevents gradient exploding.
• ReLU: The principle is simple. If the derivative of the activation function is 1, gradient
vanishing and exploding do not occur, and each network layer has the same update rate.
The ReLU function is developed for this purpose.
• For long-short term memory (LSTM) networks, gradient vanishing does not easily occur
thanks to the complex gates inside the LSTM. As shown in the following figure, LSTM can
memorize the residual memory from previous training through the internal gates when it is
updated. Therefore, it is often used in generative text. Currently, CNN-based LSTM is also
available. Try it if you are interested.
Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
Deep Neural Network
Common Activation Functions
Neural Network Training
◼ Optimizers
Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture
46 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Faster Model Training
⚫ Neural network training is very slow
when using normal methods.
⚫ To speed up the process, we can use
special tools called optimizers.
• https://3ms.huawei.com/km/static/image/detail.html?fid=57497
Optimizers
⚫ There are various improved versions of gradient descent algorithms. In object-oriented
programming, different gradient descent algorithms are often encapsulated into an object
called an optimizer.
⚫ There are several reasons for improving algorithms, including:
Accelerating algorithm convergence
Preventing or overshooting local extrema
Simplifying manual parameter setting, especially the learning rate
⚫ Common optimizers include the common SGD optimizer, momentum optimizer, AdaGrad,
RMSprop, and Adam.
48 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Momentum Optimizer
⚫ Adding the momentum term to ∆𝑤𝑗𝑖 is a basic way to improve an algorithm. Assume that the
weight correction of the 𝑛-th iteration is ∆𝑤𝑗𝑖 𝑛 . The weight correction rule is:
∆𝑤𝑗𝑖𝑙 𝑛 = −𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛) + 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1
Where α is a constant (0≤𝛼<1) called the momentum coefficient, and 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1 is the
momentum term.
⚫ Imagine that a small ball rolls down from a random point on the error surface. The momentum
term gives inertia to the ball.
• 𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 𝑛 indicates the original gradient and direction, j represents the j-th neuron at the
l-th layer, and i represents the i-th neuron at the (l+1)-th layer.
Advantages and Disadvantages of the Momentum Optimizer
⚫ Advantages:
Enhances the stability of the gradient correction direction and reduces mutations.
In areas where the gradient direction is stable, the ball rolls faster and faster (there is a speed upper
limit because 𝛼 < 1), which helps the ball quickly overshoot the flat area and accelerates
convergence.
A small ball with inertia is more likely to roll over some narrow local extrema.
⚫ Disadvantages:
The learning rate 𝜂 and momentum 𝛼 need to be manually set. Usually, additional experiments are
necessary to determine the appropriate values.
𝑔𝑡 indicates the t-th gradient, and 𝑟𝑡 is the gradient accumulation variable. The initial value of 𝑟 is 0, which increases
continuously. 𝜂 indicates the global learning rate, which needs to be set manually. 𝜀 is a small constant that is set to 10-
7 for numerical stability.
⚫ 𝑔𝑡 indicates the t-th gradient, and t is the gradient accumulation variable. The initial value of r is 0, which
may not increase and needs to be adjusted with a parameter. 𝛽 indicates the attenuation
factor. 𝜂 indicates the global learning rate, which needs to be set manually. 𝜀 is a small constant and is
set to10−7 for numerical stability.
53 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Adam Optimizer (1)
⚫ Adaptive Moment Estimation (Adam) was developed based on AdaGrad and AdaDelta. It
maintains two additional variables, 𝑚𝑡 and 𝑣𝑡 , for each variable to be trained.
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )𝑔𝑡2
⚫ Where 𝑡 represents the 𝑡-th iteration, and 𝑔𝑡 is the calculated gradient. 𝑚𝑡 and 𝑣𝑡 are moving
averages of the gradient and square gradient, respectively. From a statistical perspective,
𝑚𝑡 and 𝑣𝑡 are estimates of the gradient's first moment (the average value) and the second
moment (the uncentered variance), respectively, hence the name of the method.
• https://3ms.huawei.com/km/static/image/detail.html?fid=64716
Overfitting
⚫ Problem description: The model performs well with the training set, but poorly with the test set.
⚫ Root cause: There are too many feature dimensions, model assumptions, and parameters, too much
noise, but very little training data. As a result, the fitting function predicts the training set almost
perfectly, but the prediction result for a new test data set is unsatisfactory. Training data is overfitted
without considering the model's generalization capability.
y y y
x x x
Underfitting Overfitting
Features not learned Good fitting
Noise learned
58 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• Overfitting is caused by insufficient data and complex model. In deep learning, small
training data size and high model complexity are two common problems. They result in high
prediction accuracy with training data, but low accuracy with test data. This is how
overfitting occurs.
▫ Obtain more data: Obtain more data from data sources and perform data
augmentation.
▫ Use a proper model: Reduce the number of network layers and the number of
neurons to limit the fitting capability of the network.
▫ Dropout
▫ Using multiple models: Bagging uses diverse models to fit training sets of different
parts, while Boosting uses only simple neural networks.
𝐿1 Regularization
⚫ Add the norm constraint 𝐿1 to model parameters, that is,
𝐽ሚ 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1
⚫ If a gradient method is used to resolve the value, the parameter gradient is:
𝛻 𝐽ሚ 𝑤 = 𝛼𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤
⚫ The parameter optimization method is:
𝑤𝑡+1 = 𝑤 − 𝜀𝛼𝑠𝑖𝑔𝑛(𝑤) − 𝜀𝛻𝐽(𝑤)
different from 𝐿2 regularization. Compared with the 𝐿2 optimization method, 𝐿2 does not
directly reduce the parameter value to 0 but to a value close to 0.
𝐿2 Regularization
⚫ Add norm penalty term 𝐿2 to prevent overfitting.
1
𝐽ሚ 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2
2
2
A parameter optimization method can be inferred using an optimization technology (such as a
gradient method):
𝑤 = 1 − 𝜀𝛼 𝑤 − 𝜀𝛻𝐽(𝑤)
where 𝜀 is the learning rate. Compared with a common gradient optimization formula, this
formula multiplies the parameter by a reduction factor.
• Add a regular term Ω (θ)=12||w||22 to the target function to make the weight closer to
the origin.
Validation set
Training set
Training round
• Generally, a relatively large number (n) of iterations is set. First we need to save the current
model, including the network structure and weights, and train the model for num_batch
times (that is, an epoch) to obtain a new model. Then, use the test set as the input of the
new model for testing. If the test error is greater than the last test error, we do not stop the
test immediately. Instead, we should continue to conduct the training and test for several
epochs. If the error does not decrease, we stop the model training when we believe that
the minimum error is achieved in the last test.
• The common practice is to record the best test set accuracy p so far. If the accuracy does
not exceed p in m consecutive periods, it may be considered that p does not increase
anymore, and iteration may be stopped early (early stopping).
• As shown in the figure, the test error gradually decreases in the first several epochs.
However, it slightly increases in a certain epoch. This indicates that overfitting has occurred.
• Early stopping is to stop training before the test error starts to increase, even if the training
has not converged, that is, the training error has not reached the minimum.
Dropout
⚫ Dropout is a common simple regularization method, which has been widely used since 2014. Simply put,
dropout randomly discards some inputs during training. Parameters corresponding to the discarded
inputs are not updated.
⚫ Dropout is an integration method. It combines all sub-network results and obtains sub-networks by
randomly dropping inputs.
y y y y
h₁ h₂ h₁ h₂ h₁ h₂
x₁ x₂ x₂ x₁ x₁ x₂
y
y y y y
h₁ h₁ h₂ h₂
h₁ h₂
x₁ x₂ x₁ x₂ x₂
x₁ x₂ y y y y
h₁ h₁ h₂
x₁ x₂ x₁ x₁
y y y y
h₂ h₁
x₂
• When the samples of each test round are input into the neural network for training, a
probability p is set, so that each neuron has a specific probability of dying and being
excluded from network training.
• The process is as follows: 1. Delete neurons (temporarily dead) in the hidden layer of the
network at a specific probability p, and keep the input and output neurons unchanged. 2.
Propagate the input x forward through the modified network, and propagate back the loss
result through the modified network. After this process is performed on a small batch of
training samples, update corresponding parameters (w, b) on neurons that are not deleted
with the SGD algorithm. 3. Then, repeat the process. Restore the deleted neurons. At this
point, the deleted neurons remain unchanged, and the undeleted neurons are updated.
Randomly select a subset of half size from the neurons at the hidden layer, and temporarily
delete it. Remember to back up the parameters of the deleted neurons. For a small batch of
training samples, propagate the input forward, backpropagate the loss, and then update
the parameters (w, b) according to the SGD algorithm (The parameters of the undeleted
neurons are updated, and the parameters of the deleted neurons remain the same as what
they are before the deletion.)
• The sampling probability of each entry is 0.8 for the input and 0.5 for the hidden layers.
• Advantages:
▫ Compared with weight decay and norm constraints, this strategy is more effective.
▫ It is computationally cheap and simple and can be used in non-deep-learning models.
▫ However, it is less effective when the training data is insufficient.
▫ Stochasticity is not necessary or sufficient to achieve the normalizing effect of
dropout. Invariant shielding parameters can be constructed to obtain good solutions.
• In addition to the preceding methods, we can also use semi-supervised learning, multi-task
learning, early stopping, parameter sharing, ensemble methods, and adversarial training.
Contents
1. Perceptron
5. Transformer Architecture
• A filter matrix is a set of fixed weight and can be seen as a constant filter (kernel). The
convolution (adding each element, weighted by the kernel) is performed between an image
(data from different data windows) and a kernel. This type of network is called CNN.
• Local receptive field: It is generally considered that human perception of the outside world
is from local to global. Spatial correlations among local pixels of an image are closer than
those among pixels that are far away. Therefore, each neuron does not need to know the
global image. It only needs to know the local image and then the local information is
combined at a higher level to generate global information. The idea of local network
connection is also inspired by the biological visual system structure. The neurons in the
visual cortex receive local information (respond to stimuli of certain regions).
• Parameter sharing: One or more filters can be used to scan the input images. The
parameters of the filter are weights. At the layers scanned by the same filter, each filter
uses the same parameters to perform weighted computation. Weight sharing means that
the parameter values of each filter do not change when the filter scans the entire image.
For example, if we have three feature filters and each filter scans the entire image. During
the scanning process, the parameter values of the filters do not change. In other words, all
elements of the image share the same weights.
Main CNN Concepts
⚫ Local receptive field: It is believed that humans perceive the outside world from local
to global. The spatial correlations among an image's local pixels are closer than those
among the pixels that are far away. As such, each neuron does not need to know the
global image; it only needs to know the local image. Then, the local information is
combined at a higher level to generate global information.
⚫ Parameter sharing: One or more convolution cores may be used to scan input images.
Parameters carried by the convolution cores are weights. In a layer scanned by
convolution cores, each core uses the same parameters for weighted computation.
Weight sharing means that, when each convolution core scans an entire image, its
parameters are fixed.
Bird Pbird
Fish Pfish
Dog Pdog
Vectorization
Cat Pcat
• Rectified linear units layer (ReLU layer): uses ReLU f(x) = max(0, x) as the activation
function.
• Pooling layer: partitions features obtained from the convolutional layer into some areas
and outputs the maximum or minimum value, generating new features with a smaller
spatial size.
• Fully-connected layer: integrates all local features into global features to compute the final
scores for each type.
Wn
bn
Fn
Input Output
tensor tensor
F1
Activate
W2 b2
W1 b1
• The actual classification networks are feed forward networks that are formed by
interconnected convolutional and pooling layers. The pooling layer has the following
functions:
▫ Invariance: Max pooling ensures invariance within a certain range because the
maximum value of a region is the last output value regardless of its location.
▫ Reducing the input size for the next layer: Pooling effectively reduces the size of the
input data for the next layer, the number of parameters, and computation workload.
▫ Obtaining fixed-length data: By properly setting the pooling window size and stride,
we can obtain fixed-length outputs from variable-length inputs.
▫ Increasing the scale: The features of the upper layer can be extracted from a larger
scale.
▫ Preventing overfitting: Pooling simplifies the network and reduces the fitting
precision. Therefore, it can prevent overfitting (note the possible underfitting).
Fully-connected Layer
⚫ The fully-connected layer is essentially a classifier. The features extracted on the convolutional
and pooling layers are straightened and placed at the fully-connected layer to output and
classify results.
Straighten
⚫ Generally, the Softmax function is used as the activation function of the final fully-connected
output layer to combine all local features into global ones and compute the score of each class.
𝑧
𝑒 𝑗
softmax= σ 𝑧𝑘
𝑘𝑒
1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
◼ RNN
LSTM
Seq2Seq
5. Transformer Architecture
6. Basic Foundation Model Architecture
⚫ S𝑡 traverses multiple hidden layers, and then traverses the fully-connected layer V to obtain the
final output O𝑡 at time t:
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑡 𝑉 .
𝜕𝐶
• z𝑡 is the value before the memory cell enters the activation function. 𝛿t = 𝜕𝑧𝑡 + 𝑤 𝑇
𝑡
⚫ While the standard RNN architecture solves the problem of information memory, the information about
long-term memory attenuates.
Long-term information needs to be saved in many tasks. For example, a hint at the beginning of a speculative
fiction may not be explained until the end.
RNN may not be able to save long-term information due to the limited memory cell capacity.
We expect that memory cells can remember key information.
1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
RNN
◼ LSTM
Seq2Seq
5. Transformer Architecture
6. Basic Foundation Model Architecture
⚫ The decision is made through the forget gate. This gate reads ℎ𝑡−1 and 𝑥𝑡 , and outputs a numeric
value in the range from 0 to 1 for each digit in the cell state 𝐶𝑡−1 . The value 1 indicates that the
information is completely retained while the value 0 indicates that the information is completely
discarded.
ℎt
𝐶t−1 𝐶t
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡 𝑓𝑡 = 𝜎(𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 )
𝜎 𝜎 tanh 𝜎
ℎt−1 ℎt
𝑥t
ℎt
𝐶t−1 𝐶t
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
𝐶ሚ𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝐶 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 )
ℎt
ℎt−1
𝑥t
𝐶t−1 𝐶t
tanh
𝑜t
𝑓t 𝑖t 𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1
tanh
𝑥t
ℎt
𝐶t−1 𝐶t
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1
𝑥t
one to one one to many many to one many to many many to many
1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
RNN
LSTM
◼ Seq2Seq
5. Transformer Architecture
6. Basic Foundation Model Architecture
Input
⚫ Models such as RNN and LSTM can be selected for the encoder and decoder.
• Both the RNN and the LSTM may be selected for the encoder and the decoder.
Seq2Seq
⚫ Sequence-to-sequence (Seq2Seq): A sequence is input and another sequence is output. This is a typical
encoder-decoder model. The encoder encodes the input data, and the decoder decodes the encoded
data. It is also a solution to the time sequence problem.
Encoder Decoder
⚫ The encoding part is marked blue. The first <eos> indicates the end of encoding. The decoding part is left
white. <bos> is the first input, indicating the start of decoding. The second <eos> indicates the end of
decoding.
The encoder converts the hidden states of all time steps into context variables by using the selected function:
⚫ After obtaining the decoder's hidden state, we can use the output layer and softmax operation to calculate the
conditional probability distribution P(yt′1y1,…,yt′−1,c) of the output yt' at the time step t'.
1. Perception
5. Transformer Architecture
• https://3ms.huawei.com/km/static/image/detail.html?fid=61858
Attention Mechanism
Attention is scene and from human cognitive capabilities. For example, when people observe a
scene or deal with an event, they usually pay attention to the salient objects in the scene and
want to grasp the main contradiction when dealing with an event. The Attention mechanism
enables people to focus on important aspects while ignoring minor things, increasing their
efficiency.
Nature of Attention
• https://3ms.huawei.com/km/static/image/detail.html?fid=67539
• smartvi
Attention Principle
⚫ Attention can be used without the encoder-decoder framework. The following figure shows the
principle of attention without the encoder-decoder framework.
5-6 Movie
Marvel
books (read
carefully)
9-1 World
War II (read
quickly)
The Avengers movie is awesome. I
want to learn more. Find related books. Learn about Marvel's background.
• From the preceding modeling, we can feel that the idea of Attention is simple:
"weighted summation". Refer to an inappropriate analogy. Humans learn a new
language in four stages: rote learning (learning grammar by way of recitation) ->
outlining points (grasping key words in sentences) -> generalizing (being able to
understand the context and the relationship behind the language in complex
conversations) -> reaching the peak of perfection (immersive practice).
• This is similar to the development of Attention. The RNN era is a period of rote
learning. The Attention model has evolved into a transformer, which has excellent
expression learning capabilities, then to GPT and BERT. A wealth of practical
experience has accumulated through multi-task large-scale learning
Advantages of Attention
⚫ Fewer parameters: Compared with CNN and RNN, this model is less complex and has fewer parameters.
As such, it needs less computing power.
⚫ Higher speed: Attention addresses the fact that RNN cannot perform parallel computing. The steps in the
Attention mechanism do not depend on the previous step. As such, the Attention mechanism supports
parallel processing, just like CNN.
⚫ Better result: Before Attention was introduced, the model would struggle to remember older
information, just like people with a short memory cannot remember things from the past.
⚫ The Transformer model uses a whole new formula and rejects the CNN
and RNN structures. It uses the Attention mechanism to automatically
capture the relative associations at different positions of the input
sequence, which is conducive to processing long texts. In addition, the
model enables a high degree of parallelism, speeding up training.
⚫ Transformer uses the self-attention mechanism. As the name implies, self-attention refers to
the attention mechanism used between source elements or target elements, rather than
between the source and the target. The two mechanisms use the same computing process, but
the object to be computed is different.
If you want to know what "its" refers to and which words are relevant to
"its" in this sentence, you can use "its" as a query, this sentence as a key
and a value to calculate the value of attention. Through self-attention,
we find that "its" is most relevant to "law" and "application."
⚫ The self-attention mechanism focuses more on its own relationships. Therefore, the encoder and decoder of the self-attention
mechanism can be used separately, which further generates different types of network structures.
100 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Contents
1. Perception
5. Transformer Architecture
• Using Transformer instead of LSTM is because that, although pre-training helps capture
some language information, the LSTM model limits its prediction capability in the short
term. However, Transformer enables us to capture longer-term language structures while
achieving better generalization performance in other different tasks.
• For the positional encoding part, GPT-1 is sharply different from a common transformer.
The positional encoding of a common transformer is represented by cosine + sine, while
GPT-1 uses random initialization similar to that of a word vector. In addition, update is
performed during training, that is, each location is considered an embedding to be learned.
Bert
⚫ Released in 2018, BERT is a
bidirectional model that
analyzes the context of a
T1 T2 … Tn Add & Norm
complete sequence and then
performs predictions. This Feed Forward
Trm Trm … Trm
model is trained on a plain text
Add & Norm
corpus and Wikipedia using 3.3
billion words and 340 million Multi-Head
Attention
parameters. Trm Trm … Trm
Positional
⚫ BERT can answer questions, Encoding
Task A
Model A
Task B
Model B
Task A Result A
Multiple Multi-task
tasks processing
results
Task N
Comprehensive
Result N
model
• In actual data distribution, there are many natural subsets, such as different domains,
topics, languages, and modalities. When a single model is used for learning and the model
capacity is small, different subsets interfere with model fitting, causing slow model training
and difficult generalization.
• For a conventional learning model, training is intended to enable a final model to execute a
plurality of tasks in different scenarios. However, when the model updates the weight of a
scenario, such training also affect the weights of the model for another scenarios. The
mutual impact of the weights is called interference effect. The stronger the interference
effect, the slower the learning speed of the model and the poorer the generalization of
the model.
MoE
⚫ Mixture-of-Experts (MoE) is a model design strategy that combines multiple models ("experts") to
achieve better prediction performance.
⚫ MoE can effectively improve the capacity and efficiency of a foundation model. Generally, MoE uses a
gating mechanism along with a set of output mechanisms to combine and balance expert selection,
ultimately determining the final prediction for each expert. The expert selection mechanism chooses
specific experts for prediction based on the gating mechanism's output.
• Model size is one of the key factors for improving model performance. This is why today's
foundation models are resultful. With a limited computing resource budget, training a
larger model with fewer training steps is usually better than training a smaller model with
more steps.
• A significant advantage of MoE is that effective pre-training can be performed with far less
computing resources than what the Dense model requires. This means that you can
significantly scale up your model or dataset under the same computational budget.
Especially in the pre-training phase, the hybrid expert model can usually achieve the same
quality level more quickly than the Dense model.
Quiz
1. (Single-choice question) Which of the following is not a deep learning neural network? ( )
A. CNN
B. RNN
C. LSTM
D. Logistic
• 1.D
Quiz
A. Activation function
B. Convolution kernel
C. Pooling
D. Fully-connected layer
• 2.ABCD
Quiz
3. (True or false) Compared with the RNN, the CNN is more suitable for image recognition. (
)
A. True
B. False
• 3.A
Quiz
A. True
B. False
• B
Summary
⚫ This chapter discusses the training process of deep neural networks, including
multiple network architectures developed from conventional artificial neural
networks. Finally, it describes the infrastructure for building large models.
⚫ Recent decades have seen explosive growth in the research and application of deep
learning. This has given rise to the third tide of Artificial Intelligence (AI)
development, achieving particular success in image recognition, speech recognition
and synthesis, autonomous driving, and machine vision. These advances require
more advanced algorithms and underlying frameworks.
⚫ Ongoing development in deep learning frameworks has made it easier to use
extensive computational resources to train neural network models on large datasets.
This section describes the use of AI frameworks and the process of developing AI
applications.
1. AI Framework
◼ Functions of AI Development Frameworks
Introduction to Mainstream AI Development Frameworks
3. PyTorch
• The emergence of deep learning frameworks lowers the requirements for developers.
Developers no long need to compile code starting from complex neural networks and back-
propagation algorithms. Instead, they can use existing models to configure parameters as
required, where the model parameters are automatically trained. Moreover, you can add
self-defined network layers to the existing models, or select required classifiers and
optimization algorithms.
Functions of the AI Development Framework
Model
Preparations Training evaluation Inference deployment
development
Data preparation and Model Model Model Accuracy Model Inference
Model processing building debugging training evaluation conversion deployment
development
process Iterative optimization
Accelerates scientific
research and Enables stable and high- Makes inference
development performance training deployment easier
• Provides easy-to-use and extensive • Enables full-stack software- • Provides hardware access
programming APIs hardware synergy for high- capabilities
• Allows developers to complete performance training • Simplifies technical details of the
model development with ease • Provides functions such as underlying hardware
• Unifies data and other APIs, making automatic differentiation and • Solves difficulties in model
scientific research and experiments automatic parallelization adaptation and deployment
easier • Optimizes model compilation and
training
Forward computation
Quickly implements
algorithms, without
worrying about the
underlying logic.
Hardware-based training
Graph optimization
AI development
framework
Backward computation
Model structure
The AI framework usually further optimizes the computational graph to improve the
computational efficiency.
It also schedules the graph and manages the memory through the computational
graph execution process.
8 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
• After the model structure is built, the AI framework converts the model into a
computational graph and further optimizes the computational graph before execution.
Model Training and Deployment
1. AI Framework
Functions of AI Development Frameworks
◼ Mainstream AI Development Frameworks
3. PyTorch
Distribution
Strategy API
⚫ Disadvantages:
Weak community ecosystem: Compared with deep learning frameworks such as PyTorch, JAX has a
relatively small community ecosystem and fewer documentation resources.
1. AI Framework
3. PyTorch
# Download and load the training set (when it is not stored locally).
# Transform is the transformation operation performed on data.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
Image data is usually a 4-dimensional tensor in the [N,H,W,C] format. N indicates the batch data size,
H and W indicate the pixel height and width of the image, and C indicates the number of channels.
◼ In PyTorch, the default format of received image data tensors is [N,C,H,W].
◼ In TensorFlow, the default format of received image tensors is [N,H,W,C].
that can be linearly read. This method is particularly Scalar Data Page 1
suitable for streaming data over the network. It is Scalar Data Page 2
also useful for pre-processing when buffering any Block Data Page 1
data.
Scalar Data Page 3
Example: TFRecord in TensorFlow. Index
Block Data Page 2
These data formats reduce the disk I/O and
Data File Index File
network I/O overhead.
Inference model files support two types of data: training parameters and network models.
It uses the Protocol Buffers format and stores all parameter values on
It is generally used to resume training after a training task is
Checkpoint (.ckpt) the network by default. You can also configure it to store the model
interrupted or to fine-tune a task after training.
structure and training status.
The Open Neural Network Exchange (ONNX) is a general expression It is normally used for model migration between different
ONNX (.onnx)
for machine learning models. frameworks or on the inference engine (TensorRT).
This is a common binary format used to save and load various models A PyTorch model needs to be converted into a common binary
bin (.bin)
and data. format.
This is the default model file format of PyTorch, which is used to save A complete PyTorch model needs to be saved and loaded. For
.pt or .pth and load a complete PyTorch model, including the model structure example, the optimal model needs to be saved during training,
and parameters. or the trained model needs to be loaded during deployment.
1. AI Framework
3. PyTorch
Tensor X
Matmul Sigmoid Tensor Y
Tensor W
• A node represents a variable or operation. The input nodes represent input data or
parameters, and the intermediate node and the output node represent the calculation
result.
• The nodes are connected with edges, indicating the data flow or computation
dependencies. Each edge usually connects output from one operation to input of another
operation.
From Front-end Language to a Computational Graph
⚫ A computational graph is a way to represent mathematical functions with the graph theory
language. It is a standardized method used by deep learning frameworks to express neural
network models.
Front-end programming
class _DenseLayer(nn.Module): languages and interfaces:
def __init__( Build model
self, num_input_features: int, growth_rate: int, bn_size:
Intermediate model
int, drop_rate: float, memory_efficient: bool = False representation:
) -> None: Computational graph IR (DAG)
super().__init__()
self.norm1 = nn.Linear() Auto Graph
self.relu1 = nn.ReLU(inplace=True) differentiation optimization
...
def forward(self, x): Optimization, scheduling, and
... compilation
# Data initialization
x = tf.Variable(tf.random.normal([512, 1024])) # Batch size 512
W = tf.Variable(tf.random.normal([1024, 1024]))
# Execution timing
start_time = time.perf_counter()
for _ in range(iterations):
_ = tf.matmul(x, W) # Immediate execution mode
dynamic_duration = time.perf_counter() - start_time
start_time = time.perf_counter()
for _ in range(iterations):
sess.run(y, feed_dict={x: input_data.eval()})
static_duration = time.perf_counter() - start_time
PyTorch TensorFlow
In Tf2.x, the dynamic graph mode is used
by default.
The dynamic graph mode is used by
You can also run
default.
tf.enable_eager_execution () to manually
enable the dynamic graph mode.
In Tf1.x, the static graph mode is used by
You can use the torch.jit.trace and
default.
torch.jit.script functions to improve the
In Tf2.x, you can use tf.function to
efficiency of static graphs.
implement static graphs.
1. AI Framework
3. PyTorch
◼ Basic Modules
Implementation of the LeNet Network Structure
torch.autograd PyTorch's automatic differentiation engine, which supports gradient calculation and is mainly used for gradient
backpropagation after the loss function is obtained.
torch.optim Provides multiple optimization algorithms for model parameter updates and optimization.
torch.device Specifies the device on which tensors and models should run, allowing switching between the CPU and the GPU.
torch.distributions Provides a series of classes to enable PyTorch to sample different distributions and generate computational graphs for
the probabilistic sampling process.
torch.jit PyTorch's instant compiler module, which can convert dynamic graphs into static ones for optimization and
serialization.
torch.random Provides a series of methods to save and set the status of the random number generator, which helps debug the
structure and performance of neural networks.
torch.onnx Defines the deep learning model description files exported and loaded by PyTorch in the ONNX format, enabling
model exchanges with other deep learning frameworks.
# Define a Dataset. The following uses the open-source data in PyTorch as an example.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# Use DataLoader to load the dataset.
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
class MLP(nn.Module):
# Declare a layer with model parameters. Here, two fully connected layers are
declared.
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
self.hidden = nn.Linear(784, 256)
self.act = nn.ReLU()
self.output = nn.Linear(256,10)
class MLP(nn.Module):
# Declare a layer with model parameters.
def __init__(self, **kwargs):
...
⚫ You do not need to define a backpropagation function. The system automatically generates the
backward function required for backpropagation by automatically calculating the gradient.
# Input data (Assume that the number of input images is batch_size, the number of input
channels is in_channels, and the size is height x width.)
input_tensor = torch.randn(batch_size, in_channels, height, width)
nonlinearity: nonlinear activation function. The value is tanh or relu. The default value is tanh.
batch_first: If the value is True, the first dimension of the input and output is the batch size.
1. AI Framework
3. PyTorch
Basic Modules
◼ Implementation of the LeNet Network Structure
◼ It has two convolutional layers. The first one uses # Define the LeNet network structure.
class LeNet(nn.Module):
six 5x5 convolution kernels with a stride of 1 and a
def __init__(self):
padding of 2, and outputs six feature maps. The super(LeNet, self).__init__()
second one uses 16 5x5 convolution kernels with a # Define a two-dimensional convolutional layer.
self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2)
stride of 1 and no padding, and outputs 16 feature self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1)
maps. # Define the fully connected layer.
self.fc1 = nn.Linear(16*5*5, 120)
◼ The average pooling layer is used to reduce the self.fc2 = nn.Linear(120, 84)
dimensions of the feature maps. self.fc3 = nn.Linear(84, 10)
# Define the pooling layer.
◼ ReLU is used after the convolutional layers, and self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
sigmoid or softmax (at the output layer) is used # Activation function
self.relu = nn.ReLU()
after the fully connected layer. self.sigmoid = nn.Sigmoid()
running_loss += loss.item()
if i % 100 == 99: # Print the loss every 100 batches.
print(f'[Epoch {epoch+1}, Batch {i+1}] loss: {running_loss/100:.3f}')
running_loss = 0.0
1. AI Framework
3. PyTorch
Requirement
analysis
https://PyTorch.org/get-started/locally/
File structure
61 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Data Pre-processing
⚫ To improve the model's accuracy and ensure the generalization capability, you need to perform data
augmentation and standardization before using the data to train the model.
⚫ Load and process the dataset, including:
Reading the dataset
Defining parameters required for data augmentation and processing
Generating data augmentation operations according to the parameters
Processing the generated dataset
⚫ Parameter tuning: You can adjust different hyper-parameter combinations during training.
⚫ For details about the code, see the lab guide.
Model inference
• As the batch size and sequence length increase, the KV cache also increases, causing great
pressure on memory access.
Model Compression Method - Model Pruning
Pruning is an important model compression technology. The basic idea is to remove less
important weights and branches from a model and sparse the network structure to obtain a
model with fewer parameters. However, pruning may also cause model performance
deterioration. As such, you must find a balance between model size and performance. Neuron
connections in a neural network are mathematically represented as a weight matrix. This means
that pruning changes some elements in the weight matrix to zero elements. The following figure
shows the pruning process. The purpose is to remove less important synapses or neurons.
1. (Multiple-choice question) Which of the following hardware devices can be used for PyTorch
training?
A. CPU
B. GPU
C. TensorFlow
D. CUDA
• Answer: AB
Recommendations
⚫ https://pytorch.org/
⚫ This chapter describes the AI business construction process, how to train a large
model (the entire training and fine-tuning process), and how to efficiently use a large
model.
1. AI Business Process
Model
Business requirements Data
Data collection selection and
preprocessing
design
Model Model
Model training
deployment evaluation
• https://support.huaweicloud.com/function-modelarts/index.html#
Requirement Analysis
• Personalized news push: news content push based on user interest. News reading habits
and user profiles are crucial. Common methods include content-based recommendations,
collaborative filtering, and deep learning sequence models.
• Security
• Target recognition, including identity authentication, social media filtering, and security
management. Data sets such as Labeled Faces in the Wild (LFW) and MS-Celeb-1M are
used. Algorithms include convolutional neural network (CNN), FaceNet, and ArcFace.
• image recognition and classification, including object recognition and medical image
analysis. Data sets such as ImageNet, Common Objects in Context (COCO), and ChestX-
ray14 are used to train and test models. Deep learning frameworks such as ResNet, VGG,
and Inception series models are used.
• Autonomous driving, including road sign recognition, pedestrian detection, and obstacle
avoidance. KITTI, Cityscapes, and Waymo Open Dataset are popular for autonomous
driving. These datasets use object detection methods like YOLO, SSD, and Faster R-CNN.
They also use 3D sensing methods like PointNet and RangeNet++, which work with lidar
data.
• Image generation
Natural Language Processing (NLP)
Conversational bot
Sentiment analysis
Machine translation
Speech synthesis
• Chatbot, including customer service assistant and smart speaker. The training dataset may
contain many dialog records and encyclopedia entries. Common algorithms are the
sequence-to-sequence (Seq2Seq) model, Transformer architectures like BERT and GPT, and
reinforcement learning methods.
• Sentiment analysis, including analysis of the sentiment tendency of user comments and
social media posts. The IMDb movie review dataset and Twitter sentiment analysis dataset
are commonly used benchmark datasets. Common algorithms include the bag of words
(BOW) model, TF-IDF, and deep learning models such as TextCNN, LSTM, and bidirectional
LSTM.
• Machine translation, that is, real-time translation between different languages. The
Workshop on Machine Translation (WMT) provides multiple language pair datasets for
training translation models. Google's transformer model and its variants, like Transformer-
XL and BERT-based architectures, are the mainstream algorithms in this field.
• Speech recognition and generation, including converting speech into text or text into
speech. They are the basis for voice assistants and automatic speech translation systems.
Data Preparation (1)
⚫ Data collection
Recommendation system: Collect users' historical behavior data, which may include users' purchase
history, browsing records, and rating data. Public datasets can be used, for example, the MovieLens
movie rating dataset.
CV: Obtain diversified image or video data in related fields to cover various objects, scenarios, and
conditions that the model should recognize. Consider using public datasets like ImageNet, COCO, or
PASCAL VOC as the starting point.
NLP: Use crawlers to obtain related text from the Internet, or obtain text from public datasets, books,
news, and social media platforms. In addition, questionnaires are also an effective way to collect
specific data.
𝑧2 = 𝑊1 𝑋 𝑧3 = 𝑊2 𝑎2 1
𝑎2 = 𝑔1(𝑧2 ) 𝑎3 = 𝑔2(𝑧3) 𝐶(𝑊) = (𝑦 − 𝑎3 )2
𝜕𝑧 2
(𝜕𝑊 )
2
𝜕𝑧 𝜕𝑎 𝜕𝑧 𝜕𝑎 𝜕𝐶(𝑊)
(𝜕𝑊2 ) ( 𝜕𝑧2) (𝜕𝑎3 ) ( 𝜕𝑧3) ( 𝜕𝑎3
)
1 2 2 3
Backward propagation
14 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Model Evaluation (1)
⚫ Classification task evaluation metrics
Accuracy: It is one of the most basic evaluation metrics. It measures the ratio of the number of samples that are
correctly predicted by the model to the total number of samples.
Precision: In a classification task, precision is the proportion of true positive predictions among all positive
predictions made by the model.
Recall: For binary classification or multi-class classification, precision is the proportion of positive samples
predicted by the model, reflecting the model's ability to predict positive samples.
F1-Score: The F1-Score is the harmonic mean of precision and recall. It combines precision and recall. A higher
F1 value indicates that the model achieves a better balance between precision and recall.
• The trained model file format in PyTorch is PTH, while ONNX is an exchange format
developed by AI companies.
Contents
1. AI Business Process
▫ Model Fine-tuning
▫ RHLF
▫ Model Evaluation
Business requirements
Large model
Data collection Data preprocessing selection and
design
Model
deployment
• https://support.huaweicloud.com/intl/en-us/function-modelarts/index.html
Large Models Are More Complex with More Parameters
⚫ Large models often have billions or even more parameters and can be hundreds of gigabytes or larger.
Large models based on Transformer use the self-attention mechanism, which brings a large number of
parameters. The self-attention mechanism allows the model to independently consider each position in the
input sequence and flexibly capture long-distance dependencies, which further increases the demand for model
parameters.
For complex, high-dimensional data distribution, more parameters help the model fit better, reduce bias, and
improve training performance. In particular, when processing tasks such as natural language, the model needs
enough capacity to understand and generate various sentence structures and meanings, so as to better cope
with the richness and complexity of language.
⚫ For example, it is estimated based on documentation that GPT-3 has about 175 billion parameters.
LLaMA2 by Meta AI comes in three sizes: 7 billion (7B), 13 billion (13B), and 70 billion (70B) parameters.
Data Filtering
processing
Deduplication
Training Ratio 60%, 22%, 8%, 8%, and 3% 67%, 15%, 4.5%, 4.5%, 2.5%, and 2.0% ......
data Tokens 300B 1.4T 400B
• Explain that datasets need to be cleansed to remove errors and duplicates. The original
GPT3 data was 45 TB, but only 570 GB remains after filtering. The ratio of used data varies
according to the data quality.
Diversity of Large Model Training Data
⚫ Training large models needs large-scale, high-quality, and multimodal datasets. This data typically needs
to be collected from various domains and sources, and can be texts, images, audio, and videos.
Reinforcement
Language Language Classification
Learning
modeling modeling Prompt
generate responses
Predict the next Predict the next winning_response,
that maximize the
token token losing_response)
reward
• During pre-training, we use huge amounts of data from sources like web pages, Wikipedia,
books, GitHub, papers, and Q&A sites. This data contains hundreds of billions to trillions of
words. We train the model using a supercomputer with thousands of powerful GPUs and
fast networks. It takes dozens of days to complete this training and build the basic language
model (base model). The Basic Model develops the capability to model long texts, enabling
the model to generate language. Based on the input prompt, the model can produce text
completions.
Pre-training Datasets
⚫ Data sources of large models are diverse, mainly including the following:
Public data on the Internet: The Internet holds vast amounts of public data, such as texts, images, and videos.
Developers can crawl these data on the Internet to build their datasets. Such data may come from sources like
news sites, social media, blogs, and forums.
Professional databases and data released by institutions: Many professional institutions, databases, and
academic organizations release organized and processed datasets. These datasets usually have high quality and
reliability. Large models can use these data for training to improve their performance and accuracy.
Internal enterprise data: For some customized large models, enterprises may use their internal data to build
datasets. Such data may include enterprise business data, user behavior data, and product data, helping model
better adapt to the actual needs of enterprises.
User-generated content: With the popularity of social media and online platforms, user-generated content (such
as comments, ratings, and feedback) has also become an important source of large model datasets. Such data
reflects users' real needs and preferences, helping the model better understand and meet their needs.
▫ The number of computing units decides how parameters are split. Data also needs to
be divided for matrix calculations.
▫ The number of computing units decides how w is split. In this case, input data X does
not need to be split.
Pipeline Parallelism
⚫ The following figure shows the pipeline parallelism. Assume that there are four NPUs. The parameters of the large
model are divided into four parts by layer and placed on device 0, 1, 2, and 3 in sequence. For a minibatch of data,
forward propagation is first performed on device 0 (corresponding to the gray block F0 in the figure). After the
calculation is complete, the result of device 0 is sent to device 1 to continue forward propagation (corresponding to
the yellow block F0 in the figure). After the forward propagation is complete on the last NPU, that is, device 3, the
loss is calculated on device 3 and backward propagation is performed (corresponding to the purple block B0 in the
figure). Backward propagation is performed until device 0 is complete. The optimizer parameters are updated on
each card, and the model parameters are updated.
1. AI Business Process
▫ Supervised Fine-tuning
▫ Reinforcement Learning
▫ Model Evaluation
1. AI Business Process
▫ Supervised Fine-tuning
▫ RLHF
▫ Model Evaluation
• Why use instruction tuning: Because GPT's outputs can be random, inaccurate,
uninteresting, or harmful due to its training on large datasets.
• RLHF is a standard technology used to make sure LLMs create real, harmless, and helpful
content. However, human communication is a personal and creative act, and the value of
LLM output largely depends on human values and preferences. Different models are
trained in various ways and use different human feedback, leading to varied results. How
well each model matches human values depends on its creator.
Reward Model (RM) Training
⚫ Randomly select questions from the dataset. Use mode Selecting questions
from the question What is a banana?
generated after the supervised instruction fine-tuning (SFT library
model), and generate multiple responses for each question. A: A sour fruit... B: A piece of
decoration...
Repetitively
Labeling personnel rank the responses based on a thorough generate answers
C. Something
for 4 times D: Bananas are
evaluation. This process is like getting guidance from a monkeys love
yellow...
to eat...
mentor or teacher.
⚫ Then, use the ranking results to train the reward model.
Multiple ranking results are paired to form multiple training Manual sorting
data pairs.
⚫ The RM model takes an input and gives a score to rate the
response's quality. For each training pair, it adjusts the Use the sorting
result to train the
parameters to make sure high-quality responses get higher reward model
scores than low-quality ones.
• Data preparation: First, prepare a batch of data for training the reward model. This data
should include text samples and their human evaluation scores. You can get these scores
from crowdsourcing platforms or expert reviews, as they reflect how people perceive text
quality.
• Model construction: Next, build a neural network as the reward model. You can use a
simple multilayer perceptron (MLP) or a more complex model like the Transformer. The key
is to ensure the model can effectively capture human evaluation criteria.
• Training process: Use the prepared data to train the reward model. The training target is to
minimize the error between the model's predicted scores and the actual human evaluation
scores. You can achieve this with optimization methods like gradient descent.
• Validation and tuning: Validate the model during training to ensure it can accurately
predict human evaluations on new data. If the model performs poorly, we may need to
adjust its structure or training method.
Reinforcement Learning
⚫ Reinforcement learning considers the overall impact more than supervised learning.
Supervised learning gives feedback on individual tokens to help the model give the right answer. Reinforcement
learning gives feedback on the whole output instead of individual tokens. Reinforcement learning works better
for large language models due to the difference in feedback granularity.
⚫ Reinforcement learning is more effective at addressing hallucination issues.
In supervised learning, the model gives a result even if it doesn't know the answer. In reinforcement learning,
the model avoids low-scoring answers by using reward scores.
⚫ Reinforcement learning better handles reward accumulation issues in multi-round conversations.
Building effective multi-round dialog interactions and determining if the final goal is met is hard with supervised
learning. But using reinforcement learning, we can create a reward function that assesses the model's output
based on the dialog's coherence and context.
1. AI Business Process
▫ Supervised Fine-tuning
▫ RLHF
▫ Model Evaluation
• For example, when given only the LeetCode problem number, the model outputs the
correct solution. This shows that the training data was contaminated.
• For example, our recent study, PromptBench. It is the first evaluation benchmark form
prompt robustness in LLMs. We found that these models often struggle with interference
and lack stability. This led us to improve the system's fault tolerance through better
prompts.
• Large models keep evolving and becoming more powerful. Can we assess their abilities
from an evolutionary viewpoint using well-designed and scientific methods? How can we
predict potential risks in advance? These are important research topics.
Evaluation Dimensions
⚫ Large language models can handle many complex natural language tasks, unlike traditional algorithms.
Therefore, model evaluation must be performed from multiple dimensions. The evaluation dimensions
can be classified into the following three aspects:
Knowledge and capability: Large language models have rich knowledge and can handle many tasks, including
natural language processing (text classification and information extraction), knowledge Q&A (reading
comprehension and open-domain Q&A), natural language generation (text summarization and text creation),
logical reasoning, and code generation.
Ethics and security: Models are trained based on the 3H principles: helpfulness, honesty, and harmlessness.
Harmlessness ensures that the model's responses match human values.
Vertical domains, mainly covering complex reasoning capability (such as knowledge reasoning and mathematical
reasoning), environment interaction capability (for example, an agent generates actions based on language
instructions for home tasks), and problem solving capability in specific domains (such as finance, law, and
healthcare).
• https://3ms.huawei.com/km/blogs/details/14813656
Challenges in Large Model Evaluation (1)
⚫ Diverse foundation model capabilities make evaluation challenging
It is challenging to make comprehensive and fair evaluations on diverse foundation model
capabilities. For example, Stanford evaluates models across 42 scenario tasks in 7 dimensions,
SuperCLUE defines 70 scenario tasks, C-Eval selects 52 scenario tasks, and AGIEval chooses 20
standardized test tasks.
1. AI Business Process
...
• Provide as clear and specific instructions as possible to express what you want the model to
do. This will guide the model to create content as you expect and reduce the chance of
getting irrelevant or wrong answers.
• Most large models have emergent capabilities, can solve relatively complex problems, and
have a very broad knowledge coverage.
• They can understand and analyze natural language input from users and provide services
and assistance in different fields and tasks.
1. AI Business Process
Q: <Question>?
A:
Output: The description includes potential negative impacts of apples without directly mentioning any positive aspects of apples. The following is an analysis of
the positive and negative evaluations of the description:
1. Negative evaluation
• "Certain substances in apples may cause allergies in consumers": This points out the potential health risks apples may pose to certain groups, which
is a clear negative evaluation.
• "Leading to symptoms such as rashes and difficulty breathing": It lists the symptoms of allergic reactions in detail, further reinforcing the potential
negative health impacts of apples.
• "Excessive fructose and dietary fiber may cause digestive symptoms such as bloating and diarrhea": This again emphasizes the potential side effects
of apple consumption, which is a negative evaluation of the potential adverse effects of apples.
In this description, no positive attributes of apples, such as nutritional value or health benefits, are mentioned, so overall it leans towards a negative
evaluation.
2. Positive evaluation: Although there is no direct positive evaluation in the description, it can be inferred that, since apples are widely recognized as a
healthy food, the mentioned allergens and digestive issues usually do not prevent most people from consuming apples. However, this positive
perspective is based on general knowledge rather than what is provided in the description itself.
In summary, the description focuses on the potential risks of apples without explicitly expressing any positive features, so it can be judged as a negative
evaluation. It should be noted that, to provide a comprehensive evaluation, the positive aspects such as the nutritional and health benefits of apples are
usually also needed to be considered.
• If only simple data judgment is required, the output is complex and contains a large amount
of unnecessary information.
• Models of the GPT series are all autoregressive language models. This is to say that such
models predict the next word based on the current input, and then combine the prediction
result with the input as the model input to predict the next word. This process repeats
cyclically.
• The pre-trained GPT-3 model does not need to be retrained when migrated to a new task.
You only need to provide a task description (optional), then provide several examples (task
queries and corresponding answers, organized in pairs), and finally add the query to be
answered by the model. After the preceding content is packaged as the model input, the
model can correctly output the answer corresponding to the last query.
Few-Shot Prompting
⚫ Although large language models demonstrate impressive zero-shot capabilities, they still perform poorly
on more complex tasks when using zero-shot settings.
⚫ Few-shot prompting can be used as a technique to enable contextual learning, providing demonstrations
in the prompt to guide the model to achieve better performance. Demonstrations are conditions for
subsequent examples, and the model is expected to generate responses.
⚫ Few can be selected randomly.
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:
• One-shot few-shot
Example of Few-Shot Prompting
Prompt: Certain substances in apples may cause allergies in consumers, leading to symptoms such as rashes and difficulty breathing; at
the same time, excessive fructose and dietary fiber in apples may cause digestive symptoms such as bloating and diarrhea.
Evaluation: Negative
Bananas are a very beneficial fruit for the human body.
Evaluation: Positive
Durian is sweet and delicious, with very high nutritional value.
Evaluation:
Output:
Positive
1. AI Business Process
Prompt: The sum of the odd numbers in this group is even: 4, 8, 9, 15, 12, 2, 1.
A: Add all the odd numbers (9, 15, 1) to get 25. The answer is False.
The sum of the odd numbers in this group is even: 17, 10, 19, 4, 8, 12, 24.
A: Add all the odd numbers (17,19) to get 36. The answer is True.
The sum of the odd numbers in this group is even: 16, 11, 14, 4, 8, 13, 24.
A: Add all the odd numbers (11, 13) to get 24. The answer is True.
The sum of the odd numbers in this group is an even number: 17, 9, 10, 12, 13, 4, 2.
A: Add all the odd numbers (17, 9, 13) to get 39. The answer is False.
The sum of odd numbers in this group of numbers is even: 15, 32, 5, 13, 82, 7, 1.
A:
Output: Add all odd numbers (15, 5, 13, 7, 1) to get 41. The answer is False.
C. Better scalability
D. High generalization
• BCD
Summary
⚫ This chapter introduces the entire process of AI service and foundation model
building, and describes how to use prompt engineering to efficiently use foundation
models to obtain more accurate answers.
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
Voice assistant
• https://consumer.huawei.com/en/emui/celia/
Voice Assistant – Celia
⚫ Celia is a smart assistant. Powered by foundation models, Celia is capable of providing solutions
for every scenario.
• https://consumer.huawei.com/en/emui/celia/
Celia – Simplify Your Daily Routines
Wherever you are and whatever you are doing, with the help of Celia, you can get your everyday
tasks done quickly and effortlessly. Be productive and enjoy your day!
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
• https://3ms.huawei.com/km/static/image/detail.html?fid=66109
Smart Home – Making Your Life Easier
⚫ AI connects home appliances to create an intelligent ecosystem
When you step into your home, warm lights fade in from the hallway
to the living room. The intelligent environment control system
automatically adjusts the ambient temperature, humidity, and air
quality to suit your preferences. Let the home take care of itself, and
relax after a long day.
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
⚫ Knowledge graphs are knowledge bases used in intelligent recommender systems. They are essentially structured networks that store relationships
between entities. They contain a large amount of background information about target objects and relationships between objects in recommendation
systems. Knowledge graphs provide precise profiles of target objects, such as user behavior, interests, and requirements, to implement accurate
matching and targeted recommendation, achieving scenario- and task-specific personalized recommendation. This close relationship enables
knowledge graphs to shine in intelligent recommendation.
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
0. Different
mugs, some 1. Give an
observed by instruction
the robot, for a mug
others not. observed by
the robot.
2. Give an 3. Give an
instruction instruction
for mug 1 for mug 2
not yet not yet
observed by observed by
the robot. the robot.
• https://arxiv.org/pdf/2410.07864
• https://rdt-robotics.github.io/rdt-robotics/
Contents
1. Voice Assistants
2. Smart Homes
3. Intelligent Vehicles
4. Intelligent Recommendation
5. Intelligent Robots
6. AI4Science
• The Nobel Prize in Physics was awarded to American scientist John Hopfield and Canadian
scientist Geoffrey Hinton for their "foundational discoveries and inventions that enable
machine learning with artificial neural networks". The work of these two pioneers in AI has
laid a foundation for the development of deep learning and neural networks, and plays a
crucial role in the booming development of AI.
• Half of the Nobel Prize in Chemistry was awarded to American scientist David Baker for his
contribution to computational protein design, while British scientist Demis Hassabis and
American scientist John Jumper shared the other half for their achievements in protein
structure prediction. AlphaFold2, an AI model developed by Hassabis and Jumper, has
solved a 50-year-old problem by predicting the complex structures of approximately 200
million known proteins. It has been used by over 2 million people worldwide and holds
revolutionary significance in fields like drug R&D.
• The awarding of these prizes reflects a new trend in scientific research: AI, as a force that
cannot be ignored, is driving a paradigm shift in scientific research. In fields such as physics,
chemistry, biology, and medicine, AI has become an important tool for solving long-
standing complex scientific problems. It has established a paradigm theoretically capable of
solving all scientific problems: starting from practical problems, transforming them into
input data that AI can process, and then using deep learning networks to ultimately output
results. Many scientists believe that AI will continuously push scientific research beyond
traditional frameworks, achieving more profound and broader innovations.
AI + Quantum Mechanics
FermiNet (DeepMind, 2024) LapNet (Peking University and ByteDance, 2024)
FermiNet is a neural network architecture proposed by DeepMind. It LapNet is a deep learning architecture developed by ByteDance and
supports a parameterized wave function representation method and can Peking University for neural network-based variational Monte Carlo (NN-
efficiently process multi-electron quantum systems, improving the VMC). The team has designed a forward Laplacian computing framework
accuracy and efficiency of electronic structure calculations. FermiNet to calculate the Laplacian related to the neural network through an
calculates the energy of atoms and molecules, and still captures 97% of effective forward propagation process (which is a bottleneck of NN-VMC).
the relevant energy in a molecular system consisting of 30 electrons of LapNet can achieve significant acceleration, extending the applicability of
Bicyclobutane. NN-VMC to larger systems.
David Pfau, et.al. Accurate computation of quantum excited states with Ruichen Li, et.al. A computational framework for neural network based variational
neural networks. Science, 2024 Monte Carlo with Forward Laplacian. Nature M. I., 2024
27 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
AI + Material Science
GNoME (DeepMind and Nature, 2023)
The GNoME tool launched by DeepMind has discovered 2.2 million types of new
crystals, among which 380,000 crystals with stable structures are expected to be
synthesized. This may drive technological transformations in next-generation batteries
and superconductors. GNoME uses a graph neural network (GNN) model to efficiently
predict the crystal structure and stability of materials, greatly accelerating the discovery
of new materials. Compared with traditional experimental methods, AI models can
predict millions of materials in a short time, shortening the research period and
significantly saving experimental costs. AmilMerchant et al. Scaling deep learning for materials discovery, Nature (2023)
Weinan E, et. al. DeePMD kit v2: A software package for deep potential models. 2023
⚫ This chapter describes how AI reshapes our lives, as well as AI applications in fields
such as meteorology and robotics.