0% found this document useful (0 votes)

298 views473 pages

HCIA-AI V4.0 Training Material

Uploaded by

shahdabdelmobdyy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

298 views473 pages

HCIA-AI V4.0 Training Material

Uploaded by

shahdabdelmobdyy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 473

AI Overview

Foreword

⚫ Artificial intelligence (AI) is driving progress and industrial upgrades at an

unprecedented speed. As a field of research in human intelligence simulation, AI has
not only profoundly changed our lifestyles but has also showcased its immense value
across various fields such as research, industrial production, healthcare, financial
services, and education and entertainment.
⚫ This section describes the history, key technologies, and applications of AI in various
fields. It provides insights into the application fields and future of AI technologies
based on their definition and scope.

2 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Objectives

⚫ Upon completing this course, you will be able to understand:

 Basic AI concepts
 AI technologies and their history
 AI applications in various technologies and fields
 Future AI trends

3 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications

5. Debates and Future of AI

4 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI in the Eyes of Researchers
"I propose to consider the question, 'Can machines think?’”

-- Alan Turing in 1950

Make machines behave like humans.

-- John McCarthy in 1956

"Artificial intelligence is the science of making machines do things that would require intelligence
if done by men."

-- Marvin Minsky

5 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• "The branch of computer science concerned with making computers behave like humans."
-- A widespread early definition of AI proposed by John McCarthy at the Dartmouth
Conference in 1956. However, it seems that this definition ignores the possibility of strong
AI. According to another definition, AI is the intelligence (weak AI) demonstrated by
artificial machines.

• Alan Turing discussed the question "Can machines think" in his paper Computing Machinery
and Intelligence.
What Is Intelligence?
⚫ Professor Howard Gardner proposed the theory of multi-intelligence, and listed eight
capabilities that reflect multi-intelligence:
 Verbal/Linguistic
 Logical/Mathematical
 Visual/Spatial
 Bodily/Kinesthetic
 Musical/Rhythmic
 Inter-personal/Social
 Intra-personal/Introspective
 Naturalist
6 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Gardner, American educator and psychologist

• 1. Linguistic intelligence

It refers to the ability to express thoughts and understand others orally and literally, to grasp
speeches, semantics, and grammars masterly, and to think in words, express in words, and
appreciate the deep meaning of languages. They are competent for jobs such as political
activists, presenters, lawyers, speakers, editors, writers, journalists, and teachers.

• 2. Logical-mathematical intelligence

It refers to the ability to calculate, measure, infer, conclude, classify, and carry out complex
mathematical operations. This type of intelligence possesses sensitivity to logical means and
relationships, statements and propositions, functions, and other related abstract concepts.
They are competent for jobs such as scientists, accountants, statisticians, engineers, and
computer software R&D personnel.

• 3. Spatial intelligence

It refers to the ability to accurately perceive the visual space and surroundings and present the
perception in the form of graphics. This type of intelligence possesses sensitivity to colors,
lines, shapes, forms, and spatial relationships. They are competent for jobs such as interior
designers, architects, photographers, painters, and pilots.

• 4. Bodily-kinesthetic intelligence

It refers to the ability to express thoughts and feelings with the body and to make or operate
objects with hands. This type of intelligence possesses special physical skills such as balance,
coordination, agility, strength, flexibility and speed, and abilities triggered by tactile sensation.
They are component for jobs such as athletes, actors, dancers, surgeons, jewellers, and
mechanics.
What Is AI?
⚫ Artificial intelligence can be interpreted as "artificial" plus "intelligence". "Artificial" means being designed, created,
and made by humans. "Intelligence" means thinking and behaving like humans.
⚫ Artificial intelligence is a new technical science that deals with the research and development of theories, methods,
techniques, and application systems for simulating and extending human intelligence. In 1956, the term "Artificial
intelligence" was first coined by John McCarthy, who defined it as "the science and engineering of making intelligent
machines." The purpose of AI is to make machines intelligent and give them human thoughts.
⚫ AI has become an interdisciplinary subject that overlaps with various fields.

7 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Machine learning can be understood from multiple aspects. Tom Mitchell, known as the
"godfather of global machine learning", defined machine learning as: For a task T and
performance metric P, if the performance of a computer program measured by P on T self-
improves with experience E, the computer program is learning from experience E. These
definitions are simple and abstract. However, as we deepen our understanding of machine
learning, we will find that the connotation and extension of machine learning are changing
over time. Because a variety of fields and applications are involved and machine learning
develops rapidly, it is not easy to define machine learning simply and clearly.
Relationships Among AI, Machine Learning, and Deep Learning (1)

Area of
research

Way to
implement

Current
mainstream

Many schools of
thought have emerged
during the
development of AI.

8 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Relationships Among AI, Machine Learning, and Deep Learning (2)
⚫ AI is a new technical science that studies and develops theories, methods, techniques, and
application systems to simulate and extend human intelligence.
⚫ Machine learning is the study of how computers simulate or replicate human learning
processes to acquire new knowledge or skills and restructure existing knowledge for continuous
performance improvement.
⚫ Deep learning is a concept that originates from the research of artificial neural networks. It is
an ongoing field of machine learning research that mimics the human brain's mechanism to
interpret data.

9 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The term "artificial intelligence" was previously used to describe machines that imitate and
demonstrate "human" cognitive skill related to human thinking.

• A machine learning algorithm mainly builds a model based on sample data (referred to as
trained data), so that prediction or decision-making can be made without explicit
programming.

• Deep learning is a type of machine learning, and machine learning is a one and only path for
implementing artificial intelligence. The concept of deep learning originates from the study
of artificial neural networks. A multi-layer perceptron containing multiple hidden layers is a
deep learning structure. Deep learning uses higher level features derived from the lower
level features to form a hierarchical representation. The motivation of deep learning
research is to establish a neural network that simulates the human brain for analysis and
learning. The neural network simulates the mechanism of the human brain to interpret
data, such as images, sounds, and texts.
Major Schools of AI - Symbolism
⚫ Symbolism is also called Logicism, Psychologism, or Computerism.
⚫ Symbolism believes that AI is built upon mathematical logic. Followers of this school of thought
believe that symbols are the cognitive primitives of humans, and the human cognition process
is based on inferring and calculating a variety of symbols. In their opinion, both humans and
computers use physical symbol systems, so computers can be used to simulate the intelligent
behavior of humans.
Apple in the eyes of symbolism

10 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• After 1956, symbolism developed heuristic algorithms, expert systems, and knowledge
engineering theories and technologies, and made significant progress in the 1980s.

• The successful launch and application of the expert system is of great significance for
leading AI to engineering application and linking AI theory with practice.

• After other AI schools emerged, symbolism is still a mainstream of AI.

Major Schools of AI - Connectionism
⚫ Connectionism is also called Bionicsism or Physiologism.
⚫ Connectionism believes that AI originates from bionics, particularly the research on human
brain models. The primitives of human thinking are neurons, rather than symbolic processes.
Researchers in this school of thought began by looking at neurons and then neural network
models and brain models. This has successfully opened another path to artificial intelligence.

11 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Units in a network can represent neurons, and connections can represent synapses, just as
in the human brain.

• In 1986, Rumelhart et al. proposed a backpropagation (BP) algorithm in a multi-layer

network. Since then, connectionism has gradually emerged, from models to algorithms, and
from theoretical analysis to engineering implementation.
Major Schools of AI - Behaviorism
⚫ Behaviorism is also called Evolutionism or Cyberneticsism.
⚫ Behaviorism believes that AI originates from cybernetics. In this school of thought, intelligence
depends on perception and action; and knowledge, representation, and inference are not
involved. AI can evolve like human intelligence. Intelligent behavior can only be manifested
through interaction with surrounding environments in the real world.

12 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Behaviorism concerns more about application practices and how to learn from the
environment continuously to make corrections.

• School masterpiece: Brooks' hexapod walking robot, a control system that simulates insect
behavior based on the perception-action pattern.

• This school is a somewhat similar to an adaptive control system, which collects data using
sensors (environments) and acts on the system.
Three Elements for AI Development
Data center, distributed
Independent of the school of thought, computing, cloud
computing, edge
AI research needs three key elements. computing, and high-
performance computing
(HPC)

Computing power
The engine of AI and the driving
force behind AI systems

Data Algorithms
The powerhouse of The brain of AI and
AI and the fuel for the command center
AI systems of AI systems
Data mining, data analysis, Machine learning (ML),
data warehouse, data deep learning (DL), natural
visualization, data language processing (NLP),
security, and privacy computer vision (CV), and
protection recommendation system

13 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Industry Ecosystem

14 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Types of AI
⚫ Strong AI
 The strong AI theory is that it is possible to create intelligent machines capable of reasoning and
problem solving, and that such machines will be self-aware. They will be able to think independently
and develop the best solution to a problem. In addition, such machines will have their own values
and worldview, as well as the instincts of living things, such as the need for survival and safety. In a
sense, the machine with human thoughts can be regarded as a new civilization.

⚫ Weak AI
 The weak AI theory holds that it is impossible to create an intelligent machine capable of reasoning
and problem solving. Such a machine only looks intelligent but does not have human intelligence or
self-awareness.

15 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• An important indicator of AI is to achieve a superhuman level in challenging fields through

self-learning without any prior knowledge.

• Strong AI refers to AI that can compete with humans in all aspects. Therefore, strong AI is
not limited to a specified field, but makes robots comparable to humans in all aspects.
Strong AI can think, plan, solve problems, abstract thinking, understand complex concepts,
quickly learn, and learn from experience. Currently, it is believed that if the human brain
can be simulated and all neurons and synapses in the human brain can be imitated on the
same scale, strong AI will naturally occur.

• Now we are in the weak AI phase. Weak AI alleviates human intellectual labor, similar to
advanced bionics. AI outperforms humans only in some aspects.
Contents

1. AI Overview
2. AI Technologies
◼ AI Technologies
 NLP
 CV
 Foundation Model

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
5. Debates and Future of AI

16 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Overview of AI Technologies & Application Fields

17 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Popular AI Subfields

18 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Distribution of AI Applications
⚫ AI is commonly applied in the following technical fields:
 Computer vision
◼ How can we enable computers to "see" and "see" faster and more accurately?

 Natural language processing

◼ How can we enable computers to understand and use natural languages?

 Multimodal
 Other

19 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview
2. AI Technologies
 AI Technologies
◼ NLP
 CV
 Foundation Model

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
5. Debates and Future of AI

20 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Natural Language Processing
⚫ Natural language processing (NLP) is a branch of computer science and artificial intelligence. It studies
theories and methods for implementing effective communication between humans and computers
through natural languages. NLP is intended to enable computers to understand and generate human
languages.
 Development: (1) Rules and linguistic theories; (2) Machine learning and deep learning; (3) Large language
models (LLMs)
⚫ Applications:
 Sentiment analysis
 Text mining
 Machine translation
 Chatbots
 Other

21 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Applications in NLP

Text classification Machine translation

Automatically classifies text into preset Automatically converts a language into
categories by using algorithms, and analyzes another language (for example, Chinese to
text features to manage information English, or French to Japanese) using
efficiently and support precise decision- algorithm models.
making.

Text summarization AI applications Smart writing

Automatically extracts core information in NLP Assists in generating high-quality text,
from text and generates a more concise such as news releases and advertising
version while retaining the original copies, improving content creation
semantic associations and key elements. efficiency.

Information extraction ...

Automatically extracts key information, such as ...
entities and relationships, from massive text
data to support data analytics and decision-
making.

22 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

NLP Task - Text Classification
⚫ Definition
 Input:
I wasn't happy to hear about a price cut a few days after this
◼ An article 𝑑 purchase. The flash memory score is just over 500 points.
◼ Fixed category set 𝐶 = {𝐶1 , 𝐶2 , … , 𝐶𝑗 }
Pleasing appearance and amazing sound. Now, most
 Output:
electronic products are made by Huawei.
◼ Predicted category for the article 𝑐 ∈ 𝐶

23 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

NLP Task - Sequence Labeling
Huawei is a technology company located in Shenzhen, China.
⚫ Sequence labeling is using a model to
label each position of a given input
sequence (usually a sentence, Part-of-speech tagging

character, or word in text) to form a

label sequence. This process may be n/v/DET/n/n/v/p/n/w/n/w

viewed as sequence-to-sequence
mapping. Named Entity Recognition

ORG O O O O O O LOC LOC

24 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• DET represents determiner, n represents noun, v represents verb, p represents preposition,

and w represents punctuation.
NLP Task - Text Generation
⚫ Text generation is a process in which an algorithm automatically generates grammatically sound and
meaningful text content.
⚫ This technology can simulate the human writing process and generate various types of text, such as news
reports, emails, articles, social media posts, poems, novel chapters, and machine translations.

25 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Knowledge Graph
⚫ A knowledge graph is a graphical data model used to represent and organize structured information. It represents
entities (such as people, places, and things) in the real world and their relationships (such as friendships and
affiliations) in the form of graphs to construct a large-scale, multi-domain knowledge base.
⚫ NLP provides the means to build a knowledge graph. NLP helps extract information (such as entities, relationships,
and attributes) from text to build and enrich knowledge graphs.

26 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview
2. AI Technologies
 AI Technologies
 NLP
◼ CV
 Foundation Model

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
5. Debates and Future of AI

27 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Computer Vision
⚫ Computer Vision (CV) is a technology that uses computers and algorithms to simulate the
process of human vision. It involves extracting information from images or videos, analyzing
and comprehending content, and making decisions accordingly.

28 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Applications in CV

Image classification Video inpainting

Uses extensive amounts of unlabeled data Restores and enhances video quality using
to improve the representation capability deep learning-based techniques.
of models for applications in various
general scenarios.

Object detection AI applications Intelligent driving

Identifies and locates objects in in CV Enables vehicles to perceive
surroundings, make decisions, and
images or videos, supporting
automatic analysis and decision- perform driving operations like
making. human drivers.

Semantic segmentation ...

Assigns a predefined class label to each ...
pixel in an image, implementing pixel-level
image understanding.

29 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

CV Task - Object Detection
⚫ Object detection is one of the fundamental technologies for image processing and computer
vision in the AI field. It has a wide range of applications, such as traffic monitoring, image
search, facial recognition, and human-computer interaction. Object detection technologies can
help detect objects in an image for further processing with intelligent algorithms.

Car Car
Car
Car

Car
Car

Car

30 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Due to different appearances, shapes, and arrangements of objects, as well as lighting and
shading during imaging, object detection has always been the largest challenge in the field
of computer vision.

• Despite similar techniques used, object recognition is slightly different from object
detection. Given a specific object, target recognition is aimed to find an instance of the
object in an image. This is not to classify but to judge whether the object appears in the
image, and if so, the object is located. For example, a real-time image taken by a security
camera is monitored to identify a face of a person.
CV Task - Image Segmentation
⚫ Image segmentation is the process of partitioning an image into multiple segments based on
the problem to be solved.
⚫ There are many algorithms and application methods for image segmentation. Common ones
include connected component segmentation, motion segmentation, and object segmentation.

31 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Object segmentation: semantic segmentation and instance segmentation.

▫ Semantic segmentation refers to pixel-level image recognition, that is, marking the
object category to which each pixel in the image belongs.

▫ Instance segmentation is a task of identifying each instance of an object from an

image and performing pixel-level labeling on each of the instances.

• Segmentation is a pixel-level description of an image, and makes each pixel category

(instance) count. It is applicable to scenarios with high requirements for understanding, for
example, segmentation between roads and off-roads in unmanned driving.

• Input and output:

▫ Input: image.

▫ Output: segmented images with the same resolution as the input image and label of
each pixel category.
CV Task - Object Tracking
⚫ Object tracking is a core research area in computer vision and has a wide range of applications,
such as intelligent transportation, security monitoring, human-machine interaction, and
autonomous driving.
⚫ Tracking algorithms can obtain the trajectories of target objects in terms of time to analyze
their movement behaviors.

32 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

CV Task - OCR
⚫ Optical Character Recognition (OCR) is the process of recognizing characters in images or scans and
converting them into editable ones using image processing technologies such as character object
detection and object classification. OCR improves service efficiency by sparing manual information input.
OCR can be used to identify characters in ID cards, driving licenses, vehicle licenses, invoices, customs
forms, general tables, and general text.

33 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

CV Task - Image Generation
⚫ Image generation refers to generating a wholly new image, modifying a part of an image, or generating a new image
based on an existing one.
 Super resolution is a process of estimating a high-resolution image from a low-resolution counterpart and predicting image
features at different magnifications.
 Style migration is a process of applying the style of one domain or several images to other domains or images. For example, an
abstract style is applied to a realistic image.
 Image inpainting is the process of fixing images, for example, restoring damaged black-and-white photos and videos.

Image inpainting
34 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Input and output:

▫ Input: realistic image and random noise

▫ Output: generated image

Contents

1. AI Overview
2. AI Technologies
 AI Technologies
 NLP
 CV
◼ Foundation Model

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
5. Debates and Future of AI

35 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

What Is a Foundation Model (1)?
⚫ A foundation model refers to a model that
is trained on large-scale data, has a
massive number of parameters, and has
powerful functionalities.

36 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

What Is a Foundation Model (2)?
Principles Behind Large Models

Scaling Law Chinchilla Law Emergent Abilities

When the model reaches a certain
As the model size increases The model size and the size and scale, it shows a sudden
exponentially, the model number of training tokens and unexpected improvement in
performance increases linearly. should scale at equal rates. performance and generalization
capability.

What makes a model "large": Three problems of large models:

✓ How many parameters does a large model
✓ Large-scale data: Extensive data to cover nearly all real-
have?
world scenarios
✓ How can we develop complex algorithms?
✓ Large-scale model: High capacity to adapt to almost any ✓ How does a model process the input data
scenario and produce the output?
✓ Large-scale compute: Robust computing resources for
handling intricate calculations

37 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Where Does a Foundation Model Come from? (1)

RNN/LSTM/CNN… Output Probabilities

Google's research team
proposed the Linear+Softmax
transformer in 2017.
Y3 Add & Norm
Y1 Y2

Feed Forward

Semantic Add & Norm

Encoder coding Decoder Add & Norm
Multi-Head
Feed Forward
Attention
x1 x2 x3 x4
𝑁× Add & Norm Add & Norm

Masked
Multi-Head Multi-Head
In 2014, researchers in machine translation proposed the Attention Attention
Seq2Seq model. This was a new way to implement end-to-
Positional Positional
end machine translation based on the RNN architecture. Encoding Encoding

Input Embedding Input Embedding

38 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The core idea of the Seq2Seq model is to use an encoder network to encode an input
sequence into a vector in a fixed dimension or a series of hidden states, and then the
decoder network generates a target sequence word by word.
Where Does a Foundation Model Come from? (2)

Origin of a foundation model - Transformer

39 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• With Transformer as the infrastructure, foundation models are developed in different

branches.
From Small to Large

Compute: EFLOPS- ...

level computing DeepSeek
power
GLM
Data: A dataset with
up to trillions of LLaMA
tokens for training a
single model
GPT
ViT

Algorithm: From
millions of Transformer
parameters to
billions of ResNet
parameters
VGG
ELMO
AlexNet

40 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Sizes of Foundation Models
⚫ A foundation model usually has hundreds of millions of parameters and has powerful
representation and generalization capabilities.
Parameters Model Parameters (B) tokens
Model
(million) GPT3 175 3000B
ResNet101 44.55 GPT4 17,60 13T
ResNet152 60.19 (estimated)
Yolo v7 74.8 LLaMA1 7–70 1.4T
VGG16 138 LLaMA2 2T
LLaMA3 8–70 15T
GLM 6–130 400B
T5 11 34B
PaLM2 340 3.6T
Stable LM 2 21 2T

41 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Dataset
Data Scale Data Diversity

The data is usually used for a specific task or domain

A dataset usually contains millions or tens of (such as medical image classification and speech
millions of data records, suitable for resource- keyword recognition), and a wide coverage is not
Small models constrained scenarios (such as mobile devices or needed.
edge devices). The data is more centralized to optimize the model
performance in specific scenarios.

Large models need to process and learn more

The datasets involve multiple domains and topics,
complex relationships and patterns, and therefore
including but not limited to science, technology,
Large models require large-scale datasets. For example, the
culture, art, history, and business.
Common Crawl dataset used to train GPT-3
Datasets are no longer limited to a single modality
contains hundreds of billions of tokens, a
or a specific domain, but present multi-modal and
significant increase compared with the dataset
cross-domain characteristics.
used to train a small model.

42 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Large Model and Small Model

Small models are light and highly efficient. Large models feature higher processing capabilities and
They are suitable when resources are limited and accuracy. They are suitable when high complexity and
only one task needs to be performed, for example, accuracy are required.
a watch that is only used to check the time. For example, a smartwatch can be used to check the
time, heartbeat, sleep, and other parameters.

43 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

How Does a Foundation Model Learn?
Hundreds of billions of
parameters and trillions of tokens Quality fine-tuned data for specific
General knowledge of industries industries and scenarios
and fields

General
Computing

knowledge
power

Industry- and scenario-specific

Foundation models
foundation models

44 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Foundation models require massive data for learning. It is costly to train models for each
industry and scenario from scratch. Therefore, the learning of foundation models is based
on a basic model pre-trained on massive data and then fine-tuning is performed based on
industry data.
Foundation Models That Can Speak and Paint
⚫ Unlike conventional single-modal models (which process only one type of information such as text,
images, or audio), a foundation model can integrate multi-modal data (text, image, video, and audio) for
comprehensive understanding and inference.
⚫ Cross-modal understanding: The relationship between different types of data can be understood. For
example, information is extracted from an image and described in text, or an image or video is generated
based on a text description.
Foundation Model

Comprehensive model pipeline

Provides an end-to-end tool chain for model
development, training, and inference, together with
seamless coordination among DataOps + MLOps + Text
DevOp, making development 50% more efficient.

45 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Breakthroughs Driven by Foundation Models - Emergence
⚫ When a complex system consists of many micro individuals, these micro individuals are combined and
interact with each other. Integrating many micro individuals can result in a special phenomenon at the
macro level that cannot be explained using a single unit. This phenomenon is called "emergence".
⚫ Currently, two task types are considered to experience emergence:
 In-context learning (ICL): ICL helps algorithms understand the meaning and relationships among words in a
sentence and recognize the relationships among different objects in an image.
 Chain-of-thought (CoT): Users write down step-by-step inference processes and provide them for a large
language model (LLM) so it can perform some complex inference tasks.

46 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Emergence also exists in daily life, for example, snowflake formation, traffic jams, animal
migration, and vortex formation. Snowflakes are used as an example for explanation. A
snowflake is composed of small water molecules. However, if a large number of water
molecules interact with each other under the premise of external temperature change, a
regular, symmetric, and beautiful snowflake will be formed at the macro level.

• When the model size is not large enough, tasks cannot be processed properly. However,
when the model size exceeds a threshold, those tasks can be properly performed in a
sudden.
Stepwise Thinking - CoT
⚫ COT is an improved prompting strategy that helps LLMs perform better in complex inference tasks, such as
arithmetic, common sense, and symbolic inference. Unlike ICL, where prompts are constructed using simple input-
output pairs, CoT combines intermediate inference steps, which can import the final output to prompts.

Standard Prompting CoT Prompting

Model Input Model Input
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis
Each can has 3 tennis balls. How many tennis balls does he have
balls. Each can has 3 tennis balls. How many tennis balls
now?
does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis
A: The answer is 11.
balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
lunch and bought 6 more, how many apples do they have?
bought 6 more, how many apples do they have?

Model Output
Model Output
A: The cafeteria had 23 apples originally. They used 20 to make
A: The answer is 27. × lunch. So, they had 23 - 20 = 3. They bought 6 more apples, so they
have 3 + 6 = 9. The answer is 9.
√
47 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Compared with the traditional in-context learning (which uses x1, y1, x2, y2, ... xtest as the
input to enable the foundation model to provide the output ytest), the chain of thought
introduces intermediate inference prompts.

• https://bbs.huaweicloud.com/blogs/406077.
Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications

5. Debates and Future of AI

48 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

A History of DeepSeek

05/2024 12/2024
DeepSeek-V2 DeepSeek-V3
07/2023 DeepSeek-V2 was launched Developed with 671 billion
Launch of DeepSeek 11/2023 at a price point similar to parameters, V3 was trained in
LLM Chat mainstream LLMs, securing just 55 days and at one-tenth
Liang Wenfeng, co- seventh spot in a ranking the cost of similar models,
DeepSeek-LLM Chat
founder of High-Flyer, by the University of outperforming Llama 3.1 and
was developed with 67
launched DeepSeek. Waterloo. Qwen 2.5 and rivaling GPT-4o.
billion parameters.
Leading new
High-Flyer Start modes
11/2024 01/2025
DeepSeek-R1-Lite- DeepSeek-R1
11/2023 Preview
DeepSeek Coder DeepSeek claimed that DeepSeek-R1 is positioned to
its reasoning model provide performance on par
A free and open course
exceeds OpenAI o1 in with OpenAI o1 and supports
model was released,
logical inference, model distillation.
designed for doing tasks.
mathematical reasoning,
and real-time problem
solving.

49 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

How Popular Is DeepSeek?
⚫ On January 31, 2025, it was reported DeepSeek AI topped the list for most downloaded mobile app in 140 global
markets. This led to big US companies like Microsoft, NVIDIA, and Amazon to embrace and interconnect with
DeepSeek.
⚫ As of February 9, 2025, DeepSeek had seen 110 million downloads—up from the 2.26 million downloads between
January 20 and 26 and 63 million downloads the following week—with nearly 97 million weekly active users. This
resulted in a week-on-week growth of 2,700%, a remarkable feat achieved without significant marketing investment.

Weekly app downloads Unit: 10,000

7000
6000
6300
5000
4000
3000
2000
1000
226
0
Cycle
Jan. 20–26 Jan. 27–Feb. 2

50 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

What Makes DeepSeek So Popular?

51 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• DeepSeek provides performance that rivals GPT-4o, but at a much lower training cost.
While ChatGPT-4 ran up costs reaching US$78–100 million, DeepSeek-V3 was launched
with just US$5.576 million (2.788 million H800 GPU hours)—just 5–10% of similar
models.

• DeepSeek uses the new Mixture of Experts architecture, Multi-Head Latent Attention,
auxiliary loss-free load balancing, and Multi-Token Prediction.

• V3 uses an innovative approach for training.

Influence of DeepSeek on AI Development

52 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Challenging big tech monopoly: For a long time, the AI field has been dominated by a
few tech giants that boast countless resources and other advantages. With DeepSeek,
there is legitimate competition to the existing monopoly. DeepSeek boasts strong
capabilities and a unique open-source strategy that have combined into various
breakthroughs and widespread uptake in major markets. Many users and developers
have moved to DeepSeek for both personal and work issues, disrupting the previous
balance and prompting rival companies to reassess their market strategies.

• Intense industry competition: DeepSeek's success has inspired more companies and
research institutions to invest in AI. To gain a competitive edge, major companies are
investing more in R&D by introducing more cutting-edge technologies, which have not
only improved technical performance but also reduced costs and expanded application
scenarios. In the LLM market, this has sparked a price war, forcing other companies to
lower prices to remain competitive. This greater competition has catalyzed faster
technological iterations and ensured a baseline for AI development costs.

• Open-source strategy: DeepSeek is open-source, meaning it is available for anyone to

download, copy, and build upon, marking a break in the typical method of closed-source
AI. As part of their drive for sharing and collaboration, DeepSeek is available for
developers for secondary development of new, sector-specific technologies, advancing
the application of AI in various fields.

• Pioneering AI development: Prior to DeepSeek's emergence, AI development required

huge investment in both money and computing resources. This notion has changed with
DeepSeek, which demonstrated that achievements can be made even under limited
resource conditions through using innovative algorithms and efficient training methods.
Now, AI development is not simply reliant on computing power and model scale, but on
innovations in algorithms, resources, and applications, paving the way for new avenues of
AI development.

• Diverse models: DeepSeek delivers powerful language understanding, text generation,

mathematical reasoning, and code processing, meaning it can be applied to meet diverse
needs in many different industries and domains. In finance, it can be used for risk
assessment and investment decisions; in healthcare, it can assist in diagnosis and drug
development; and in education, it can help develop personalized learning and intelligent
tutoring. The countless benefits have seen wider support and popularity for AI in
everyday life.

• Lower development barrier: Open-source models have reduced the barrier for
enterprises and developers to develop AI. Instead of starting from scratch, developers
instead have free tools to quickly build complex AI models or applications. Now, SMEs or
other small-scale teams can develop and train applications quicker and more
economically.

• More importance attached to inference: Since development costs are set to decrease
with each iteration, the focus will shift to balancing cost and performance in the inference
phase. This will likely lead to changes in computing power structures. The importance of
inference computing power will increase significantly, driven by user needs.

• Domestic opportunities: Numerous chip manufacturers have adapted their products to

DeepSeek, influencing global chip design considerations. This is advantageous for major
Chinese processor companies, as it can stimulate a wider ecosystem and provide more
opportunities for domestic computing power.
How DeepSeek Influences Future AI Development

53 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Algorithm innovation: DeepSeek achieves efficient training and powerful performance

with limited resources thanks to its excellent algorithms, an enhanced Transformer
architecture, and a unique attention mechanism and training framework. Since
algorithms are crucial for advancing AI, companies and research institutions should
increase investment accordingly, aiming to explore new paths for enhanced system
performance and efficiency.

• Application innovation: Another notable aspect of DeepSeek in various fields

demonstrates the potential for deep integration of AI across industries. Companies are
looking to explore AI in their practices and business models to improve production and
gain tangible value.

• Technology sharing: Open-source models can be shared and used by global developers,
encouraging wider technology sharing and collaboration. By open-sourcing code and
models, ecosystem users can work together to improve AI technology, while companies
can actively participate in open-source projects to share their achievements and hear
from others' experience.

• Secure and economical R&D: Open-source projects can lower R&D costs and risks. SMEs
and startups can leverage open-source AI to create their products or services without
substantial investment or uncertainty from starting from scratch. Further feedback from
the open-source community can help refine and enhance products for added market
competitiveness.

• Prioritizing AI talent: The success of DeepSeek can be attributed to the exceptional AI

team working behind the scenes. With the demand for AI growing every year, there is an
equal demand for talented personnel who have a solid foundation in mathematical and
computer science, expertise in core AI technologies, and a drive for innovation.
Governments, universities, and enterprises should collaborate to provide initiatives that
encourage the next generation of AI talent. This includes boosting investments in AI
education and nurturing more professionals based on industry demands.

• Talent enablement: Enterprises and research institutions need to create a favorable

environment to attract and retain excellent AI personnel. This means providing
competitive compensation, development opportunities, a supportive atmosphere, and
encouraging innovation and exploration. It is also important to ensure objective and fair
talent evaluation that prioritizes actual achievements.
DeepSeek in the Industry
⚫ Major companies inside and outside China are moving to use DeepSeek or integrate it with their existing
applications.
⚫ Owing to its powerful features and open-source nature, the AI has been widely adopted in many sectors, catalyzing
significant technological advancements in all fields.
On February 5, Huawei officially linked its Celia
app with DeepSeek-R1 (beta). This strategic
partnership has enhanced Celia for programmers
and coders, students struggling with complex
math calculations, or professionals needing logical
reasoning analysis. Now, users can access the
DeepSeek-R1 model through the Celia app to
quickly receive accurate answers.

On February 7, Geely Auto announced its self-

Henan Yunfei Technology Development Co., Ltd. developed Xingrui AI large model will integrate with
partnered with DeepSeek to explore its potential in DeepSeek.
agricultural plant protection thanks to robust data
analysis. This led to better capabilities in identifying On February 7, Dongfeng Motor announced its entire
patterns of and predicting the risk of disease and passenger vehicle lineup is integrated with DeepSeek
pest outbreaks. The AI can recommend agricultural LLMs. Wholly-owned brands it has including MHERO,
plant protection tailored to farms and crops based eπ, Aeolus, and Nammi will soon adopt the AI-
on real-time field data. powered system.

54 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
◼ AI Application History
 AI Application Cases

5. Debates and Future of AI

55 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Application History - From Perception and Understanding to
Creation and Generation
AI milestones Perception and Cognition, generation, and General-
Automation understanding creation purpose AI

As intelligent or even more

intelligent than humans
Vertical data + Foundation model
Exquisite articles
Big data + deep AGI
learning General-purpose AI
Getting to know the foundation model
world
Human rules +
Machine Multi-model driven by
execution foundation models 100 trillion
Generative AI /Multi-task collaboration
system
parameters
application VS
Pattern recognition
system Multimodal 100 trillion
Expert Recommendation Foundation synapses in human
Foundation
systems system Model brains
AI Model
development
1956 1997 2015 2020 2022 2024 5–10 years 10+ years
Beginning of AI

56 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Weighty Application - ChatGPT
⚫ ChatGPT is a next-generation dialog-based AI system developed by
OpenAI based on its GPT language model.
 ChatGPT had over 100 million active users only two months after its
launch, making it the fastest growing consumer-level application in
history.
⚫ ChatGPT is acclaimed for human-like understanding and inference
capabilities in language, context, intent, and logic. It has outstanding
performance in high-quality, multi-round, and long-text human-
machine interaction, and has achieved preliminary "intelligent
emergence."
⚫ ChatGPT is currently in its GPT4-o version with multi-modal
capabilities and multi-domain expertise.

57 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Applications Similar to ChatGPT

• After the release of ChatGPT, many similar

applications have been released in China and
globally.

• Today, these applications have integrated multiple

functions, such as text, code, video, and image
generation.

Other

58 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

From Single Tasks to Multiple Tasks

Object
detection

Machine
Transformer
translation

Image
generation

… …

59 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

From Simple Tasks to Complex Tasks
Application scenario

Specific optimization tasks; single task types A wide coverage of domains; diverse tasks

Generalization

Traditional Weak adaptation to new samples; single-type data Strong capabilities in cross-domain task processing,
inference knowledge transfer, and adaptation to new samples LLM
DL model
applications
applications

Model capability
Poor generation capabilities; mostly used for Strong creativity, zero-shot & few-shot, and logical
discrimination tasks reasoning

Simple scenarios Complex scenarios

60 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications
 AI Application History
◼ AI Application Cases

5. Debates and Future of AI

61 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Intelligent Mining
Scenario 1: Electromechanical chamber - storing power Solution for scenario 1: Automatic
distribution and communications equipment inspection robot
Objective: To implement unattended automatic inspection Video surveillance: AI machine vision for 24-hour
monitoring, alarms, and sensitive area inspection
24-hour monitoring of Sound detection: AI intelligently identifies whether
Scheduled manual machines are running properly
Challenge 1: the chamber Challenge 2: inspection
environment

Scenario 2
Belt tear monitoring Belt deviation monitoring
Scenario 2: Underground intelligent monitoring
Challenges

Coal conveyor It's unsafe and time-consuming to inspect a coal conveyor belt
belt longer than 20 km. Multiple belt segments need to collaborate to
prevent coal accumulation.
Behavior protocol Not wearing safety helmets, smoking, passing under equipment,
violations walking in laneways, and sitting on belts
Operational Improper operations during gas drainage and water inspection and
violations drainage

62 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://e.huawei.com/cn/material/wireless/4d008289c5424b31a055c71eaaed5790
AI Helps Protect Nature - Preserving Chile's Biodiversity
⚫ The Nature Guardian project uses Huawei Cloud and AI to research and protect Darwin's foxes in
the Nahuelbuta Mountains. The Nature Guardian is an acoustic monitoring system developed by
Rainforest Connection (RFCx) and has been effective in several projects.

⚫ It consists of solar devices equipped with microphones and antennas. These devices collect
sound data from the surrounding environment and transmit it to the cloud through wireless
networks for AI data analysis. Each device can cover three square kilometers around the clock.

⚫ The Nature Guardian can capture animal sounds as well as illegal noise made by poachers'
gunshots or trucks and electric saws of illegal loggers.

⚫ The trained AI model can identify the sounds of different animals, enabling experts to study the
distribution and behavior of species and helping with environmental protection through
adaptive management.

⚫ If a threat is identified, the system sends a real-time alarm to the ranger's mobile application for
a fast response.

Darwin's Fox
Source:https://www.huawei.com/cn/tech4all/stories/nature-guardians-for-biodiversity-in-chile

63 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://www.huawei.com/cn/tech4all/stories/nature-guardians-for-biodiversity-in-chile
AI Safeguards Nature - Protecting Wildlife in Greece with a Shield of Sound

64 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://www.huawei.com/cn/tech4all/stories/wildlife-greece
Scaling Law in AI for Science
Multi-disciplinary and multi-scale data
to train four fundamental science
models: life, materials, fluids, and
electromagnetism Atomic orbitals Crystal structure Small molecule Proteins

Universal
Material
Model

Cell
Universal Universal
Universal
AI4Science Fluid
Life Model
Model Model

Turbine simulation

Brain
Universal
electromagnetic
model
Weather forecasting

Electromagnetic simulations Geophysical Electromagnetic Inversion

65 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI4Science

Emergent intelligence based on massive data in various industries

Aerospace Information and Synthetic

Industry Pharmaceutical Materials Chemical Meteorology
manufacturing communications
Healthcare
biology
Breeding

Electromagnetism
Universal Materials Fluids
Universal Electromagnetic
Life
Universal Material Model Universal Fluid Model Universal Life Model
model Model

Multi-
Small
disciplinary Materials
molecule
Protein Fluid Gas Electromagnetism Gene Cell Organ

data

66 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI4Science - AI + Meteorology
Weather models for fast, accurate weather forecasting

Weather Model
More accurate

Forecasting accuracy 20%

Faster and greener

80
typhoons worldwide
Compute used for generating a 10-day
weather forecast
Traditional Numerical Weather model
each year Weather Prediction inference
(NMP)
3,000 servers 1 device
5 hours 10 seconds

Forecasting speed 10,000x

3D weather model | Hierarchical
temporal aggregation

67 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI4Science - AI + Drugs
Drug molecule models for much higher drug design efficiency
Costly and time-consuming
Industry Long drug discovery period
manual experiments
pain points
Drug molecule model
Learns small-molecule compounds Novel compound
ZINC as humans do library generation
DrugSpaceX Molecular Molecular Protein target
UniChem [x1, x2, x3, ..., xn] matching
encoder decoder
Compound property
About 1.7 billion compounds prediction
100s of millions of Compound structure
parameters optimization

Drug targets

Hit Lead
compound compound Screening result
candidate candidate library
Compound library library
Drug molecule model, database Ye a r s 1 month
producing 100 million
Compound property predictor Compound optimizer
structurally novel
compounds Lead compound discovery cycle

68 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Smart Government
Real-time, all-domain Office collaboration Better services for residents Precise and efficient city
sensing anytime, anywhere and businesses governance

Event analysis Report generation Intelligent online services Government service digital human
Government hotline Document writing Government service assistant Policy recommendation
… … … …

Unified city One-network All-in-one government Scenario-

Intelligent operation center
management collaboration services specific
Five major skills for full-process intelligent sensing, recognition, processing, and decision-making models
Government Video-based multimodal Open-vocabulary object
Government Q&A City video sensing
copywriting understanding detection

Pre-trained on massive government

100 billion parameters Government Model service knowledge
Hybrid network architecture Multimodal data understanding and
recognition

All-scenario smart government services and ecosystems

69 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI For Coding

70 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications

5. Debates and Future of AI

▫ Debates in AI

▫ Future of AI

71 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Debates in AI
⚫ There are many debates around AI technologies, especially as they evolve:
 Is what we see or hear true?
 How do we solve the ethical issues in AI?
 Will everyone lose their jobs in the future?
 Will our privacy be protected? How do we protect our privacy?
 Will AI be controllable in the future?
 ...

72 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• What you see may not be true.

• Ethical issues will be controlled by policies and regulations.

• Partial unemployment may be caused thereby.

• Privacy may possibly be leaked. However, privacy can be protected depending on

technologies such as differential privacy training, model fingerprint, and model encryption.

• Controllability.
Is Seeing Still Believing?
⚫ As AI technologies develop, we begin to question the credibility of images, audio, and video.
Now, technologies such as multi-modal large models and generative adversarial network (GAN)
can be used to produce false images and videos, making it difficult to distinguish between what
is true and what is false.
 For example, Lyrebird is a tool that can automatically imitate human voices from several minutes of
recording samples.
 Deepfake can generate videos with fake faces.

True? False?

73 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Opinions on Ethical Issues in AI
⚫ Discrimination and stigmatization: Threatens social
fairness and justice and restricts people's freedom of
choice.
⚫ Autonomous decision-making: This mechanism
endangers human autonomy. Results cannot be traced
back and effective measures cannot be taken to correct
decisions.
⚫ Disadvantages: The objectives, methods, and decisions
of AI systems cannot be explained.
⚫ Harm to human dignity and mind, and physical
damage: People may face unsafe environments (for
example, coexistence with robots) and vulnerability to
the malicious use of robots, privacy leaks, and
susceptibility to deception and manipulation.

74 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Will You Lose Your Job?
⚫ As a society, we have always been in pursuit of higher efficiency. The steam engine reduced the need for
horses. Every step in achieving automation will change our life and work. In the era of AI, what jobs will
be replaced by AI?
⚫ AI may replace:
 Repetitive work What opportunities
and jobs will AI bring?
 Hazardous work
 ...

75 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Privacy and Data Security
⚫ Big data and AI allow us to obtain information more conveniently and efficiently. However, our behavior is also
recorded and the data is used all the time. How can we protect private data?
• 1. Users: Before obtaining the user's information, mobile
applications must obtain the user's consent (privacy policy).
Users can refuse to provide the information required by an
application for proactive protection.

• 2. Society: Actively promote the formulation of privacy

protection clauses, implement basic industry rules, and
strengthen supervision.

• 3. Developers: Confidential computing, model privacy

protection, federated learning, or adversarial learning can
be used to prevent reverse attacks and protect user privacy.
Alternatively, a well-established privacy protection
framework.

76 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Confidential computing: Data transmission and computing are confidential, and privacy
protection is costly.

• Model privacy protection: Commonly used methods include data anonymization,

differential privacy training, model encryption, and model obfuscation. (Differential privacy
can measure and control the leakage of trained data by models.)

• Federated learning: Generally, multi-party machine learning is performed without sharing

data. It is essentially a distributed machine learning framework with limited data access.
Computing nodes share only gradients but not raw data.
Problems to Be Solved
⚫ Are AI-created works protected by copyright laws?
⚫ Who gives authority to robots? What rights shall be given to robots?
⚫ ...

Who owns it?

Model

77 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Overview

2. AI Technologies

3. Overview of DeepSeek and Its Influence on AI Development

4. AI Applications

5. Debates and Future of AI

▫ Debates in AI

▫ Future of AI

78 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Value - Wide Application
⚫ AI will see wider application as vertical fields integrate the technology.
 Smart transportation: highways, airports, railways, logistics, ports
 Energy: electricity, oil, gas
 Finance: banking, insurance, and securities
 And more

79 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Digital government: smart city, smart government, smart emergency response, sunny
kitchen, smart water conservancy, and the like. For example, sunny kitchen uses AI machine
vision to identify food types on meal plates and calculate prices and calories. Smart water
conservancy and management can solve problems in the following aspects: 1. black water
management and sewage monitoring (AI video algorithm for sewage identification, satellite
remote sensing, and water pump gate scheduling); 2. reservoir (enhanced video, intelligent
power supply, and integrated pole site for water conditions) 3. rivers and lakes (spectral
water quality analysis, and 5G access to drones/ships).

• Smart mining: underground 5G network, security monitoring, and violation detection.

• Smart transportation: airports (IOC queuing management – sensing an extra long passenger
queue: AI + video analysis; stand scheduling: operation optimization AI algorithm + big data
+ IoT + GIS + video + simulation...), and highways (holographic intersection, unmanned
driving...).
Value - Growing Market

The global AI market was expected to reach USD638.2 billion in 2024 (USD538.1
Market scale
billion in 2023).

In 2022, the total global investment in AI IT was USD128.8 billion, which is

Total AI IT
investment
estimated to rise to USD423.6 billion in 2027, representing a five-year compound
annual growth rate (CAGR) of about 26.9%.

In the first half of 2024, the market scale of AI foundation model solutions in
Foundation
Model
China was CNY1.38 billion. The estimated CAGR is 56.2% from 2024 to 2028,
Solutions reaching a market scale of CNY21.1 billion by 2028.

80 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Value - Higher Production Efficiency
⚫ AI innovation plays a prominent role in
advancing the manufacturing industry.
Smart manufacturing can significantly
improve product quality and production
efficiency. Statistics show that AI
reconstruction helps speed up R&D in
factories by 20.7% and improve production A group of people work
together.
efficiency by 34.8%. A few people control a group
⚫ Labor-intensive and technology-intensive of machines.

enterprises should make full use of AI

innovation. The former can use AI for
higher labor productivity, while the latter
has strong R&D capabilities and is
somewhat ready for AI innovation.

81 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Value - Industry-specific AI Applications
⚫ In addition to offering general capabilities, AI technologies are also customized for vertical industries to
solve pain points in specific domains and provide differentiated solutions.
⚫ Industry-specific customization unlocks the true value of AI technologies. In the future, competition will
increasingly hinge on a deep understanding of vertical scenarios.

82 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Technology - Artificial General Intelligence
⚫ AGI is a computer system that can implement any intelligent human activity. It features
universal intelligence, including self-learning, self-improvement, self-adjustment, and cross-
domain and cross-modal learning, inference, and decision-making capabilities.

83 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Technology - Embodied Artificial Intelligence
⚫ Embodied artificial intelligence is an
intelligent system that implants AI in
physical objects. This gives objects
capabilities like autonomous perception,
learning, decision-making, and action in
physical environments, so they can
flexibly adapt to physical environments
and tasks.

84 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Devices
⚫ Smartphones will integrate AI, ⚫ Intelligent vehicles will achieve
enabling natural interactions and intelligent driving in a real sense,
becoming truly smart. They will Automotive delivering safe and comfortable
deliver a brand-new experience driving experiences.
together with wearable devices. Mobile
Robot
phone

⚫ Smart home appliance will provide AI

⚫ Intelligent robots will feature
recipe recommendations, optimize stronger sensing and decision-
Home
Wearables
energy usage, and offer language appliance
making capabilities, ready to
interaction for a more comfortable take on jobs like housekeeping
experience. ... and nursing.

85 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Future of AI Applications
⚫ The future of AI applications is expected to see wider technology integration, deeper industry
penetration, and continuous ethical and security considerations.

Enhanced Open-source
Multimodal Improved self- Stricter ethical Energy
Wider industry personalization and
and cross- learning and security efficiency and
applications and democratized
modal learning capabilities control sustainability
adaptability AI technologies

86 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Wider industry applications: As technologies become more mature and adaptable, large
models are poised to find broader applications in traditional industries, such as
manufacturing, energy, and logistics, to support more efficient problem-solving and
decision-making.

• Multimodal and cross-modal learning: Future large models will not be limited to a single
domain such as text or images. Instead, they will be able to process and integrate
multiple types of data (such as text, images, videos, and sounds) to provide a more
comprehensive AI application experience.

• Enhanced personalization and adaptability: Large models will better adapt to specific
needs of individual users and provide customized services. For example, in the education
and health fields, models can provide personalized suggestions based on individuals'
learning progress or health status.

• Improved self-learning capabilities: Future large models will have stronger self-learning
capabilities, allowing them to learn new patterns and knowledge through continuous
interaction and feedback. This reduces the reliance on large-scale labeled data.

• Stricter ethical and security control: As large models become more widespread, ethical,
privacy, and security issues will receive more attention. Future development will focus
on ensuring the transparency, explainability, and fairness of models while strengthening
data protection.

• Open-source and democratized AI technologies: The development and application of large

models may become more democratized. More research institutions and companies may
make their models and technologies open-source to promote global technology sharing
and innovation.

• Energy efficiency and sustainability: Given the enormous energy consumption required to
train large models, future research will focus on improving the energy efficiency of
models and reducing the carbon footprint to support sustainable development.
Summary

⚫ This chapter covered the definition and history of AI, its applications and sectors, and
controversial topics and future trends in the field.

87 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Quiz

1. (Multi-choice question) Which of the following are applications of computer vision?

A. Knowledge graph

B. Semantic segmentation

C. Intelligent driving

D. Video analysis

88 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Answer:

▫ BCD
Recommendations

⚫ Huawei Talent Online Website

 https://e.huawei.com/cn/talent/#/home

⚫ Huawei Cloud
 https://www.huaweicloud.com/intl/en-us/

89 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright© 2025 Huawei Technologies Co., Ltd. All rights reserved.

The information in this document may contain predictive

⚫ Upon completion of this course, you will understand:

◼ Learning algorithm definitions and machine learning process
◼ Related concepts such as hyperparameters, gradient descent, and cross-validation
◼ Common machine learning algorithms

2 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Machine Learning Algorithms

2. Types of Machine Learning

3. Machine Learning Process

4. Important Machine Learning Concepts

5. Common Machine Learning Algorithms

3 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Machine Learning Algorithms (1)
⚫ Machine learning is often combined with deep learning methods to study and observe AI
algorithms. A computer program is said to learn from experience 𝐸 with respect to some class
of tasks 𝑇 and performance measure 𝑃, if its performance at tasks in 𝑇, as measured by 𝑃,
improves with experience 𝐸.

Understanding
Data Learning algorithm
(Performance
(Experience 𝐸) (Task 𝑇)
measure 𝑃)

4 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Deep learning is a sub-field of machine learning. To understand deep learning, you need to
first understand the fundamentals of machine learning.

▫ Task 𝑇 represents how the machine learning system should process a sample.

▫ Performance measure 𝑃 is used to evaluate aspects of machine learning such as

accuracy and error rate.

▫ Experience 𝐸: Most learning algorithms can be understood as being allowed to

experience an entire dataset. Some machine learning algorithms do not just
experience a fixed dataset. For example, reinforcement learning algorithms interact
with an environment, so there is a feedback loop between the learning system and
its training process. Machine learning algorithms can be broadly categorized as
unsupervised or supervised by what kind of experience they are allowed to have
during the learning process.

• To learn the game of Go:

▫ Experience 𝐸1 :Game with itself- Unsupervised and indirect learning

▫ Experience 𝐸2 :Inquiring Humans during a game with itself – Semi-supervised

learning

▫ Experience 𝐸3 :Learning from human historical games - Supervised and direct

learning
• Handwriting recognition problem:

▫ Task 𝑇: Recognize handwritten text

▫ Performance measure 𝑃: classification accuracy

▫ Experience 𝐸: Classified sample library (supervised and direct learning)

• Robots' desire to advance: Look for new games and practice their skills through tiny
changes in the same situation, enriching their training examples.
Machine Learning Algorithms (2)

Historical
Experience
data

Summarize Train

Input Predict Input Predict

New New Future
Rules Future Model
problem data attributes

6 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Created by: Jim Liang

Differences Between Machine Learning Algorithms and Traditional Rule-

based Methods
Rule-based method Machine learning

Training data

Machine
learning

New
Model Prediction
data

• Models are trained on samples.

• Explicit programming is used to solve
• Decision-making rules are complex or
problems.
difficult to describe.
• Rules can be manually determined.
• Machines automatically learn rules.
7 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
When to Use Machine Learning (1)
⚫ Machine learning provides solutions to complex problems, or those involving a large amount of data
whose distribution function cannot be determined.
⚫ Consider the following scenarios:

Task rules change over time, for example, Data distribution changes over time and
Rules are complex or difficult to describe, part-of-speech tagging, in which new words programs need to adapt to new data
for example, speech recognition. or word meanings can be generated at any constantly, for example, sales trend
time. forecast.

8 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

When to Use Machine Learning (2)

High
Manual Machine learning
rules algorithms

Complexity of
rules

Simple Rule-based
Low questions algorithms

Small Large
Scale of the problem

9 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Rationale of Machine Learning Algorithms

Target equation
𝑓: 𝑋 → 𝑌

Ideal

Actual
Training data Learning Hypothesis function
𝐷: {(𝑥1 , 𝑦1 ) ⋯ , (𝑥𝑛 , 𝑦𝑛 )} algorithm 𝑔≈𝑓

⚫ The objective function 𝑓 is unknown, and the learning algorithm cannot obtain a
perfect function 𝑓.
⚫ Hypothesis function 𝑔 approximates function 𝑓, but may be different from function
𝑓.

10 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Main Problems Solved by Machine Learning
⚫ Machine learning can solve many types of tasks. Three most common types are:
◼ Classification: To specify a specific one of the k categories for the input, the learning algorithm usually outputs a function 𝑓: 𝑅 𝑛
→ (1,2, … , 𝑘) . For example, image classification algorithms in computer vision solve classification tasks.
◼ Regression: The program predicts the output for a given input. The learning algorithms usually output a function 𝑓: 𝑅 𝑛 → 𝑅.
Such tasks include predicting the claim amount of a policy holder to set an insurance premium or predicting the security price.
◼ Clustering: Based on internal similarities, the program groups a large amount of unlabeled data into multiple classes. Same-
class data is more similar than data across classes. Clustering tasks include search by image and user profiling.
⚫ Classification and regression are two major types of prediction tasks. The output of classification is discrete class
values, and the output of regression is continuous values.

11 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Machine Learning Algorithms

2. Types of Machine Learning

3. Machine Learning Process

4. Important Machine Learning Concepts

5. Common Machine Learning Algorithms

12 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Types of Machine Learning
⚫ Supervised learning: The program takes a known set of samples and trains an optimal model to generate
predictions. Then, the trained model maps all inputs to outputs and performs simple judgment on the outputs. In
this way, unknown data is classified.
⚫ Unsupervised learning: The program builds a model based on unlabeled input data. For example, a clustering model
groups objects based on similarities. Unsupervised learning algorithms model the highly similar samples, calculate
the similarity between new and existing samples, and classify new samples by similarity.
⚫ Semi-supervised learning: The program trains a model through a combination of a small amount of labeled data and
a large amount of unlabeled data.
⚫ Reinforcement learning: The learning systems learn behavior from the environment to maximize the value of reward
(reinforcement) signal function. Reinforcement learning differs from supervised learning of connectionism in that,
instead of telling the system the correct action, the environment provides scalar reinforcement signals to evaluate its
actions.
⚫ Machine learning evolution is producing new machine learning types, for example, self-supervised learning,
contrastive learning, generative learning.

13 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Supervised learning: We give a computer a bunch of choice questions (training samples)

and provide standard answers. The computer tries to adjust its model parameters to make
their predictions close to standard answers. In this way, the computer learns how to deal
with this type of problem. Then the computer can help us reply to choice questions whose
answers are not given (test samples).

• Unsupervised learning: We give a computer a bunch of choice questions (training samples),

but do not provide standard answers. The computer tries to analyze the relationships
between these questions and classify them. It does not know the answers to these
questions, but it thinks that the answers to the questions in the same category should be
the same.

• Semi-supervised learning: Traditional supervised learning uses a large number of labeled

training samples to create a model for predicting the labels of new samples. For example, in
a classification task, a label indicates the category of a sample while in a regression task, a
label is a real-value output of the sample. As data storage capabilities are developing, we
can easily obtain a large amount of unlabeled data from many tasks. However, labeling the
data is labor-consuming and time-consuming. For example, for web page recommendation,
users need to label web pages they like, but only a few users are willing to spend a lot of
time doing this. Then we get limited labeled web page data and a large amount of
unlabeled web page data.
• Reinforcement learning: We give a computer a bunch of choice questions (training samples),
but do not provide standard answers. It tries to solve these questions, and we check
whether the answers are correct as teachers. If the computer generates more correct
answers, we offer more rewards. The computer adjusts its model parameters to make its
predictions correct and obtain more rewards. Not strictly speaking, reinforcement learning
can be understood as unsupervised learning plus supervised learning (unsupervised
learning before supervised learning).
Supervised Learning

Data features Labels

Feature 1 ······ Feature n Target

Supervised learning
Feature 1 ······ Feature n Target
algorithm

Feature 1 ······ Feature n Target

Suitable for
Weather Temperature Wind Speed
Exercise
Sunny High High
Yes
Rainy Low Medium
No
Sunny Low Low
Yes
15 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Supervised Learning - Regression
⚫ Regression reflects the features of sample attributes in a dataset. A function is used to express
the sample mapping relationship and further discover the dependency between attributes.
Examples include:
◼ How much money can I make from stocks next week?
◼ What will the temperature be on Tuesday?

Monday Tuesday

38° ?

16 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Supervised Learning - Classification
⚫ Classification uses a classification model to map samples in a dataset to a given category.
◼ What category of garbage does the plastic bottle belong to?
◼ Is the email a spam?

17 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Unsupervised Learning
Data features

Feature 1 ······ Feature n

Unsupervised learning Intra-cluster

Feature 1 ······ Feature n similarity
algorithm

Feature 1 ······ Feature n

Monthly Sales Product Sale Duration Category

Volume Cluster 1
1000-2000 Badminton racket 6:00-12:00 Cluster 2
500-1000 Basketball 18:00-24:00
1000-2000 Game console 00:00-6:00

18 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Unsupervised Learning - Clustering
⚫ Clustering uses a clustering model to classify samples in a dataset into several categories based
on similarity.
◼ Defining fish of the same species.
◼ Recommending movies for users.

19 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Semi-supervised Learning

Data features Labels

Feature 1 ······ Feature n Target

Semi-supervised learning
Feature 1 ······ Feature n Unknown
algorithm

Feature 1 ······ Feature n Unknown

Weather Temperature Wind Speed Suitable for

Sunny High High Exercise

Rainy Low Medium Yes

Sunny Low Low /

/
20 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Reinforcement Learning
⚫ A reinforcement learning model learns from the environment, takes actions, and adjusts the
actions based on a system of rewards.

Model

Status 𝑠𝑡 Reward 𝑟𝑡 Action 𝑎𝑡

𝑟𝑡+1

𝑠𝑡+1 Environment

21 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Reinforcement learning uses a series of actions to maximize the reward function to learn a
model.

• Both good and bad behavior can help reinforcement learning models learn.

• Example: Autonomous vehicles learn by continuously interacting with the environment

Reinforcement Learning - Best Action
⚫ Reinforcement learning always tries to find the best action.
◼ Autonomous vehicles: The traffic lights are flashing yellow. Should the vehicle brake or accelerate?
◼ Robot vacuum: The battery level is 10%, and a small area is not cleaned. Should the robot continue cleaning or
recharge?

22 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Machine Learning Algorithms

2. Types of Machine Learning

3. Machine Learning Process

4. Important Machine Learning Concepts

5. Common Machine Learning Algorithms

23 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Machine Learning Process

Feature Model
Data Data Model Model
extraction and deployment and
preparation cleansing training evaluation
selection integration

Feedback and
iteration

24 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Machine Learning Basic Concept - Dataset
⚫ Dataset: collection of data used in machine learning tasks, where each piece of data is called a
sample. Items or attributes that reflect the presentation or nature of a sample in a particular
aspect are called features.
 Training set: dataset used in the training process, where each sample is called a training sample.
Learning (or training) is the process of building a model from data.
 Test set: dataset used in the testing process, where each sample is called a test sample. Testing refers
to the process, during which the learned model is used for prediction.

25 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Data Overview
⚫ Typical dataset composition

Feature 1 Feature 2 Feature 3 Label

No. Area Location Orientation House Price

1 100 8 South 1000

2 120 9 Southwest 1300

Training
set 3 60 6 North 700

4 80 9 Southeast 1100

Test set 5 95 3 South 850

26 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Importance of Data Processing
⚫ Data is crucial to models and determines the scope of model capabilities. All good models
require good data.

Data cleansing
Data Data
standardization
preprocessing
Fill in missing values, Standardize data to
and detect and reduce noise and
eliminate noise and improve model
other abnormal points accuracy
Data dimension
reduction

Simplify data
attributes to avoid the
curse of
dimensionality

27 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Data Cleansing Workloads
⚫ What data scientists spend time doing for machine learning:

3% Rebuilding training sets

5% Other
4% Optimizing models

9% Mining data for patterns

19% Collecting datasets

60% Cleansing and organizing data

CrowdFlower Data Science Report 2016

28 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Data Cleansing
⚫ Most machine learning models process features, which are usually numeric representations of
input variables that can be used in the model.
⚫ In most cases, only preprocessed data can be used by algorithms. Data preprocessing involves
the following operations:
◼ Data filtering
◼ Data loss handling
◼ Handling of possible error or abnormal values
◼ Merging of data from multiple sources
◼ Data consolidation

29 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Dirty Data
⚫ Raw data usually contains data quality problems:
◼ Incompleteness: Incomplete data or lack of relevant attributes or values.
◼ Noise: Data contains incorrect records or abnormal points.
◼ Inconsistency: Data contains conflicting records.

Missing value

Invalid value

Misfielded value

Invalid duplicate Incorrect

items format Dependent attributes Misspelling

30 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Data Conversion
⚫ Preprocessed data needs to be converted into a representation suitable for machine learning models.
The following are typically used to convert data:
◼ Encoding categorical data into numerals for classification
◼ Converting numeric data into categorical data to reduce the values of variables (for example, segmenting age
data)
◼ Other data:
◼ Embedding words into text to convert them into word vectors (Typically, models such as word2vec and BERT are used.)
◼ Image data processing, such as color space conversion, grayscale image conversion, geometric conversion, Haar-like
features, and image enhancement
◼ Feature engineering:
◼ Normalizing and standardizing features to ensure that different input variables of a model fall into the same value range
◼ Feature augmentation: combining or converting the existing variables to generate new features, such as averages.

31 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Necessity of Feature Selection
⚫ Generally, a dataset has many features, some of which may be unnecessary or irrelevant to the
values to be predicted.
⚫ Feature selection is necessary in the following aspects:

Simplifies models
Shortens training
for easy
time
interpretation

Improves model
Avoids the curse generalization
of dimensionality and avoids
overfitting

32 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Feature Selection Methods - Filter
⚫ Filter methods are independent of models during feature selection.

By evaluating the correlation between each feature

and target attribute, a filter method scores each
feature using a statistics measurement and then sorts
the features by score. This can preserve or eliminate
specific features.
Common methods:
• Pearson correlation coefficient
Selecting the best
Traversing all
feature subset
Learning Model • Chi-square coefficient
features algorithm evaluation
• Mutual information
Limitations of filter methods:
Filter method process
• Filter methods tend to select redundant variables
because they do not consider the relationships
between features.

33 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Feature Selection Methods - Wrapper
⚫ Wrapper methods use a prediction model to score a feature subset.

Wrapper methods treat feature selection as a search

issue and evaluate and compare different
combinations. Wrapper methods use a predictive
model to evaluate the different feature combinations,
and score the feature subsets by model accuracy.
Selecting the best feature subset Common method:
• Recursive feature elimination
Traversing all Generating a Learning Model Limitations of wrapper methods:
features feature subset algorithm evaluation
• Wrapper methods train a new model for each
feature subset, which can be computationally
Wrapper method process intensive.
• Wrapper methods usually provide high-
performance feature sets for a specific type of
model.

34 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Feature Selection Methods - Embedded
⚫ Embedded methods treat feature selection as a part of the modeling process.

Regularization is the most common type of

embedded methods.
Regularization methods, also called penalization
Selecting the most appropriate feature subset methods, introduce additional constraints into the
optimization of a predictive algorithm to bias the
model toward lower complexity and reduce the
Traversing all Generating a Learning algorithm + number of features.
features feature subset model evaluation

Common method:
Embedded method process
• LASSO regression

35 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The penalty attribute of the model is highlighted, which is used to eliminate insignificant
features.
Overall Process of Building an AI Model
Model building process

1 2 3

Split data Train the model Validate the model

Split the dataset into training, Use data that is processed by data Use the validation set to
validation, and test sets. cleansing and feature engineering. evaluate the model
effectiveness.

6 5 4

Tune the model Deploy the model Test the model

Continuously tune the Deployed in actual Use the test set to evaluate
model based on actual production scenarios. the generalization
data of the service. capability of the model.

36 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• After data cleansing and feature extraction, we need to start building the model. The
general procedure for building a model is shown above (supervised learning).
Supervised Learning Example - Learning Phase
⚫ Use a classification model to determine whether a person is a basketball player based on
specific features.
Features (attributes) Target (label)

Service Name City Age Label

data Mike Miami 42 yes Training set
Data used by the model to
Jerry New York 32 no determine the relationships
(Cleansed features and labels)
Split between features and
Bryan Orlando 18 no
targets.
Task: Use a classification model to Patricia Miami 45 yes
determine whether a person is a basketball
player using specific features Elodie Phoenix 35 no Test set
Remy Chicago 72 yes New data for evaluating
model effectiveness.
John New York 48 yes
Train the
model Each feature or set of features provides a
judgment basis for the model.
37 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Model example
Supervised Learning Example - Prediction Phase
Name City Age Label
Marine Miami 45 ?
Unknown data
Julien Miami 52 ? Recent data cannot
New determine whether they are
data Fred Orlando 20 ? a basketball player.
Michelle Boston 34 ?
Nicolas Phoenix 90 ?
IF city = Miami → Probability = +0.7
IF city= Orlando → Probability = +0.2
Apply the IF age > 42 → Probability = +0.05*age + 0.06
model IF age <= 42 → Probability = +0.01*age + 0.02

Name City Age Prediction

Marine Miami 45 0.3
New Predicted possibility
data Julien Miami 52 0.9 Use the model against new
data to predict the
Fred Orlando 20 0.6 possibility of them being
Predicted
Michelle Boston 34 0.5 basketball players.
data
Nicolas Phoenix 90 0.4

38 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Model example

• The model here is similar to the decision tree model.

What Is a Good Model?

• Generalization
The accuracy of predictions based on actual data

• Explainability
Predicted results are easy to explain

• Prediction speed
The time needed to make a prediction

39 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Which factors are used to judge a model?

• Generalization capability is the most important.

Model Effectiveness (1)
⚫ Generalization capability: Machine learning aims to ensure models perform well on new
samples, not just those used for training. Generalization capability, also called robustness, is
the extent to which a learned model can be applied to new samples.
⚫ Error is the difference between the prediction of a learned model on a sample and the actual
result of the sample.
◼ Training error is the error of the model on the training set.
◼ Generalization error is the error of the model on new samples. Obviously, we prefer a model
with a smaller generalization error.
⚫ Underfitting: The training error is large.
⚫ Overfitting: The training error of a trained model is small while the generalization error is large.

40 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Once the form of a problem's hypothesis is given, all possible functions constitute a space,
which is hypothesis space. The problem of machine learning is searching for a suitable
fitting function in a hypothesis space.

• Overfitting: It occurs frequently in complex mathematical models. To prevent overfitting,

we can simplify mathematical models, end training before overfitting, or use
dropout/weight decay methods.

• Underfitting: It occurs if the mathematical model is too simple or the training time is too
short. To solve the underfitting problem, use a more complex model or extend the training
time.
Model Effectiveness (2)
⚫ Model capacity, also known as model complexity, is the capability of the model to fit various functions.
◼ With sufficient capacity to handle task complexity and training data volumes, the algorithm results are optimal.
◼ Models with an insufficient capacity cannot handle complex tasks because underfitting may occur.
◼ Models with a large capacity can handle complex tasks, but overfitting may occur when the capacity is greater
than the amount required by a task.

Underfitting: Overfitting:
features not learned Good fitting
noises learned
41 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The effective capacity is restricted by algorithms, parameters, and regularization.

Cause of Overfitting - Errors
⚫ Prediction error = Bias2 + Variance + Ineliminable error
⚫ In general, the two main factors of prediction error are variance and
Variance
bias.
⚫ Variance: Bias

◼ How much a prediction result deviates from the mean

◼ Variance is caused by the sensitivity of the model to small fluctuations in
a training set.
⚫ Bias:
◼ Difference between the average of the predicted values and the actual
values.

42 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• About the ineliminable error

▫ Theoretically, if there is infinite amount of data and a perfect model, the error can be
eliminated.

▫ In practice, no model is perfect, and our data volume is limited.

Variance and Bias
⚫ Different combinations of variance and bias are as
follows:
◼ Low bias & low variance ➜ good model
◼ Low bias & high variance ➜ inadequate model
◼ High bias & low variance ➜ inadequate model
◼ High bias & high variance ➜ bad model
⚫ An ideal model can accurately capture the rules in the
training data and be generalized to invisible (new)
data. However, it is impossible for a model to complete
both tasks at the same time.

43 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Complexity and Errors of Models
⚫ The more complex a model is, the smaller its training error is.
⚫ As the model complexity increases, the test error decreases before increasing again, forming a
convex curve.

Test error
Error

Training error
Model complexity

44 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Performance Evaluation of Machine Learning - Regression
⚫ Mean absolute error (MAE). An MAE value closer to 0 indicates the model fits the training data
better.
1 m
MAE =  yi − yi
m i =1
⚫ Mean squared error (MSE).
2
1 m
 ( yi − yi )
m i =1
MSE =
⚫ The value range of 𝑅2 is [0,1]. A larger value indicates that the model fits the training data
better. 𝑇𝑆𝑆 indicates the difference between samples, and 𝑅𝑆𝑆 indicates the difference
between the predicted values and sample values.
m 2

RSS ( y i − yi )
R2 = 1 − = 1− i =1
2
TSS m

( y
i =1
i − yi )

45 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Performance Evaluation of Machine Learning - Classification (1)
⚫ Terms:
◼ 𝑃: positive, indicating the number of real positive cases in the data. Predicted
◼ 𝑁: negative, indicating the number of real negative cases in the data. Yes No Total
Actual
◼ 𝑇P : true positive, indicating the number of positive cases that are correctly
Yes 𝑇𝑃 𝐹𝑁 𝑃
classified.
◼ 𝑇𝑁: true negative, indicating the number of negative cases that are correctly No 𝐹𝑃 𝑇𝑁 𝑁
classified.
Total 𝑃′ 𝑁′ 𝑃+𝑁
◼ 𝐹𝑃: false positive, indicating the number of positive cases that are incorrectly
classified.
Confusion matrix
◼ 𝐹𝑁: false negative, indicating the number of negative cases that are incorrectly
classified.
⚫ The confusion matrix is an 𝑚 × 𝑚 table at minimum. The entry 𝐶𝑀𝑖,𝑗 in the first 𝑚 rows and 𝑚 columns indicates the number of
cases that belong to class 𝑖 but are labeled as 𝑗.
 For classifiers with high accuracy, most of the cases should be represented by entries on the diagonal of the confusion matrix from 𝐶𝑀1,1 to
𝐶𝑀𝑚,𝑚 ,while other entries are 0 or close to 0. That is, 𝐹𝑃 and 𝐹𝑁 are close to 0.

46 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Performance Evaluation of Machine Learning - Classification (2)
Measurement Formula
𝑇𝑃 + 𝑇𝑁
Accuracy, recognition rate
𝑃+𝑁
𝐹𝑃 + 𝐹𝑁
Error rate, misclassification rate
𝑃+𝑁
True positive rate, sensitivity, 𝑇𝑃
recall 𝑃
𝑇𝑁
True negative rate, specificity
𝑁
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝐹1 value, harmonic mean of 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
precision and recall 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

𝐹𝛽 value, where 𝛽 is a non- (1 + 𝛽 2 ) × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙

negative real number 𝛽 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

47 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Performance Evaluation of Machine Learning - Example
⚫ In this example, an ML model was trained to identify an image of a cat. To evaluate the model's
performance, 200 images were used, of which 170 of them were cats.
⚫ The model reported that 160 images were cats.
𝑇𝑃 140
Precision: 𝑃 = 𝑇𝑃+𝐹𝑃 = 140+20 = 87.5% Predicted
𝒚𝒆𝒔 𝒏𝒐 Total
𝑇𝑃 140 Actual
Recall: 𝑅 = 𝑃
= 170 = 82.4%
𝑦𝑒𝑠 140 30 170
𝑇𝑃+𝑇𝑁 140+10
Accuracy: 𝐴𝐶𝐶 = 𝑃+𝑁
= 170+30 = 75% 𝑛𝑜 20 10 30

Total 160 40 200

48 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Machine Learning Algorithms

2. Types of Machine Learning

3. Machine Learning Process

4. Important Machine Learning Concepts

5. Common Machine Learning Algorithms

49 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Machine Learning Training Methods - Gradient Descent (1)
⚫ This method uses the negative gradient direction
of the current position as the search direction,
which is the fastest descent direction of the
current position. The formula is as follows:

⚫
w k +1 = w k −   f w ( x i
)
𝜂 is the learning rate. 𝑖 indicates the 𝑖-th data
k

record. 𝜂𝛻𝑓𝑤𝑘 (𝑥 𝑖 ) indicates the change of weight

parameter 𝑤 in each iteration.
⚫ Convergence means that the value of the
objective function changes very little or reaches
the maximum number of iterations.

50 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Typically, the gradient descent method shows a decreasing descent in a very small range
close to the target point, but the gradient may fluctuate in a specific range of the target
point.
Machine Learning Training Methods - Gradient Descent (2)
⚫ Batch gradient descent (BGD) uses the sum of gradients of all 𝑚 samples of the dataset at the current
point to update the weight parameter.
1 m
wk +1 = wk −   f wk ( x i )
m i =1
⚫ Stochastic gradient descent (SGD) randomly uses the gradient of a random sample of the dataset at the
current point to update the weight parameter in the current gradient.

wk +1 = wk − f wk ( x i )
⚫ Mini-batch gradient descent (MBGD) combines the features of BGD and SGD, and chooses the gradients
of 𝑛 samples in a dataset each time to update the weight parameter.

1 t + n −1
wk +1 = wk −   f wk ( xi )
n i=t
51 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Machine Learning Training Methods - Gradient Descent (3)
⚫ Comparison of gradient descent methods
◼ SGD randomly chooses samples for each training pass, causing instability. As a result, the loss
function fluctuates or even produces reverse displacement during the process of dropping to the
minimum.
◼ BGD is the most stable, but it consumes too many compute resources. MBGD is a balance between
BGD and SGD.
BGD
Use all training samples for each training pass.

SGD
One training sample is used for each training pass.

MBGD
A certain number of training samples are used for
each training pass.

52 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Reverse displacement is caused by noise data from datasets.

• Some consider BGD and MBSGD as types of SGD.

Parameters and Hyperparameters
⚫ A model contains not only parameters but also hyperparameters. Hyperparameters enable the
model to learn the optimal configurations of the parameters.
◼ Parameters are automatically learned by models.
◼ Hyperparameters are manually set. Parameters are
"distilled" from data.

Model

Train

Use hyperparameters to
control training

53 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Hyperparameters
• Commonly used for model parameter • λ of Lasso/Ridge regression
estimation. • Learning rate, number of iterations,
• Specified by the user. batch size, activation function, and
• Set heuristically. number of neurons of a neural network
• Often tuned for a given predictive to be trained
modeling problem. • 𝐶 and 𝜎 of support vector machines
(SVMs)
• k in the k-nearest neighbors (k-NN)
algorithm
• Number of trees in a random forest

Hyperparameters are
Common
configurations outside
hyperparameters
the model.

54 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Parameters are a part of a model that is learned from historical training data and key to
machine learning algorithms. They have the following characteristics:

▫ They are required by a model to make predictions.

▫ Their values define the model functions.

▫ They are learned or estimated from data.

▫ They are often not set manually by the user.

▫ They are often saved as a part of the learned model.

• Examples:

▫ Weights in an artificial neural network

▫ Support vectors in a SVM

▫ Coefficients in linear regression or logistic regression

Hyperparameter Search Process and Methods

1. Divide a dataset into a training set, validation set, and test set.
2. Optimize the model parameters using the training set based on the model
performance metrics.
3. Search for model hyperparameters using the validation set based on model
Hyperparameter performance metrics.
search general 4. Perform step 2 and step 3 alternately until model parameters and
process hyperparameters are determined, and assess the model using the test set.

•Grid search
•Random search
•Heuristic intelligent search
Search algorithms •Bayesian search
(step 3)

55 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Hyperparameter Tuning Methods - Grid Search
⚫ Grid search performs an exhaustive search of all possible hyperparameter
combinations to form a hyperparameter value grid.
Grid search
⚫ In practice, the hyperparameter ranges and steps are
5
manually specified.

Hyperparameter 1
4

⚫ Grid search is expensive and time-consuming.

 This method works well when there are relatively few

hyperparameters. Therefore, it is feasible for general machine learning

1
algorithms, but not for neural networks (see the deep learning course).
0 1 2 3 4 5

Hyperparameter 2

56 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Hyperparameter Tuning Methods - Random Search
⚫ If the hyperparameter search space is large, random search is
more appropriate than grid search.
Random search
⚫ In a random search, each setting item is sampled from possible
parameter values to find the most appropriate parameter
subset.

Hyperparameter 1
⚫ Note:
◼ In a random search, a search is first performed within a broad range, and
then the range is narrowed based on the location of the best result.
◼ Some hyperparameters are more important than others and affect random
search preferences.
Hyperparameter 2

57 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Cross-Validation (1)
⚫ Cross-validation is a statistical analysis method used to check the performance of classifiers. It splits the original
data into the training set and validation set. The former is used to train a classifier, whereas the latter is used to
evaluate the classifier by testing the trained model.
⚫ k-fold cross-validation (k-fold CV):
◼ Divides the original data into 𝑘 (usually equal-sized) subsets.
◼ Each unique group is treated as a validation set, and the remaining 𝑘 − 1 groups are treated as the training set.
In this way, 𝑘 models are obtained.
◼ The average classification accuracy score of the 𝑘 models on the validation set is used as the performance metric
for k-fold CV classifiers.

58 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• After a dataset is divided into a fixed training set and a fixed test set, if the test set error is
small, there is a problem. A small test set implies statistical uncertainty around the
estimated mean test error, making it difficult to claim that 𝐴 algorithm works better than 𝐵
algorithm on the given task. When the dataset has hundreds of thousands of samples or
more, this is not a serious problem. When the dataset is too small, alternative procedures
enable one to use all the samples in the estimation of the mean test error, at the price of
increased computational workload.

• In k-fold CV, 𝑘 is typically greater than or equal to 2. It usually starts from 3 and is set to 2
only when the original dataset is small. k-fold CV can effectively avoid over-learning and
under-learning, and the final result is also persuasive.
Cross-Validation (2)

Full dataset

Training set Test set

Training set Validation set Test set

⚫ Note: k in k-fold CV is a hyperparameter.

59 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Machine Learning Algorithms

2. Types of Machine Learning

3. Machine Learning Process

4. Important Machine Learning Concepts

5. Common Machine Learning Algorithms

60 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Machine Learning Algorithm Overview
Machine learning

Supervised learning Unsupervised learning

Classification Regression Clustering Other

Logistic regression Linear regression k-means clustering Association rule

Principal component analysis
SVM SVM Hierarchical clustering

Neural network Neural network Density-based clustering Gaussian mixture modeling

Decision tree Decision tree

Random forest Random forest

Gradient boosted decision tree
GBDT
(GBDT)
k-NN k-NN

Naive Bayes

61 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Linear Regression (1)
⚫ Linear regression uses the regression analysis of mathematical statistics to determine the
quantitative relationship between two or more variables.
⚫ Linear regression is a type of supervised learning.

Simple linear regression Multiple linear regression

62 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• 𝑥Simple linear regression analysis refers to regression analysis where only one independent
variable and one dependent variable exist, and their relationship can be approximately
represented by a straight line. Multiple linear regression analysis involves two or more
independent variables and the relationship between independent variables and dependent
variables is linear. The linear regression model is a straight line only when the variable x is
one-dimensional. It is a hyperplane when this variable is multi-dimensional. For example,
the price of an apartment is determined by a variety of factors such as the area, layout, and
location. Prediction of the apartment price based on these factors can be abstracted into a
linear regression problem.

• The above figures show polynomial linear regression.

Linear Regression (2)
⚫ The model function of linear regression is as follows, where 𝑤 is the weight parameter, 𝑏 is the bias, and 𝑥
represents the sample: T
hw ( x) = w x + b
⚫ The relationship between the value predicted by the model and the actual value is as follows, where 𝑦 indicates the
actual value, and 𝜀 indicates the error:
T
y = w x +b+
⚫ The error 𝜀 is affected by many independent factors. Linear regression assumes that the error 𝜀 follows normal
distribution. The loss function of linear regression can be obtained using the normal distribution function and
maximum likelihood estimation (MLE):
1
J ( w) =  ( hw ( x) − y )
2

2m
⚫ We want the predicted value approaches the actual value as far as possible, that is, to minimize the loss value. We
can use a gradient descent algorithm to calculate the weight parameter 𝑤 when the loss function reaches the
minimum, thereby complete model building.
63 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Linear Regression Extension - Polynomial Regression
⚫ Polynomial regression is an extension of linear regression. Because the complexity of a dataset
exceeds the possibility of fitting performed using a straight line (obvious underfitting occurs if
the original linear regression model is used), polynomial regression is used.

hw ( x ) = w1 x + w2 x 2 + + wn x n + b
Here, 𝑛-th power indicates the degree of the
polynomial.

Polynomial regression is a type of linear

regression. Although its features are non-linear,
the relationship between its weight parameters 𝑤
Comparison between linear and polynomial regression
is still linear.
64 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Preventing Overfitting of Linear Regression
⚫ Regularization terms help reduce overfitting. The 𝑤 value cannot be too large or too small in
the sample space. You can add a square sum loss to the target function:
1 2 2
𝐽(𝑤) = ෍ ℎ𝑤 (𝑥) − 𝑦 +𝜆 𝑤 2
2𝑚
⚫ Regularization term: This regularization term is called L2-norm. Linear regression that uses this
loss function is called Ridge regression.
1 2
𝐽(𝑤) = ෍ ℎ𝑤 (𝑥) − 𝑦 +𝜆 𝑤 1
2𝑚
⚫ Linear regression with an absolute loss is called Lasso regression.

65 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Logistic Regression (1)
⚫ The logistic regression model is a classification model used to resolve classification problems. The model
is defined as follows:
𝑒 −(𝑤𝑥+𝑏)
𝑃 𝑌=0𝑥 =
1 + 𝑒 −(𝑤𝑥+𝑏)
1
𝑃 𝑌=1𝑥 =
1 + 𝑒 −(𝑤𝑥+𝑏)
𝑤 represents the weight, 𝑏 represents the bias, and 𝑤𝑥 + 𝑏 represents a linear function with respect to 𝑥.
Compare the preceding two probability values. 𝑥 belongs to the type with a larger probability value.

66 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Logistic Regression (2)
⚫ Logistic regression and linear regression are both linear models in broad sense. The former
introduces a non-linear factor (sigmoid function) on the basis of the latter and sets a threshold.
Therefore, logistic regression applies to binary classification.
⚫ According to the model function of logistic regression, the loss function of logistic regression
can be calculated through maximum likelihood estimation as follows:
1
J ( w) = -
m
 ( y ln hw ( x) + (1 − y) ln(1 − hw ( x)) )
⚫ In the formula, 𝑤 indicates the weight parameter, 𝑚 indicates the number of samples, 𝑥
indicates the sample, and 𝑦 indicates the actual value. You can also obtain the values of all the
weight parameters 𝑤 by using a gradient descent algorithm.

67 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Overfitting of logistic regression can also be prevented by adding regularization terms.

Logistic Regression Extension - Softmax (1)
⚫ Logistic regression mainly applies to binary classification. For multi-class classification, the
softmax function is typically used.

Binary classification problem Multi-class classification problem

Grape?

Boy? Orange?

Apple?

Girl? Banana?

68 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Logistic Regression Extension - Softmax (2)
⚫ Softmax regression is a generalization of logistic regression and applies to k-class classification.
⚫ The softmax function compresses (maps) a k-dimensional vector of arbitrary real values to
another k-dimensional vector of real values, where each vector element is in (0, 1).
⚫ The Softmax regression probability function is:

T
e wk x
p( y = k | x; w) = K
, k = 1, 2 ,K
e
l =1
wlT x

69 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Logistic Regression Extension - Softmax (3)
⚫ Softmax assigns a probability value to each class in a multi-class classification problem. The sum
of all the probabilities is 1.
◼ Softmax may present the generated classes as follows:

Class Probability
Grape? 0.09

• Sum of all probabilities:

Orange? 0.22
• 0.09 + 0.22 + 0.68 + 0.01 =1
• The most possible object in the
Apple? 0.68 image: Apple

Banana? 0.01

70 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Decision Tree
⚫ Each non-leaf node of the decision tree denotes a test on an attribute; each branch represents the output of a test;
and each leaf (or terminal) node holds a class label. The algorithm starts at the root node (topmost node in the
tree), tests the selected attributes on the intermediate (internal) nodes, and generates branches according to the
output of the tests. Then, it saves the class labels on the leaf nodes as the decision results.
Root node

Small Large

Does not Short- Long-

squeak Squeaks necked necked

Short- Long-
It could be a nosed nosed It could be a
It could be a
squirrel. giraffe.
rat.
Stays on Stays in It could be an
land water elephant.

It could be a rhino. It could be a hippo.

• How to construct a decision tree is very important. We should determine the topological
structure of feature attributes based on attribute selection measures. The key step is to
split attributes. That is, different branches are constructed based on the differences of a
feature attribute on a node.

• The common learning algorithms used to generate a decision tree include ID3, C4.5, and
CART.
Structure of a Decision Tree

Root node

Subnode
Subnode

Leaf node Leaf node Subnode Leaf node

Leaf node Leaf node Leaf node

72 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• All nodes except the root node are called leaf nodes.
Key to Decision Tree Construction
⚫ A decision tree requires feature attributes and an appropriate tree structure. The key step of
constructing a decision tree is to divide data of all feature attributes, compare the result sets in terms of
purity, and select the attribute with the highest purity as the data point for dataset division.
⚫ Purity is measured mainly through the information entropy and GINI coefficient. The formula is as
follows:
K K
H ( X )= - pk log 2 ( pk ) Gini = 1 −  pk2
k =1 2 k =12
𝑚𝑖𝑛𝑗,𝑠 [𝑚𝑖𝑛𝑐1 ෍ 𝑦𝑖 − 𝑐1 + 𝑚𝑖𝑛𝑐2 ෍ 𝑦𝑖 − 𝑐2 ]
𝑥𝑖 ∈𝑅1 𝑗,𝑠 𝑥𝑖 ∈𝑅2 𝑗,𝑠

⚫ 𝑝𝑘 indicates the probability that a sample belongs to category 𝑘 (in a total of K categories). A larger purity difference
between the sample before and after division indicates a better decision tree.
⚫ Common decision tree algorithms include ID3, C4.5, and CART.

73 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Decision Tree Construction Process
⚫ Feature selection: Select one of the features of the training data as the split standard of the
current node. (Different standards distinguish different decision tree algorithms.)
⚫ Decision tree generation: Generate subnodes from top down based on the selected feature
and stop until the dataset can no longer be split.
⚫ Pruning: The decision tree may easily become overfitting unless necessary pruning (including
pre-pruning and post-pruning) is performed to reduce the tree size and optimize its node
structure.

74 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Decision Tree Example
⚫ The following figure shows a decision tree for a classification problem. The classification result is affected
by three attributes: refund, marital status, and taxable income.

TID Refund Marital Taxable Cheat

Status Income
Refund
1 Yes Single 125K No
2 No Married 100K No
Marital
3 No Single 70K No No Status
4 Yes Married 120K No
5 No Divorced 95K Yes
Taxable
6 No Married 60K No Income No
7 Yes Divorced 220K No
8 No Single 85K Yes No Yes
9 No Married 75K No
10 No Single 90K Yes

75 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Support Vector Machine
⚫ Support vector machines (SVMs) are binary classification models. Their basic model is the linear classifier
that maximizes the width of the gap between the two categories in the feature space. SVMs also have a
kernel trick, which makes it a non-linear classifier. The learning algorithm of SVMs is the optimal
algorithm for convex quadratic programming.

Mapping

Difficult to split in a low- Easy to split in a high-

dimensional space. dimensional space.

76 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The main ideas of SVMs are:

▫ In the case of linear inseparability, non-linear mapping algorithms are used to

convert the linearly inseparable samples of low-dimensional input space into samples
of high-dimensional feature space. In this way, samples become linearly separable.
Then the linear algorithm can be used to analyze the non-linear features of samples.

▫ Based on the structural risk minimization principle, an optimal hyperplane is

constructed in the feature space, so that the learner is optimized globally, and the
expectation of the whole sample space satisfies an upper boundary with a certain
probability.
Linear SVM (1)
⚫ How can we divide the red and blue data points with just one line?

Two-dimensional data set with Both the division methods on the left and right can divide
two sample categories data. But which is correct?

77 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Linear SVM (2)
⚫ We can use different straight lines to divide data into different categories. SVMs find a straight line and
keep the most nearby points as far from the line as possible. This gives the model a strong generalization
capability. These most nearby points are called support vectors.
⚫ In the two-dimensional space, a straight line is used for division; in the high-dimensional space, a
hyperplane is used for division.

Maximize the
distance from each
support vector to
the line

78 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Non-linear SVM (1)
⚫ How can we divide a linear inseparable data set?

Linear SVM works well on a linear A non-linear data set cannot

separable data set. be divided using a straight
line.

79 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Non-linear SVM (2)
⚫ Kernel functions can be used to create non-linear SVMs.
⚫ Kernel functions allow algorithms to fit a maximum-margin hyperplane in a transformed high-
dimensional feature space.
Common kernel functions

Polynomial
Linear kernel
kernel

Gaussian Sigmoid
kernel kernel Input space High-dimensional
feature space

80 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The Gaussian kernel function is the most frequently used.

k-Nearest Neighbors Algorithm (1)
⚫ The k-nearest neighbor (k-NN) classification algorithm
is a theoretically mature method and one of the
simplest machine learning algorithms. The idea of k-
NN classification is that, if most of k closest samples
(nearest neighbors) of a sample in the feature space ?
belong to a category, the sample also belongs to this
category.

The category of point ? varies according

to how many neighbor nodes are chosen.

81 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

k-Nearest Neighbors Algorithm (2)
⚫ The logic of k-NN is simple: If an object's k nearest neighbors belong to a class, so does the
object.
⚫ k-NN is a non-parametric method and is often used for datasets with irregular decision
boundaries.
◼ k-NN typically uses the majority voting method to predict classification, and uses the mean value
method to predict regression.
⚫ k-NN requires a very large amount of computing.

82 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

k-Nearest Neighbors Algorithm (3)
⚫ Typically, a larger k value reduces the impact of noise on classification, but makes the boundary between
classes less obvious.
◼ A large k value indicates a higher probability of underfitting because the division is too rough; while a small k
value indicates a higher probability of overfitting because the division is too refined.

• As seen from the figure, the boundary

becomes smoother as the k value
increases.
• As the k value increases, the points will
eventually become all blue or all red.

83 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Naive Bayes (1)
⚫ Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on Bayes' theorem with strong
independence assumptions between the features. For a given sample feature 𝑋, the probability that the sample
belongs to category 𝐻 is:

P ( X 1 ,  , X n | Ck ) P ( C k )
P ( Ck | X 1 ,  , X n ) =
P ( X 1 , , X n )

◼ 𝑋1 , 𝑋2 , … , 𝑋𝑛 are data features, which are usually described by m measurement values of the attribute set.
◼ For example, the attribute of the color feature may be red, yellow, and blue.

◼ 𝐶𝑘 indicates that the data belongs to a specific class 𝐶.

◼ 𝑃(𝐶𝑘 |𝑋1 , 𝑋2 , … , 𝑋𝑛 ) is the posterior probability, or the posterior probability of 𝐻 under condition 𝐶𝑘 .
◼ P(𝐶𝑘 ) is the prior probability independent of 𝑋1 , 𝑋2 , … , 𝑋𝑛 .
◼ 𝑃(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) is the prior probability of 𝑋.

84 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Class conditional independence: The Bayes classifier assumes that the effect of an attribute
value on a given class is independent of the values of other attributes. This assumption is
made to simplify the calculation and becomes "naive" in this sense.

• Bayes classifiers show high accuracy and speed when applied to large databases.
Naive Bayes (2)
⚫ Feature independent hypothesis example:
◼ If a fruit is red, round, and about 10 cm in diameter, it can be considered an apple.
◼ A Naive Bayes classifier believes that each of these features independently contributes to the
probability of the fruit being an apple, regardless of any possible correlation between color,
roundness, and diameter features.

85 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Ensemble Learning
⚫ Ensemble learning is a machine learning paradigm in which multiple learners are trained and combined to resolve an
issue. When multiple learners are used, the generalization capability of the ensemble can be much stronger than
that of a single learner.
⚫ For example, If you ask thousands of people at random a complex question and then summarize their answers, the
summarized answer is more accurate than an expert's answer in most cases. This is the wisdom of the crowd.

Training set

Dataset 1 Dataset 2 Dataset m

Model 1 Model 2 Model m

Large
Ensemble model

86 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Types of Ensemble Learning

Example: random forest

Bagging • Bagging independently builds multiple basic learners and then
averages their predictions.
• On average, an ensemble learner is usually better than a
single-base learner because of a smaller variance.
Ensemble learning

Example: AdaBoost, GBDT, XGBoost

Boosting Bagging builds basic learners in sequence and gradually
reduces the biases of the ensemble learner. An ensemble
learner has a strong fitting capability but may overfit.

87 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Ensemble Learning - Random Forest
⚫ Random forest = Bagging + Classification and regression tree (CART)
⚫ Random forest builds multiple decision trees and aggregates their results to make prediction more accurate and
stable.
◼ The random forest algorithm can be used for classification and regression problems.
Bootstrap sampling Build trees Aggregate results

Subset 1 Prediction 1

Subset 2 Prediction 2 • Classification:

majority voting
All training data Final prediction
• Regression:
Prediction
mean value

Subset n Prediction n

88 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Bootstrap bagging is based on sampling with replacement.

Ensemble learning - Gradient Boosted Decision Tree
⚫ Gradient boosted decision tree (GBDT) is a type of boosting algorithm.
⚫ The prediction result of the ensemble model is the sum of results of all base learners. The essence of GBDT is that
the next base learner tries to fit the residual of the error function to the prediction value, that is, the residual is the
error between the prediction value and the actual value.
⚫ During GBDT model training, the loss function value of the sample predicted by the model must be as small as
possible.
Predict
30 20
Calculate
residual
Predict
10 9
Calculate
residual
Predict
1 1

89 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Unsupervised Learning - k-Means Clustering
⚫ k-means clustering takes the number of clusters k and a dataset of n objects as inputs, and outputs k
clusters with minimized within-cluster variances.
⚫ In the k-means algorithm, the number of clusters is k, and n data objects are split into k clusters. The
obtained clusters meet the following requirements: high similarity between objects in the same cluster,
and low similarity between objects in different clusters.
x1 x1

k-means clustering

k-means clustering
automatically classifies
unlabeled data.
x2 x2

90 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Unsupervised Learning - Hierarchical Clustering
⚫ Hierarchical clustering divides a dataset at different layers and forms a tree-like clustering structure. The
dataset division may use a "bottom-up" aggregation policy, or a "top-down" splitting policy. The
hierarchy of clustering is represented in a tree diagram. The root is the only cluster of all samples, and
the leaves are clusters of single samples.

91 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Summary

⚫ This course first describes the definition and types of machine learning, as well as
problems machine learning solves. Then, it introduces key knowledge points of
machine learning, including the overall procedure (data preparation, data cleansing,
feature selection, model evaluation, and model deployment), common algorithms
(including linear regression, logistic regression, decision tree, SVM, Naive Bayes, k-
NN, ensemble learning, and k-means clustering), and hyperparameters.

92 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Quiz

1. (Single-answer) Which of the following is not a supervised learning algorithm? ( )

A. Linear regression
B. Decision tree
C. k-NN
D. k-means clustering
2. (True or false) Gradient descent is the only method of machine learning. ( )

93 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Answers: 1. D 2. False
Recommendations

⚫ Huawei Talent
 https://e.huawei.com/en/talent/portal/#/

⚫ Huawei knowledge base

 https://support.huawei.com/enterprise/en/knowledge?lang=en

94 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Copyright © Huawei Technologies Co., Ltd. All rights reserved.

The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could
cause actual results and developments to differ materially from those
expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes
neither an offer nor an acceptance. Huawei may change the information
at any time without notice.
Basics of Deep Learning and Foundation Models
Foreword
⚫ In today's digital era, explosive data growth and significant improvements in
computing power have enabled unprecedented development opportunities in the AI
industry. Deep learning is an important branch of this industry.
⚫ As research develops and technologies progress, a shallow deep learning architecture
has been replaced with a deeper one, giving rise to foundation models. Foundation
models are extremely large neural networks with a more complex structure. Due to
their excellent expressiveness, foundation models have outperformed most other
methods of image generation, speech recognition, and text generation.
⚫ This chapter explains the basics of neural networks and foundation models.

2 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Objectives

⚫ Upon completing this course, you will be able to understand:

 Deep learning basics
 The basic architecture of convolutional neural networks
 The basic architecture of recurrent neural networks
 The transformer architecture
 The architectures of common foundation models

3 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Perceptron

2. Fully-connected Neural Network and How It Is Trained

3. Convolutional Neural Network

4. Model Architecture Based on the Recurrent Neural Network

5. Transformer Architecture

6. Basic Foundation Model Architecture

4 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Traditional Machine Learning

Problem
analysis
Task locating

Data Feature Feature

cleansing extraction selection

Machine
learning
algorithm

Question: Can you

use an algorithm to
automatically execute Inference, prediction,
the procedure? and identification

5 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Problem analysis and task locating: Locate the problem, convert the problem into a
machine learning problem, determine required data, and then collect the required data.
This part includes data exploration, scenario abstraction, and algorithm selection. For
example, logistic regression can be used to determine whether we can drink coffee, and the
Naive Bayes algorithm can be used to distinguish junk mails from normal ones. AI systems
do not directly detect but gather data from doctors, such as human body's tolerance to
caffeine and whether it causes sleeplessness.

• Data cleansing: Data cleansing is the process of rechecking and validating data to remove
duplicate information, correct errors, and ensure data consistency. Adoption (random,
systematic, and layered), normalization, noise removal, and data filtering

• Feature extraction: feature abstraction, feature importance evaluation, feature derivation,

and feature dimension reduction

• Feature selection

• Model training

• Inference, prediction, and identification

• Generally, features are manually selected in machine learning. More features

provide more information and higher identification accuracy.
• However, more features also increase the computation complexity and search
space. Training data appears to be sparse in the overall feature vectors,
affecting similarity determination, that is, dimension exploration.
What Is a Perceptron?
⚫ A perceptron receives multiple input signals and outputs one signal.

⚫ The figure on the left shows an example of a perceptron that receives two
input signals. x1 and x2 are the input signals, y is the output signal, and w1
x1 w1 and w2 are the weights.

y1 ⚫ The circles in the figure are neurons or nodes. When the input signals are
w2 sent to the neuron, they are multiplied by a fixed weight (w1 x 1 and w2 x
2). The neuron calculates the sum of the input signals and outputs 1 only
x2 when the sum exceeds a certain threshold. This is also called neuron
activation.

6 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• In 1957, Frank Rosenblatt at the aviation lab of the Cornell University invented the
perceptron artificial neural network. Considered a simplest form of a feedforward neural
network model, the network is a binary linear classifier. Its activation function is a sign
function sign (x). The perceptron is the first practical application of the artificial neural
network, marking a new stage of neural network development. After implementing the
perceptron with software, Frank Rosenblatt started to build the hardware perceptron
MarkI. He used 400 optoelectronic devices as neurons and adjustable potentiometers as
synaptic weights. Motors are used to adjust potentiometers to implement weight changes
during learning. He built the hardware perceptron in this way and used it for image
recognition. Due to technical limitations at that time, perceptrons based on physical
implementation are rare. This image recognition system drew attention from many parties
and received a large amount of fund from the U.S. Navy.

• Signals mentioned here can be seen as something that has mobility like an electric current
or a river. Similar to an electric current in a wire sending electrons forward, perceptron
signals form a flow and transmit information forward. The difference from an electric
current is that a perceptron signal has only two values, current (1) and no current (0). In
this document, 0 indicates that no signal is transmitted, and 1 indicates that signals are
transmitted.
Mathematical Representation of a Perceptron
⚫ Input signals of the perceptron have their own inherent weights, which determine the
importance of each signal. A larger weight indicates that the signal is more important.

A weight is similar to the resistance in an electric circuit. Resistance determines how difficult it is for current to flow. The lower the
resistance, the larger the current flowing through. When it comes to the perceptron, a larger weight indicates a larger signal that
passes through the perceptron. Resistance and weight play the same role in controlling the signal flow difficulty (or ease).

7 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Using a Perceptron to Simulate a Simple Logic Circuit
⚫ Forms of logic circuits

AND gate NAND gate OR gate

(w1, w2, θ) = (0.5, 0.5, 0.7) (w1, w2, θ) = (−0.5, −0.5, −0.7) (w1, w2, θ) = ?

8 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• In the logic circuit, the output -1 is changed to

• What other value combinations meet different gates?

• AND gate: (0.5, 0.5, 0.8) or (1.0, 1.0, 1.0)

How To Use a Single-layer Perceptron To Solve the XOR Problem

⚫ When the OR gate truth table is expressed in a two-dimensional ⚫ The XOR gate is also referred to as a logical XOR circuit. The output is 1
coordinate system, the single-layer perceptron linearly separates the only when either x1 or x2 is 1.
two types of values. ⚫ Can we use a single-layer perceptron to distinguish between 0 and 1 in
an XOR gate?

x2 x2 XOR gate
OR gate

x1 x1 XOR truth table

0 and represent different data. A single-layer
perceptron is a linear problem. We can use a and represent the output of the
straight line to divide the data and implement XOR gate. Can we divide the output
the classification function. with a straight line?
9 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The two lines may be separated, leading to two layers.

How To Use the AND, NAND, and OR Gates To Implement the XOR
Gate
⚫ If a single-layer perceptron cannot solve the XOR problem, can we combine implemented logic circuit gates to
implement the XOR gate?

AND gate NAND gate OR gate

？
？
？
Replace the question marks (?) with the AND gate, NAND gate,
and OR gate to implement the XOR gate.
10 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Implementation of the XOR Gate — Two-layer Perceptron

x1 s1

s2
x2

The XOR gate can be implemented by Truth table of the XOR gate
combining the AND, NAND, and OR gates. implementation process

11 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Multi-layer Perceptron
⚫ Convert the previous logic gate representation into a perceptron representation:

Layer 0 Layer 1 Layer 2

1. The two neurons at layer 0 receive an input signal and

x1 s1
send it to a neuron at layer 1.

2. The neuron at layer 1 sends the signal to the neuron at

y1
layer 2, which then outputs y.

x2 s2

12 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Nonlinearity
⚫ To solve the XOR problem, a non-linear curve is required to separate 0 and 1 to simulate the
XOR gate effect.

x2 XOR gate
⚫ Question: How does a multi-layer
perceptron transform the
perceptron function from linear to
nonlinear?
x1

13 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Source of Nonlinearity of a Multi-layer Perceptron — More Layers
⚫ A multi-layer perceptron can solve complex problems for the following reasons:
 More layers

1 𝑥>0
x1 𝑠𝑖𝑔𝑛 𝑥 = ቊ a1
−1 𝑥 ≤ 0

1 𝑥>0
x2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
1 𝑥>0 a2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
−1 𝑥 ≤ 0
y
−1 𝑥 ≤ 0

𝜽𝟏 𝜽𝟐

Input layer Middle layer Output layer

3-layer perceptron

14 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Source of Nonlinearity of a Multi-layer Perceptron — Nonlinear
Activation Function
 A non-linear function is used to process the output.
The sign function determines whether the neuron is
activated based on the output in the previous phase. The final output is
The weighted sum of multiple Therefore, the function used in this phase is called
changed to -1 or 1.
inputs is still a linear the activation function.
x1 w1 representation.

w2
x2 𝑠𝑖𝑔𝑛 𝑥 = ቊ
1 𝑥>0 y
−1 𝑥 ≤ 0

θ is the threshold that indicates the level of difficulty to activate the

neuron. Now, move it to the left of the formula and use it as part of
the input for the next calculation.

Calculation process of a single-layer perceptron

15 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

The Importance of Multi-layer Perceptrons and Nonlinear Activation
Functions The sign function only
x1 determines the output type.
w1
w2
x2 y

If a single-layer perceptron is used, this formula is decisive for discrimination, and it is a linear classifier.

If the sign function is not used at the middle layer:

x1 a1
𝑎1 = 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + 𝜃1 – linear

x2 a2 y 𝑎2= 𝑥1 ∗ 𝑤3 + 𝑥2 ∗ 𝑤4 + 𝜃1 – linear

The final output is as follows:

𝑎1 ∗ 𝑤5 + 𝑎2 ∗ 𝑤6 + 𝜃2 =
𝑤5 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + 𝜃1
+ 𝑤6 𝑥1 ∗ 𝑤3 + 𝑥2 ∗ 𝑤4 + 𝜃1 + 𝜃2

16 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
◼ Deep Neural Network
 Common Activation Functions
 Neural Network Training
 Optimizers
 Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture

17 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Perceptron Capability Improvement
⚫ We have seen the ability of multi-layer perceptrons to solve complex problems (fitting complex functions). However,
its middle layer has only two units, and it can only solve binary classification problems.
⚫ Can this ability be further improved if we continue to increase the number of nodes at the middle layer and the
number of network layers?

x1
-1

...

...
x2

Input layer Middle layer Added layer Output layer

18 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Deep Neural Network
⚫ Generally, the deep learning architecture is a deep neural network. The depth refers to the
number of layers in the neural network. Input layer

Hidden layer

Input Middle Output

layer layer layer

Output layer

Human neural network Perceptron Deep neural network

• What is deep learning? In the previous course, we have learned that deep
learning is a subset of machine learning.
• Generally, the deep learning architecture is a deep neural network. The depth
refers to the number of layers in the neural network.
• The network is built by simulating the human neural network.

• In the design and application of artificial neural networks (ANNs), the

following factors need to be considered: neuron functions, connection modes
among neurons, and network learning (training).
• Source of image on the left:
http://3ms.huawei.com/km/static/image/detail.html?fid=59876
Feedforward Neural Network Structure
⚫ A Feedforward Neural Network (FNN) is a typical deep learning model. It has the following features:
 As the input layer, the first layer provides the initial data source for the network.
 Each neuron at the hidden layer and the output layer receives the weighted sum of all neurons from the previous layer, and
provides the output to the next layer through the activation function.
 A unidirectional multi-layer structure is used. The entire network provides no feedback. Signals are transmitted unidirectionally
from the input layer to the output layer. The network can be represented by a directed acyclic graph (DAG).

𝑏2 𝑏3
𝑏1 ⚫ X indicates the input data.
(1) (1) (1)
𝑦1 𝑦2 𝑦3
𝑥1
⚫ W indicates the weight.
(2)
(2) (2)
𝑦3 ⚫ B is the threshold for activating a
𝑦1 𝑦2
𝑥2 (3)
neuron, and is referred to as a bias
𝑦3
𝑦1
(3) (3)
𝑦2
in a feedforward neural network.
𝑥3
𝑗𝑖 Node 𝑗 at layer 𝐿 is ⚫ Y is the output of a neuron, and
𝑦1
(4) (4)
𝑦2 𝑤𝑙
𝑤143 connected to node 𝑖 at represents the output after the
layer 𝐿 − 1. input is processed by the activation
Layer 𝐿
Hidden layer 1 Hidden layer 2 function.

Input layer Fully-connected layer Output layer

• It is also called the Feedforward Neural Network.

Calculating FNN

𝑏2 𝑏3 The first neuron at hidden layer 1 is

𝑏1
expanded, as marked by the red box in
𝑤111 (1)
a1 |𝑦1
(1) (1) (1)
𝑦2 𝑦3
the figure. A single neuron is calculated
𝑥1
(2)
𝑦3 as follows:
(2) (2)
𝑦1 𝑦2
𝑥2 (3)
𝑦3
(1)
𝑦1
(3) (3)
𝑦2 a1 = σm
1
（ 𝑥i ∗ 𝑤 1i） + 𝑏1
𝑥3 1
(4) (4)
𝑤143 𝑦1 𝑦2 (1)
𝑦1 = f（ a1(1) ）
Hidden layer 1 Hidden layer 2
…
Input layer Fully-connected layer Output layer

21 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

How Do Additional Hidden Layers Impact Neural Networks?

0 hidden 3 hidden 20 hidden

layers layers layers

22 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The more hidden layers, the stronger identification capability of the neural
network.
Adding Nodes at a Single Layer
⚫ As we shift from single-layer to the multi-layer perceptrons, we realize that we can increase
both the number of layers on a network and the number of network nodes at a single layer.

⚫ Can we enhance network capabilities simply by adding nodes to a single layer?

23 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
 Deep Neural Network
◼ Common Activation Functions
 Neural Network Training
 Optimizers
 Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture
24 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Deep Neural Networks Still Require Nonlinear Activation Functions
⚫ Apart from multi-layer perceptrons, deep neural networks still need nonlinear activation
functions to have nonlinear capabilities.
⚫ Use a linear activation function, for example: f(x) = wx +b

⚫ What are some common activation functions?

25 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Step Function
⚫ Each neuron node in the neural network receives the weighted sum of neurons at the upper
layer as the input. After being processed by a nonlinear function, the input value enters the
next layer as the input. This nonlinear function is referred to as the activation function.
⚫ The sign function is the activation function used in the multi-layer perceptron and is a step
function with the following derivative:

26 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Sigmoid Function
⚫ The sigmoid function is an activation function frequently used in neural networks.

⚫ For an input with a domain in R, the sigmoid

function maps it to an output within the
interval (0,1). Because it compresses any
input in the range (-∞,+∞) to a value within
the interval (0,1), the sigmoid function is
often referred to as a squeezing function.

27 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Tanh Function
⚫ Similar to the sigmoid function, the tanh (hyperbolic tangent) function can also compress and convert its input to
the interval (-1,1) using the following formula:

⚫ When the input is near 0, the tanh

function is close to linear
transformation. It has a similar
shape to the sigmoid function. The
difference is that the tanh function
is centrosymmetric about the origin
of the coordinate system.

28 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

ReLU Function
⚫ Rectified linear unit (ReLU) is one of the most popular activation functions because it is easy to implement and
performs well in various prediction tasks.
⚫ ReLU provides a simple nonlinear transformation. With a given element x, the ReLU function is defined as the
maximum value between the element and 0.

29 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Leaky ReLU Function
⚫ The Leaky ReLU function is an enhanced ReLU function. It assigns a non-zero slope to all
negative values.

30 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Swish Function
⚫ The Swish function is a self-gated activation function, and is defined as:

⚫ Where σ (.) is the sigmoid function. β is a

learnable parameter or a fixed hyper-parameter.
σ() ∈ (0, 1) can be regarded as a soft gating
mechanism. When σ (β𝑥) is close to 1, the gate is
open, and the output of the activation function is
approximate to 𝑥 itself. When σ (β𝑥) is close to 0,
the gate closed, and the output is approximate to
0.

31 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• When β = 0, the Swish function becomes a linear

function 𝑥/2. When β = 1, the Swish function is
approximately linear when 𝑥> 0, and is approximately
saturated and has certain non-monotonicity when 𝑥< 0.
When β→ +∞, σ(β𝑥) tends to be a discrete 0-1
function, and the Swish function approximates to a
ReLU function. Therefore, the Swish function can be
considered as a non-linear interpolation function
between a linear function and a ReLU function, and the
degree is controlled by the β parameter.
Softmax Function
⚫ The Softmax function is usually used for multiclass classification in deep learning.
⚫ It converts a K-dimensional vector containing any real number into another K-dimensional
vector within the range of (0,1), and the sum of all elements is equal to 1. Therefore, the vector
output by the Softmax function can be interpreted as a probability distribution, which is
suitable for representing the probabilities of different classifications in the classification model.

32 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
 Deep Neural Network
 Common Activation Functions
◼ Neural Network Training
 Optimizers
 Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture
33 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Learning from Data
⚫ The primary difference between a neural network and a multi-layer perceptron is that, when
solving nonlinear problems, the neural network can automatically learn how to extract features
to form a problem-solving model. This learning ability depends on data.

How to identify
different Algorithms that humans Answer
handwritten 5s think of

Features that
Machine learning
humans think of Answer
(SIFT, HOG, etc) (SVM, KNN, etc)

Neural network
Answer
(deep learning)

34 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Data is the foundation of machine learning. Finding

answers from data, discovering patterns from data, and
telling stories based on data... Without data, machine
learning will not be possible at all. Therefore, data is the
core of machine learning. This data-driven approach
helps us get rid of the human-centered approach.
Data Division
⚫ A dataset is usually divided into a training set, a validation set, and a test set.

Feature
Training
data learning
Established
identification
mode
Self-optimization
Training set

Test Flowers?
data
Identification
...
Validation and
test sets Leaves?
The validation set is used to evaluate the model training at the end of a phase, so that we can make
adjustments based on the validation results.
The test set is generally data that has never been seen by a formally-working model. The results
obtained with the test set directly reflect the model's performance.
35 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://3ms.huawei.com/km/static/image/detail.html?fid=57074

• https://3ms.huawei.com/km/static/image/detail.html?fid=56891
How Does a Neural Network Learn?
Your answer
Learning Simulation test
Reflection

Answer

Human learning process

How is the error
calculated?

Identification
result Calculation
error
True result
Neural network Parameter adjustment
learning process

36 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Loss Functions
⚫ A loss function is used to measure the difference between the model's predicted value and the
actual value in machine learning and deep learning. It quantifies the degree to which a model
makes mistakes during training.
⚫ Common loss functions
 Mean square error: mainly used for regression tasks.

yk represents the output of the neural network, tk represents supervised

data, and k represents the number of data dimensions.
 Cross entropy loss function: mainly used for classification tasks.

yk represents the output of the neural network, and tk

represents the correct solution label.
37 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Gradient Descent
⚫ Neural network learning is a process of finding the optimal parameters. As the parameters are
optimized, the neural network's predicted value gradually approaches the real value, and the loss
function decreases. Therefore, the optimal parameters allow us to obtain the minimum value of the loss
function.
⚫ The gradient is the direction in which the multivariate function value increases the quickest. Therefore,
all parameters should be optimized in the reverse gradient direction.

The gradient of the multivariate function 𝐶 𝑊 =

𝑓 𝑤0 , 𝑤1 , … , 𝑤𝑛 at

𝑊 ′ = [𝑤0 ′ , 𝑤1 ′ , … , 𝑤𝑛 ′ ] 𝑇 is as follows:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
𝛻𝑓 𝑤0 ′ , 𝑤1 ′ , … , 𝑤𝑛 ′ = [ , ,…, ] | ′
𝜕𝑤0 𝜕𝑤1 𝜕𝑤𝑛 𝑊=𝑊

38 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Batch Gradient Descent (BGD) Algorithm
⚫ <X, t> indicates a sample in the training set. X is the input value vector, t is the target output, o
is the actual output, η is the learning rate, and C is the loss function.
Initialize each 𝑤𝑖 to a random value with a smaller absolute value.
Before the termination condition is met, do as follows:
◼ Initialize each ∆𝑤𝑖 to zero.
◼ For each <X, t> in the training set, do as follows:
− Input X to this unit and calculate the output o.
1 𝜕C(𝑡𝑑 ,𝑜𝑑)
− For each 𝑤𝑖 in this unit: ∆𝑤𝑖 += -η𝑛 σx σ𝑑∈𝐷 𝜕𝑤𝑖

◼ For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖

⚫ This gradient descent algorithm is not used often because the convergence is very slow since all
training samples need to be computed every time the weight is updated.

39 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The BGD algorithm uses the entire training set for each learning. Therefore, each update is
made towards the correct direction, ensuring convergence to the extrema in the end. A
convex function converges to a global extremum, and a non-convex function may converge
to a local extremum. However, this algorithm requires longer learning time and more
memory resources.
Stochastic Gradient Descent (SGD) Algorithm
⚫ A common variant, Incremental Gradient Descent or Stochastic Gradient Descent, has
been developed to solve the issues found in the BGD algorithm. One of its
implementations is online learning, which updates the gradient based on each sample.
1 𝜕C(𝑡𝑑 , 𝑜𝑑 ) 𝜕C(𝑡𝑑 , 𝑜𝑑 )
∆𝑤𝑖 = −𝜂 ෍ ෍ ⟹ ∆𝑤𝑖 = −𝜂 ෍
𝑛 x 𝜕𝑤𝑖 𝜕𝑤𝑖
𝑑∈𝐷 𝑑∈𝐷

⚫ SGD:
 Initialize each 𝑤𝑖 to a random value with a smaller absolute value.
 Before the termination condition is met, do as follows:
◼ Randomly select <X, t> in the training set:
− Input X to this unit and calculate the output o.
𝜕C(𝑡𝑑,𝑜𝑑)
− For each 𝑤𝑖 in this unit: 𝑤𝑖 += -ησ𝑑∈𝐷 𝜕𝑤𝑖

40 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• This gradient descent algorithm updates the weight based on each sample. Most training
samples contain noise. As a result, when the extremum is approximated to, the gradient
direction is oriented up and down near the extremum but difficult to converge to the
extremum.
Mini-batch Gradient Descent (MBGD) Algorithm
⚫ The MBGD algorithm was designed to address the issues found in the previous two gradient descent
algorithms and is now most widely used. It uses a small fixed batch size (BS) of samples to compute ∆𝑤𝑖
and update weights.
Initialize each 𝑤𝑖 to a random value with a smaller absolute value.
Before the termination condition is met, do as follows:
◼ Initialize each ∆𝑤𝑖 to zero.
◼ For each <X, t> in the next batch of samples (number of samples = BS) obtained from the training set, do as
follows:
− Input X to this unit and calculate the output o.
1 𝜕C(𝑡𝑑,𝑜𝑑 )
− For each 𝑤𝑖 in this unit: ∆𝑤𝑖 += -η𝑛 σx σ𝑑∈𝐷 𝜕𝑤𝑖

◼ For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖

◼ If it is the last batch, shuffle the training samples.

41 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• This gradient descent algorithm considers both the efficiency and gradient stability. It may
easily overshoot the local minimum and is the most commonly used gradient descent
algorithm in actual work. The batch size varies with specific problems and is usually set to
32.
Network Training Process
1 1
⚫ Forward propagation: 𝐶(𝑊) = 2 (𝑦 − 𝑎3 )2=2 (𝑦 − 𝑔(𝑊2 𝑔 𝑊1 𝑋 ))2 .
𝑑𝐶
⚫ If the parameter 𝑊1 needs to be updated, according to 𝑊𝑡+1 = 𝑊𝑡 − 𝜂 𝑑𝑊,
𝑑𝐶 𝜕𝐶(𝑊) 𝜕𝑎3 𝜕𝑧3 𝜕𝑎2 𝜕𝑧2
compute: 𝑑𝑊1 = 𝜕𝑎3 𝜕𝑧3 𝜕𝑎2 𝜕𝑧2 𝜕𝑊1
.
Forward propagation

Linear Combination Activation Linear Combination Activation

Back propagation

42 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• As shown in the figure, the sequence of

backpropagation is to multiply the signal LOSS by the
local derivative of the node and then send the result to
the next node.
Chain Rule
⚫ Composite function
Breakdown

⚫ Deriving the composite function: If a function is represented by a composite function, the

derivative of the composite function may be represented by the product of derivatives of
functions that form the composite function.

derivative (the derivative of z with respect to x) may be expressed by (the derivative of z with
respect to t) multiplied by (the derivative of t with respect to x).
⚫ This is the chain rule.
43 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Neural networks propagate forward layer by layer. How can we calculate the gradients of
all parameters based on the gradient of the loss function?

• If a function is represented by a composite function,

the derivative of the composite function may be
represented by the product of derivatives of functions
that form the composite function.
Vanishing and Exploding Gradients (1)
⚫ Vanishing gradient: As the number of network layers increases, the derivative value of
backward propagation decreases, and the gradient vanishes.
⚫ Exploding gradient: As the number of network layers increases, the derivative value of
backward propagation increases, and the gradient explodes.
⚫ Cause: y𝑖 = 𝜎(𝑧𝑖 ) = 𝜎 𝑤𝑖 𝑥𝑖 + 𝑏𝑖 , 𝑤ℎ𝑒𝑟𝑒 𝜎 is the sig𝑚𝑜𝑖𝑑 function.

w1 w2 w3 w4
x Z1|Y1 Z2|y2 Z3|Y3 Z4|Y4 C

⚫ Backward propagation can be deduced as follows:

44 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Vanishing and Exploding Gradients (2)
1
⚫ The maximum value of 𝜎 ′ 𝑥 is 4 :

1
⚫ The network weight | w | is usually less than 1. Therefore, |σ'(z) w| ≤ 4. When using the chain rule, the
𝜕C
value of becomes smaller when the number of layers increases, resulting in gradient vanishing.
𝜕𝑏1
⚫ When the network weight | w | is large, that is |σ'(z) w| > 1, the gradient explodes.
⚫ Solution: Use gradient clipping to prevent the gradient from exploding, and use the ReLU activation
function and LSTM neural network to mitigate gradient vanishing.

45 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• This problem occurs when the value of w is large. However, this problem rarely occurs
when the sigmoid activation function is used. It is because that the value of 𝜎 ′ 𝑧 is also
related to w ([equation]), unless the input value [equation] of this layer is within a small
range.

• Gradient exploding and vanishing are caused by large network depth and unstable network
weight update. In essence, they are caused by the chain rule in gradient backward
propagation.

• Gradient clipping is proposed to solve gradient exploding. The principle is to set a gradient
clipping threshold. When being updated, if the gradient exceeds the threshold, the gradient
must be limited under the threshold. This prevents gradient exploding.

• Another approach for resolving gradient exploding is weight regularization. l1 regularization

and l2 regularization are commonly used. All deep learning frameworks provide APIs for
regularization. For example, in TensorFlow, if regularization parameters are set during
network setup, you can use the following code to calculate the regularization loss.

• ReLU: The principle is simple. If the derivative of the activation function is 1, gradient
vanishing and exploding do not occur, and each network layer has the same update rate.
The ReLU function is developed for this purpose.

• For long-short term memory (LSTM) networks, gradient vanishing does not easily occur
thanks to the complex gates inside the LSTM. As shown in the following figure, LSTM can
memorize the residual memory from previous training through the internal gates when it is
updated. Therefore, it is often used in generative text. Currently, CNN-based LSTM is also
available. Try it if you are interested.
Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
 Deep Neural Network
 Common Activation Functions
 Neural Network Training
◼ Optimizers
 Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture
46 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Faster Model Training
⚫ Neural network training is very slow
when using normal methods.
⚫ To speed up the process, we can use
special tools called optimizers.

47 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://3ms.huawei.com/km/static/image/detail.html?fid=57497
Optimizers
⚫ There are various improved versions of gradient descent algorithms. In object-oriented
programming, different gradient descent algorithms are often encapsulated into an object
called an optimizer.
⚫ There are several reasons for improving algorithms, including:
 Accelerating algorithm convergence
 Preventing or overshooting local extrema
 Simplifying manual parameter setting, especially the learning rate

⚫ Common optimizers include the common SGD optimizer, momentum optimizer, AdaGrad,
RMSprop, and Adam.
48 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Momentum Optimizer
⚫ Adding the momentum term to ∆𝑤𝑗𝑖 is a basic way to improve an algorithm. Assume that the
weight correction of the 𝑛-th iteration is ∆𝑤𝑗𝑖 𝑛 . The weight correction rule is:
∆𝑤𝑗𝑖𝑙 𝑛 = −𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛) + 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1

Where α is a constant (0≤𝛼<1) called the momentum coefficient, and 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1 is the
momentum term.
⚫ Imagine that a small ball rolls down from a random point on the error surface. The momentum
term gives inertia to the ball.

49 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• 𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 𝑛 indicates the original gradient and direction, j represents the j-th neuron at the
l-th layer, and i represents the i-th neuron at the (l+1)-th layer.
Advantages and Disadvantages of the Momentum Optimizer

⚫ Advantages:
 Enhances the stability of the gradient correction direction and reduces mutations.
 In areas where the gradient direction is stable, the ball rolls faster and faster (there is a speed upper
limit because 𝛼 < 1), which helps the ball quickly overshoot the flat area and accelerates
convergence.
 A small ball with inertia is more likely to roll over some narrow local extrema.

⚫ Disadvantages:
 The learning rate 𝜂 and momentum 𝛼 need to be manually set. Usually, additional experiments are
necessary to determine the appropriate values.

50 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AdaGrad Optimizer (1)
⚫ The common characteristic of the SGD algorithm, MBGD algorithm, and momentum optimizer
is that they update each parameter using the same learning rate.
⚫ According AdaGrad principle, different learning rates should be set for different parameters.
C (t , o)
gt = Computing gradient
wt
rt = rt −1 + gt2 Square gradient accumulation

wt = − gt Computation update
 + rt
wt +1 =wt + wt Application update

𝑔𝑡 indicates the t-th gradient, and 𝑟𝑡 is the gradient accumulation variable. The initial value of 𝑟 is 0, which increases
continuously. 𝜂 indicates the global learning rate, which needs to be set manually. 𝜀 is a small constant that is set to 10-
7 for numerical stability.

51 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AdaGrad Optimizer (2)
⚫ According to the AdaGrad optimization algorithm, 𝑟 continuously increases while the overall
learning rate keeps decreasing as the algorithm iterates. This is because we want the learning
rate to slow down as the number of updates increases. When the model starts learning, we are
far away from the optimal solution to the loss function. With more updates, we get closer and
closer to the optimal solution, so learning speed can decrease.
⚫ Advantage:
 The learning rate is automatically updated. As the number of updates increases, the learning rate
decreases.
⚫ Disadvantage:
 The denominator increases continuously, and learning becomes very slow at the end. As a result, the
algorithm loses its effectiveness.

52 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

RMSProp Optimizer
⚫ The RMSProp optimizer is an improved AdaGrad optimizer. It introduces an attenuation coefficient to
ensure that 𝑟 attenuates at a certain rate in each round.
⚫ The RMSProp optimizer addresses the fact that the AdaGrad optimizer ends the optimization process too
early. It is suitable for handling unstable targets and is especially effective for RNNs.
C (t , o)
gt = Computing gradient
wt
rt = rt −1 + (1 −  ) gt2 Square gradient accumulation

wt = − gt Computation update
 + rt
wt +1 = wt + wt Application update

⚫ 𝑔𝑡 indicates the t-th gradient, and t is the gradient accumulation variable. The initial value of r is 0, which
may not increase and needs to be adjusted with a parameter. 𝛽 indicates the attenuation
factor. 𝜂 indicates the global learning rate, which needs to be set manually. 𝜀 is a small constant and is
set to10−7 for numerical stability.
53 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Adam Optimizer (1)
⚫ Adaptive Moment Estimation (Adam) was developed based on AdaGrad and AdaDelta. It
maintains two additional variables, 𝑚𝑡 and 𝑣𝑡 , for each variable to be trained.
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )𝑔𝑡2
⚫ Where 𝑡 represents the 𝑡-th iteration, and 𝑔𝑡 is the calculated gradient. 𝑚𝑡 and 𝑣𝑡 are moving
averages of the gradient and square gradient, respectively. From a statistical perspective,
𝑚𝑡 and 𝑣𝑡 are estimates of the gradient's first moment (the average value) and the second
moment (the uncentered variance), respectively, hence the name of the method.

54 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Adam Optimizer (2)
⚫ If 𝑚1 and 𝑣1 are initialized using the zero vector, 𝑚𝑡 and 𝑣𝑡 are close to 0 during the initial iterations,
especially when 𝛽1 and 𝛽2 are close to 1. To solve this problem, we use 𝑚
ෝ 𝑡 and 𝑣ො𝑡 .
𝑚𝑡
𝑚ෝ𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣ො𝑡 =
1 − 𝛽2𝑡
⚫ Adam's weight update rule is as follows:
𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚
ෝ𝑡
𝑣ො𝑡 + 𝜖
⚫ Although the rule requires manually setting 𝜂, 𝛽1 , and 𝛽2 , the setting process is much simpler. According
to experiments, the default settings are 𝛽1 = 0.9, 𝛽2 = 0.999,𝜖 = 10−8 , 𝜂=0.001. In practice, Adam
converges quickly. When convergence saturation is reached, 𝜂 can be reduced. After several rounds of
reduction, a satisfying (local) extrema is obtained. Other parameters do not need to be adjusted.

55 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents
1. Perceptron
2. Fully-connected Neural Network and How It Is Trained
 Deep Neural Network
 Common Activation Functions
 Neural Network Training
 Optimizers
◼ Regularization
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
5. Transformer Architecture
6. Basic Foundation Model Architecture
56 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
End of Model Learning
⚫ When does the model finish learning?
⚫ When is the model ready for real use?
⚫ Can we say the model is perfect once it has learnt all training data features?

57 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://3ms.huawei.com/km/static/image/detail.html?fid=64716
Overfitting
⚫ Problem description: The model performs well with the training set, but poorly with the test set.
⚫ Root cause: There are too many feature dimensions, model assumptions, and parameters, too much
noise, but very little training data. As a result, the fitting function predicts the training set almost
perfectly, but the prediction result for a new test data set is unsatisfactory. Training data is overfitted
without considering the model's generalization capability.

y y y

x x x
Underfitting Overfitting
Features not learned Good fitting
Noise learned
58 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Overfitting is caused by insufficient data and complex model. In deep learning, small
training data size and high model complexity are two common problems. They result in high
prediction accuracy with training data, but low accuracy with test data. This is how
overfitting occurs.

▫ Obtain more data: Obtain more data from data sources and perform data
augmentation.

▫ Use a proper model: Reduce the number of network layers and the number of
neurons to limit the fitting capability of the network.

▫ Dropout

▫ Regularization: Restricts weight increase during training.

▫ Limit the training time. Pass the evaluation test.

▫ Increase the noise: Input + Weight (Gaussian initialization)

▫ Data cleansing/pruning: Rectifies or deletes incorrect labels.

▫ Using multiple models: Bagging uses diverse models to fit training sets of different
parts, while Boosting uses only simple neural networks.
𝐿1 Regularization
⚫ Add the norm constraint 𝐿1 to model parameters, that is,
𝐽ሚ 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1

⚫ If a gradient method is used to resolve the value, the parameter gradient is:
𝛻 𝐽ሚ 𝑤 = 𝛼𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤
⚫ The parameter optimization method is:
𝑤𝑡+1 = 𝑤 − 𝜀𝛼𝑠𝑖𝑔𝑛(𝑤) − 𝜀𝛻𝐽(𝑤)

59 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• In special cases, for secondary optimization, the parameter recursive formula is 𝑤𝑖 =

𝛼
𝑠𝑖𝑔𝑛 𝑤𝑖∗ ma𝑥( 𝑤𝑖∗ − 𝜆 , 0), assuming that the corresponding Hessian matrix is a diagonal
𝑖
∗ 𝛼
matrix. According to this formula, the parameter value is reduced to 0 if 𝑤𝑖 < 𝜆 . This is
𝑖

different from 𝐿2 regularization. Compared with the 𝐿2 optimization method, 𝐿2 does not
directly reduce the parameter value to 0 but to a value close to 0.
𝐿2 Regularization
⚫ Add norm penalty term 𝐿2 to prevent overfitting.
1
𝐽ሚ 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2
2
2
A parameter optimization method can be inferred using an optimization technology (such as a
gradient method):
𝑤 = 1 − 𝜀𝛼 𝑤 − 𝜀𝛻𝐽(𝑤)
where 𝜀 is the learning rate. Compared with a common gradient optimization formula, this
formula multiplies the parameter by a reduction factor.

60 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Add a regular term Ω (θ)=12||w||22 to the target function to make the weight closer to
the origin.

• If 𝐽 is a secondary optimization formula, the model parameter may be represented as 𝑤

෥𝑖 =
𝜆𝑖
𝑤,
𝜆𝑖 +𝛼 𝑖
that is, adding a control factor to the original parameter, where λ is an eigenvalue
of the parameter Hessian matrix. Therefore:

▫ When 𝜆𝑖 ≫∝, the penalty factor has a small effect.

▫ When 𝜆𝑖 ≪∝, the corresponding parameter value is reduced to 0.

Stopping Training Early
⚫ You can insert a data test on the validation set during the training. If the data loss of the
verification set increases, the training is stopped early.

Stopping training early

Validation set

Training set

Training round

61 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Generally, a relatively large number (n) of iterations is set. First we need to save the current
model, including the network structure and weights, and train the model for num_batch
times (that is, an epoch) to obtain a new model. Then, use the test set as the input of the
new model for testing. If the test error is greater than the last test error, we do not stop the
test immediately. Instead, we should continue to conduct the training and test for several
epochs. If the error does not decrease, we stop the model training when we believe that
the minimum error is achieved in the last test.

• The common practice is to record the best test set accuracy p so far. If the accuracy does
not exceed p in m consecutive periods, it may be considered that p does not increase
anymore, and iteration may be stopped early (early stopping).

• As shown in the figure, the test error gradually decreases in the first several epochs.
However, it slightly increases in a certain epoch. This indicates that overfitting has occurred.

• Overfitting is unwanted. We can prevent it through early stopping.

• Early stopping is to stop training before the test error starts to increase, even if the training
has not converged, that is, the training error has not reached the minimum.
Dropout
⚫ Dropout is a common simple regularization method, which has been widely used since 2014. Simply put,
dropout randomly discards some inputs during training. Parameters corresponding to the discarded
inputs are not updated.
⚫ Dropout is an integration method. It combines all sub-network results and obtains sub-networks by
randomly dropping inputs.
y y y y

h₁ h₂ h₁ h₂ h₁ h₂

x₁ x₂ x₂ x₁ x₁ x₂

y
y y y y

h₁ h₁ h₂ h₂
h₁ h₂
x₁ x₂ x₁ x₂ x₂

x₁ x₂ y y y y

h₁ h₁ h₂

x₁ x₂ x₁ x₁

y y y y

h₂ h₁

x₂

62 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• When the samples of each test round are input into the neural network for training, a
probability p is set, so that each neuron has a specific probability of dying and being
excluded from network training.
• The process is as follows: 1. Delete neurons (temporarily dead) in the hidden layer of the
network at a specific probability p, and keep the input and output neurons unchanged. 2.
Propagate the input x forward through the modified network, and propagate back the loss
result through the modified network. After this process is performed on a small batch of
training samples, update corresponding parameters (w, b) on neurons that are not deleted
with the SGD algorithm. 3. Then, repeat the process. Restore the deleted neurons. At this
point, the deleted neurons remain unchanged, and the undeleted neurons are updated.
Randomly select a subset of half size from the neurons at the hidden layer, and temporarily
delete it. Remember to back up the parameters of the deleted neurons. For a small batch of
training samples, propagate the input forward, backpropagate the loss, and then update
the parameters (w, b) according to the SGD algorithm (The parameters of the undeleted
neurons are updated, and the parameters of the deleted neurons remain the same as what
they are before the deletion.)
• The sampling probability of each entry is 0.8 for the input and 0.5 for the hidden layers.
• Advantages:
▫ Compared with weight decay and norm constraints, this strategy is more effective.
▫ It is computationally cheap and simple and can be used in non-deep-learning models.
▫ However, it is less effective when the training data is insufficient.
▫ Stochasticity is not necessary or sufficient to achieve the normalizing effect of
dropout. Invariant shielding parameters can be constructed to obtain good solutions.
• In addition to the preceding methods, we can also use semi-supervised learning, multi-task
learning, early stopping, parameter sharing, ensemble methods, and adversarial training.
Contents

1. Perceptron

2. Fully-connected Neural Network and How It Is Trained

3. Convolutional Neural Network

4. Model Architecture Based on the Recurrent Neural Network

5. Transformer Architecture

6. Basic Foundation Model Architecture

63 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Convolutional Neural Network
⚫ A CNN is a feedforward neural network. Its artificial neurons may respond to units within the
coverage range. CNN excels at image processing. It includes a convolutional layer, a pooling
layer, and a fully-connected layer.
⚫ In the 1960s, Hubel and Wiesel studied the cortex neurons that cats use to sense their
environment and select their direction. They found that their unique network structure could
simplify feedback neural networks. They then proposed the CNN.
⚫ CNN has become a hot research topic in many scientific fields, especially in pattern
classification. The network is widely used because it can avoid complex image pre-processing,
directly using original images instead.

64 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• A filter matrix is a set of fixed weight and can be seen as a constant filter (kernel). The
convolution (adding each element, weighted by the kernel) is performed between an image
(data from different data windows) and a kernel. This type of network is called CNN.

• Local receptive field: It is generally considered that human perception of the outside world
is from local to global. Spatial correlations among local pixels of an image are closer than
those among pixels that are far away. Therefore, each neuron does not need to know the
global image. It only needs to know the local image and then the local information is
combined at a higher level to generate global information. The idea of local network
connection is also inspired by the biological visual system structure. The neurons in the
visual cortex receive local information (respond to stimuli of certain regions).

• Parameter sharing: One or more filters can be used to scan the input images. The
parameters of the filter are weights. At the layers scanned by the same filter, each filter
uses the same parameters to perform weighted computation. Weight sharing means that
the parameter values of each filter do not change when the filter scans the entire image.
For example, if we have three feature filters and each filter scans the entire image. During
the scanning process, the parameter values of the filters do not change. In other words, all
elements of the image share the same weights.
Main CNN Concepts
⚫ Local receptive field: It is believed that humans perceive the outside world from local
to global. The spatial correlations among an image's local pixels are closer than those
among the pixels that are far away. As such, each neuron does not need to know the
global image; it only needs to know the local image. Then, the local information is
combined at a higher level to generate global information.
⚫ Parameter sharing: One or more convolution cores may be used to scan input images.
Parameters carried by the convolution cores are weights. In a layer scanned by
convolution cores, each core uses the same parameters for weighted computation.
Weight sharing means that, when each convolution core scans an entire image, its
parameters are fixed.

65 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

CNN Architecture
3 feature 5 feature Output
maps 3 feature 5 feature
Input image maps layer
maps maps

Bird Pbird

Fish Pfish

Dog Pdog

Vectorization
Cat Pcat

Convolution + nonlinearity Max pooling

Multiple classes
Convolutional layers + pooling layers
Fully-connected layer

66 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Input layer: inputs data.

• Convolutional layer: composed of several convolutional units. The parameters of each

convolutional unit are obtained by optimizing the backward propagation algorithm. The
purpose of convolution computation is to extract different input features. The first
convolutional layer may extract only some low-level features such as edges, lines, and
angles. A multi-layer network can extract more complex features based on the low-level
features.

• Rectified linear units layer (ReLU layer): uses ReLU f(x) = max(0, x) as the activation
function.

• Pooling layer: partitions features obtained from the convolutional layer into some areas
and outputs the maximum or minimum value, generating new features with a smaller
spatial size.

• Fully-connected layer: integrates all local features into global features to compute the final
scores for each type.

• Output layer: outputs the final result.

• Image on the left: http://image.huawei.com/tiny-

lts/v1/images/0281bccdff8137c18a163beba5a2650c_4715x2927.jpg
Computing a Single Convolution Kernel (1)
⚫ Description of convolution computing

67 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Computing a Single Convolution Kernel (2)
⚫ Result of convolution computing

Hanbingtao, 2017, CNN

68 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Convolutional Layer
The basic CNN architecture is multi-channel convolution that consists of multiple single convolutions. The
output of the upper layer (or the original image of the first layer) is used as the input of the current layer. It
is then convolved with the layer's convolution kernel and serves as this layer's output. Each layer's
convolution kernel is the weight to learn. Similar to the fully-connected layer, after the convolution is
completed, the result should be biased and activated through activation functions before being input to the
next layer.

Wn
bn
Fn

Input Output
tensor tensor
F1
Activate
W2 b2

W1 b1

69 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Pooling Layer
Pooling combines nearby units to reduce the size of the input at the next layer, reducing dimensions. Commonly used
pooling methods include max pooling and average pooling. In max pooling, the maximum value within a small square
region is selected as the representative for that area. In average pooling, the average value of the small region is chosen
as the representative, with the side length of this region equal to the pool window size. The following figure shows max
pooling where the pool window size is 2.

70 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The actual classification networks are feed forward networks that are formed by
interconnected convolutional and pooling layers. The pooling layer has the following
functions:

▫ Invariance: Max pooling ensures invariance within a certain range because the
maximum value of a region is the last output value regardless of its location.

▫ Reducing the input size for the next layer: Pooling effectively reduces the size of the
input data for the next layer, the number of parameters, and computation workload.

▫ Obtaining fixed-length data: By properly setting the pooling window size and stride,
we can obtain fixed-length outputs from variable-length inputs.

▫ Increasing the scale: The features of the upper layer can be extracted from a larger
scale.

▫ Preventing overfitting: Pooling simplifies the network and reduces the fitting
precision. Therefore, it can prevent overfitting (note the possible underfitting).
Fully-connected Layer
⚫ The fully-connected layer is essentially a classifier. The features extracted on the convolutional
and pooling layers are straightened and placed at the fully-connected layer to output and
classify results.

Straighten

⚫ Generally, the Softmax function is used as the activation function of the final fully-connected
output layer to combine all local features into global ones and compute the score of each class.
𝑧
𝑒 𝑗
softmax= σ 𝑧𝑘
𝑘𝑒

71 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
◼ RNN
 LSTM
 Seq2Seq

5. Transformer Architecture
6. Basic Foundation Model Architecture

72 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Recurrent Neural Network
⚫ The recurrent neural network (RNN) is a neural network that captures dynamic information in
sequential data through periodical connections of hidden layer nodes. It can classify sequential
data.
⚫ Unlike FNNs, the RNN can keep a context state and even store, learn, and express related
information in context windows of any length. Unlike traditional neural networks, it is not
restricted by the space boundary but can extend time sequences. In other words, a side exists
between the hidden layer of the current moment and the hidden layer of the next moment.
⚫ The RNN is widely used in scenarios related to sequences, such as videos consisting of image
frames, audio consisting of clips, and sentences consisting of words.

73 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

RNN Architecture
⚫ 𝑋𝑡 is the input of the input sequence at time t.
⚫ 𝑆𝑡 is the memory cell of the sequence at time t and caches previous information:
𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

⚫ S𝑡 traverses multiple hidden layers, and then traverses the fully-connected layer V to obtain the
final output O𝑡 at time t:
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑡 𝑉 .

74 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Back-Propagation Through Time
⚫ BPTT:
 Conventional backward propagation is an extension of the time sequence.
 There are two errors in a memory cell at time step t: the partial derivative of the error 𝐶𝑡 at time step t with respect to the
memory cell and the partial derivative of the error at the next time step t+1 with respect to the memory cell at time step t.
These two errors need to be combined.
 The longer the time sequence, the more likely will the loss of the gradient from the last time step to the weight gradient at the
first time step contribute to the vanishing or exploding gradient problem.
 The total gradient of the weight w is the accumulation of the gradients across all time steps.
⚫ BPTT steps:
1. Compute the output of each neuron (forward);
2. Compute the error 𝛿𝑗 of each neuron (backward); and
3. Compute the gradient of each weight.
⚫ Updating weights using the SGD algorithm

75 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

𝜕𝐶
• z𝑡 is the value before the memory cell enters the activation function. 𝛿t = 𝜕𝑧𝑡 + 𝑤 𝑇
𝑡

× 𝑓𝑡′ (𝑧𝑡 ) × 𝛿𝑡+1

RNN Problems
⚫ 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 extended on the time sequence.

⚫ 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …

⚫ While the standard RNN architecture solves the problem of information memory, the information about
long-term memory attenuates.
 Long-term information needs to be saved in many tasks. For example, a hint at the beginning of a speculative
fiction may not be explained until the end.
 RNN may not be able to save long-term information due to the limited memory cell capacity.
 We expect that memory cells can remember key information.

76 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
 RNN
◼ LSTM
 Seq2Seq

5. Transformer Architecture
6. Basic Foundation Model Architecture

77 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

LSTM

78 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Forget Gate of LSTM
⚫ The first step in LSTM is to decide the information to be discarded from the cell state.

⚫ The decision is made through the forget gate. This gate reads ℎ𝑡−1 and 𝑥𝑡 , and outputs a numeric
value in the range from 0 to 1 for each digit in the cell state 𝐶𝑡−1 . The value 1 indicates that the
information is completely retained while the value 0 indicates that the information is completely
discarded.
ℎt

𝐶t−1 𝐶t
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡 𝑓𝑡 = 𝜎(𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 )
𝜎 𝜎 tanh 𝜎
ℎt−1 ℎt

𝑥t

79 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Input Gate of LSTM
⚫ This step is to determine the information to be stored in the cell state.
⚫ It consists of two parts:
 The Sigmoid layer is called the input gate layer, which determines the value to be updated.
 A candidate value vector is created at the tanh layer and is added to the state.

ℎt

𝐶t−1 𝐶t

tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
𝐶ሚ𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝐶 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 )
ℎt
ℎt−1

𝑥t

80 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Updating LSTM Information
⚫ Time to update the earlier cell state: 𝐶𝑡−1 is updated as 𝐶𝑡 .
⚫ We multiply the earlier state by 𝑓𝑡 and discard information we decide to discard. 𝑖𝑡 *𝐶𝑡 is added.
This is a new candidate value, which changes with the expected update of each state.

𝐶t−1 𝐶t
tanh
𝑜t
𝑓t 𝑖t 𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1
tanh
𝑥t

81 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Output Gate of LSTM
⚫ Run a Sigmoid layer to determine which part of the cell state will be output.
⚫ Then, we process the cell state through tanh (to obtain a value between −1 and 1) and multiply the value
by the output of the Sigmoid gate. In the end, we output only the part we decide to output.

ℎt

𝐶t−1 𝐶t

tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1

𝑥t

82 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

RNN Types

one to one one to many many to one many to many many to many

83 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Perception
2. Fully-connected Neural Network and How It Is Trained
3. Convolutional Neural Network
4. Model Architecture Based on the Recurrent Neural Network
 RNN
 LSTM
◼ Seq2Seq

5. Transformer Architecture
6. Basic Foundation Model Architecture

84 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Encoder-Decoder Architecture
⚫ Machine translation is a core task tackled by a sequence transduction model. Both the input and output of the
machine translation are sequences of variable lengths. An architecture with two main components is needed to
handle them.
1. Encoder: Uses a variable-length sequence as input and converts it into an encoding state with a fixed shape.

2. Decoder: Maps the fixed-shape encoding state to a variable-length sequence.

⚫ This is referred to as the encoder‐decoder architecture:

Input Encoder State Decoder Output

Input

⚫ Models such as RNN and LSTM can be selected for the encoder and decoder.

85 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Both the RNN and the LSTM may be selected for the encoder and the decoder.
Seq2Seq
⚫ Sequence-to-sequence (Seq2Seq): A sequence is input and another sequence is output. This is a typical
encoder-decoder model. The encoder encodes the input data, and the decoder decodes the encoded
data. It is also a solution to the time sequence problem.
Encoder Decoder

⚫ The encoding part is marked blue. The first <eos> indicates the end of encoding. The decoding part is left
white. <bos> is the first input, indicating the start of decoding. The second <eos> indicates the end of
decoding.

86 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Encoder Calculation
⚫ The encoder converts the variable-length input sequence into a context variable c with a fixed shape,
and encodes information about the input sequence in the context variable.
 It is assumed that the input sequence is x1, ..., xt, where xt is the tth lexical element in the input text sequence.
At a time step t, the RNN converts the input feature vectors xt and ht-1 (that is, the hidden state of the previous
time step) of the lexical element xt into ht (that is, the hidden state of the current step). A function f is used to
describe the transformation performed by the recurrent layer of the RNN:

 The encoder converts the hidden states of all time steps into context variables by using the selected function:

87 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Decoder Calculation
⚫ The context variable c output by the encoder encodes the entire input sequence x1, ..., xT. For the output sequence
y1, y2, ..., yT' from the training dataset, at each time step t' (distinct from the time step t of the input sequence or
encoder), the probability that the decoder outputs yt' depends on the previous output subsequence y1, ..., yt'-1 and
the context variable c. In other words, the probability is P (yt'-y1, ..., yt'-1,c).
⚫ At any time step t' on the output sequence, the RNN uses the output yt'-1 from the previous time step and the
context variable c as its input. It then converts them and the previous hidden state st'-1 to the hidden state st' at the
current time step. Thus, a function g can be used to represent the transformation of the decoder's hidden layer:

⚫ After obtaining the decoder's hidden state, we can use the output layer and softmax operation to calculate the
conditional probability distribution P(yt′1y1,…,yt′−1,c) of the output yt' at the time step t'.

88 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Seq2Seq Application — Machine Translation

89 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Seq2Seq Challenges
⚫ Although Seq2Seq can solve the sequence problem, it still has the following challenges:
 There is only one vector c between the encoder and the decoder to transfer information, and the length of c is
fixed.
◼ The semantic vector cannot completely represent information of the entire sequence. Since c has a fixed length, if the
input data is too long, it will be compressed to a specified length, and information will be lost. Longer input means greater
compression and higher information loss.
◼ Long sentences are prone to information loss. In the case of c obtained through RNN encoding, earlier information tends
to be gradually lost over time.
 RNN-based Seq2Seq models do not support parallel computing, so the efficiency is low.
◼ When calculating data at moment t, the RNN needs to output data at the previous moment, leading to low parallelism.
 CNN-based Seq2Seq has many parameters, resulting in complex training.
 During inference, a word can be predicted only after the previous word is output, slowing the inference.

90 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. Perception

2. Fully-connected Neural Network and How It Is Trained

3. Convolutional Neural Network

4. Model Architecture Based on the Recurrent Neural Network

5. Transformer Architecture

6. Basic Foundation Model Architecture

91 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Addressing Seq2Seq Challenges
⚫ Storing all information in sequence is inefficient. Is there any way to resolve this problem?
⚫ Think about how humans observe things. Generally, humans have a wide field of view. They
glance all information within their field of view, quickly capture important information, and
ignore the rest. Can this mechanism be applied to the sequence model?

92 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://3ms.huawei.com/km/static/image/detail.html?fid=61858
Attention Mechanism
Attention is scene and from human cognitive capabilities. For example, when people observe a
scene or deal with an event, they usually pay attention to the salient objects in the scene and
want to grasp the main contradiction when dealing with an event. The Attention mechanism
enables people to focus on important aspects while ignoring minor things, increasing their
efficiency.

Nature of Attention

Focus all Focus on key points

93 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• https://3ms.huawei.com/km/static/image/detail.html?fid=67539

• smartvi
Attention Principle
⚫ Attention can be used without the encoder-decoder framework. The following figure shows the
principle of attention without the encoder-decoder framework.

94 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Understanding Attention in Reality
⚫ There are many books (values) in a library (source). Numbers (keys) are assigned to the books
to facilitate searches. When we want to know about Marvel (query), we can watch cartoons
and movies and even read books related to World War II (Captain America).
3-1 Comic books
(read carefully)

5-6 Movie
Marvel
books (read
carefully)
9-1 World
War II (read
quickly)
The Avengers movie is awesome. I
want to learn more. Find related books. Learn about Marvel's background.

95 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Attention Computing Process
⚫ Step 1: Calculate the similarity between
the Query and each Key to obtain a score
s.

⚫ Step 2: Convert the score s to a probability Phase 1

distribution in the range of [0,1] through
softmax conversion to obtain a readily
available weight. Phase 2

⚫ Step 3: Use [a1, a2, a3... an] as the weight

matrix to perform weighted summation Phase 3
on values to obtain the final Attention
value.

96 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• From the preceding modeling, we can feel that the idea of Attention is simple:
"weighted summation". Refer to an inappropriate analogy. Humans learn a new
language in four stages: rote learning (learning grammar by way of recitation) ->
outlining points (grasping key words in sentences) -> generalizing (being able to
understand the context and the relationship behind the language in complex
conversations) -> reaching the peak of perfection (immersive practice).

• This is similar to the development of Attention. The RNN era is a period of rote
learning. The Attention model has evolved into a transformer, which has excellent
expression learning capabilities, then to GPT and BERT. A wealth of practical
experience has accumulated through multi-task large-scale learning
Advantages of Attention
⚫ Fewer parameters: Compared with CNN and RNN, this model is less complex and has fewer parameters.
As such, it needs less computing power.
⚫ Higher speed: Attention addresses the fact that RNN cannot perform parallel computing. The steps in the
Attention mechanism do not depend on the previous step. As such, the Attention mechanism supports
parallel processing, just like CNN.
⚫ Better result: Before Attention was introduced, the model would struggle to remember older
information, just like people with a short memory cannot remember things from the past.

Three advantages of Attention

Fewer parameters Higher speed Better result

97 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Transformer
⚫ In a 2017 paper titled Attention Is All You Need, Google proposed a
Transformer model to process sequence-related problems based on the
Attention (self-attention mechanism) structure.

⚫ Transformer has supported many NLP tasks, such as text classification,

machine translation, and reading comprehension.

⚫ The Transformer model uses a whole new formula and rejects the CNN
and RNN structures. It uses the Attention mechanism to automatically
capture the relative associations at different positions of the input
sequence, which is conducive to processing long texts. In addition, the
model enables a high degree of parallelism, speeding up training.

98 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Self-Attention
⚫ The Attention mechanism takes effect between the target element and all source elements. In
the encoder-decoder framework of a common task, the input source and output target are
different. For example, for English-Chinese machine translation, the source is an English
sentence, and the target is a translated Chinese sentence.

⚫ Transformer uses the self-attention mechanism. As the name implies, self-attention refers to
the attention mechanism used between source elements or target elements, rather than
between the source and the target. The two mechanisms use the same computing process, but
the object to be computed is different.

99 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Example of Self-Attention
⚫ A variant of the attention mechanism, the self-attention mechanism is less dependent on external information and outperforms in
capturing the internal correlation of data or features.
⚫ When applied to text, the self-attention mechanism primarily addresses the long-distance dependency problem by calculating the
relationships between words.

If you want to know what "its" refers to and which words are relevant to
"its" in this sentence, you can use "its" as a query, this sentence as a key
and a value to calculate the value of attention. Through self-attention,
we find that "its" is most relevant to "law" and "application."

⚫ The self-attention mechanism focuses more on its own relationships. Therefore, the encoder and decoder of the self-attention
mechanism can be used separately, which further generates different types of network structures.
100 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Contents

1. Perception

2. Fully-connected Neural Network and How It Is Trained

3. Convolutional Neural Network

4. Model Architecture Based on the Recurrent Neural Network

5. Transformer Architecture

6. Basic Foundation Model Architecture

101 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

State of Foundation Models
⚫ With higher computing capabilities and
Age of foundation models
more data, we can train larger and deeper Model sizes have expanded
significantly thanks to the
neural networks. Foundation models, such scalability and increased
computing power of
as the GPT series, LLaMA, and GLM, are Transformer, creating models
with billions of parameters,
trained with powerful generation and such as GPT 4, GLM, and
stablediffusion.
generalization capabilities based on massive
amounts of data and computing resources.
⚫ These models achieve excellent Transformer
Thanks to Transformer, we
performance on a single task but can also have shifted away from RNN
and CNN, which has enabled
quickly adapt to new tasks and data through algorithms with excellent
performance, such as Bert, GPT,
fine-tuning. and ViT.

102 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

GPT - 1

Original Transformer Decoder GPT-1 Decoder

103 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• GPT-1, the first version of OpenAI, is called GPT.

• Using Transformer instead of LSTM is because that, although pre-training helps capture
some language information, the LSTM model limits its prediction capability in the short
term. However, Transformer enables us to capture longer-term language structures while
achieving better generalization performance in other different tasks.

•GPT-1 uses the decoder framework of

Transformer.
• GPT-1 retains the masked multi-attention layer and feed forward layer of the decoder and
expands the network scale. The number of layers is increased to 12. GPT-1 also increases
the number of dimensions of Attention to 768 (originally 512) and the number of heads of
Attention to 12 (originally 8). The number of hidden layers at the Feed Forward layer is
increased from 2048 to 3072. The total number of parameters reaches 117 million.

• For the positional encoding part, GPT-1 is sharply different from a common transformer.
The positional encoding of a common transformer is represented by cosine + sine, while
GPT-1 uses random initialization similar to that of a word vector. In addition, update is
performed during training, that is, each location is considered an embedding to be learned.
Bert
⚫ Released in 2018, BERT is a
bidirectional model that
analyzes the context of a
T1 T2 … Tn Add & Norm
complete sequence and then
performs predictions. This Feed Forward
Trm Trm … Trm
model is trained on a plain text
Add & Norm
corpus and Wikipedia using 3.3
billion words and 340 million Multi-Head
Attention
parameters. Trm Trm … Trm
Positional
⚫ BERT can answer questions, Encoding

predict statements, and … Input Embedding

E1 E2 En
translate text.
Inputs

104 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Multi-domain Task Processing
⚫ In the past, neural network models were used to cope with single work challenges. Each model
was used for only one or several tasks.

Task A
Model A

Task B
Model B

⚫ Can one model process tasks in multiple domains?

Task A Result A

Multiple Multi-task
tasks processing
results
Task N
Comprehensive
Result N
model

105 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• In actual data distribution, there are many natural subsets, such as different domains,
topics, languages, and modalities. When a single model is used for learning and the model
capacity is small, different subsets interfere with model fitting, causing slow model training
and difficult generalization.

• For a conventional learning model, training is intended to enable a final model to execute a
plurality of tasks in different scenarios. However, when the model updates the weight of a
scenario, such training also affect the weights of the model for another scenarios. The
mutual impact of the weights is called interference effect. The stronger the interference
effect, the slower the learning speed of the model and the poorer the generalization of
the model.
MoE
⚫ Mixture-of-Experts (MoE) is a model design strategy that combines multiple models ("experts") to
achieve better prediction performance.
⚫ MoE can effectively improve the capacity and efficiency of a foundation model. Generally, MoE uses a
gating mechanism along with a set of output mechanisms to combine and balance expert selection,
ultimately determining the final prediction for each expert. The expert selection mechanism chooses
specific experts for prediction based on the gating mechanism's output.

106 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• Model size is one of the key factors for improving model performance. This is why today's
foundation models are resultful. With a limited computing resource budget, training a
larger model with fewer training steps is usually better than training a smaller model with
more steps.

• A significant advantage of MoE is that effective pre-training can be performed with far less
computing resources than what the Dense model requires. This means that you can
significantly scale up your model or dataset under the same computational budget.
Especially in the pre-training phase, the hybrid expert model can usually achieve the same
quality level more quickly than the Dense model.
Quiz

1. (Single-choice question) Which of the following is not a deep learning neural network? ( )

A. CNN

B. RNN

C. LSTM

D. Logistic

107 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• 1.D
Quiz

2. (Multi-choice question) Which of the following are components of the CNN? ( )

A. Activation function

B. Convolution kernel

C. Pooling

D. Fully-connected layer

108 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• 2.ABCD
Quiz

3. (True or false) Compared with the RNN, the CNN is more suitable for image recognition. (
)

A. True

B. False

109 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• 3.A
Quiz

4. (True or false) Transformer uses the encoder-decoder architecture. Therefore, it has

features similar to Seq2Seq. ( )

A. True

B. False

110 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• B
Summary

⚫ This chapter discusses the training process of deep neural networks, including
multiple network architectures developed from conventional artificial neural
networks. Finally, it describes the infrastructure for building large models.

111 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Copyright © Huawei Technologies Co., Ltd. All rights reserved.

The information in this document may contain predictive

⚫ Recent decades have seen explosive growth in the research and application of deep
learning. This has given rise to the third tide of Artificial Intelligence (AI)
development, achieving particular success in image recognition, speech recognition
and synthesis, autonomous driving, and machine vision. These advances require
more advanced algorithms and underlying frameworks.
⚫ Ongoing development in deep learning frameworks has made it easier to use
extensive computational resources to train neural network models on large datasets.
This section describes the use of AI frameworks and the process of developing AI
applications.

2 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Objectives

⚫ Upon completing this course, you will be able to understand:

 Mainstream development frameworks
 How AI applications are developed
 How to use the PyTorch framework

3 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Framework
◼ Functions of AI Development Frameworks
 Introduction to Mainstream AI Development Frameworks

2. Basics of AI Development Frameworks

3. PyTorch

4. AI Application Development Process

4 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Development Framework
01. Data pre-processing
Algorithm Natural language Recommendation Scientific
Computer vision
and search
... Pre-processes various data from audio, video,
application processing computing
images, and text (data conversion and
enhancement).
02. Development APIs
AI framework Provides developers with integrated, full-
process AI development APIs, accelerating
PyTorch TensorFlow
research and development.
03. Debugging and tuning
Model development suite Model acceleration suite Provides visualized debugging and tuning
capabilities to help view and trace model
training.

Compiling and 04. Compiling and execution

Data Development Visualized Efficient
processing APIs execution Compiles user code and implements high-
debugging deployment
performance training through full-stack
hardware-software synergy.

05. Inference deployment

Implements model delivery and computing
Hardware enablement scheduling, supports quick deployment for
industry customers, and maximizes the computing
power of AI chips.

5 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• The emergence of deep learning frameworks lowers the requirements for developers.
Developers no long need to compile code starting from complex neural networks and back-
propagation algorithms. Instead, they can use existing models to configure parameters as
required, where the model parameters are automatically trained. Moreover, you can add
self-defined network layers to the existing models, or select required classifiers and
optimization algorithms.
Functions of the AI Development Framework
Model
Preparations Training evaluation Inference deployment
development
Data preparation and Model Model Model Accuracy Model Inference
Model processing building debugging training evaluation conversion deployment
development
process Iterative optimization

Accelerates scientific
research and Enables stable and high- Makes inference
development performance training deployment easier

• Provides easy-to-use and extensive • Enables full-stack software- • Provides hardware access
programming APIs hardware synergy for high- capabilities
• Allows developers to complete performance training • Simplifies technical details of the
model development with ease • Provides functions such as underlying hardware
• Unifies data and other APIs, making automatic differentiation and • Solves difficulties in model
scientific research and experiments automatic parallelization adaptation and deployment
easier • Optimizes model compilation and
training

6 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Front-end Programming Interface
With the AI framework, developers can focus on the service logic
of applications without worrying about the underlying
mathematical and computing details.

Interfaces with multiple programming languages and

provides programming styles that comply with language Provides encapsulated APIs, such as CNN and RNN.
features, such as PyTorch-python. Provides common modules and deep learning components.

Forward computation
Quickly implements
algorithms, without
worrying about the
underlying logic.

Deep learning AI development

algorithm framework Backward computation
Model structure

7 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Computing Optimization and Scheduling
The AI framework abstracts model structure computing into
computational graphs. It understands, expresses, and executes a
neural network model through a common data structure (tensor).
It also automatically differentiates based on computational
graphs.
Forward computation

Hardware-based training
Graph optimization

AI development
framework
Backward computation
Model structure

The AI framework usually further optimizes the computational graph to improve the
computational efficiency.
It also schedules the graph and manages the memory through the computational
graph execution process.
8 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

• After the model structure is built, the AI framework converts the model into a
computational graph and further optimizes the computational graph before execution.
Model Training and Deployment

Distributed training is typically used for training

large models. Some AI frameworks provide
functions such as model segmentation and
distributed policies.

Model training and Application

optimization deployment

AI development Deep learning model

framework

The AI framework provides inference deployment

tools.

9 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

Contents

1. AI Framework
 Functions of AI Development Frameworks
◼ Mainstream AI Development Frameworks

2. Basics of AI Development Frameworks

3. PyTorch

4. AI Application Development Process

10 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

AI Frameworks
⚫ Today, there are many AI development frameworks,
each with its own unique characteristics and serving
different domains and scenarios. PyTorch
⚫ PyTorch is a popular development framework for
foundation models.
TensorFlow
PaddlePaddle JAX
OneFlow MXNet

11 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

PyTorch
⚫ PyTorch is a machine learning computing framework released by Meta. It was developed based
on Torch, a scientific computing framework supported by many machine learning algorithms.
Torch is a tensor operation library similar to NumPy, which is highly flexible. However, NumPy is
less popular because it uses the niche programming language Lua. This is why PyTorch was
developed.
⚫ PyTorch is a tensor library optimized for deep learning using GPUs and NPUs. It is one of the
most common deep learning frameworks.

12 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

PyTorch Characteristics
⚫ Python first: PyTorch does not simply bind Python to the C++ framework. Instead, it directly
supports Python access at a granular level. Developers can use PyTorch as easily as using
NumPy or SciPy.
⚫ Dynamic neural network: By default, PyTorch uses dynamic computational graphs. Programs
can dynamically build or adjust computational graphs during execution.
⚫ Easy debugging: PyTorch can generate dynamic graphs during execution, and developers can
stop the interpreter in the debugger and view the output of a specific node.
⚫ Rich ecosystem: PyTorch has an active community with many documents, tutorials, models, and
tools.

13 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

TensorFlow
⚫ TensorFlow is an end-to-end open-source machine learning platform designed by Google.
Currently, the main version is TensorFlow 2.X.
⚫ TensorFlow provides multiple levels of APIs. For example, you can use advanced Keras APIs to
quickly build and train models.
⚫ For large deep learning training tasks, you can use the Distribution Strategy API (for distributed
training across processing units) to perform distributed training on different hardware
configurations without changing the model definition.

Distribution
Strategy API

14 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

TensorFlow Production Tools
⚫ TensorFlow.js: a library for developing and training machine learning models in JavaScript. It can
be deployed on browsers or Node.js.
⚫ TensorFlow Lite: an open-source deep learning framework for on-device inference. It can be
used to deploy machine learning models on mobile and IoT devices.
⚫ TensorFlow Extended (TFX): an end-to-end platform for deploying production machine learning
pipelines. When you are ready to move trained models from research to production, use TFX to
create and manage a production pipeline.

15 Copyright © Huawei Technologies Co., Ltd. All rights reserved.

JAX
⚫ JAX is a high-performance numerical computing library developed by Google. It combines
NumPy-like APIs, the automatic differentiation function of Autograd, and the XLA compiler of
TensorFlow, implementing efficient computing across CPUs, GPUs, and TPUs.
⚫ Compared with TensorFlow 2.x, it provides:
 Functional programming APIs.
 Higher computing efficiency.

pip install jax # Linux, Windows, and macOS

pip install -U "jax[cuda12]" # GPU

Advantages and Disadvantages of JAX
⚫ Advantages:
 Simplicity and ease of use: JAX is designed to minimize encapsulation and redundant operations,
making its APIs simple and easy-to-use.
 High performance: JAX delivers excellent computing performance through the XLA-based
optimization and accelerated execution on GPUs and TPUs.
 Flexibility: The functional programming style and composable function transformations of JAX make it
highly flexible and easy to integrate with other libraries and frameworks.

⚫ Disadvantages:
 Weak community ecosystem: Compared with deep learning frameworks such as PyTorch, JAX has a
relatively small community ecosystem and fewer documentation resources.

Contents

1. AI Framework

2. Basics of AI Development Frameworks

◼ Basics
 Computational Graph

3. PyTorch

4. AI Application Development Process

Tensor
⚫ A tensor is the most basic data structure in an AI development framework. All data is encapsulated in
tensors.
⚫ Definition: a multidimensional array
 Zero-order tensor: scalar
 First-order tensor: vector
 Second-order tensor: matrix

Tensor Attributes

Tensor Attribute Function

Shape Size of each tensor dimension, for example, [N,C,H,W].

Number of tensor dimensions; the value is 0 for a scalar and 1
Rank or dimension (dim)
for a vector.

Data type of a tensor, for example, bool, uint8, int16, float32,

Data type (dtype)
and float64.

Position of the device where the tensor is located, for example,

Storage position (device)
NPU or GPU.
Name Tensor identifier.

Data Types
⚫ Currently, AI development frameworks support
the following data types: numeric (such as float32,
float16, int8, and bf16), bool, and complex. (Some
frameworks may also support data types of certain
hardware.)
⚫ Data types and programming languages can be
used across different AI development frameworks.

Data Loading
⚫ AI development frameworks provide data loading modules, which can directly load some open-
source datasets, such as MNIST, CIFAR-10, CIFAR-100, VOC, COCO and ImageNet.
 For example, torchvision.datasets.CIFAR10 in PyTorch can directly load CIFAR-10 data. (The data is
automatically downloaded if it does not exist locally.)

# Download and load the training set (when it is not stored locally).
# Transform is the transformation operation performed on data.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

Data Format After Loading
⚫ A loaded dataset usually takes the form of a multidimensional tensor.
 Sequence data is usually a 3-dimensional tensor in the [batch_size, sequence_length, feature_length]
format.
◼ For example, sentence embedding in NLP is expressed as [batch size, sentence length, word embedding
dimension].

 Image data is usually a 4-dimensional tensor in the [N,H,W,C] format. N indicates the batch data size,
H and W indicate the pixel height and width of the image, and C indicates the number of channels.
◼ In PyTorch, the default format of received image data tensors is [N,C,H,W].
◼ In TensorFlow, the default format of received image tensors is [N,H,W,C].

TFRecord
⚫ If a dataset has a lot of data, it can be more efficient
to serialize the data and store it in a group of files File Header

that can be linearly read. This method is particularly Scalar Data Page 1

suitable for streaming data over the network. It is Scalar Data Page 2

also useful for pre-processing when buffering any Block Data Page 1
data.
Scalar Data Page 3
 Example: TFRecord in TensorFlow. Index
Block Data Page 2
 These data formats reduce the disk I/O and
Data File Index File
network I/O overhead.

Model File
AI development frameworks can execute inference tasks on different hardware platforms using trained models.

Inference model files support two types of data: training parameters and network models.

File Format Description Application Scenario

It uses the Protocol Buffers format and stores all parameter values on
It is generally used to resume training after a training task is
Checkpoint (.ckpt) the network by default. You can also configure it to store the model
interrupted or to fine-tune a task after training.
structure and training status.
The Open Neural Network Exchange (ONNX) is a general expression It is normally used for model migration between different
ONNX (.onnx)
for machine learning models. frameworks or on the inference engine (TensorRT).
This is a common binary format used to save and load various models A PyTorch model needs to be converted into a common binary
bin (.bin)
and data. format.
This is the default model file format of PyTorch, which is used to save A complete PyTorch model needs to be saved and loaded. For
.pt or .pth and load a complete PyTorch model, including the model structure example, the optimal model needs to be saved during training,
and parameters. or the trained model needs to be loaded during deployment.

Contents

1. AI Framework

2. Basics of AI Development Frameworks

 Basics
◼ Computational Graph

3. PyTorch

4. AI Application Development Process

Computational Graph
⚫ A computational graph provides an intuitive way to describe the computation process in a deep
learning model. Each node represents a computation operation, for example, addition,
multiplication, and activation function. An edge represents a flow of data between these
operations. This structured representation method makes complex computation easy to
understand and analyze.
 Quickly performs automatic differentiation.
 Helps find network parallelism and optimization points.

Composition of the Computational Graph
⚫ A computational graph contains tensors (basic data structure) and operators (basic operation
units).
⚫ The operator nodes in a graph are connected with directed edges, which indicate the state of
each tensor and dependencies between the operations.

Tensor X
Matmul Sigmoid Tensor Y

Tensor W

• A node represents a variable or operation. The input nodes represent input data or
parameters, and the intermediate node and the output node represent the calculation
result.

• The nodes are connected with edges, indicating the data flow or computation
dependencies. Each edge usually connects output from one operation to input of another
operation.
From Front-end Language to a Computational Graph
⚫ A computational graph is a way to represent mathematical functions with the graph theory
language. It is a standardized method used by deep learning frameworks to express neural
network models.
Front-end programming
class _DenseLayer(nn.Module): languages and interfaces:
def __init__( Build model
self, num_input_features: int, growth_rate: int, bn_size:
Intermediate model
int, drop_rate: float, memory_efficient: bool = False representation:
) -> None: Computational graph IR (DAG)
super().__init__()
self.norm1 = nn.Linear() Auto Graph
self.relu1 = nn.ReLU(inplace=True) differentiation optimization
...
def forward(self, x): Optimization, scheduling, and
... compilation

Computing hardware: Efficient

execution

Dynamic/Static Computational Graph
⚫ Currently, mainstream deep learning frameworks can generate two types of computational
graphs, static and dynamic.
Static Graph Dynamic Graph
When the program is compiled and The program is executed in the coding
executed, the graph structure of the sequence. During forward execution, the
Execution neural network is generated first, reverse execution diagram is dynamically
followed by the computation operations generated based on the backpropagation
involved in the graph. principle.
The compiler uses technologies like The compiler delivers and executes the
graph optimization to further optimize operators on the neural network one by
Advantages
the execution graph, achieving higher one, making it easy to build and debug
execution performance. neural network models.
Not user-friendly for developers and Inefficient and difficult to optimize.
Disadvantages
difficult to debug.

Dynamic Computational Graph
⚫ A dynamic computational graph is constructed during running.
⚫ Advantages:
 Instant execution: Operations are computed in real time. The graph structure can be adjusted anytime as required.
 Flexibility: It is easy to process various data sizes. It supports complex control flow operations.
 Easy to debug: Due to on-demand computation, developers can easily use debugging tools of programming languages like
Python to check for errors.
 Research-friendly: Researchers can try different ideas and quickly iterate models during experiments.
 Intuitiveness: Dynamic computational graphs are easier to understand and use for beginners.
⚫ Disadvantages:
 The graph may cause extra overhead during running, resulting in lower performance than static computational graphs.
 Due to the dynamic graph structure, these graphs cannot be optimized as thoroughly as static computational graphs.

Using Dynamic Graphs in TensorFlow
import tensorflow as tf
import time

# Data initialization
x = tf.Variable(tf.random.normal([512, 1024])) # Batch size 512
W = tf.Variable(tf.random.normal([1024, 1024]))

# Performance test configuration

iterations = 1000

# Execution timing
start_time = time.perf_counter()
for _ in range(iterations):
_ = tf.matmul(x, W) # Immediate execution mode
dynamic_duration = time.perf_counter() - start_time

print(f"[Dynamic Graph] {iterations} iterations took {dynamic_duration:.4f} seconds")

# [Dynamic Graph] 1000 iterations took 1.2873 seconds

Static Computational Graph
⚫ A static computational graph is defined before running.
⚫ Advantages:
 Predefinition: The entire computational graph needs to be defined before execution.
 Optimization: Because the graph structure is fixed, the framework can be further optimized, such as memory
allocation and parallel computing, to improve the runtime performance.
 Deployment-friendly: Static computational graphs are suitable for production environments and mobile devices
because their runtime performance is usually higher.
 Portability: Static computational graphs can be exported into a format independent from the framework and can
be easily deployed in different environments.
⚫ Disadvantages:
 These graphs require more complex coding, especially when complex control flows are involved.
 Because they are defined before running, debugging may be less intuitive than dynamic computational graphs.

Using Static Graphs in TensorFlow
import tensorflow.compat.v1 as tf
import time
tf.disable_v2_behavior()
# Computational graph construction
x = tf.placeholder(tf.float32, shape=[None, 1024], name='input')
W = tf.Variable(tf.random_normal([1024, 1024]), name='weights')
y = tf.matmul(x, W) # Matrix multiplication

# Performance test configuration

iterations = 1000
input_data = tf.random.normal([512, 1024]) # Batch size 512 [Static Graph] 1000 iterations took
# Execution timing
0.3214 seconds
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

start_time = time.perf_counter()
for _ in range(iterations):
sess.run(y, feed_dict={x: input_data.eval()})
static_duration = time.perf_counter() - start_time

print(f"[Static Graph] {iterations} iterations took {static_duration:.4f} seconds")

Computational Graphs in TensorFlow and PyTorch

PyTorch TensorFlow
In Tf2.x, the dynamic graph mode is used
by default.
The dynamic graph mode is used by
You can also run
default.
tf.enable_eager_execution () to manually
enable the dynamic graph mode.
In Tf1.x, the static graph mode is used by
You can use the torch.jit.trace and
default.
torch.jit.script functions to improve the
In Tf2.x, you can use tf.function to
efficiency of static graphs.
implement static graphs.

Contents

1. AI Framework

2. Basics of AI Development Frameworks

3. PyTorch
◼ Basic Modules
 Implementation of the LeNet Network Structure

4. AI Application Development Process

Common PyTorch Modules
Common PyTorch Module Function
torch The core PyTorch module, which provides functions such as tensor operations and linear algebra.
torch.nn Constructs neural networks and provides the network layer, loss function, and activation function.

torch.utils.data Provides tools for data loading and pre-processing.

torch.autograd PyTorch's automatic differentiation engine, which supports gradient calculation and is mainly used for gradient
backpropagation after the loss function is obtained.
torch.optim Provides multiple optimization algorithms for model parameter updates and optimization.

torch.device Specifies the device on which tensors and models should run, allowing switching between the CPU and the GPU.

torch.distributions Provides a series of classes to enable PyTorch to sample different distributions and generate computational graphs for
the probabilistic sampling process.
torch.jit PyTorch's instant compiler module, which can convert dynamic graphs into static ones for optimization and
serialization.
torch.random Provides a series of methods to save and set the status of the random number generator, which helps debug the
structure and performance of neural networks.
torch.onnx Defines the deep learning model description files exported and loaded by PyTorch in the ONNX format, enabling
model exchanges with other deep learning frameworks.

torchvision
torchvision is a library specifically designed for computer vision in the PyTorch ecosystem.

torchvision Module Function

Provides multiple widely used image and video datasets, such as MNIST, CIFAR10/100,
Fashion-MNIST, ImageNet and COCO. These datasets can be directly used to train and
torchvision.datasets
evaluate visual models such as image classification, object detection, and semantic
segmentation.
Supports various data augmentation and pre-processing operations, including
cropping, rotation, flipping, normalization, resizing, and color conversion. These
torchvision.transforms
operations are critical to training robust deep learning models, and can improve the
quality and diversity of training data.
Encapsulates many pre-trained model structures, such as AlexNet, VGG, ResNet,
Inception series, DenseNet, and SqueezeNet, as well as models for object detection
torchvision.models
and semantic segmentation tasks. These models can be directly loaded for transfer
learning or extended as the basic network structure.
Provides a number of practical methods, for example, saving tensors as image files
torchvision.utils and creating image grids to visualize multiple samples, to help visualize and analyze
experiment results.

Data Loading
⚫ PyTorch data is loaded through Dataset and DataLoader. Dataset defines the data format and
how data is transformed. DataLoader continuously and interactively reads batch data.
 torch.utils.data.Dataset is the base class for constructing datasets. By inheriting this class and
rewriting the __init__, __len__, and __getitem__ methods, you can provide your own data and
labels.
 torch.utils.data.DataLoader is an iterator used to load datasets. It provides functions such as batch
loading, data disordering, and multi-process data loading. You need to pass a Dataset object when
using it.

# Define a Dataset. The following uses the open-source data in PyTorch as an example.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# Use DataLoader to load the dataset.
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)

Other Methods of Constructing Datasets
⚫ torchvision.datasets.ImageFolder: a class used to load image datasets from a folder. It assumes
that all files are stored in folders, and files in each folder belong to the same category.
⚫ Other open-source datasets:

API Name Description

torchvision.datasets.MNIST A class used to load the MNIST handwritten digit dataset. You only need to specify the
data's root directory and some other parameters to automatically download the dataset.
torchvision.datasets.CIFAR10 CIFAR10 is another class in the torchvision library for loading the standard dataset
(CIFAR-10). This dataset contains 60,000 32 x 32 color images divided into 10 categories.
Each category includes 6000 images.
torchvision.datasets.CIFAR100 Similar to CIFAR10, CIFAR100 includes 100 categories, and each category includes 600
images. This dataset is also used for image classification tasks.

torchvision.datasets.ImageNet ImageNet is a large image dataset containing millions of images in thousands of

categories.

Model Construction
⚫ torch.nn.Module is a model construction class in PyTorch. It is the base class of all neural
network modules and can be inherited to define models.
⚫ When defining a model, reload the __init__ and forward functions, which are used to create
model parameters and define forward computation (forward propagation), respectively.
import torch
from torch import nn

class MLP(nn.Module):
# Declare a layer with model parameters. Here, two fully connected layers are
declared.
def __init__(self, **kwargs):

super(MLP, self).__init__(**kwargs)
self.hidden = nn.Linear(784, 256)
self.act = nn.ReLU()
self.output = nn.Linear(256,10)

Forward Computation

class MLP(nn.Module):
# Declare a layer with model parameters.
def __init__(self, **kwargs):
...

# Forward computation of the model, that is, how to compute and

return the required model output based on the input x.
def forward(self, x):
o = self.act(self.hidden(x))
return self.output(o)

⚫ You do not need to define a backpropagation function. The system automatically generates the
backward function required for backpropagation by automatically calculating the gradient.

Two-dimensional Convolutional Layer
⚫ torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1,
bias=True, padding_mode='zeros', device=None, dtype=None):
 A class for implementing the two-dimensional convolution operation.
 in_channels: number of image input channels. For example, for a color RGB image, the value of in_channels is 3.
 out_channels: number of output channels (number of convolution kernels) after convolution.
 kernel_size: size of the convolution kernel.
# Define a two-dimensional convolutional layer
 stride: convolution stride. conv_layer = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1,
padding=0, bias=True)
 padding: number of zero padding layers.
# Input data (Assume that the number of input images is batch_size, the number
of input channels is in_channels, and the size is height x width.)
input_tensor = torch.randn(batch_size, in_channels, height, width)

# Pass through the convolutional layer

output_tensor = conv_layer(input_tensor)

Max Pooling
⚫ torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False,
ceil_mode=False):
 Implements two-dimensional max pooling.
 kernel_size (int or tuple): size of the pooling window.
 stride (int or tuple, optional): stride during pooling. The default value is kernel_size.
 padding (int or tuple, optional): number of zero padding layers on each edge of the input feature
map. The default value is 0. conv_layer = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0,
bias=True)

# Input data (Assume that the number of input images is batch_size, the number of input
channels is in_channels, and the size is height x width.)
input_tensor = torch.randn(batch_size, in_channels, height, width)

# Pass through the convolutional layer.

output_tensor = conv_layer(input_tensor)

Recurrent Neural Network
⚫ torch.nn.RNN(input_size, hidden_size, num_layers = 1, nonlinearity = 'tanh', bias = True, batch_first =
False , dropout = 0.0, bidirectional = False, device = None, dtype = None):
# Define the RNN model.
 Defines the basic recurrent neural network (RNN). rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=1,
batch_first=True)
 input_size: dimension of the input feature.
# Create input data.
 hidden_size: dimension of the hidden layer feature. inputs = torch.randn(5, 3, 10) # The sequence length is 5, batc_size is
3, and the input feature dimension is 10.
h0 = torch.zeros(1, 3, 20) # Initially the model is in hidden state, and
 num_layers: number of RNN layers.
the shape is [num_layers, batch_size, hidden_size].
# Forward Propagation.
 bias: Indicates whether to use the bias. output, hn = rnn(inputs, h0)

 nonlinearity: nonlinear activation function. The value is tanh or relu. The default value is tanh.
 batch_first: If the value is True, the first dimension of the input and output is the batch size.

PyTorch Activation Functions
⚫ Use the activation function class in the nn module and the activation functions in nn.functional.

Activation Function Calling Method Calling Method Description

nn.Sigmoid() The following uses Sigmoid as an example:

Sigmoid
F.sigmoid(input) Use nn.Sigmoid () to construct the Sigmoid layer (instantiated object) and then
call this layer.
nn.Tanh()
Tanh Or, use F.sigmoid (input) to directly apply the Sigmoid function to the input.
F.tanh(input)
nn.ReLU()
ReLU import torch
F.relu(input)
import torch.nn as nn
nn.LeakyReLU(negative_slope=0.01) import torch.nn.functional as F
LeakyReLU F.leaky_relu(input, # Create an input tensor.
negative_slope=0.01) input_tensor = torch.tensor([[-1.0, 0.0, 1.0], [-2.0, -1.0, 0.0]])
# Create a Sigmoid layer instance.
nn.ELU() sigmoid_layer = nn.Sigmoid()
ELU
F.elu(input)
# Transfer the input data to the Sigmoid layer.
nn.PReLU() output_tensor = sigmoid_layer(input_tensor)
PReLU
F.prelu(input)
# Directly apply the Sigmoid activation function to the input data.
nn.Softmax(dim=None) output_tensor = F.sigmoid(input_tensor)
Softmax
F.softmax(input, dim=None)

Loss Functions
⚫ PyTorch provides multiple loss functions.
Function Name Calling Method Description
Calculates the average of the absolute
L1 loss (average absolute
torch.nn.L1Loss(reduction='mean') differences between the predicted value and
error)
the actual value.
Calculates the average square of the
Mean square error loss torch.nn.MSELoss(reduction='mean') differences between the predicted value and
the actual value.
torch.nn.CrossEntropyLoss(weight=None, Combines LogSoftMax and NLLLoss for
Cross-entropy loss
reduction='mean') multiclass classification.
Calculates the binary cross entropy between
torch.nn.BCELoss(weight=None,
Binary cross-entropy loss the target value and the predicted value for
reduction='mean')
binary classification.

Connection timing Automatically aligns unaligned data for

torch.nn.CTCLoss(blank=0, reduction='mean')
classification loss (CTCLoss) serialized data training.

PyTorch Optimizers
Optimizer Name Calling Method Description
params: parameters to be optimized; lr: learning rate; momentum: momentum
factor, which is used to accelerate the convergence of SGD in the related
torch.optim.SGD(params, lr=0.01, momentum=0, dampening=0,
SGD direction and suppress oscillation; dampening: damping coefficient of the
weight_decay=0, nesterov=False)
momentum, which is used to reduce the amplitude of the momentum update;
weight_decay: weight decay coefficient, which is used for L2 regularization
params: parameters to be optimized; lr: learning rate; betas: coefficient used to
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08,
Adam calculate the running average of gradients and gradient squares; weight_decay:
weight_decay=0, amsgrad=False)
weight decay coefficient
params: parameters to be optimized; lr: learning rate; lr_decay: learning rate
torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0,
AdaGrad decay; weight_decay: weight decay coefficient; initial_accumulator_value:
initial_accumulator_value=0, eps=1e-10)
initial value of the accumulator
params: parameters to be optimized; lr: learning rate; betas: coefficient used to
calculate the running average of gradients and gradient squares; eps: item
torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08,
AdamW added to the denominator to improve the stability of value calculation;
weight_decay=0.01, amsgrad=False)
weight_decay: weight decay coefficient; amsgrad: indicates whether to use the
AMSGrad variant
params: parameters to be optimized; lr: learning rate; alpha: coefficient of the
torch.optim.RMSprop(params, lr=0.001, alpha=0.99, eps=1e-08, smoothing term; weight_decay: weight decay coefficient; momentum:
RMSProp
weight_decay=0, momentum=0, centered=False) momentum factor; centered: indicates whether to calculate the centralized
RMSprop

Adding Loss Functions and Optimizers for Training

# Instantiate a model structure.

model = MLP()
# Create a loss function.
criterion = nn.CrossEntropyLoss()
# Create an optimizer.
optimizer = optim.SGD(model.parameters(), lr=0.01,momentum=0.9)
# Input data into the model.
output = model(trainset)
# Calculate the loss.
loss = criterion(output, target)
loss.backward()
optimizer.step()

Contents

1. AI Framework

2. Basics of AI Development Frameworks

3. PyTorch
 Basic Modules
◼ Implementation of the LeNet Network Structure

4. AI Application Development Process

Building a Complete Neural Network (1)
import torch
⚫ Use PyTorch to build a complete network LeNet import torch.nn as nn
and use MNIST for training. import torch.optim as optim
import torchvision
 Build a LeNet network. import torchvision.transforms as transforms

◼ It has two convolutional layers. The first one uses # Define the LeNet network structure.
class LeNet(nn.Module):
six 5x5 convolution kernels with a stride of 1 and a
def __init__(self):
padding of 2, and outputs six feature maps. The super(LeNet, self).__init__()
second one uses 16 5x5 convolution kernels with a # Define a two-dimensional convolutional layer.
self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2)
stride of 1 and no padding, and outputs 16 feature self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1)
maps. # Define the fully connected layer.
self.fc1 = nn.Linear(16*5*5, 120)
◼ The average pooling layer is used to reduce the self.fc2 = nn.Linear(120, 84)
dimensions of the feature maps. self.fc3 = nn.Linear(84, 10)
# Define the pooling layer.
◼ ReLU is used after the convolutional layers, and self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
sigmoid or softmax (at the output layer) is used # Activation function
self.relu = nn.ReLU()
after the fully connected layer. self.sigmoid = nn.Sigmoid()

Building a Complete Neural Network (2)
Conv1 (the first convolutional layer)
| - - Six 5 x 5 convolution kernels with a stride of 1
| - - Output feature maps: 28 x 28 x 6
Define the LeNet forward computation process.
Pool
# Define the LeNet network structure.
| - - 2 x 2 pooling window, with a stride of 2
class LeNet(nn.Module):
| - - Output feature maps: 14 x 14 x 6
def __init__(self):
Conv2 (the second convolutional layer)
...
| - - 16 5 x 5 convolution kernels, with a stride of 1
| - - Output feature maps: 10 x 10 x 16
def forward(self, x):
Pool
x = self.conv1(x)
| - - 2 x 2 pooling window, with a stride of 2
x = self.relu(x)
| - - Output feature maps: 5 x 5 x 16
x = self.pool(x)
Conv3 (the third convolutional layer, which can be considered a fully
x = self.conv2(x)
connected layer)
x = self.sigmoid(x)
| - - 120 5 x 5 convolution kernels, with a stride of 1
x = self.pool(x)
| - - Output feature maps: 1 x 1 x 120 (full connection)
x = x.view(-1, 16*5*5)
FC1 (the first fully connected layer)
x = self.fc1(x)
| - - Input: 120 neurons
x = self.relu(x)
| - - Output: 84 neurons
x = self.fc2(x)
FC2 (the second fully connected layer/output layer)
x = self.relu(x)
| - - Input: 84 neurons
x = self.fc3(x)
| - - Output: 10 neurons (corresponding to 10 numerical
return x
categories)
OUTPUT (output layer)
52 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Building a Complete Neural Network (3)
⚫ Define the dataset with data from the open-source dataset MNIST.
⚫ Define the data pre-processing mode using transforms.Compose. Then, use the transform
parameter to apply it to the dataset.
# Data preprocessing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

# Load the MNIST dataset.

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True,
transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True,

transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1000, shuffle=False)

Building a Complete Neural Network (4)
# Initialize the network, loss function, and optimizer.
⚫ Initialize the network, loss function, net = LeNet()
criterion = nn.CrossEntropyLoss()
and optimizer. optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
# Train the network.
for epoch in range(10): # Train 10 epochs.
⚫ Use the for loop for training. running_loss = 0.0
for i, data in enumerate(trainloader, 0):
 Conduct a test after the training is inputs, labels = data

complete. The testing process is not optimizer.zero_grad()

outputs = net(inputs)
elaborated here. loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()
if i % 100 == 99: # Print the loss every 100 batches.
print(f'[Epoch {epoch+1}, Batch {i+1}] loss: {running_loss/100:.3f}')
running_loss = 0.0

Building a Complete Neural Network (5)
⚫ Save the trained model.
 In PyTorch, saving and loading a model involves two main parts: the model's state dictionary and
architecture. The model's state dictionary contains model parameters (such as weights and biases),
and the model architecture defines the model's hierarchy.

# Save the model parameters. # Save the entire model.

torch.save(net.state_dict(), 'lenet_mnist.pth') torch.save(net, 'lenet_mnist.pth')

# Re-instantiate the model as a model object when # Load the model.

loading the model. net = torch.load('lenet_mnist.pth')
net = LeNet() model.eval () # Switch to the eval mode if inference is
net.load_state_dict(torch.load('lenet_mnist.pth')) required.
net.eval () # Switch to the eval mode if inference is
required.

Contents

1. AI Framework

2. Basics of AI Development Frameworks

3. PyTorch

4. AI Application Development Process

AI Application Development Process

Requirement
analysis

• Framework • Hyper-parameter •Model training

installation Data • Data collection Network Model Application • Model deployment
Environment definition •Model saving
• Cloud preparation • Data pre- building training deployment • Model prediction
setup processing • Model structure •Model testing
environment use building

ResNet-50 Application for Image Classification
⚫ Image classification is the most basic computer vision application and falls into the supervised
learning category. For example, the application should determine the type of a given image,
such as a cat, a dog, an airplane, or a car. Here we introduce how the application uses the
ResNet-50 network to classify flowers in a flower dataset.

Which flowers are these?

Environment Setup
⚫ Choose a proper environment (Python 3 in this example), obtain the command, and install it
according to the guide.
 The following uses PyTorch as an example.

https://PyTorch.org/get-started/locally/

Cloud Environment Preparation (Optional)
⚫ If no development environment is available, use ModelArts for development, model training, or
deployment.
 ModelArts helps users quickly create and deploy models and manage the entire AI workflow.

• Compared with a local environment, cloud environment provides plenty of computing

power and storage space, making it a good choice for development.
Data Preparation
⚫ The flower image dataset is used as the training data. It is an open-source dataset and contains
a total of 3670 images of five flower types: daisies (633 images), dandelions (898 images), roses
(641 images), sunflowers (699 images), and tulips (799 images). The images are stored in five
folders. The file structure consists of two parts, flower_photos_train and flower_photos_test.

File structure
61 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Data Pre-processing
⚫ To improve the model's accuracy and ensure the generalization capability, you need to perform data
augmentation and standardization before using the data to train the model.
⚫ Load and process the dataset, including:
 Reading the dataset
 Defining parameters required for data augmentation and processing
 Generating data augmentation operations according to the parameters
 Processing the generated dataset

• This application uses open-source datasets, so they do not need to be processed. If

collected service data is used, you need to clean and sort the data.
Introduction to ResNet
⚫ Kaiming He from Microsoft Labs proposed ResNet50 in 2015 and won the first place in the ILSVRC 2015
image classification competition.
⚫ The Residual Network is a ResNet highlight. It effectively mitigates degradation and enables deeper
network structure design.

Model Training
⚫ Start model training after pre-processing the data and building the network model.
⚫ There are two ways to train a model:
 Start training from scratch based on the dataset (for PyTorch experiments).
 Fine-tune the dataset on a trained model.
◼ Pre-trained model: ResNet model file trained on the ImageNet dataset.
◼ Modify the last-layer parameters of the pre-trained model. (The pre-trained model is the 1001 classification task trained on
the ImageNet dataset, and the task is the 5 classification task for implementing the flowers dataset.)
◼ Perform training based on the training set.

⚫ Parameter tuning: You can adjust different hyper-parameter combinations during training.
⚫ For details about the code, see the lab guide.

Model Loading and Prediction
⚫ Test the test data based on the trained model weight.
 After PyTorch training is complete, perform the inference test (torch.no_grad) when gradient
calculation is disabled.
◼ In the case of inference after deployment, you need to enable the eval mode (model.eval ()).

 Load the weight file and perform the prediction.

◼ To load the model weight, you need to create the model's instance first.

Model Deployment
⚫ The figure on the right shows the
simplified model training and inference Prediction
process. Note that model inference does result
Data Model
not require updating the model weight
parameters. As such, the hardware Model training
requirements are lower than those for Correct
training. result
Model training
⚫ Models are usually deployed on dedicated
inference servers. After optimization, such
as model compression, models can also be
deployed on devices such as mobile Prediction
Data Model
Modelinference
phones and cameras. result

Model inference

Model Compression
⚫ Increasing the number of model parameters has the following impact on the model.
 Inference performance
◼ When the hardware memory access bandwidth is limited, the data transfer overhead increases with the
increased weight.
◼ When there is not enough video RAM capacity, the model can only be deployed on the CPU, lowering the
inference performance.
 Deployment cost
◼ A larger weight requires more storage capacity, computing resource overhead, and video RAM capacity. As
such, more devices are required for model deployment.
⚫ Currently, devices with large video RAM and high memory access bandwidth are expensive. As
such, model compression is essential for deploying and promoting AI applications on consumer
devices.

• As the batch size and sequence length increase, the KV cache also increases, causing great
pressure on memory access.
Model Compression Method - Model Pruning
Pruning is an important model compression technology. The basic idea is to remove less
important weights and branches from a model and sparse the network structure to obtain a
model with fewer parameters. However, pruning may also cause model performance
deterioration. As such, you must find a balance between model size and performance. Neuron
connections in a neural network are mathematically represented as a weight matrix. This means
that pruning changes some elements in the weight matrix to zero elements. The following figure
shows the pruning process. The purpose is to remove less important synapses or neurons.

Model Compression Method - Knowledge Distillation
⚫ Knowledge distillation is a model compression technology based on transfer learning.
 A large complex network usually provides good performance, but also contains a lot of redundant
information, which increases the computational complexity and resource consumption. Distilling
extracts useful information from a complex network and migrates it to a smaller network. After
learning, the small network can deliver similar performance to the large complex network while
requiring fewer computing resources. The complex network can be seen as a teacher, and the small
network as a student.

Common Deployment Tool
⚫ Llama.cpp
 Implemented with C++ code; helps implement LLM inference with minimal setup and superb performance on
various hardware devices (on-premises and on the cloud) without the need for an AI framework.

Summary

⚫ This section describes AI development frameworks, including the basics of common

AI frameworks, and PyTorch framework. It also touches on the process of developing
AI applications.

Quiz

1. (Multiple-choice question) Which of the following hardware devices can be used for PyTorch
training?
A. CPU

B. GPU

C. TensorFlow

D. CUDA

• Answer: AB
Recommendations

⚫ https://pytorch.org/

The information in this document may contain predictive statements

including, without limitation, statements regarding the future financial
and operating results, future product portfolio, new technology, etc.
There are a number of factors that could cause actual results and
developments to differ materially from those expressed or implied in the
predictive statements. Therefore, such information is provided for
reference purpose only and constitutes neither an offer nor an
acceptance. Huawei may change the information at any time without
notice.
AI Business Process Overview
Foreword

⚫ This chapter describes the AI business construction process, how to train a large
model (the entire training and fine-tuning process), and how to efficiently use a large
model.

Objectives

⚫ Upon completion of this course, you will be able to:

 Understand the overall data development process o the AI business.
 Understand the overall process of foundation model training.
 Understand how to efficiently use large models

Contents

1. AI Business Process

2. Large Model Business Process

3. Use of Large Models and Prompt Engineering

AI Business Development Process

Model
Business requirements Data
Data collection selection and
preprocessing
design

Model Model
Model training
deployment evaluation

• https://support.huaweicloud.com/function-modelarts/index.html#
Requirement Analysis

⚫ What is the purpose?

⚫ What are the problems to be solved?

Recommendation Algorithm Field

• The recommendation system algorithms include collaborative filtering, content-based

recommendation, knowledge-based recommendation, and hybrid recommendation.

• E-commerce product recommendation: product recommendation on AI algorithms.

• Personalized news push: news content push based on user interest. News reading habits
and user profiles are crucial. Common methods include content-based recommendations,
collaborative filtering, and deep learning sequence models.

• Video and music content recommendation: video recommendation systems on platforms

such as YouTube and iQIYI. Video metadata and content features are considered along with
user behavior data. The algorithms include deep learning embedding models and multi-
modal fusion models, such as YouTube DNN and DeepFM.
Computer Vision (CV)

• Security

• Target recognition, including identity authentication, social media filtering, and security
management. Data sets such as Labeled Faces in the Wild (LFW) and MS-Celeb-1M are
used. Algorithms include convolutional neural network (CNN), FaceNet, and ArcFace.

• image recognition and classification, including object recognition and medical image
analysis. Data sets such as ImageNet, Common Objects in Context (COCO), and ChestX-
ray14 are used to train and test models. Deep learning frameworks such as ResNet, VGG,
and Inception series models are used.

• Autonomous driving, including road sign recognition, pedestrian detection, and obstacle
avoidance. KITTI, Cityscapes, and Waymo Open Dataset are popular for autonomous
driving. These datasets use object detection methods like YOLO, SSD, and Faster R-CNN.
They also use 3D sensing methods like PointNet and RangeNet++, which work with lidar
data.

• Image generation
Natural Language Processing (NLP)
Conversational bot

Sentiment analysis

Machine translation

Speech synthesis

• Chatbot, including customer service assistant and smart speaker. The training dataset may
contain many dialog records and encyclopedia entries. Common algorithms are the
sequence-to-sequence (Seq2Seq) model, Transformer architectures like BERT and GPT, and
reinforcement learning methods.

• Sentiment analysis, including analysis of the sentiment tendency of user comments and
social media posts. The IMDb movie review dataset and Twitter sentiment analysis dataset
are commonly used benchmark datasets. Common algorithms include the bag of words
(BOW) model, TF-IDF, and deep learning models such as TextCNN, LSTM, and bidirectional
LSTM.

• Machine translation, that is, real-time translation between different languages. The
Workshop on Machine Translation (WMT) provides multiple language pair datasets for
training translation models. Google's transformer model and its variants, like Transformer-
XL and BERT-based architectures, are the mainstream algorithms in this field.

• Speech recognition and generation, including converting speech into text or text into
speech. They are the basis for voice assistants and automatic speech translation systems.
Data Preparation (1)
⚫ Data collection
 Recommendation system: Collect users' historical behavior data, which may include users' purchase
history, browsing records, and rating data. Public datasets can be used, for example, the MovieLens
movie rating dataset.
 CV: Obtain diversified image or video data in related fields to cover various objects, scenarios, and
conditions that the model should recognize. Consider using public datasets like ImageNet, COCO, or
PASCAL VOC as the starting point.
 NLP: Use crawlers to obtain related text from the Internet, or obtain text from public datasets, books,
news, and social media platforms. In addition, questionnaires are also an effective way to collect
specific data.

Data Preparation (2)
⚫ Data cleansing
 This process removes incorrect and unnecessary information, such as duplicate images, irrelevant
images, duplicate text items, spelling errors, and user data that is duplicated or abnormal.

⚫ (Optional) Feature engineering

 Extraction of useful features from raw data might be needed based on the requirements of the
recommendation system. Such features include user personal information, object attributes, and user
behaviors. The purpose of feature engineering is to create features that can help the model better
understand data, improving recommendation accuracy.
 Currently, CV and NLP do not require this operation.

Data Preparation (3)
⚫ Data labeling:
 For supervised learning tasks, you need to label each data item with the correct label. This can be performed manually
or using automated tools. The quality and accuracy of labeling are crucial to the training effect of the model.
⚫ Dataset splitting:
 Split the cleansed and labeled data into training, validation, and test sets. The training set is used to train the model,
the validation set is used to adjust model parameters and select the best model, and the test set is used to evaluate the
model performance.
⚫ Preprocessing:
 Preprocess the data, such as normalization and encoding, to make it easier for the model to process.
⚫ Data analysis and evaluation:
 After creating a dataset, perform thorough data analysis to ensure its quality, diversity, and balance. In addition, you
need to evaluate whether the dataset meets the expected objectives and prepare to support model training.

Model Selection and Design
⚫ Select a proper deep learning model architecture based on the task type and data characteristics. For
example:
 Image recognition task: CNN, such as VGG, ResNet, and YOLO
 Text processing task: recurrent neural network (RNN), long short term memory (LSTM), gated recurrent unit (GRU), and
Transformer (BERT and GPT)
 Deep neural network-based recommendation algorithm and attention mechanism-based recommendation algorithm
⚫ Design the model hierarchy, including the input layer, hidden layer (convolutional layer, recurrent layer,
self-attention layer, and so on), output layer, and auxiliary layers such as the potential pooling layer and
normalization layer.
⚫ Hyperparameter selection:
 Determine the hyperparameters of the model, such as the learning rate, batch size, regularization strength, number of
layers, number of nodes, and dropout ratio.

Model Training
⚫ Forward propagation: Calculates the loss function, that is, the difference between the prediction and the
actual label.
⚫ Backward propagation: Optimizes the parameters in the model based on the loss function, so that the
next prediction is as close as possible to the actual value.
Forward propagation

𝑊1 Linear Combination Activation 𝑊2 Linear Combination Activation

𝑧2 = 𝑊1 𝑋 𝑧3 = 𝑊2 𝑎2 1
𝑎2 = 𝑔1(𝑧2 ) 𝑎3 = 𝑔2(𝑧3) 𝐶(𝑊) = (𝑦 − 𝑎3 )2
𝜕𝑧 2
(𝜕𝑊 )
2

𝜕𝑧 𝜕𝑎 𝜕𝑧 𝜕𝑎 𝜕𝐶(𝑊)
(𝜕𝑊2 ) ( 𝜕𝑧2) (𝜕𝑎3 ) ( 𝜕𝑧3) ( 𝜕𝑎3
)
1 2 2 3

Backward propagation
14 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Model Evaluation (1)
⚫ Classification task evaluation metrics
 Accuracy: It is one of the most basic evaluation metrics. It measures the ratio of the number of samples that are
correctly predicted by the model to the total number of samples.
 Precision: In a classification task, precision is the proportion of true positive predictions among all positive
predictions made by the model.
 Recall: For binary classification or multi-class classification, precision is the proportion of positive samples
predicted by the model, reflecting the model's ability to predict positive samples.
 F1-Score: The F1-Score is the harmonic mean of precision and recall. It combines precision and recall. A higher
F1 value indicates that the model achieves a better balance between precision and recall.

Model Evaluation (2)
⚫ Regression task evaluation metrics
 Mean squared error (MSE): It is the average of the squared differences between the predicted value
and the actual value. It is used to measure the prediction error.
 Root mean squared error (RMSE): It is the square root of MSE. The unit is the same as that of the
original observation value, which more intuitively reflects the difference between the predicted value
and the actual value.
 Mean absolute error (MAE): It is the average of the absolute differences between the predicted value
and the actual value. Compared with MSE, MAE is more robust to outliers.

Model Deployment
⚫ Model conversion to the middleware format
⚫ Quantization: FP32 to FP16 and INT8
⚫ Model encapsulation: Encapsulate the model as an API for rollout

Model packaging and

Model Model quantization &
encapsulation into an
conversion compression
API

• The trained model file format in PyTorch is PTH, while ONNX is an exchange format
developed by AI companies.
Contents

1. AI Business Process

2. Large Model Business Process

▫ Pre-training

▫ Model Fine-tuning

▫ RHLF

▫ Model Evaluation

3. Use of Large Models and Prompt Engineering

Large Model Business Development Process

Business requirements
Large model
Data collection Data preprocessing selection and
design

Large model fine-

Model Large model
tuning and AI
evaluation pre-training
alignment

Model
deployment

• https://support.huaweicloud.com/intl/en-us/function-modelarts/index.html
Large Models Are More Complex with More Parameters
⚫ Large models often have billions or even more parameters and can be hundreds of gigabytes or larger.
 Large models based on Transformer use the self-attention mechanism, which brings a large number of
parameters. The self-attention mechanism allows the model to independently consider each position in the
input sequence and flexibly capture long-distance dependencies, which further increases the demand for model
parameters.
 For complex, high-dimensional data distribution, more parameters help the model fit better, reduce bias, and
improve training performance. In particular, when processing tasks such as natural language, the model needs
enough capacity to understand and generate various sentence structures and meanings, so as to better cope
with the richness and complexity of language.
⚫ For example, it is estimated based on documentation that GPT-3 has about 175 billion parameters.
LLaMA2 by Meta AI comes in three sizes: 7 billion (7B), 13 billion (13B), and 70 billion (70B) parameters.

Large Models Require a Large Data Volume
⚫ Large models have many more parameters than traditional ones, so they need bigger training datasets.
Learning from more data helps these models improve.
⚫ The following table shows the training data distribution of several typical large models.

Data GPT3-175B LLaMA-65B GLM-130B

CommonCrawl
C4
CommonCrawl Pile,
Github
Data Data Books1 and 2 Chinese Wudao,
description source WebText2
Wikipedia
Corpora
Wikipedia Books
...
ArXiv
StackExchange

Data Filtering
processing
Deduplication

Training Ratio 60%, 22%, 8%, 8%, and 3% 67%, 15%, 4.5%, 4.5%, 2.5%, and 2.0% ......
data Tokens 300B 1.4T 400B

• Explain that datasets need to be cleansed to remove errors and duplicates. The original
GPT3 data was 45 TB, but only 570 GB remains after filtering. The ratio of used data varies
according to the data quality.
Diversity of Large Model Training Data
⚫ Training large models needs large-scale, high-quality, and multimodal datasets. This data typically needs
to be collected from various domains and sources, and can be texts, images, audio, and videos.

E2E model development pipeline

An E2E tool chain for model development, training, and
inference enables seamless collaboration of DataOps, Texts
MLOps, and DevOps, improving development efficiency
by 50%.

Large Models Have Better Capabilities
⚫ Better learning capability: Large models have more parameters and a more complex structure,
which allows them to fit data better and learn more effectively. This allows them to process
bigger datasets and find deeper patterns, leading to better performance on many tasks.
⚫ Better generalization capability: Large models generalize better, performing well on new data.
This makes them adaptable for real-world uses and less likely to overfit.
⚫ Stronger multi-task processing capability: Large models can handle multiple tasks at once,
eliminating the need for separate models for each task. This improves flexibility and efficiency
and lowers maintenance costs.

LLM Training Process
Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
> 1 trillion tokens 10K-100K 100K-1M 10K–100K

Text Demonstration data

Comparison data Prompts
e.g. internet data (prompt, response)

Reinforcement
Language Language Classification
Learning
modeling modeling Prompt
generate responses
Predict the next Predict the next winning_response,
that maximize the
token token losing_response)
reward

Pretrained LLM SFT model Reward model Final model

• During pre-training, we use huge amounts of data from sources like web pages, Wikipedia,
books, GitHub, papers, and Q&A sites. This data contains hundreds of billions to trillions of
words. We train the model using a supercomputer with thousands of powerful GPUs and
fast networks. It takes dozens of days to complete this training and build the basic language
model (base model). The Basic Model develops the capability to model long texts, enabling
the model to generate language. Based on the input prompt, the model can produce text
completions.
Pre-training Datasets
⚫ Data sources of large models are diverse, mainly including the following:
 Public data on the Internet: The Internet holds vast amounts of public data, such as texts, images, and videos.
Developers can crawl these data on the Internet to build their datasets. Such data may come from sources like
news sites, social media, blogs, and forums.
 Professional databases and data released by institutions: Many professional institutions, databases, and
academic organizations release organized and processed datasets. These datasets usually have high quality and
reliability. Large models can use these data for training to improve their performance and accuracy.
 Internal enterprise data: For some customized large models, enterprises may use their internal data to build
datasets. Such data may include enterprise business data, user behavior data, and product data, helping model
better adapt to the actual needs of enterprises.
 User-generated content: With the popularity of social media and online platforms, user-generated content (such
as comments, ratings, and feedback) has also become an important source of large model datasets. Such data
reflects users' real needs and preferences, helping the model better understand and meet their needs.

Data Parallelism
⚫ Data parallelism: The training dataset needs to be split into N parts, each located onto a
separate computing node. Each node keeps a full model copy and calculates gradients using its
local data. All nodes send their gradients to node 0, which combines them and sends the final
result back to the other nodes.

Tensor Parallelism
⚫ In tensor parallel training, we split a tensor into N blocks along one dimension. Each device
holds 1/N of the tensor, keeping the computation accurate. This requires extra communication
to ensure correct results.

• Row-wise weight partitioning

▫ The number of computing units decides how parameters are split. Data also needs to
be divided for matrix calculations.

• Column-wise weight partitioning

▫ The number of computing units decides how w is split. In this case, input data X does
not need to be split.
Pipeline Parallelism
⚫ The following figure shows the pipeline parallelism. Assume that there are four NPUs. The parameters of the large
model are divided into four parts by layer and placed on device 0, 1, 2, and 3 in sequence. For a minibatch of data,
forward propagation is first performed on device 0 (corresponding to the gray block F0 in the figure). After the
calculation is complete, the result of device 0 is sent to device 1 to continue forward propagation (corresponding to
the yellow block F0 in the figure). After the forward propagation is complete on the last NPU, that is, device 3, the
loss is calculated on device 3 and backward propagation is performed (corresponding to the purple block B0 in the
figure). Backward propagation is performed until device 0 is complete. The optimizer parameters are updated on
each card, and the model parameters are updated.

Contents

1. AI Business Process

2. Large Model Business Process

▫ Pre-training

▫ Supervised Fine-tuning

▫ Reinforcement Learning

▫ Model Evaluation

3. Use of Large Models and Prompt Engineering

Supervised Fine-Tuning (SFT)
⚫ SFT refers to pre-training a neural network model (the source model) on a source dataset. Then, a new
neural network model (the target model) is created.
⚫ The target model copies all design elements and parameters from the source model, except for the
output layer.
⚫ These parameters hold knowledge from the source dataset, which works well with the target dataset.
The output layer from the source model is removed because it is closely tied to the source dataset's
labels.
⚫ During fine-tuning, add an output layer whose output size equals the number of target dataset
categories to the target model, and randomly initialize the model parameters of this layer. We train the
target model from scratch up to the output layer. The other layer's parameters are fine-tuned using the
source model's parameters.

Adapter Tuning
⚫ As computer hardware improves, pre-trained models have more parameters. Fine-tuning the whole model for
downstream tasks is costly and time-consuming. Adapters help solve this problem. Adapter adds parameters for
downstream tasks into each layer of the pre-trained model. During fine-tuning, only these task-specific parameters
are trained, and the main model is frozen. This reduces the computing power costs.

An adapter module is added to each layer or

some layers of the pre-trained model.
During fine-tuning, the pre-trained model's
main part is frozen, and the adapter module
learns the knowledge of the downstream
task.

Prefix Tuning
⚫ In prefix tuning, a set of task-specific virtual tokens is added before the input as the prefix. During
training, only these prefix token's parameters are updated, while the rest of the Transformer parameters
stay fixed.

Prompt Tuning
⚫ This method is a simpler form of prefix tuning. It adds prompt tokens only to the input layer, removing
the need for an MLP to handle training issues.
⚫ The pre-trained parameters stay fixed, and one or more embeddings are added for each task. These
embeddings are then combined with the query and fed into the LLM, with only the embeddings being
trained. The left figure shows full-parameter fine-tuning for a single task, while the right figure shows
prompt tuning.

LoRA: Low-Rank Adaptation of Large Language Models
⚫ LoRA's idea is simple. In the figure below, the pre-trained model's matrix parameters are frozen and
replaced with matrices A and B. During downstream tasks, only A and B are updated.

⚫ A bypass is added next to the pre-trained language model

(PLM) to reduce and then increase dimensions, mimicking the
intrinsic rank.
⚫ During training, only the matrices for dimensionality reduction
(A) and increasing (B) are updated;
⚫ The model's input and output sizes stay the same. At the end,
the parameters of BA and the PLM are merged.
⚫ Matrix A starts with random Gaussian values, and matrix B
starts as a zero matrix to keep the bypass matrix at zero
initially.

Contents

1. AI Business Process

2. Large Model Business Process

▫ Pre-training

▫ Supervised Fine-tuning

▫ RLHF

▫ Model Evaluation

3. Use of Large Models and Prompt Engineering

Alignment
⚫ Use data that humans prefer for training (SFT + RLHF):
 Authenticity: Is it false or misleading information?
 Harmlessness: Does it cause physical or mental harm to people or the environment?
 Usefulness: Does it address the user's task?

Supervised instruction fine- Instruction fine-tuning for

tuning reinforcement learning

Apologize to them. Argue with them. -5

What to do when your Go walking with them. Reason with them. -1

• Why use instruction tuning: Because GPT's outputs can be random, inaccurate,
uninteresting, or harmful due to its training on large datasets.

• Why use reinforcement learning: 1. Labeling data is expensive. 2. Reinforcement learning

shows if a model works well but doesn't explains how to improve it. Reinforcement learning
allows models to explore more freely, overcoming the limits of supervised learning.
RLHF
⚫ Reinforcement learning from human feedback (RLHF) is a method that leverages human
feedback to refine a model, enabling effective self-learning.
⚫ By integrating human feedback into the reward function, RLHF ensures that large models can
perform tasks more aligned with human targets, aspirations, and needs.

• RLHF is a standard technology used to make sure LLMs create real, harmless, and helpful
content. However, human communication is a personal and creative act, and the value of
LLM output largely depends on human values and preferences. Different models are
trained in various ways and use different human feedback, leading to varied results. How
well each model matches human values depends on its creator.
Reward Model (RM) Training
⚫ Randomly select questions from the dataset. Use mode Selecting questions
from the question What is a banana?
generated after the supervised instruction fine-tuning (SFT library

model), and generate multiple responses for each question. A: A sour fruit... B: A piece of
decoration...
Repetitively
Labeling personnel rank the responses based on a thorough generate answers
C. Something
for 4 times D: Bananas are
evaluation. This process is like getting guidance from a monkeys love
yellow...
to eat...
mentor or teacher.
⚫ Then, use the ranking results to train the reward model.
Multiple ranking results are paired to form multiple training Manual sorting

data pairs.
⚫ The RM model takes an input and gives a score to rate the
response's quality. For each training pair, it adjusts the Use the sorting
result to train the
parameters to make sure high-quality responses get higher reward model
scores than low-quality ones.

• Reward model training

• Data preparation: First, prepare a batch of data for training the reward model. This data
should include text samples and their human evaluation scores. You can get these scores
from crowdsourcing platforms or expert reviews, as they reflect how people perceive text
quality.

• Model construction: Next, build a neural network as the reward model. You can use a
simple multilayer perceptron (MLP) or a more complex model like the Transformer. The key
is to ensure the model can effectively capture human evaluation criteria.

• Training process: Use the prepared data to train the reward model. The training target is to
minimize the error between the model's predicted scores and the actual human evaluation
scores. You can achieve this with optimization methods like gradient descent.

• Validation and tuning: Validate the model during training to ensure it can accurately
predict human evaluations on new data. If the model performs poorly, we may need to
adjust its structure or training method.
Reinforcement Learning
⚫ Reinforcement learning considers the overall impact more than supervised learning.
 Supervised learning gives feedback on individual tokens to help the model give the right answer. Reinforcement
learning gives feedback on the whole output instead of individual tokens. Reinforcement learning works better
for large language models due to the difference in feedback granularity.
⚫ Reinforcement learning is more effective at addressing hallucination issues.
 In supervised learning, the model gives a result even if it doesn't know the answer. In reinforcement learning,
the model avoids low-scoring answers by using reward scores.
⚫ Reinforcement learning better handles reward accumulation issues in multi-round conversations.
 Building effective multi-round dialog interactions and determining if the final goal is met is hard with supervised
learning. But using reinforcement learning, we can create a reward function that assesses the model's output
based on the dialog's coherence and context.

RLHF Training Procedure

Fine-tune the pre-trained LLM on Use reinforcement learning, like

Collect human-labeled datasets to
specific topics or with instructions and PPO, to fine-tune the LLM with this
train a reward model.
human example corpora. dataset and reward model.

Step 1: supervised learning Step 2: reward model Step 3: reinforcement

learning
Manually develop the Ask the trained model
Ask random new
training plan (which same questions again.
questions.
questions to ask). GPT generates
multiple answers for
each question. The trained model
Professional labeling gives answers.
personnel give high-
quality answers to
these questions.
Manually sort and Use the reward
score the results. model to rank
Professional labeling and score the
personnel use these output answers
Give back the results and repeatedly
answers to refine
again to generate a train the GPT
GPT-3 and fine-tune reward model.
the model. model.

Contents

1. AI Business Process

2. Large Model Business Process

▫ Pre-training

▫ Supervised Fine-tuning

▫ RLHF

▫ Model Evaluation

3. Use of Large Models and Prompt Engineering

Why Do We Need Evaluation for Large Models
⚫ A large model is a highly capable function f, which is not fundamentally different from previous machine
learning models. So why do we need evaluation for large models? What are the differences between
evaluation for large models and previous evaluation for machine learning models?
⚫ Research and evaluation help us better understand the strengths and weaknesses of large models.
 Most studies show these models match or exceed human performance in many tasks. However, some studies
doubt if this success comes from memorizing the training data.
⚫ Research and evaluation can guide the improvement of collaboration between human and large models.
 Since people are the main users, understanding and evaluating model capabilities is crucial for enhancing
human-machine interaction designs.
⚫ Research and evaluation can better plan and coordinate the future development of large models and
prevent unknown and potential risks.

• For example, when given only the LeetCode problem number, the model outputs the
correct solution. This shows that the training data was contaminated.

• For example, our recent study, PromptBench. It is the first evaluation benchmark form
prompt robustness in LLMs. We found that these models often struggle with interference
and lack stability. This led us to improve the system's fault tolerance through better
prompts.

• Large models keep evolving and becoming more powerful. Can we assess their abilities
from an evolutionary viewpoint using well-designed and scientific methods? How can we
predict potential risks in advance? These are important research topics.
Evaluation Dimensions
⚫ Large language models can handle many complex natural language tasks, unlike traditional algorithms.
Therefore, model evaluation must be performed from multiple dimensions. The evaluation dimensions
can be classified into the following three aspects:
 Knowledge and capability: Large language models have rich knowledge and can handle many tasks, including
natural language processing (text classification and information extraction), knowledge Q&A (reading
comprehension and open-domain Q&A), natural language generation (text summarization and text creation),
logical reasoning, and code generation.
 Ethics and security: Models are trained based on the 3H principles: helpfulness, honesty, and harmlessness.
Harmlessness ensures that the model's responses match human values.
 Vertical domains, mainly covering complex reasoning capability (such as knowledge reasoning and mathematical
reasoning), environment interaction capability (for example, an agent generates actions based on language
instructions for home tasks), and problem solving capability in specific domains (such as finance, law, and
healthcare).

Evaluation System
⚫ GLUE and SuperGLUE test a model's natural language understanding. They cover tasks like text classification,
inference, question answering, and sentiment analysis. These benchmarks simulate real-world language processing
scenarios and set the standard for measuring a model's natural language understanding capabilities.
⚫ GSM8K and MMLU test 57 different disciplines, providing a comprehensive evaluation of the knowledge domain of
large models. GSM8K evaluates math problem-solving skills. MMLU uses multiple-choice questions to evaluate a
model's understanding and reasoning in many fields.
⚫ BLUE-bench focuses on the biomedical field and tests how well models perform in medical Q&A and related tasks.
⚫ Adversarial Robustness Benchmarks are used to test how well the model resists malicious inputs, evaluating its
security.
⚫ LLM Ethics Benchmarks check if large models create content that breaks social ethics, like bias, toxicity, or
dishonesty.

Common Evaluation Metrics
⚫ BLEU
 It is commonly used in machine translation tasks. It checks how many n-grams match between the translated
text and the reference text. More matches mean better translation quality. Unigrams are used to measure the
accuracy of word-level translation, while higher-order n-grams are used to evaluate the fluency of sentence-level
translation.
⚫ NIST
 It is an improvement on the BLEU method, introducing the concept of information volume for each n-gram,
giving more weight to less frequent but important words.
⚫ ROUGE
 Rouge is an improved version of BLEU that focuses on recall, not precision. It checks how many n-grams from the
reference translation appear in the output. Rouge-N and Rouge-L are scores are based on n-grams and the
longest common subsequence, respectively.

• https://3ms.huawei.com/km/blogs/details/14813656
Challenges in Large Model Evaluation (1)
⚫ Diverse foundation model capabilities make evaluation challenging
 It is challenging to make comprehensive and fair evaluations on diverse foundation model
capabilities. For example, Stanford evaluates models across 42 scenario tasks in 7 dimensions,
SuperCLUE defines 70 scenario tasks, C-Eval selects 52 scenario tasks, and AGIEval chooses 20
standardized test tasks.

⚫ Lack of unified industry standards for large model evaluation

 Testing standards and methods lack industry-wide uniformity; different evaluation organizations have
varying testing subjects, scopes, and results. Public dataset-based testing presents significant
challenges: the availability of numerous public datasets increases selection complexity and workload,
while it is also difficult to ensure that these datasets do not overlap with the model's training set.

Challenges in Large Model Evaluation (2)
⚫ Lack of evaluation standards for enterprise applications
 The industry lacks a standard way to evaluate enterprise apps. Different areas focus on different
features. For example, customer service values instruction compliance and hallucination suppression,
while R&D focuses on code generation.
 Open-source Internet data cannot be used to evaluate model performance in enterprise services.
 Professional testing platforms may leak enterprise data.
⚫ Rapid model updates, heavy manual evaluation workload, and repeated work
 Models update quickly. Manual evaluations are time-consuming, repetitive, and inconsistent.
 Professional evaluation needs a lot of data preparations. Expert labeling takes a lot of time and effort,
making it costly. Also, defining tasks and ensuring coverage are challenges.

Contents

1. AI Business Process

2. Large Model Business Process

3. Use of Large Models and Prompt Engineering

▫ How to Use Large Models

▫ Prompt Engineering and Basic Methods

▫ Advanced Prompting Techniques

How to Better Use Large Models

Use symbols to clearly distinguish different parts of the

Write clear and specific prompt.
instructions. Output structured prompts.
Provide a few examples.

...

Develop the steps required to complete the task.

Give the model time to
think. Let the model propose a solution first. Then, compare and
assess it.

• Provide as clear and specific instructions as possible to express what you want the model to
do. This will guide the model to create content as you expect and reduce the chance of
getting irrelevant or wrong answers.

• Most large models have emergent capabilities, can solve relatively complex problems, and
have a very broad knowledge coverage.

• They can understand and analyze natural language input from users and provide services
and assistance in different fields and tasks.

• However, communicating with large models still requires some skills.

Contents

1. AI Business Process

2. Large Model Business Process

3. Use of Large Models and Prompt Engineering

▫ How to Use Large Models

▫ Prompt Engineering and Basic Methods

▫ Advanced Prompting Techniques

Prompt Engineering
⚫ Prompt engineering involves refining prompt phrases using structured text to guide the LLM to
produce desired outputs.
⚫ Researchers can use prompt engineering to improve the capabilities of LLMs in handling
complex tasks, such as Q&A and arithmetic inference.
⚫ Developers can use prompt engineering to design and develop powerful engineering
technologies to efficiently connect with LLMs or other ecosystem tools.
⚫ Prompt engineering is key for interacting and interconnecting with large language models and
understanding what they can do. Users can improve LLM security through prompt engineering.
They can also enhance LLM capabilities by using specialized knowledge and external tools.

Zero-Shot Prompting
⚫ Zero-shot prompting is a technique where the user does not provide demonstrations related to
the task result, and directly prompts the language model to provide task-related answers.
⚫ Large language models are capable of zero-shot prompting, but their success depends on the
complexity of the task and the scope of existing knowledge.

Q: <Question>?
A:

Example of Zero-Shot Prompting
Prompt: Certain substances in apples may cause allergies in consumers, leading to symptoms such as rashes and difficulty breathing; at the same time, excessive
fructose and dietary fiber in apples may cause digestive symptoms such as bloating and diarrhea.
Positive and negative evaluation:

Output: The description includes potential negative impacts of apples without directly mentioning any positive aspects of apples. The following is an analysis of
the positive and negative evaluations of the description:
1. Negative evaluation
• "Certain substances in apples may cause allergies in consumers": This points out the potential health risks apples may pose to certain groups, which
is a clear negative evaluation.
• "Leading to symptoms such as rashes and difficulty breathing": It lists the symptoms of allergic reactions in detail, further reinforcing the potential
negative health impacts of apples.
• "Excessive fructose and dietary fiber may cause digestive symptoms such as bloating and diarrhea": This again emphasizes the potential side effects
of apple consumption, which is a negative evaluation of the potential adverse effects of apples.
In this description, no positive attributes of apples, such as nutritional value or health benefits, are mentioned, so overall it leans towards a negative
evaluation.
2. Positive evaluation: Although there is no direct positive evaluation in the description, it can be inferred that, since apples are widely recognized as a
healthy food, the mentioned allergens and digestive issues usually do not prevent most people from consuming apples. However, this positive
perspective is based on general knowledge rather than what is provided in the description itself.
In summary, the description focuses on the potential risks of apples without explicitly expressing any positive features, so it can be judged as a negative
evaluation. It should be noted that, to provide a comprehensive evaluation, the positive aspects such as the nutritional and health benefits of apples are
usually also needed to be considered.

• If only simple data judgment is required, the output is complex and contains a large amount
of unnecessary information.

• Models of the GPT series are all autoregressive language models. This is to say that such
models predict the next word based on the current input, and then combine the prediction
result with the input as the model input to predict the next word. This process repeats
cyclically.

• Researchers of GPT-3 found that large-scale training data generates an interesting

emerging behavior called In-Context Learning (ICL). ICL does not require model parameter
adjustment and can achieve excellent results with only a few examples of downstream
tasks.

• The pre-trained GPT-3 model does not need to be retrained when migrated to a new task.
You only need to provide a task description (optional), then provide several examples (task
queries and corresponding answers, organized in pairs), and finally add the query to be
answered by the model. After the preceding content is packaged as the model input, the
model can correctly output the answer corresponding to the last query.
Few-Shot Prompting
⚫ Although large language models demonstrate impressive zero-shot capabilities, they still perform poorly
on more complex tasks when using zero-shot settings.
⚫ Few-shot prompting can be used as a technique to enable contextual learning, providing demonstrations
in the prompt to guide the model to achieve better performance. Demonstrations are conditions for
subsequent examples, and the model is expected to generate responses.
⚫ Few can be selected randomly.

Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:

• One-shot few-shot
Example of Few-Shot Prompting
Prompt: Certain substances in apples may cause allergies in consumers, leading to symptoms such as rashes and difficulty breathing; at
the same time, excessive fructose and dietary fiber in apples may cause digestive symptoms such as bloating and diarrhea.
Evaluation: Negative
Bananas are a very beneficial fruit for the human body.
Evaluation: Positive
Durian is sweet and delicious, with very high nutritional value.
Evaluation:

Output:
Positive

Contents

1. AI Business Process

2. Large Model Business Process

3. Use of Large Models and Prompt Engineering

▫ How to Use Large Models

▫ Prompt Engineering and Basic Methods

▫ Advanced Prompting Techniques

Chain-of-Thoughts Prompting
⚫ Chain-of-thought (CoT) prompting enables complex reasoning by introducing intermediate reasoning
steps. It expands a question step by step and retains the background information of previous questions
in each answer. CoT makes a conversation content-rich and coherent.
⚫ CoT prompting mainly has the following two types:
 Few-shot-CoT
 Zero-shot-CoT
⚫ Feature
 Orderliness: CoT prompting requires that questions be broken down into a series of ordered steps. Each step is
based on the previous step, forming a clear CoT.
 Relevance: There must be a close logical connection between each thought step to ensure the coherence and
consistency of the entire thought process.
 Step-by-step inference: The model focuses only on the current question and related information in each step
and approaches the final answer through step-by-step reasoning.
57 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
Few-Shot-CoT
⚫ In this mode, the model is provided with a small number of examples, each of which includes a
high-quality inference process (handwritten or model-generated).

Prompt: The sum of the odd numbers in this group is even: 4, 8, 9, 15, 12, 2, 1.
A: Add all the odd numbers (9, 15, 1) to get 25. The answer is False.
The sum of the odd numbers in this group is even: 17, 10, 19, 4, 8, 12, 24.
A: Add all the odd numbers (17,19) to get 36. The answer is True.
The sum of the odd numbers in this group is even: 16, 11, 14, 4, 8, 13, 24.
A: Add all the odd numbers (11, 13) to get 24. The answer is True.
The sum of the odd numbers in this group is an even number: 17, 9, 10, 12, 13, 4, 2.
A: Add all the odd numbers (17, 9, 13) to get 39. The answer is False.
The sum of odd numbers in this group of numbers is even: 15, 32, 5, 13, 82, 7, 1.
A:

Output: Add all odd numbers (15, 5, 13, 7, 1) to get 41. The answer is False.

Zero-Shot-CoT
⚫ The zero-shot-CoT method adds the prompt "Let's think step by step" before each answer of
the language model to guide the model to think step by step.

Advantages of Zero-Shot-CoT
⚫ Generalization: Zero-shot-CoT uses a single template and can work across multiple inference tasks without
needing to design a specific prompt for each task, whereas few-shot prompts require a dedicated example for
each task.
⚫ No manual example design: Zero-shot-CoT does not require manual design of multi-step inference examples,
reducing the workload of prompt engineering, whereas few-shot prompts necessitate meticulous design of
examples for each task.
⚫ Stronger zero-shot inference benchmark: Zero-shot-CoT serves as a stronger zero-shot inference benchmark,
providing better performance for difficult multi-step inference tasks, while standard zero-shot prompts
perform poorly.
⚫ Better scalability: As the model scale increases, the performance improvement of zero-shot-cot is more
obvious, showing better scalability. Few-shot prompts are not very sensitive to the model scale.
⚫ Reduced bias: Zero-shot-CoT reduces the bias introduced by manually designed examples, making the
research on language model bias more fair.

Quiz

1. (Multiple-answer question) The chain-of-thought (CoT) prompt implements complex

reasoning capabilities by introducing intermediate reasoning steps. Compared with few-
shot-CoT, what are the advantages of zero-shot-CoT?
A. In most cases, the effect is better than that of few-shot-CoT.

B. No need to manually design examples

C. Better scalability

D. High generalization

• BCD
Summary

⚫ This chapter introduces the entire process of AI service and foundation model
building, and describes how to use prompt engineering to efficiently use foundation
models to obtain more accurate answers.

The information in this document may contain predictive

⚫ In a time of overwhelming technological advancement, artificial intelligence (AI) is an

incredible force for change. From intelligent assistants to autonomous driving, and
from weather forecasts to intelligent robots, AI is being applied almost pervasively,
making it a new engine for social progress and development.

Objectives

⚫ Upon the completion of this course, you will be able to understand:

 AI applications in daily life
 AI applications in cutting-edge fields

Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

Ubiquitous AI
⚫ In this age of technology, AI has become indispensable to our lives. AI is bringing all kinds of
novel conveniences to our lives, ranging from the voice assistants on our smartphones to the
automatic management of smart homes, even reaching our healthcare, education, and travel.

Voice assistant

Smart home Efficient travel

• https://consumer.huawei.com/en/emui/celia/
Voice Assistant – Celia
⚫ Celia is a smart assistant. Powered by foundation models, Celia is capable of providing solutions
for every scenario.

Natural dialogue-based interaction,

making it an efficient and
convenient tool

• https://consumer.huawei.com/en/emui/celia/
Celia – Simplify Your Daily Routines
Wherever you are and whatever you are doing, with the help of Celia, you can get your everyday
tasks done quickly and effortlessly. Be productive and enjoy your day!

Celia – Be Fun, Be Smart
Celia makes entertainment easier as well. Quickly find items you want to buy, play your
favourite videos and control your music without picking up the phone. Relax and have fun!

Celia – Your Perfect Travel Companion
Celia is also the travel assistant you are looking for. Ask Celia to translate your menu, plan your day, take photos
for you and more. Celia is always there to meet your needs.

Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

Smart Home
⚫ The smart home seamlessly blends modern
technology with home life, redefining our
personal spaces with an elevated lifestyle of
convenience.
⚫ Through highly advanced IoT, algorithms, and
big data analytics, smart homes connect a
variety of devices into one automated and
tailored intelligent ecosystem, bringing users
with convenience, comfort, and security.

• https://3ms.huawei.com/km/static/image/detail.html?fid=66109
Smart Home – Making Your Life Easier
⚫ AI connects home appliances to create an intelligent ecosystem
When you step into your home, warm lights fade in from the hallway
to the living room. The intelligent environment control system
automatically adjusts the ambient temperature, humidity, and air
quality to suit your preferences. Let the home take care of itself, and
relax after a long day.

Awaken the entire house in the

flick of a switch

Lighting up your way

At night, corridor lights automatically switch on to match your

path, without a need of manual control.

Smart Home - Making Your Life More Comfortable
Bask in the
sunshine

During the day, sunshades are controlled in groups for adjusting

light intensity, making sure rooms are filled with soft light.

Rest Assured at Home and Away
Cameras automatically switch on
each time you leave your front door,
AI 3D recognition
allowing you to keep an eye on
things while out and about. Should
somebody break in, they will trip the
alarms instantly.

Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

Qiankun Intelligent Driving
QianKun ADS Qiankun Intelligent Driving Intelligent computing platform
Convergent perception
With powerful identification and processing Functioning as the "brain" of an intelligent vehicle,
capabilities, Qiankun ADS 3.0 reaches an the intelligent driving computing platform has
identification rate of 99.9% for unusually shaped As the "eye" of an intelligent vehicle, sensors are a core
powerful performance and can receive complex
obstacles. It can accurately identify whitelisted component of an intelligent driving system. They can
road condition information from dozens of sensors
obstacles such as humans and vehicles, as well as include laser radars, millimeter-wave radars, and
around the vehicle in real time. Being responsive,
unusually shaped obstacles like cones, water- cameras. With data from multiple sensors merged,
it enables the intelligent driving algorithm to
weighted traffic barriers, soil piles, and stones and objects can be detected in reliable, accurate, and
perform inference and calculation on complex
gravel. robust way, providing all-scenario, all-condition, and all-
road condition information and quickly decide
weather protection.
upon how to drive. Being stable and reliable, it can
run for a long time, which can help reduce a
driver's fatigue when going long distances. This all
combines for a relaxing and comfortable travel
experience.

HarmonyOS Cockpit
⚫ The HarmonyOS automotive operating system redefines the intelligent cockpit. Celia's voice
and visual capabilities make travel smarter. The smart OS links people, audio systems, and
devices. Hundreds of automotive applications and in-vehicle audio deliver a new mobile
entertainment experience.

Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

Intelligent Recommendation
⚫ Intelligent recommendation is a recommender system service built based on AI. AI-powered recommendation can automatically integrate big data
and continuously gain insights into end users through data analysis and filtering, empowering enterprises to develop into data-driven decision-making
organizations.

⚫ Knowledge graphs are knowledge bases used in intelligent recommender systems. They are essentially structured networks that store relationships
between entities. They contain a large amount of background information about target objects and relationships between objects in recommendation
systems. Knowledge graphs provide precise profiles of target objects, such as user behavior, interests, and requirements, to implement accurate
matching and targeted recommendation, achieving scenario- and task-specific personalized recommendation. This close relationship enables
knowledge graphs to shine in intelligent recommendation.

Case Study
⚫ As shown in the following figure, a knowledge graph provides different types of relationship links, introducing more
semantic relationships to items. This helps diverge recommendation results, avoid recommendation limitations, and
discover user interests at a deeper level. This user may also like the following movies: Days of Being Wild, The Last
Emperor, and Caught in the Web. In addition, the graph can connect users' historical records and recommendation
results to improve their satisfaction and acceptance of recommendation results and enhance their trust in the
recommender system.
Leading
Leslie Cheung actor
Leading Days of Being
Kwok Wing Wild
actor

Like Genre Historical Genre

Farewell My The Last
Concubine film Emperor

Chen Director Caught in

Director
Kaige the Web

• In addition, as a type of effective auxiliary information in hybrid recommender systems,

knowledge graphs can effectively resolve a range of key problems regarding data
sparseness, cold start, and recommendation diversity.
Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

Intelligent Robots
⚫ Humanoid robots are designed to have human-like forms and functions, including sensory
perception and cognition.
⚫ They will likely become the most valuable carriers of embodied AI. Combined with rapidly
developing general artificial intelligence and foundation models, humanoid robots will enable
machines to interoperate and interact with the environments in a more intuitive and intelligent
manner. They will be able to perform a wide variety of complex tasks just like humans.

Robots Understand Instructions and Things
The Vision-Language-Action model can significantly improve a robot's generalization capability
for new objects. It can interpret commands that do not exist in the robot's training data, for
example, placing an object on a specified number or icon. It can also perform basic inference in
response to user commands, like picking up the smallest or largest object, or the one closest to a
specified object.

Vision-language models trained on massive

Internet data are directly incorporated into
end-to-end robotic control, and such models
are referred to as vision-language-action
models (VLAs).

RT-2: Vision-Language-Action Models TransferWeb Knowledge to Robotic Control--DeepMind

• DeepMind explores how to incorporate vision-language models trained on massive Internet

data directly into end-to-end robotic control for improving generalization and implementing
unexpected semantic inference. We aim to enable an end-to-end trained single model to
learn to map the information observed by the robot to actions, while retaining the
advantages of massive pre-training on language and vision-language data. To this end, we
propose joint fine-tuning of state-of-the-art vision-language models, including robot
trajectory data and massive Internet vision-language tasks (for example, visual Q&A). Unlike
other practices, we decide to achieve this objective by using a simple and general approach.
To incorporate natural language responses and robot actions into the same format, we
represent the actions as text labels and add them directly into the training set of the model
in the same way as natural language labels. Such model is called a Vision-Language-Action
Model (VLA) and instantiated as RT-2. According to our extensive evaluations (6000
evaluation trials), efficient robot strategies were generated and RT-2 acquired a series of
unanticipated capabilities from massive Internet training. To be specific, this can
significantly improve a robot's generalization capability for new objects, interpret
commands that do not exist in the robot's training data (for example, placing an object on a
specified number or icon), and perform basic inference in response to user commands (for
example, picking up the smallest or largest object, or the one closest to a specified object).
Robots Understand Things Like Human

0. Different
mugs, some 1. Give an
observed by instruction
the robot, for a mug
others not. observed by
the robot.

RDT-1B: Diffusion Foundation Model for Bimanual Manipulation

2. Give an 3. Give an
instruction instruction
for mug 1 for mug 2
not yet not yet
observed by observed by
the robot. the robot.

• https://arxiv.org/pdf/2410.07864

• https://rdt-robotics.github.io/rdt-robotics/
Contents

1. Voice Assistants

2. Smart Homes

3. Intelligent Vehicles

4. Intelligent Recommendation

5. Intelligent Robots

6. AI4Science

The Power of AI in Science

On October 9, 2024, the Royal Swedish Academy of

On October 8, 2024, the Royal Swedish Academy of
Sciences announced that the Nobel Prize in Chemistry
Sciences announced that the Nobel Prize in Physics
2024 would be awarded to David Baker for his
2024 would be awarded to John J. Hopfield and
Geoffrey E. Hinton for their "foundational contributions to computational protein design, and to
Demis Hassabis and John M. Jumper for their contributions
discoveries and inventions that enable machine
to protein structure prediction using AlphaFold2, an AI
learning with artificial neural networks".
model developed by them.

• The Nobel Prize in Physics was awarded to American scientist John Hopfield and Canadian
scientist Geoffrey Hinton for their "foundational discoveries and inventions that enable
machine learning with artificial neural networks". The work of these two pioneers in AI has
laid a foundation for the development of deep learning and neural networks, and plays a
crucial role in the booming development of AI.

• Half of the Nobel Prize in Chemistry was awarded to American scientist David Baker for his
contribution to computational protein design, while British scientist Demis Hassabis and
American scientist John Jumper shared the other half for their achievements in protein
structure prediction. AlphaFold2, an AI model developed by Hassabis and Jumper, has
solved a 50-year-old problem by predicting the complex structures of approximately 200
million known proteins. It has been used by over 2 million people worldwide and holds
revolutionary significance in fields like drug R&D.

• The awarding of these prizes reflects a new trend in scientific research: AI, as a force that
cannot be ignored, is driving a paradigm shift in scientific research. In fields such as physics,
chemistry, biology, and medicine, AI has become an important tool for solving long-
standing complex scientific problems. It has established a paradigm theoretically capable of
solving all scientific problems: starting from practical problems, transforming them into
input data that AI can process, and then using deep learning networks to ultimately output
results. Many scientists believe that AI will continuously push scientific research beyond
traditional frameworks, achieving more profound and broader innovations.
AI + Quantum Mechanics
FermiNet (DeepMind, 2024) LapNet (Peking University and ByteDance, 2024)

FermiNet is a neural network architecture proposed by DeepMind. It LapNet is a deep learning architecture developed by ByteDance and
supports a parameterized wave function representation method and can Peking University for neural network-based variational Monte Carlo (NN-
efficiently process multi-electron quantum systems, improving the VMC). The team has designed a forward Laplacian computing framework
accuracy and efficiency of electronic structure calculations. FermiNet to calculate the Laplacian related to the neural network through an
calculates the energy of atoms and molecules, and still captures 97% of effective forward propagation process (which is a bottleneck of NN-VMC).
the relevant energy in a molecular system consisting of 30 electrons of LapNet can achieve significant acceleration, extending the applicability of
Bicyclobutane. NN-VMC to larger systems.

David Pfau, et.al. Accurate computation of quantum excited states with Ruichen Li, et.al. A computational framework for neural network based variational
neural networks. Science, 2024 Monte Carlo with Forward Laplacian. Nature M. I., 2024
27 Copyright © Huawei Technologies Co., Ltd. All rights reserved.
AI + Material Science
GNoME (DeepMind and Nature, 2023)
The GNoME tool launched by DeepMind has discovered 2.2 million types of new
crystals, among which 380,000 crystals with stable structures are expected to be
synthesized. This may drive technological transformations in next-generation batteries
and superconductors. GNoME uses a graph neural network (GNN) model to efficiently
predict the crystal structure and stability of materials, greatly accelerating the discovery
of new materials. Compared with traditional experimental methods, AI models can
predict millions of materials in a short time, shortening the research period and
significantly saving experimental costs. AmilMerchant et al. Scaling deep learning for materials discovery, Nature (2023)

MatterGen (Microsoft, 2024)

MatterGen is a new generative model developed by Microsoft Research to design
inorganic materials with specific properties. MatterGen can generate stable and diverse
crystal structures, and refine atomic types, coordinates, and periodic lattices through a
new diffusion-based generation process. The model can be fine-tuned based on specific
property requirements to generate new materials that meet multiple property
constraints, such as chemical composition, symmetry, and mechanical, electronic, and
magnetic properties, for directional material design.
Tian Xie et al. MatterGen: a generative model for inorganic materials design. (2024)

AI + Molecular Dynamics
DeePMD (DP Technology, 2023)
DeePMD-kit is a deep learning package with great potential developed by DP Technology. It integrates first-principles modeling, data-
driven methods represented by machine learning, and high-performance computing to process ultra-large-scale molecular dynamics
problems with the precision of first principles. DeePMD-kit can significantly improve computing efficiency while improving simulation
precision, resolving the contradiction between precision and computing volume in traditional simulation methods.

Weinan E, et. al. DeePMD kit v2: A software package for deep potential models. 2023

Quiz

1. List some typical AI applications.

• Smart speaker, text-to-image APP, and the like

Summary

⚫ This chapter describes how AI reshapes our lives, as well as AI applications in fields
such as meteorology and robotics.

The information in this document may contain predictive

1 Artificial Intelligence Overview v3.5
No ratings yet
1 Artificial Intelligence Overview v3.5
51 pages
AI Material
No ratings yet
AI Material
47 pages
Huawei: Question & Answers
No ratings yet
Huawei: Question & Answers
149 pages
Hcai Mock
100% (1)
Hcai Mock
5 pages
MCQQQQQQQQQ
No ratings yet
MCQQQQQQQQQ
35 pages
Unit 1 - Question Bank
No ratings yet
Unit 1 - Question Bank
4 pages
HCIA-AI V4.0 Training Material
No ratings yet
HCIA-AI V4.0 Training Material
497 pages
HCIA-AI V4.0 Version Instruction
No ratings yet
HCIA-AI V4.0 Version Instruction
2 pages
AI Chapter 6
No ratings yet
AI Chapter 6
28 pages
Artificial Intelligence (Subject Code 417) : SAMPLE QUESTION PAPER-2 (Marking Scheme) Term - 1 General Instructions
No ratings yet
Artificial Intelligence (Subject Code 417) : SAMPLE QUESTION PAPER-2 (Marking Scheme) Term - 1 General Instructions
5 pages
Generative Ai Leader Demo
No ratings yet
Generative Ai Leader Demo
4 pages
Hcia Ai v3.5 h13-311 Exam Site2
No ratings yet
Hcia Ai v3.5 h13-311 Exam Site2
7 pages
Hcia-Ai v3.0 Lab Guide
No ratings yet
Hcia-Ai v3.0 Lab Guide
173 pages
H13 311 - V3.5 Demo
No ratings yet
H13 311 - V3.5 Demo
5 pages
Assignment Solution 6
No ratings yet
Assignment Solution 6
11 pages
Hcia Ai 1 PDF
No ratings yet
Hcia Ai 1 PDF
171 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
HCIA-AI V3.5 Exam Guide 2023
No ratings yet
HCIA-AI V3.5 Exam Guide 2023
3 pages
T or F
No ratings yet
T or F
6 pages
Cybersecurity Quiz Questions and Answers
100% (2)
Cybersecurity Quiz Questions and Answers
21 pages
Copy of AIML Simp-Tie
No ratings yet
Copy of AIML Simp-Tie
4 pages
AI Agent Systems Course Guide
0% (1)
AI Agent Systems Course Guide
2 pages
Exercise 1 - Installation of Cloudsim
No ratings yet
Exercise 1 - Installation of Cloudsim
7 pages
Software Engineering 6ed Sommerville
No ratings yet
Software Engineering 6ed Sommerville
7 pages
HCIA-AI V3.0 Exam Prep Guide
No ratings yet
HCIA-AI V3.0 Exam Prep Guide
22 pages
Algorithm Design MCQs with Answers
100% (1)
Algorithm Design MCQs with Answers
12 pages
AI.02a - Solving Problems by Searching - T
No ratings yet
AI.02a - Solving Problems by Searching - T
118 pages
Knowledge Representation in Artificial Intelligence
No ratings yet
Knowledge Representation in Artificial Intelligence
25 pages
DoD AI and Data Strategy 2023
No ratings yet
DoD AI and Data Strategy 2023
13 pages
Iot July 22 QP
No ratings yet
Iot July 22 QP
2 pages
Pydantic Ai Implementation Guide
No ratings yet
Pydantic Ai Implementation Guide
26 pages
HCIA-AI (Artificial Intelligence)
100% (2)
HCIA-AI (Artificial Intelligence)
36 pages
ML Practice Questions
No ratings yet
ML Practice Questions
6 pages
CH 1
No ratings yet
CH 1
17 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
2 pages
SAAI1-AI Analyst 2019-Course Guide 1
No ratings yet
SAAI1-AI Analyst 2019-Course Guide 1
166 pages
1 Ken Radh3touh Its Ok U Have Bilgacem For Further Assistance If You Dont Support Bilgacem Timchi T
No ratings yet
1 Ken Radh3touh Its Ok U Have Bilgacem For Further Assistance If You Dont Support Bilgacem Timchi T
174 pages
GET307 - Introduction To AI, ML, and Convergent Technologies
100% (1)
GET307 - Introduction To AI, ML, and Convergent Technologies
78 pages
Digital Scams and Frauds Class12 by AKASH
No ratings yet
Digital Scams and Frauds Class12 by AKASH
17 pages
Ai Individual Assignment Answer
No ratings yet
Ai Individual Assignment Answer
2 pages
Online Scam Awareness in India
No ratings yet
Online Scam Awareness in India
16 pages
Cloud Native Artifical Intaligence (Text Book)
No ratings yet
Cloud Native Artifical Intaligence (Text Book)
31 pages
Lecture 1: Introduction: Bit 2319: Artificial Intelligence
No ratings yet
Lecture 1: Introduction: Bit 2319: Artificial Intelligence
66 pages
Machine Learning Overview and Resources
No ratings yet
Machine Learning Overview and Resources
19 pages
Chapter 4 Knowledge and Reasoning
No ratings yet
Chapter 4 Knowledge and Reasoning
44 pages
HCIP-Cloud Service Solutions Architect V2.0 Exam Outline
No ratings yet
HCIP-Cloud Service Solutions Architect V2.0 Exam Outline
3 pages
AIQUEBANK
No ratings yet
AIQUEBANK
9 pages
CS8691 AI Model Exam Paper 2021
No ratings yet
CS8691 AI Model Exam Paper 2021
2 pages
07 - Ai-900 71-90
No ratings yet
07 - Ai-900 71-90
6 pages
Cluster Computing
No ratings yet
Cluster Computing
32 pages
Practice Test 4
No ratings yet
Practice Test 4
87 pages
Robot Applications: Assembly
No ratings yet
Robot Applications: Assembly
17 pages
Understanding AI and Knowledge-Based Systems
No ratings yet
Understanding AI and Knowledge-Based Systems
27 pages
Understanding Intelligent Agents in AI
No ratings yet
Understanding Intelligent Agents in AI
38 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Generative AI
No ratings yet
Generative AI
4 pages
Cyber Security IV Question Bank
No ratings yet
Cyber Security IV Question Bank
5 pages
cs224r L02 Imitation
No ratings yet
cs224r L02 Imitation
40 pages
HCIA-AI V3.5 Training Materials
No ratings yet
HCIA-AI V3.5 Training Materials
478 pages
Huawei AI Strategy & Overview
No ratings yet
Huawei AI Strategy & Overview
66 pages
Artificial Intelligent 1
No ratings yet
Artificial Intelligent 1
20 pages
Artificial Intelligence and Knowledge Processing Improved Decision-Making and Prediction (Etc.)
No ratings yet
Artificial Intelligence and Knowledge Processing Improved Decision-Making and Prediction (Etc.)
387 pages
CLASS 10 Project Title AI
No ratings yet
CLASS 10 Project Title AI
13 pages
Ajay Gurav: Junior Data Scientist
No ratings yet
Ajay Gurav: Junior Data Scientist
1 page
AI-Driven Patient Triage in Emergency Rooms
No ratings yet
AI-Driven Patient Triage in Emergency Rooms
6 pages
Nlp011gu05 PDF
No ratings yet
Nlp011gu05 PDF
53 pages
Unit-1 - Important QnA On Domains of AI
No ratings yet
Unit-1 - Important QnA On Domains of AI
3 pages
A Multitask Multilingual Multimodal Evaluation of
No ratings yet
A Multitask Multilingual Multimodal Evaluation of
52 pages
Mini Project Report - Updated
No ratings yet
Mini Project Report - Updated
8 pages
AI Fundamentals for MBA Students
No ratings yet
AI Fundamentals for MBA Students
133 pages
1 N-Grams and Language Models Detailed
No ratings yet
1 N-Grams and Language Models Detailed
4 pages
Scientific Computing with Python
No ratings yet
Scientific Computing with Python
4 pages
ChatGPT's Wolfram Superpowers Explained
No ratings yet
ChatGPT's Wolfram Superpowers Explained
14 pages
NLP Meets Psychotherapy
No ratings yet
NLP Meets Psychotherapy
4 pages
Ai For Everyone Practice Exercises
No ratings yet
Ai For Everyone Practice Exercises
13 pages
Understanding Automatic Essay Generators
100% (2)
Understanding Automatic Essay Generators
8 pages
Deep Learning: A Comprehensive Guide
No ratings yet
Deep Learning: A Comprehensive Guide
12 pages
Informatics Practics (065) : Artificial Intelligence Project File
No ratings yet
Informatics Practics (065) : Artificial Intelligence Project File
14 pages
Chatbots As Problem Solvers Playing Twenty Questions With Role Reversals
No ratings yet
Chatbots As Problem Solvers Playing Twenty Questions With Role Reversals
20 pages
Automatic Code Summarization: A Systematic Literature Review
No ratings yet
Automatic Code Summarization: A Systematic Literature Review
12 pages
AI - Facilitators - Handbook - VI 2025-26
100% (3)
AI - Facilitators - Handbook - VI 2025-26
36 pages
CV Lhuang
No ratings yet
CV Lhuang
12 pages
Intro to AI Course Overview 2023-24
No ratings yet
Intro to AI Course Overview 2023-24
10 pages
Introduction to Genei.io and NLP
No ratings yet
Introduction to Genei.io and NLP
5 pages
ChatGPT For Dummies (2 Books in 1) : Chatgpt Prompts & Chatgpt For Beginners - Over 300 Prompts and Learning Example Oliver Ruiz PDF Download
100% (3)
ChatGPT For Dummies (2 Books in 1) : Chatgpt Prompts & Chatgpt For Beginners - Over 300 Prompts and Learning Example Oliver Ruiz PDF Download
53 pages
Language Detection Using Natural Language Processing: Abstract
No ratings yet
Language Detection Using Natural Language Processing: Abstract
11 pages
Artificial Intelligence and Machine Learning For Healthcare Vol 1 Image and Data Analytics Cheepeng Lim Download
100% (12)
Artificial Intelligence and Machine Learning For Healthcare Vol 1 Image and Data Analytics Cheepeng Lim Download
76 pages
AI Unit 1 Notes
No ratings yet
AI Unit 1 Notes
15 pages
AI Notes
No ratings yet
AI Notes
141 pages
Medical Chatbot for Health Queries
No ratings yet
Medical Chatbot for Health Queries
5 pages