System Architecture and Performance - H7 - en - DM00306681
System Architecture and Performance - H7 - en - DM00306681
Application note
STM32H72x, STM32H73x, and single-core STM32H74x/75x
system architecture and performance
Introduction
The STM32H7 Series is the first series of STMicroelectronics microcontrollers in 40 nm-
process technology. This technology enables STM32H7 devices to integrate high-density
embedded Flash memory and SRAM that decrease the resource constraints typically
complicating high-end embedded development. It also unleashes the performance of the
core and enables ultra-fast data transfers through the system while realizing major power
savings.
In addition, the STM32H7 Series is the first series of Arm® Cortex®-M7-based 32-bit
microcontrollers able to run at up to 550 MHz, reaching new performance records of
1177 DMIPS and 2777 CoreMark®.
The STM32H7 Series is the continuity of the STM32F7 Series in terms of high-performance
products with significant architecture improvement allowing a performance boost versus
STM32F7 Series devices.
The architecture and performance of STM32H7 Series devices make them ideally suited for
industrial gateways, home automation, telecom equipment and smart consumer products,
high-performance motor control and domestic appliances, and use in small devices with rich
user interfaces such as smart watches.
This application note focuses on STM32H72x, STM32H73x, STM32H742x,
STM32H743/753x and STM32H750x single-core microcontrollers, referred to herein as
STM32H72x/73x/74x/75x (see Table 1). Dual-core devices are not covered by this
document. Its objective is to present the global architecture of the devices as well as their
memory interfaces and features, which provide a high degree of flexibility to achieve the
best performance and additional code and data size trade-off.
The application note also provides the results of a software demonstration of the
STM32H74x/75x Arm® Cortex®-M7 single-core architecture performance in various
memory partitioning configurations with different code and data locations.
This application note is delivered with the X-CUBE-PERF-H7 Expansion Package dedicated
to STM32H742x, STM32H743/753x and STM32H750x microcontrollers. This Expansion
Package includes the H7_single_cpu_perf project aimed at demonstrating the performance
of CPU memory accesses in different configurations with code execution and data storage
in different memory locations using L1 cache. The project runs on the STM32H743I-EVAL
board.
Reference documents
• Reference manual STM32H723/733, STM32H725/735 and STM32H730 Value line
advanced Arm®-based 32-bit MCUs (RM0468)
• Reference manual STM32H742, STM32H743/753 and STM32H750 Value line
advanced Arm®-based 32-bit MCUs (RM0433)
All documents are available from STMicroelectronics website: www.st.com.
Contents
1 General information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 STM32H72x/73x/74x/75x
system architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Cortex®-M7 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Cortex®-M7 system caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Cortex®-M7 memory interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 AXI bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 TCM bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 AHBS bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 AHBP bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 STM32H72x/73x/74x/75x interconnect matrix . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 AXI bus matrix in the D1 domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 AHB bus matrices in the D2 and D3 domains . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Inter-domain buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 STM32H72x/73x/74x/75x memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Embedded Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Embedded RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 External memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Main architecture differences between STM32F7 Series and,
STM32H72x, STM32H73x, STM32H74x and STM32H75x devices . . . . 31
3 Typical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 FFT demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Configuring demonstration projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7 Revision history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
List of tables
List of figures
1 General information
2 STM32H72x/73x/74x/75x
system architecture overview
a. Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or elsewhere.
Cacheable, Write-Through,
0x0000 0000-0x1FFF FFFF Code Normal No
Allocate on read miss
Cacheable, Write-Back, Allo-
0x2000 0000-0x3FFF FFFF SRAM Normal No
cate on read and write miss
0x4000 0000-0x5FFF FFFF Peripheral Device Non-shareable Yes
Cacheable, Write-Back, Allo-
0x6000 0000-0x7FFF FFFF RAM Normal No
cate on read and write miss
Cacheable, Write-Through,
0x8000 0000-0x9FFF FFFF RAM Normal No
Allocate on read miss
0xA000 0000-0xBFFF FFFF External device Device Shareable Yes
0xC000 0000-0xDFFF FFFF External device Device Non-shareable Yes
Private peripheral Strongly
0xE000 0000-0xE000 FFFF - Yes
bus ordered
0xE001 0000-0xFFFF FFFF Vendor system Device Non-shareable Yes
In STM32H72x/73x/74x/75x, the 64-bit AXI master bus connects the core to the 64-bit AXI
bus matrix (D1 domain).
SDMMC2
Ethernet
DMA1
DMA2
MAC
USB
HS1
ITCM-RAM Cortex®-M7 AHBP
AHBS
DTCM-RAM L1-Cache
DMA2_PERIPH
7
DTCM
DMA1_PERIPH
Shared
DMA1_MEM
DMA2_MEM
AXIM LTDC SRAM D1-to-D2
SDMMC1 MDMA DMA2D (1)
AHB bus
AHB AXI AHB AXI AXI AXI
ASIB1 ASIB2 ASIB3 ASIB4 ASIB5 ASIB6
APB
AXI OTFDEC2 OCTOSPI2 AHB2
AXI OTFDEC1 OCTOSPI1 APB1
AN4891 Rev 5
AXI AXI
SRAM APB2
5
AHB
AMIB1
AHB3
APB
GPV APB3
64-bit AXI bus matrix 1
D1 Domain 32-bit AHB bus matrix
D2-to-D1 AHB bus D2 Domain 2
D2-to-D1 AHB bus
4 D1-to-D3 AHB bus D2-to-D3 AHB bus
64-bit bus (AXI) 32-bit bus AHB
AHBP
AHBS
ITCM
AHB
APB
AXI
AN4891
interfaces
Masters External interface
MSv66817V2
Figure 2. STM32H74x and STM32H75x system architecture
AN4891
ITCM
SDMMC2
Ethernet
Note 1: Not available in
DMA1
DMA2
MAC
USB
USB
HS1
HS2
® AHBP STM32H742xx devices
ITCM-RAM Cortex -M7
L1-Cache
(1) AHBS
Note 2: Not available in
DTCM-RAM L1-Cache STM32H750xx devices
7
DTCM
DMA1_PERIPH
DMA2_PERIPH
DMA1_MEM
DMA2_MEM
AXIM LTDC
SDMMC1 MDMA DMA2D (1) D1-to-D2 AHB bus
AHB AXI AHB
AHB
AXI AXI AXI
ASIB1 ASIB2 ASIB3 ASIB4 ASIB5 ASIB6 4
APB
AHB
AHB3 SRAM2
AXI
Flash 1 SRAM3(1)
AXI Flash 2(2) AHB1
APB
AXI
FMC AHB2
d
AXI
AMIB6 AMIB7
QUADSPI APB1
AXI AXI
APB2
SRAM
5
GPV 64-bit AXI bus matrix 1
D1 Domain
D2-to-D1 AHB bus
ITCM bus AHBS bus D1-to-D3 AHB bus D2-to-D3 AHB bus
3 6
AXI bus matrix (D1 domain) AHBx pripherals AHB4 APB4
DTCM
AHBP
AHBS
ITCM
AHB
APB
AXI
interfaces D3 Domain
Masters External interface
MSv44011V14
STM32H72x/73x/74x/75x system architecture overview AN4891
– the D2-to-D3 AHB inter-domain that connects the D2 domain to the D3 domain
– the D2-to-D1 AHB inter-domain that connects the D2 domain to the D1 domain
The AHB bus matrix in the D3 domain is dedicated to reset, clock control, power
management and GPIOs. It interconnects:
• three initiators:
– the D1-to-D3 AHB inter-domain that connects the D1 domain to the D3 domain
– the D2-to-D3 AHB inter-domain that connects the D2 domain to the D3 domain
– the BDMA memory AHB bus
• two bus slaves:
– the AHB4 peripherals including the AHB to APB bridge (connection 6 in Figure 2)
and APB4 peripherals
– internal SRAM4 up to 64 Kbytes and the Backup SRAM of 4 Kbytes that shares
the same AHB bus
If SDMMC1, LTDC and DMA2D need some data, for example, from SRAM4, MDMA can be
used to transfer them from SRAM4 to AXI SRAM.
For STM32H74x/75x, the D1-to-D3 AHB inter-domain bus also makes possible for some
masters located in the D2 domain to have access to the resources located in the D3
domain. Note that some masters in D2 domain, such as USB OTG_HS, do not have direct
access to the resources located in D3 through the D2-to-D3 AHB inter-domain bus. The
access is performed first through the D2-to-D1 AHB and then through the D1-to-D3 AHB
inter-domain buses (refer to the USBHS2 access to SRAM4 path highlighted in yellow in
Figure 3).
Cortex®-M7 - ITCM
Cortex®-M7 - AHBP
USBHS2 - AHB(5)
DMA1 - PERIPH
DMA2 - PERIPH
SDMMC2 - AHB
USBHS1 - AHB
MDMA - AHBS
DMA1 - MEM
DMA2 - MEM
BDMA - AHB
MDMA - AXI
SDMMC1
LTDC(2)
DMA2D
Bus slave / type(1) Interconnect path and type(3)
ITCM - - D - - - 7 - - - - - - - - - - - - -
DTCM - - - D - - 7 - - - - - - - - - - - - -
AHB3 peripherals 1 - - - - 1 - - - 21 21 - 21 21 - 21 21 21 21 -
APB3 peripherals 14 - - - - 14 - - - 214 214 - 214 214 - 214 214 214 214 -
Flash bank 1 1 - - - 1 1 - 1 1 21 21 - 21 21 - 21 21 21 21 -
AXI SRAM 1 - - - 1 1 - 1 1 21 21 - 21 21 - 21 21 21 21 -
(5)
QUADSPI 1 - - - 1 1 - 1 1 21 21 - 21 21 - 21 21 21 21 -
OCTOSPI(6) 1 - - - 1 1 - 1 1 21 21 - 21 21 - 21 21 21 21 -
FMC 1 - - - 1 1 - 1 1 21 21 - 21 21 - 21 21 21 21 -
SRAM 1 12 - - - - 12 - 12 - 2 2 - 2 2 - 2 2 2 2 -
SRAM 2 12 - - - - 12 - 12 - 2 2 - 2 2 - 2 2 2 2 -
SRAM 3(7) 12 - - - - 12 - 12 - 2 2 - 2 2 - 2 2 2 2 -
AHB1 peripherals 12 2 - - - 12 - 12 - 2 2 - 2 2 - - - - - -
AHB2 peripherals - 2 - - - - - - - 2 2 - 2 2 - - - - - -
23 213
AHB4 peripherals 13 - - - - 13 - - - 23 23 - 23 23 - - (8) (8) 213 3
23 213
APB4 peripherals 136 - - - - 136 - - - 236 236 - 236 236 - - (8) (8) 213 36
23 213
SRAM4 13 - - - - 13 - - - 23 23 - 23 23 - - (8) (8) 213 3
23 213
Backup RAM 13 - - - - 13 - - - 23 23 - 23 23 - - (8) (8) 213 3
1. Bold font type denotes 64-bit bus, plain type denotes 32-bit bus.
2. LTDC is not available on STM32H742x devices.
3. Cells in the table body indicate access possibility, utility, path and type:
Access possibility and utility:
Any figure = access possible, “-” = access not possible, gray shading = access useful/usable
Access path:
D = direct, 1 = via AXI bus matrix, 2 = via AHB bus matrix in D2, 3 = via AHB bus matrix in D3, 4 = via AHB/APB bridge in
D1, 5 = via AHB/APB bridge in D2, 6 = via AHB/APB bridge in D3, 7 = via AHBS bus of Cortex®-M7,
Multi-digit numbers = interconnect path goes through more than one matrix or/and bridge, in the order of the digits.
Access type:
Plain = 32-bit, Italic=32-bit on bus master end / 64-bit on bus slave end, Bold=64-bit
4. Flash bank 2 is available only in STM32H74x/75x except for STM32H750x devices.
5. QUADSPI and USBHS2 are available only in STM32H74x/75x devices.
6. Available only in STM32H72x/73x.
7. SRAM3 is available only in STM32H74x/75x except STM32H742x.
8. Connection available only in STM32H74x/75x.
Figure 3 shows some paths (ten examples of master access paths) used by some masters
located in the D1 and D2 domains to access to resources located in the D1, D2, and D3
domains. This example is based on STM32H74x/75x. However, the paths are the same for
STM32H72x/73x except for the USBHS2 access to SRAM4 path.
ITCM
SDMMC2
Ethernet
DMA1
DMA2
MAC
USB
USB
HS1
AHBP
HS2
ITCM-RAM ®
Cortex -M7
L1-Cache
(1) AHBS
DTCM-RAM
M L1-Cache
DTCM
DMA1_PERIPH
DMA2_PERIPH
DMA1_MEM
DMA2_MEM
AXIM
SDMMC1 MDMA DMA2D LTDC D1-to-D2 AHB bus
AHB
ASIB1 ASIB2 ASIB3 ASIB4 ASIB5 ASIB6
APB
AHB
AHB3 SRAM2
AXI
Flash 1 SRAM3
AXI Flash 2 AHB1
AN4891 Rev 5
APB
AXI
FMC AHB2
AXI
QSPI APB1
AXI AXI
SRAM APB2
CPU access to SRAM4 path DMA1 access to Quad-SPI path D1-to-D3 AHB bus
D2-to-D3 AHB bus
BDMA
CPU access to AXI-SRAM path DMA2 access to SRAM4 path
AN4891
AXI bus matrix (D1 domain) AHB bus matrix (D3 domain)
32-bit AHB bus matrix Bckp SRAM
AHB bus matrix (D2 domain) Bus multiplexer D3 Domain
AN4891 STM32H72x/73x/74x/75x system architecture overview
Cortex®-M7
L1-Cache
AXIM
D2-to-D1 AHB bus SDMMC1 MDMA DMA2D LTDC
Flash Memory
AHB AXI AHB AXI AXI AXI Flash
ASIB1 ASIB2 ASIB3
SI ASIB4
SI B ASIB5
SII
S ASIB6
IB Flash configuration bus
AHB
AMIB1 AMIB3
registers on AHB3 266-bit
AXI Flash
MII
M
bank 1
Flash memory
interface
AMIB4
A M
Flash
AXI bank 2(1)
MSv44013V5
CPU speed higher than 520 MHz, the STM32H72x/73x must disable the ECC on this
RAM through the CPUFREQ_BOOST option byte.
For STM32H72x/73x the ITCM-RAM size can be increased up to 192 Kbytes with a
64 Kbyte granularity at the expense of AXI-SRAM size.
• 128 Kbytes of DTCM-RAM
The DTCM-RAM is located in the D1 domain.
It is split into two DTCM-RAMs with 32-bit access each. Both memories are connected
respectively to the D0TCM and D1TCM ports of the Cortex®-M7 (not represented in the
figures) and can be used in parallel (for load/store operations) thanks to the Cortex®-
M7 dual issue capability. The DTCM-RAM is mapped on the DTCM interface at
address 0x2000 0000. It is accessible only by the CPU and the MDMA and can be
accessed by the Cortex®-M7 through the DTCM bus (light green bus in Figure 2) and
by the MDMA through the specific AHBS bus of the Cortex®-M7 (light pink path in
Figure 3). It is accessible by the Cortex®-M7 by bytes, half-words (16 bits), words
(32 bits) or double words (64 bits).
The DTCM-RAM is accessible at the maximum Cortex®-M7 clock speed without any
latency. To perform accesses at a CPU speed higher than 520 MHz, the
STM32H72x/73x must disable the ECC on this RAM through the CPUFREQ_BOOST
option byte.
Concurrent accesses to the DTCM-RAM by the Cortex®-M7 and the MDMA as well as
their priorities can be handled by the slave control register of the Cortex®-M7 itself
(CM7_AHBSCR register). A higher priority can be given to the Cortex®-M7 to access
the DTCM-RAM versus the MDMA. For more details of this register, refer to Arm®
Cortex®-M7 processor - technical reference manual.
• Up to 512 Kbyte of AXI SRAM
The AXI SRAM is mapped at address 0x2400 0000.
It can be accessed by all masters located only in D1 domain and D2 domain through
D2-to-D1 AHB inter-domain bus. The BDMA located in D3 domain cannot access this
memory. The AXI SRAM is connected to the AXI bus matrix through a 64-bit wide AXI
bus and can be accessed as bytes (8 bits), half-words (16 bits), full-words (32 bits) or
double-words (64 bits). Refer to Figure 3 for some possible AXI SRAM accesses. The
AXI SRAM can be used for read/write data storage as well as for code execution. It is
accessed at the same frequency as the AXI bus matrix (half the maximum CPU
frequency).
• 192 Kbytes of RAM shared between ITCM and AXI RAM (available only in
STM32H72x/73x)
It is located in the D1 domain and can be shared between Instruction TCM or AXI
SRAM with a 64-Kbyte granularity.
When used as ITCM-RAM, it is accessed as the 64-Kbyte ITCM-RAM, except that it is
mapped at address 0x0001 0000, contiguous to the fixed ITCM-RAM.
When used as AXI SRAM, it is accessed as the 128-Kbyte AXI SRAM, except that it is
mapped at address 0x2402 0000 contiguous to the fixed AXI SRAM.
This feature can be configured through the TCM_AXI_SHARED[1:0] option byte.
All internal RAMs feature error correction code (ECC). The ECC must be disabled on TCM-
RAMs, through CPUFREQ_BOOST option byte, when STM32H73x/74x devices are
running above 520 MHz.
Table 5 and Table 6 summarize the internal memory mapping and the memory sizes of the
STM32H72x/73x/74x/75x devices:
Flash memory FLASH-1 0x0800 0000 1 Mbyte(1) AXI (64 bits) D1 275 MHz
DTCM-RAM 0x2000 0000 128 Kbytes DTCM (64 bits) D1 550 MHz
ITCM-RAM 0x0000 0000 (2)
ITCM (64 bits) D1 550 MHz
AXI SRAM 0x2400 0000 (2)
AXI (64 bits) D1 275 MHz
RAM SRAM1 0x3000 0000 16 Kbytes AHB (32 bits) D2 275 MHz
SRAM2 0x3000 4000 16 Kbytes AHB (32 bits) D2 275 MHz
SRAM4 0x3800 0000 64 Kbytes AHB (32 bits) D3 275 MHz
Backup SRAM 0x3880 0000 4 Kbytes AHB (32 bits) D3 275 MHz
1. Up to 1 Mbytes depending on the device
2. Refer to table ITCM/DTCM/AXI configuration from RM0468 reference manuals.
0xDFFF FFFF
SDRAM Bank2 SDRAM Bank2
256 Mbytes 256 Mbytes
0xD000 0000
0xCFFF FFFF
SDRAM Bank1 NOR/PSRAM
256 Mbytes 4 x 64 Mbytes
0xC000 0000
0xBFFF FFFF
Reserved Reserved
0xA000 0000
0x9FFF FFFF
OCTOSPI1 OCTOSPI1
(memory mapped mode) (memory mapped mode)
256 Mbytes 256 Mbytes
0x9000 0000
0x8FFF FFFF
NAND NAND
256 Mbytes 256 Mbytes
0x8000 0000
0x7FFF FFFF
OCTOSPI2 OCTOSPI2
(memory mapped mode) (memory mapped mode)
256 Mbytes 256 Mbytes
0x7000 0000
0x6FFF FFFF
NOR/PSRAM SDRAM Bank1
4 x 64 Mbytes 256 MBytes
0x6000 0000
Default mapping NOR/PSRAM and SDRAM Banks swapped
BMAP[1:0] = 00b BMAP[1:0] = 01b
MSv66814V1
0xDFFF FFFF
SDRAM Bank2 SDRAM Bank2 SDRAM Bank2
256 Mbytes 256 Mbytes 256 Mbytes
0xD000 0000
0xCFFF FFFF
SDRAM Bank1 NOR/PSRAM SDRAM Bank1
256 Mbytes 4 x 64 Mbytes 256 Mbytes
0xC000 0000
0xBFFF FFFF
Reserved Reserved Reserved
0xA000 0000
0x9FFF FFFF
Quad-SPI Quad-SPI Quad-SPI
(memory mapped mode) 256 (memory mapped mode) 256 (memory mapped mode) 256
Mbytes Mbytes Mbytes
0x9000 0000
0x8FFF FFFF
NAND NAND NAND
256 Mbytes 256 Mbytes 256 Mbytes
0x8000 0000
0x7FFF FFFF
SDRAM Bank1 SDRAM Bank2 SDRAM Bank2
256 Mbytes 256 Mbytes 256 Mbytes
0x7000 0000
0x6FFF FFFF
NOR/PSRAM SDRAM Bank1 NOR/PSRAM
4 x 64 Mbytes 256 Mbytes 4 x 64 Mbytes
0x6000 0000
MSv44014V5
Figure 7 shows the possible paths that interconnect the Cortex®-M7 and the different DMAs
with these external memories via the AXI and the AHB buses. The example covers
STM32H74x/75x. However it also applies to STM32H72x/73x by replacing the Quad-SPI
interface by Octo-SPI. As shown in Figure 7, the external memories can benefit from the
Cortex®-M7 cache and therefore maximize the performance whatever their usage (data
storage or code execution). This enables combining high performance and large memory
size.
The path highlighted in pink in Figure 7 represents the connection between the external
memories and the Cortex®-M7, by means of the FMC.
The path highlighted in yellow represents the connection between the external memories
and the Cortex®-M7, by means of the Quad-SPI controller.
The path highlighted in light green represents the access paths of some masters located in
the D1 domain to external memories connected by means of the FMC controller.
The path highlighted in dark green represents the access path of some masters located in
the D1 domain to external memories connected by means of the Quad-SPI controller.
The paths highlighted in purple and in blue represent respectively the accesses of masters
located in D2 domain to external memories connected by means of the Quad-SPI controller
and the FMC controller that are located in D1 domain.
All external memories are accessible by any master available in the system with the
exception of the BDMA.
Cortex®-M7
L1-Cache
AXIM
D2-to-D1 AHB bus SDMMC1 MDMA DMA2D LTDC
FMC
AXI
AMIB6
AM
QUADSPI
3 Typical application
This application note provides a software example that demonstrates the performance of
the STM32H74x/75x devices. The selected example is based on the FFT example provided
in the CMSIS library. The H7_single_cpu_perf project can be used as a skeleton where the
user can integrate her/his application.
MSv37584V2
7 - D1_Flash - D1_DTCM: the program is executed from the internal Flash memory and the
data storage is done in the DTCM-RAM with the I-cache and the D-cache enabled. The FFT
algorithm uses huge constants, the read-only data are located, in this case, in the internal
Flash memory. This is the reason why the D-cache is also enabled.
8 - D1_QuadSPI_Single - D1_DTCM: the program is executed from the Quad-SPI Flash
memory with the I-cache and the D-cache enabled (since the constants are located in
Quad-SPI Flash). The Quad-SPI Flash memory is configured in Single mode and runs with
the DDR mode enabled at 60MHz if the system clock frequency is set at 480 MHz and 50
MHz if the system clock is set to 400 MHz.
9 - D1_QuadSPI_Dual - D1_DTCM: the program is executed from the Quad-SPI Flash
memory with the I-cache and the D-cache enabled (since the constants are located in
Quad-SPI Flash). The Quad-SPI Flash memory is configured in Dual mode and runs with
DDR mode enabled at 60MHz if the system clock frequency is set at 480 MHz and 50 MHz
if the system clock frequency is set at 400 MHz.
10 - D1_SDRAM_Swapped - D1_DTCM: the program is executed from the FMC-SDRAM
with bank 2 swapped address (0xD000 0000 -> 0x7000 0000) and the data storage is done
in the DTCM-RAM. The I-cache and D-cache are enabled (constants are located in the
SDRAM) and the SDRAM runs at 100 MHz.
Each configuration has its own flag set. These flags are settable on the configuration
project. Figure 10 shows where these flags are defined for the MDK-ARM toolchain.
The code is optimized for time level 3 for all the configurations.
Project flags are the following:
• USE_VOS0_480MHZ: defines the system clock frequency.
– if USE_VOS0_480MHZ = 1, the system clock frequency is set to 480 MHz.
– if USE_VOS0_480MHZ = 0, the system clock frequency is set to 400 MHz.
• AHB_FREQ_X_CORE_FREQ: defines the core and bus matrices operating
frequencies.
X is equal to HALF or EQU. There are two configurations:
– AHB_FREQ_HALF_CORE_FREQ: the AXI bus matrix and the two AHB bus
matrices run at the half core frequency.
– AHB_FREQ_EQU_CORE_FREQ: the AXI bus matrix and the two AHB bus
matrices run at the same frequency as the core.
Note: To modify the RAM regions in the scatter files (stack and heap regions), the user must
modify accordingly the stack and heap sizes in the ASM menu of the MDK-ARM toolchain.
The size of the region in the scatter file is not considered as the real stack size of the main
application. The user must modify the STACK_SIZE_APPLICATION and
HEAP_SIZE_APPLICATION flag values to force their values in line with the heap/stack size
regions configured in the scatter file.
Figure 11 shows where to modify these flags (blue framed). There is also an initial stack
pointer that is used when external memories are used for data storage. Its size is 1 Kbyte
and can be changed by modifying the Stack_Size_Init variable in the startup file. The initial
stack pointer base address can be configured in the ASM (assembly control symbols) menu
as shown in Figure 11 (see red frame).
MSv44007V3
The scatter files of different configurations are located under MDK-ARM\scatter_files path in
the project of the demonstration.
For the IAR™ (EWARM) toolchain, the linkers are located under EWARM\icf_files. For the
System Workbench toolchain, the linkers are located under SW4STM32\<project folder
configuration>.
The results can be displayed on an HyperTerminal PC application through the UART using
the Virtual COM port with the following configuration:
• Baudrate: 115200
• Data bits: 7 bits
• Stop bits: 1 bit
• Parity: Odd
• HW flow control: none
To know which COM number the board is using, the user must connect the board ST-LINK
to the PC through an USB cable and go to Control Panel > System > Device Manager >
Ports (COM & LPT). Figure 12 shows an example where the UART COM number is COM4.
This section explains each feature activation (I-cache and D-cache) following the
configuration used and presents the corresponding results obtained (time spent by the FFT
algorithm in nanosecond).
The results are obtained with the Keil® MDK-ARM (v5.27.1) toolchain, the STM32H7xx
Pack version 2.2.0 and the STM32CubeH7 MCU Package version 1.4.0.
The MDK-ARM code optimization configuration is set to level 3 (optimized for time).
If the instruction fetch is done through the AXIM bus of the CPU, the I-cache must be
enabled to increase the performance of the code execution.
If the data is fetched through the AXIM bus of the CPU, the D-cache must be enabled to
increase the performance of the data access to memories and remove their latencies.
If the code is not located in the ITCM-RAM, the I-cache must be enabled as for the following
configurations:
• 7 - D1_Flash - D1_DTCM
• 8 - D1_QuadSPI_Single - D1_DTCM
• 9 - D1_QuadSPI_Dual - D1_DTCM
• 10 - D1_SDRAM_Swapped - D1_DTCM
In these configurations, the instructions are fetched from the AXIM bus of the CPU.
The D-cache must be enabled for all configurations that do not have the read-write data
located in the DTCM-RAM as for the following configurations:
• 2 - D1_ITCM - D1_AXISRAM
• 3 - D1_ITCM - D2_SRAM1
• 4 - D1_ITCM - D2_SRAM2
• 5 - D1_ITCM - D3_SRAM4
• 6 - D1_ITCM - D1_SDRAM
The D-cache must also be enabled when the read-only data are not located in the ITCM-
RAM as for the following configurations:
• 7 - D1_Flash - D1_DTCM
• 8 - D1_QuadSPI_Single - D1_DTCM
• 9 - D1_QuadSPI_Dual - D1_DTCM
• 10 - D1_SDRAM_Swapped - D1_DTCM
In the case of the FFT algorithm used in the demonstration, a very large amount of read-
only data is used. Disabling the data cache for configurations 7, 8, 9, and 10 drastically
decreases performances.
For configurations 6 and 10, the SDRAM is swapped (remapped from 0xD000 0000 to
0x7000 0000 in order to allow cache usage and that memory region to be executable) since
the default MPU attribute region starting from 0xA000 0000 to 0xDFFF FFFF is neither
cacheable nor executable as it is a Device memory type region.
Table 8. MDK-ARM results of data storage for different memory locations (execution location
fixed in ITCM-RAM), CPU running at 480 MHz (AHB_FREQ_HALF_CORE_FREQ,
USE_VOS0_480MHZ = 1, Flash ws = 4)
Cache configuration Configuration Execution time (ns)(1) Relative ratio
Table 10. MDK-ARM results of data storage in different memory locations (execution location
fixed in ITCM-RAM) CPU running at 240 MHz (AHB_FREQ_EQU_CORE_FREQ,
USE_VOS0_480MHZ = 1, Flash ws = 4)
Execution time in
Cache configuration Configuration Relative ratio
ns(1)
Table 11. MDK-ARM results of execution in different memory locations (data location fixed in
DTCM-RAM) CPU running at 240 MHz (AHB_FREQ_EQU_CORE_FREQ,
USE_VOS0_480MHZ = 1, Flash ws = 4)
Execution time in
Cache configuration Configuration Relative ratio
ns(1)
Table 12. MDK-ARM results of data storage in different memory locations (execution location
fixed in ITCM-RAM) CPU running at 400 MHz (AHB_FREQ_HALF_CORE_FREQ,
USE_VOS0_480MHZ = 0, Flash ws = 2)
Cache configuration Configuration Execution time in ns (1) Relative ratio
Table 13. MDK-ARM results of data storage in different memory locations (execution location
fixed in ITCM-RAM) CPU running at 400 MHz (AHB_FREQ_HALF_CORE_FREQ,
USE_VOS0_480MHZ = 0, Flash ws = 2)
Cache configuration Configuration Execution time in ns (1) Relative ratio
Table 14. MDK-ARM results of data storage in different memory locations (execution location
fixed in ITCM-RAM) CPU running at 200 MHz (AHB_FREQ_EQU_CORE_FREQ,
USE_VOS0_480MHZ = 0, Flash ws = 2)
Cache configuration Configuration Execution time in ns (1) Relative ratio
Table 15. MDK-ARM results of execution in different memory locations (data location fixed in
DTCM-RAM) CPU running at 200 MHz ((AHB_FREQ_EQU_CORE_FREQ,
USE_VOS0_480MHZ = 0, Flash ws = 2)
Cache configuration Configuration Execution time in ns (1) Relative ratio
1 - D1_ITCM - D1_DTCM
- 624765 1.00
(reference)
I-cache + D-cache ON 7 - D1_Flash - D1_DTCM 647915 1.04
I-cache + D-cache ON 8 - D1_QuadSPI_Single - D1_DTCM 821015 1.31
I-cache + D-cache ON 9 - D1_QuadSPI_Dual - D1_DTCM 748810 1.20
10 - D1_SDRAM_Swapped -
I-cache + D-cache ON 654390 1.05
D1_DTCM
1. The execution time values may vary from one tool-chain version to another.
The relative ratio calculation enables comparing the performance of a given configuration
versus the configuration having the best performance (1 - D1_ITCM - D1_DTCM) as well as
comparing the performance of a given configuration with another one.
The chart in Figure 13 shows the relative ratio of each configuration versus configuration 1
that represents the reference of this benchmark. This chart shows the FFT benchmark of
data storage in different memory locations while the code location is fixed in ITCM-RAM with
a CPU running at 480 MHz.
Figure 13. STM32H74x and STM32H75x FFT benchmark: data storage in different
memory locations (code in ITCM-RAM) at 480 MHz with MDK-ARM toolchain
MS63500V1
The chart in Figure 14 shows the relative ratio of each configuration versus configuration 1
(D1_ITCM - D1_DTCM). It represents the FFT benchmark of the code execution from
different memory locations while the data storage location is fixed in the DTCM-RAM with
the CPU running at 480 MHz.
Figure 14. STM32H74x and STM32H75x FFT benchmark: code execution from different
memory locations (R/W data in DTCM-RAM) at 480 MHz with MDK-ARM toolchain
MS63299V1
The chart in Figure 15 shows the relative ratio of each configuration versus configuration 1
that represents the reference of this benchmark. This chart shows the FFT benchmark of
data storage in different memory locations while the code location is fixed in ITCM-RAM with
CPU running at 400 MHz.
Figure 15. STM32H74x and STM32H75x FFT benchmark: data storage in different
memory locations (code in ITCM-RAM) at 400 MHz with MDK-ARM toolchain
MS63298V1
The chart in Figure 16 shows the relative ratio of each configuration versus configuration 1
(D1_ITCM - D1_DTCM). It represents the FFT benchmark of the code execution from
different memory locations while the data storage location is fixed in the DTCM-RAM with
the CPU running at 400 MHz.
Figure 16. STM32H74x and STM32H75x FFT benchmark: code execution from
different memory locations (R/W data in DTCM-RAM) at 400 MHz
with MDK-ARM toolchain
MS63297V1
Table 17. Number of Flash wait states vs performance (MDK-ARM) /CPU running at
400 MHz/
AXI running at 200 MHz (VOS1)
7 - D1_Flash - D1_DTCM
The above results (Table 16 and Table 17) show a decrease of performance of about
0,54 % when the number of Flash wait states is incremented by 1.
Table 18. SDRAM data read/write access performance vs bus width and
clock frequency based on 6 - D1_ITCM - D1_SDRAM configuration
6 - D1_ITCM - D1_SDRAM Execution time in ns Decrease (%)
Table 19. Execution performance from SDRAM versus bus width and clock
frequency based on 10 - D1_SDRAM_Swapped - D1_DTCM configuration
10 - D1_SDRAM_Swapped - D1_DTCM Execution time in ns Decrease (%)
Table 20 provides the SDRAM data read/write access results for SDRAM swapped and non-
swapped configurations based on the 6 - D1_ITCM - D1_SDRAM configuration.
Table 20. SDRAM data read/write access performance in swapped and non-swapped
bank configurations based on 6 - D1_ITCM - D1_SDRAM configuration
6 - D1_ITCM - D1_SDRAM Execution time in ns Decrease (%)
Table 21 provides the results of the code execution from SDRAM in SDRAM swapped and
non-swapped configurations using the 10 - D1_SDRAM_Swapped - D1_DTCM
configuration.
Table 21. Execution performance from SDRAM in swapped and non-swapped bank
configurations based on 10 - D1_SDRAM_Swapped - D1_DTCM configuration
10 - D1_SDRAM_Swapped - D1_DTCM Execution time in ns Decrease (%)
This section provides some tips about code and data partitioning in
STM32H72x/73x/74x/75x memory in order to get the best trade-off between performance
and code/data sizes. Recommendations about product optimal configuration and issue
avoidance are also provided.
The SRAM4 located in D3 domain is generally used to store data for low-power part of the
user application. It can be used to retain some application data when D1 and D2 enter
DStandby mode. The low-power application data can be R/W data for the CPU or buffers to
be transferred by peripherals (located in the D3 domain) such as LPUART1, I2C4, and
others.
Note: The data cache needs to be cleared before switching domain D1 to standby. Clearing the
data cache avoids to lose data transferred to SRAM4 for retention.
SRAM4 remains available as long as all the system is not in Standby mode. Otherwise, it is
still possible to use the Backup SRAM to retain some data with a battery connected to
VBAT. However, the Backup SRAM size is optimized to 4 Kbytes in order to reduce leakage
(refer to RM0433 and RM0468 reference manuals), that provides a low-power application
example and describes the usage of the D3 domain in low-power mode applications.
SRAM4 can also be used for CPU regular data storage as an internal memory extension for
non-low-power applications.
When the application needs more memory and when its code, data, or both, do not fit in the
internal memories, the external memories can be used to expand the memory size without a
loss of performance.
For example, an external NOR Flash memory up to 64 Mbytes connected through the FMC
can contain the application instructions with the cache enabled.
In case of a lack of internal RAMs, the data storage can be achieved in an external SRAM or
in an SDRAM through the FMC interface while enabling the data cache. These memories
can contain either frame buffers for graphical applications or non-critical data. At the same
time, more priority is given for very critical data to be placed in the DTCM-RAM.
Quad-SPI and Octo-SPI Flash memories can be used to store read-only data (relatively
huge image or audio files) with keeping almost the same level of performance as an internal
Flash memory access by enabling the data cache.
Quad-SPI and Octo-SPI Flash memories can also be used to store the application code in
the memory mapped mode up to 256 Mbytes and at the same time to save several GPIOs in
smaller STM32H72x/73x/74x/75x device packages, compared to parallel Flash memories
that have to be connected to the FMC interface. In that case, when the CPU accesses
regularly to read-only data, the latter can be mapped in the internal Flash memory. If the
application needs more memory space and more execution performance, the user can load
her/his application in the Quad-SPI/Octo-SPI (load region) and use an external SDRAM,
where the application is copied (at the scatter load phase) and executed (execution region).
6 Conclusion
7 Revision history
Updated:
– Introduction: stm32h7x3_cpu_perf replaced with H7_single_cpu_perf.
– Section 3: Typical application: stm32h7x3_cpu_perf replaced with
H7_single_cpu_perf.
– Section 3.2: Configuring demonstration projects
16-Jul-2019 3
Added:
– Section 4.1.1: Effects of data and instructions locations on performance
– Figure 8 through Figure 12 in Section 3: Typical application.
– Figure 13 through Figure 15 in Section 4: Benchmark results and analysis
– Section 4.1.2: Impact of basic parameters on performance
Updated document to cover STM32H72x and STM32H73x microcontrollers.
Replaced Flash A and B by Flash bank 1 and bank 2, respectively. Specified that
only STM32H74x and SMT32H75x (except for STM32H750x) feature two banks.
Updated Section 2.5.1: Embedded Flash memory.
16-Sep-2020 4
Replaced QSPI interface (QSPI) by Quad-SPI interface (QUADSPI).
Section : Flexible memory controller interface (FMC): changed update frequency
for synchronous memories to kernel clock divided by 2 or 3.
STMicroelectronics NV and its subsidiaries (“ST”) reserve the right to make changes, corrections, enhancements, modifications, and
improvements to ST products and/or to this document at any time without notice. Purchasers should obtain the latest relevant information on
ST products before placing orders. ST products are sold pursuant to ST’s terms and conditions of sale in place at the time of order
acknowledgement.
Purchasers are solely responsible for the choice, selection, and use of ST products and ST assumes no liability for application assistance or
the design of Purchasers’ products.
Resale of ST products with provisions different from the information set forth herein shall void any warranty granted by ST for such product.
ST and the ST logo are trademarks of ST. For additional information about ST trademarks, please refer to www.st.com/trademarks. All other
product or service names are the property of their respective owners.
Information in this document supersedes and replaces information previously supplied in any prior versions of this document.
The scattered SRAM architecture in the STM32H72x/73x/74x/75x series allows for flexible partitioning of memory resources based on application requirements. This flexibility supports optimizing performance by enabling programmers to allocate specific memory regions tailored to code and data size requirements, which can enhance execution speed and reduce power consumption. By having multiple SRAM blocks, tasks can be distributed to minimize latency and maximize throughput, especially for applications requiring simultaneous access to different memory banks. This architecture particularly suits applications with varying demand for high-speed data processing versus power efficiency .
The trade-offs between dual-bank and single-bank configurations in STM32H74x/75x concern performance and functionality. While dual-bank configurations allow for parallel read/program/erase operations, reducing contention when multiple masters access the Flash memory, they also require more hardware complexity and potentially greater power consumption. A single-bank configuration, on the other hand, simplifies design and saves power but at the cost of throughput when accessing Flash, as simultaneous operations are not possible. This decision impacts application performance depending on the demands for parallel data processing and power efficiency across applications .
Considering both CPU frequency and bus frequency compatibility in the STM32H72x/73x/74x/75x configurations is crucial for ensuring optimal performance and stability. A mismatch can lead to inefficient CPU cycles, where the processor is either forced to wait for slower peripheral communications or is underutilized. The harmony between CPU and bus frequencies ensures efficient data flow and task execution while minimizing idle time. This alignment enhances system throughput and can also contribute to energy savings as devices can operate closer to their optimal performance points .
Using ITCM-RAM as a fixed execution location generally maximizes instruction throughput because of its non-wait state access properties. However, the choice of data storage locations can have significant implications for performance. Using high-speed memory like DTCM-RAM for data storage offers minimal latency and better performance, while using slower memories like SDRAM can introduce wait states that reduce overall system speed. This configuration impacts how quickly data-intensive operations can complete, with system performance hinging on the strategy to balance high-speed data requirements against available memory architecture .
In the STM32H74x/75x series, ITCM-RAM and DTCM-RAM hold specific roles in enhancing performance. The ITCM-RAM is typically used to store code due to its single-cycle access speed, optimizing execution performance. DTCM-RAM, meanwhile, is purposed for data, especially where frequent read/write operations are needed. By separating code and data storage, microcontrollers can minimize cache usage conflicts and latency, thereby ensuring higher performance. The simultaneous operation of ITCM-RAM for execution and DTCM-RAM for data access optimizes overall throughput and ensures efficient resource utilization .
The performance characteristics of the STM32H72x/73x/74x/75x are influenced by their connection to different bus matrices. The devices support three separate bus matrices: a 64-bit AXI bus matrix in the D1 domain for high-performance operations, and two 32-bit AHB bus matrices in the D2 and D3 domains for communication peripherals and system controls, respectively. The high bandwidth peripherals are connected to the AXI bus in the D1 domain, allowing for high-speed operations essential for tasks requiring efficient data transfer. The 32-bit AHB buses in D2 and D3 domains handle peripherals and basic functionalities, providing a compromise between performance and power consumption. These configurations help in reducing bus congestion and optimizing simultaneous operations by enabling separate masters to remain active without conflict .
Inter-domain bus configurations in the STM32H72x/73x/74x/75x series create challenges primarily related to data latency and synchronization. The need to bridge multiple 32-bit AHB bus domains (D2 and D3) with the higher bandwidth 64-bit AXI bus in the D1 domain can result in bottlenecks, especially when multiple peripherals require concurrent access across domains. Additionally, careful management is required to ensure data coherence and synchronization across these domains, particularly for real-time applications where predictability in data transfer times is crucial .
Simultaneous operations of high-speed peripherals in STM32H72x/73x/74x/75x devices significantly influence computing efficiency by allowing multiple data streams to be processed concurrently. This is facilitated by the presence of independent bus matrices that prevent bus contention, effectively enabling peripherals in different domains to operate without interference. As a result, tasks can execute in parallel, improving response times and throughput. However, this requires well-coordinated scheduling and power management to avoid resource bottlenecks and optimize overall energy efficiency .
Flash wait states influence the execution speed of programs running from Flash memory in STM32H74x/75x microcontrollers. A higher number of wait states can lead to increased latency as the CPU must idle while waiting for data, potentially slowing down execution. Conversely, reducing the wait states improves performance by decreasing access time to Flash memory. However, decreasing wait states may lead to instability or errors if the devices cannot operate within their timing constraints. Thus, a balance needs to be struck based on the system's speed and reliability requirements .
The performance of data storage and retrieval from SDRAM in STM32H74x/75x devices is directly affected by the bus width and clock frequency. Generally, increasing the bus width allows more data to be transferred in parallel, thus reducing data transfer times and enhancing throughput. Similarly, clock frequency boosts not only the transfer speed but also the rate at which data can be processed. However, higher frequencies may increase power consumption and lead to more rigorous management of thermal conditions. Thus, optimal configurations depend on balancing these variables against performance requirements and power constraints .