0% found this document useful (0 votes)
10 views33 pages

Prob Ass

The document is an assignment from the Faculty of Computer Science and Engineering at Ho Chi Minh City University of Technology, focusing on analyzing the impact of GPU characteristics on memory bandwidth. It includes sections on data introduction, background on statistical methods, data preprocessing, descriptive and inferential statistics, and discussions on limitations and future directions. The analysis utilizes a dataset of GPU specifications and performance metrics to explore relationships between various hardware features and memory bandwidth.

Uploaded by

transerrn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views33 pages

Prob Ass

The document is an assignment from the Faculty of Computer Science and Engineering at Ho Chi Minh City University of Technology, focusing on analyzing the impact of GPU characteristics on memory bandwidth. It includes sections on data introduction, background on statistical methods, data preprocessing, descriptive and inferential statistics, and discussions on limitations and future directions. The analysis utilizes a dataset of GPU specifications and performance metrics to explore relationships between various hardware features and memory bandwidth.

Uploaded by

transerrn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY


FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Probability and Statistics (MT2013)

Assignment

Analyzing impact of GPUs’


characteristics on GPUs’
memory bandwidth

Advisor(s): PhD. Phan Thi Huong


Student(s): Tran Hung Son ID 2353055
Ngo Phan Khai Tu ID 2153951
Luong The Kiet ID 2352649
Dang Duy Truc Giang ID 2052963
Vo Hoang Anh Kiet ID 2352659
Bui Huy Phuong ID 2352948

HO CHI MINH CITY, JULY 2025


Member list & Workload
No. Full name Student ID Workloads Contribution

1 Ngo Phan Khai Tu 2153951 Descriptive Statistic, 100%


Charts and Graphs
2 Tran Hung Son 2353055 Data preprocessing, 100%
Proofreading
3 Luong The Kiet 2352649 Inferential Statistics 100%
4 Dang Duy Truc Giang 2052963 Discussion and Ex- 100%
pansion
5 Vo Hoang Anh Kiet 2352659 Data Introduction and 100%
Background
6 Bui Huy Phuong 2352948 Descriptive Statistic, 100%
Analyzing the Charts
and Graphs
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Contents
1 Data Introduction 2

2 Background 2
2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Data Preprocessing 5
3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Data Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Descriptive Statistics 10
4.1 Types of Charts and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Analyzing the Distribution of Variables to Assess Normality Using
Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.2 Analyzing the linearity of the chosen datasets in comparison to
Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Boxplot between Memory Bandwidth and other datasets: . . . . . . 18

5 Inferential Statistics 21
5.1 Linear Regression Problem for GPUs’ memory bandwidth . . . . . . . . . 21
5.2 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Multiple Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 22
5.3.1 Model Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.2 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Discussion and Expansion 28


6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Future Directions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Conclusion 30

1
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

1 Data Introduction
• Data Source: All_GPUs.csv

• Acknowledgements: The dataset primarily originates from GPU manufacturers,


including NVIDIA, AMD, and other contributors such as Game-Debate.

• Contents: This dataset contains detailed specifications, release dates, and perfor-
mance metrics of GPUs.

• Population: Collection of GPUs from various manufacturers.

• Number of Variables: 34

• Number of Observations: 3,406

The main variables in the analysis include:

1. Memory_Bandwidth (Numerical): A measure of memory performance, calcu-


lated in GB/s.

2. Memory_Speed (Numerical): Memory speed, measured in MHz.

3. L2_Cache (Numerical): Level 2 cache capacity, processed in KB (converted if


necessary).

4. Memory_Bus (Numerical): Memory bus width, measured in bits.

2 Background
2.1 Basic Definitions
2.1.1 Linear Regression

Linear regression is a fundamental statistical technique used to understand and pre-


dict the relationship between variables. It involves modeling the relationship between a
dependent variable (response or target variable) and one or more independent variables
(predictors or explanatory variables) using a linear equation. For simple linear regression,
the equation has the form:
y = a + bx (2.1)

where:

2
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• y: Predicted value.

• a: Intercept.

• b: Slope.

• x: Independent variable.

In multiple linear regression, the equation expands to include multiple predictors. The
method minimizes the sum of squared differences between observed and predicted values,
creating a "best-fit" line or hyperplane.

2.1.2 Histogram

A histogram is a graphical representation used to summarize and visualize the distri-


bution of a dataset. It divides the range of data into equal-sized intervals (bins) and uses
vertical bars to show the frequency or count of data points falling within each bin. Key
characteristics observable from histograms include:

• Central tendency (e.g., mean or median).

• Variability.

• Skewness.

• Presence of outliers.

2.1.3 Scatter Plot

A scatter plot displays the relationship between two quantitative variables by plotting
data points on a Cartesian coordinate system. It helps visualize potential patterns, trends,
or associations. Key observations include:

• Linear relationships.

• Direction (positive or negative).

• Strength of the association.

3
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

2.1.4 Correlation

Correlation measures the statistical relationship or association between two variables.


The most common measure is Pearson’s correlation coefficient (r):

• r = 1: Perfect positive relationship.

• r = −1: Perfect negative relationship.

• r = 0: No linear relationship.

Correlation does not imply causation but is valuable in exploring potential connections.

2.2 Multiple Linear Regression


The Multiple Linear Regression (MLR) method analyzes the relationship between a
dependent variable and multiple independent variables. For example, predicting GPU
memory bandwidth based on hardware characteristics. The general form of MLR is:

Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ϵ (2.2)

where:

• Y : Dependent variable (GPU memory bandwidth).

• Xi : Independent variables (e.g., hardware characteristics).

• β0 : Intercept.

• βi : Regression coefficients for Xi .

• ϵ: Random error.

Key Assumptions:

1. Linearity: Relationship between dependent and independent variables is linear.

2. Normality of Residuals: Errors follow a normal distribution.

3. No Multicollinearity: Independent variables are not highly correlated.

4. Independence of Residuals: Errors are not correlated.

4
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Model Performance Metrics:

• R-squared (R2 ): Proportion of variance in the dependent variable explained by


the independent variables.

• Root Mean Square Error (RMSE): Average deviation of predicted values from
actual values.

The results provide insights into key factors affecting GPU memory bandwidth and sup-
port hardware design optimization.

3 Data Preprocessing
3.1 Feature Selection
Before we start analyzing the data, we have to choose our main feature and its related
features, in order to do so, we decided to use 2 filter: the first filter is the proportion of
missing data in variables, if the proportion is beyond 10%, it means this variable cannot
be used at all, so we will delete it; the second filter will depend on the remaining variables,
we will talk about it later.

3.2 Data Reading


Before working with the provided dataset, we need a quick overview in order to get
familiar with the dataset as well as its structure for better understanding and analysis.
We use command [Link] and head() to read the dataset from .csv file and then print it
out. The dataset is shown in 3.1

5
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 3.1: Summary of the dataset before cleaning

At quick glance, there are different symbols are used to represent the missing data.
In order to have an easier cleaning process, we will convert all the missing data into NA.
Now we make a summary table to summarize the proportion of missing data.

6
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 3.2: Summary of total missing values and Probability of NA values of each feature

For better analysis, we visualize the proportion into a graph in 3.3

7
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 3.3: Proportion of missing data in variables

Now that we have the missing percentage of these features, let us start deleting all
the feature with missing data percentage equal and higher than 10%, with the features
that have the missing data percentage less than 10, the missing data will not affect the
analyzing so we can delete observations contains missing data
Move on to the disagreement in unit, the only variable with this problem is the L2
cache. We see there are unusual unit such as KB(x2), KB(x3), KB(X4). So we remove
any whitespace and ’KB’, check for multiplier (x2, x3, x4), and then return the numeric
value.
After that, there is no more disagreement in notation in any variable, so for easier
data observation, we can delete all the unit behind, keep only the numeric value. We will
have the table as in 3.4.

8
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 3.4: Data after removing units

After the cleaning precess, the dataset has 17 variables remained with 3036 observa-
tions, now we must choose the main subject an its related features to study before any
further tasks. We can see that memory bandwidth is an extremely important feature to
determine a GPU’s performance, so we decided that memory bandwidth will be the main
feature, for the related features, we did not choose Architecture, Direct_X and Resolu-
tion_WxH because they are string with many and complex value, dealing with them will
potentially waste time without any clear different in the output. Shader is more related
to software than hardware, meaning it depends on algorithms, code, and can be updated
through multiple versions; this also means that a GPU line does not have a fixed shader
version. We can turn down the Dedicated variable because when Dedicated = No, it is no
longer a separate GPU; rather, it is a GPU running on the CPU’s resources, meaning the
hardware characteristics (such as Memory Bandwidth) will depend on the CPU/IGPU
manufacturer. Therefore, statistics should only be based on GPUs with Dedicated = Yes
to ensure the accuracy of the model; however, even with Dedicated, it will still depend on
other factors, such as the generation of the accompanying hardware, which will also af-
fect Memory Bandwidth. The similar goes for Manufacturer, they are not the key feature
determine the memory bandwith. So we picked Memory_Speed, L2_Cache and Mem-
ory_Bus as related features, because they are easy to handle and analyze and they can
moderately, or even highly, affect the memory bandwidth [1] .
So for the finishing touch, we remove all the remaining unit and here we have two
variables that have categorical data type. Therefore, we need to encode two data variables:

9
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Dedicated and Manufacturer. We use the factor() function to convert the data columns
to factor, then use the [Link]() function to convert the factor to numeric data. Now
we have everything we need for the assignment, the result will be 3.5.

Figure 3.5: Final dataset

4 Descriptive Statistics
After the cleaning data step, we now have a summary of new data file by using summary()
in R. Then, we have the summary of the data in Figure 4.1.

Figure 4.1: Data Summary

4.1 Types of Charts and Graphs


4.1.1 Analyzing the Distribution of Variables to Assess Normality Using His-
tograms

At the beginning of this part, we will show the distribution of Memory Bandwidth by
using histogram to see whether it follows the Normal Distribution or not.

10
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 4.2: The original Memory Figure 4.3: The Memory Bandwidth
Bandwidth distribution distribution after applying the formula
Obviously, the first histogram, which shows the original dataset of Memory Bandwidth,
indicates that the data does not follow a normal distribution. To analyze this variable
effectively, we applied a transformation using the formula Y = ln(Y + 1) to make the
dataset approximately normal. The distribution of the dataset after this transformation
is represented by the second histogram.

We used the formula ln(Y + 1) instead of ln(Y) because the Memory Bus dataset con-
tains zeros. Applying ln(Y) directly would result in errors, and to ensure consistency, all
variables, including Memory Bandwidth, must be transformed using the same formula.

Next, we present six histograms that display three datasets that before and after applying
the formula Y = ln(Y + 1).

Figure 4.4: The original Memory Speed Figure 4.5: The Memory Speed
distribution distribution after applying the formula
The histograms in the illustration above show how the distribution of memory speed

11
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

values differs before and after applying a logarithmic transformation.


Figure 4.4 shows a skewed distribution, with values concentrated on the lower end (left
side) and a lengthy tail extending to the right. This pattern implies a right-skewed distri-
bution, implying that the variable is not normally distributed. To examine this variable
successfully, we transformed the dataset using the formula Y = ln(Y + 1) to approximate
normality. The second histogram shows the dataset’s distribution following modification.
After applying the logarithmic modification in Figure 4.5, the data seems more symmetri-
cal and closer to a normal distribution. Logarithmic transformations are widely employed
to reduce skewness and bring data closer to normalcy, particularly when working with
variables that expand exponentially or have wide ranges of values.
The use of a logarithmic transformation helps meet the normalcy assumption, which is
required by many statistical tests and models. It reduces variability and improves linear
connections between variables.

Figure 4.6: The original L2 Cache Figure 4.7: The L2 Cache distribution
distribution after applying the formula

In Figure 4.6, the histogram has a strong right-skewness, with the majority of the values
concentrated at the lower end and a lengthy tail to the right. This implies that the data
is not regularly distributed, which makes statistical analysis difficult without changes.
Using the formula Y = ln(Y + 1), the distribution in Figure 4.7 becomes more symmetric.
The range of values is narrowed, with a more steady distribution spanning 0 to 7.5 (loga-
rithmic scale). However, there is still a significant bump at the low end (log-transformed
numbers near to zero). The transformation does not completely normalize the data be-
cause there is still a strong concentration of values at the lower end (around ln(1)=0) and
a progressive spread to higher values.

12
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Despite this, the adjustment minimized the extreme skewness while stabilizing the vari-
ance. The formula Y = ln(Y + 1) is used to avoid errors caused by ln(0)/ln(0) ln(0), as
logarithms are undefined for zero or negative numbers.

Figure 4.8: The original Memory Bus Figure 4.9: The Memory Bus distribution
distribution after applying the formula

In Figure 4.8, the histogram appears skewed to the right. The vast majority of data points
are concentrated on the left side of the histogram, with a lengthy tail stretching to the
right. This shows that a few extremely large values have shifted the mean to the right.
The right side of the distribution contains some outliers, and the distribution deviates
significantly from a bell curve, indicating that the data is not normally distributed. The
right skewness suggests that most memory bus values are modest, with only a few bigger
values influencing the overall distribution.
Figure 4.9 shows that the distribution is still skewed to the right, although to a lesser
extent than the initial distribution. The outliers on the right side remain, albeit less
noticeable, and the distribution does not resemble a bell curve.
According to the histograms, neither the original Memory Bus distribution nor the dis-
tribution obtained after applying the formula appears to be regularly distributed. The
appropriate skewness and the existence of outliers indicate that the data may not fit the
assumptions of many statistical tests that require normalcy.

4.1.2 Analyzing the linearity of the chosen datasets in comparison to Memory


Bandwidth

In this section, we will show the linear relationship between three datasets: Memory
Speed, L2-Cache, and Memory Bus, with the Memory Bandwidth dataset (Note: all of
the datasets have already applied Y = ln(Y + 1) formula) by using Scatter Diagrams.

13
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Scatter plots are a powerful tool for visualizing the relationship between two numerical
variables. Quickly identify trends and patterns between two variables, as well as data
points that depart significantly from the overall trend. Scatter plots are an effective tool
for exploratory data analysis, hypothesis testing, and data-driven decision making.

Figure 4.10: Scatter Plot of Memory Speed and Memory Bandwidth

Figure 4.10 demonstrates a positive linear relationship between Memory Speed (MHz)
and Memory Bandwidth (GB/s). As memory speed grows, so does memory bandwidth.
This suggests that faster memory speeds are typically connected with greater bandwidth
possibilities.
The points on the scatter plot often move upward, indicating a positive correlation between
the two variables. The points appear to cluster around a straight line, implying that the
connection between Memory Speed and Memory Bandwidth is about linear. The points
are not exactly aligned along a straight line, showing some variation in the relationship.
This variability could be due to other factors or measurement errors.

This relationship implies that increasing memory speed can result in more memory band-
width, which can increase total system performance. System designers can utilize this
data to optimize memory settings for specific applications.

14
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 4.11: Scatter Plot of L2 Cache and Memory Bandwidth

Figure 4.11 indicates a weak positive association between L2 cache size (KB) and memory
bandwidth (GB/s). The Memory Bandwidth tends to rise in tandem with the L2 Cache
size. However, the correlation is weak, and the data is highly variable.

The points on the scatter plot indicate an overall upward trend, although the relationship
is not as evident as with a strong correlation. The points are widely distributed, demon-
strating that Memory Bandwidth varies greatly for any given L2 Cache Size. Furthermore,
the data appears to be clustered in vertical bands, indicating that some L2 Cache sizes
are more prevalent than others.
Overall, the scatter plot shows that there is a weak positive link between L2 Cache Size
and Memory Bandwidth, although it is not strong, and the data is highly variable. This
shows that other factors may have a larger influence in memory performance.

15
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 4.12: Scatter Plot of Memory Bus and Memory Bandwidth

Figure 4.12 shows a weak positive correlation between Memory Bus (Bit) and Memory
Bandwidth (GB/s). Memory Bandwidth tends to rise as the Memory Bus width increases.
However, the correlation is weak, and the data is highly variable.
The points on the scatter plot indicate an overall upward trend, although the relationship
is not as evident as with a strong correlation. The data points cluster around specific
Memory Bus widths, indicating popular configurations or design decisions. While a larger
Memory Bus can potentially support faster data transfer rates, other factors such as
memory speed, memory controller efficiency, and system architecture are likely to have a
greater impact on overall Memory Bandwidth.
The scatter figure indicates that there is much variability in the data and that, although
there is a small positive correlation between Memory Bandwidth and Memory Bus width,
the relationship is not strong. This implies that memory performance may be more sig-
nificantly influenced by other factors.
Overall, these Scatter Diagrams illustrate that all of the chosen datasets have a minimal
linear relationship with Memory Bandwidth dataset.
The Correlation diagram will then show more about the ratio of linear relationship be-
tween each of the chosen dataset with the one of the Memory Bandwidth.

16
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 4.13: Correlation Diagram of all in consider datasets

The correlation matrix visually represents the linear relationships between pairs of vari-
ables. Each cell in the matrix shows the correlation coefficient, a numerical value that
ranges from -1 to 1.

• Positive Correlation: A positive correlation indicates that as one variable in-


creases, the other variable also tends to increase. A value closer to 1 signifies a
stronger positive relationship.

• Negative Correlation: A negative correlation indicates that as one variable in-


creases, the other variable tends to decrease. A value closer to -1 signifies a stronger
negative relationship.

• Zero Correlation: A correlation coefficient of 0 indicates no linear relationship


between the variables.

The correlation diagram represents the linear relationships between the four variables:
memory bandwidth, memory speed, L2 cache, and memory bus. The color intensity in
each cell represents the significance and direction of the correlation.
Interpretation of Correlations:

1. Memory Bandwidth vs. Memory Speed: The intense red color indicates a
strong positive correlation. This means that as Memory Speed increases, Memory

17
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Bandwidth also tends to increase. This is expected, as faster memory can transfer
data at higher rates.

2. Memory Bandwidth vs. L2 Cache: The light red color indicates a weak positive
correlation. This suggests that larger L2 caches might be associated with slightly
higher Memory Bandwidths, but the relationship is not very strong. This could be
due to factors like memory controller efficiency and system architecture.

3. Memory Bandwidth vs. Memory Bus: Similar to the relationship with L2


Cache, the positive correlation suggests that wider Memory Buses might be associ-
ated with slightly higher Memory Bandwidths. However, other factors likely play a
more dominant role.

4.1.3 Boxplot between Memory Bandwidth and other datasets:

Figure 4.14: Boxplot of Memory Speed and Memory Bandwidth

Figure 4.14 shows the distribution of memory bandwidth across different memory speeds.
We can see that the majority of the data points fall within the range of 4 to 5 GB/s. The
overall distribution appears to be relatively narrow and centered around this range.

Median and quartiles


• Median: The blue line within the box represents the median memory bandwidth. It
indicates that half of the memory speeds have a memory bandwidth below this value,

18
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

and half have a bandwidth above it.

• Quartiles: The box itself represents the middle 50% of the data. The bottom edge
of the box corresponds to the first quartile (Q1), and the top edge corresponds to
the third quartile (Q3). This means that 25% of the memory speeds have a memory
bandwidth below Q1, and 25% have a bandwidth above Q3.

Outliers
The range that the majority of the data, excluding outliers, falls within is shown by the
whiskers that extend from the box. Since the data points are inside the whiskers, there
aren’t any outliers in this instance.

Interpretation
The boxplot suggests that memory speed has a significant impact on memory bandwidth.
As the memory speed increases, the memory bandwidth generally tends to increase as
well. However, the relationship is not strictly linear. There is some variability in memory
bandwidth for different memory speeds.

Figure 4.15: Boxplot of L2 Cache and Memory Bandwidth

Figure 4.14 illustrates the distribution of memory bandwidth across various L2 cache
sizes. We can observe that the majority of the data points lie between 4 and 5 GB/s. The
overall distribution looks to be rather narrow and centered on this range.

19
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Median and quartiles


• Median: The blue line in the box reflects the median memory bandwidth. It shows
that half of the L2 cache sizes have a memory bandwidth less than this amount, while
the other half have a bandwidth more than it.

• Quartiles: The box reflects the middle 50 percent of the data. The box’s bottom
edge corresponds to the first quartile (Q1), while the upper edge corresponds to the
third quartile. This means that 25% of L2 cache sizes have a memory bandwidth less
than Q1, while 25% have a bandwidth greater than Q3.

Outliers
The whiskers extending from the box represent the range in which the majority of the
data falls, excluding outliers. In this situation, there are no outliers because the data
points are within the whiskers.

Interpretation
The boxplot indicates that L2 cache size has a considerable effect on memory bandwidth.
As the capacity of the L2 cache increases, so does memory bandwidth. However, the
relationship is not exactly linear. Memory bandwidth varies depending on the L2 cache
size.

Figure 4.16: Box Plot of Memory Bus and Memory Bandwidth

20
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

The distribution of memory bandwidth across various memory bus sizes is shown in figure
4.16. Most of the data points, as we can see, fall between 4 and 5 GB/s. This range seems
to be the center of the distribution, which is generally rather narrow.

Median and quartiles


• Median: The median memory bandwidth is shown by the blue line inside the box.
It shows that the memory bandwidth of half of the memory bus sizes is below this
value, while the bandwidth of the other half is above it.

• Quartiles: The middle 50% of the data is represented by the box itself. The first
quartile (Q1) is represented by the box’s bottom edge, while the third quartile (Q3)
is represented by its top edge. The memory bandwidth of 25% of the memory bus
sizes is below Q1, and 25% is above Q3, according to this data.

Outliers
The range that the majority of the data, excluding outliers, falls within is shown by the
whiskers that extend from the box. Since the data points are inside the whiskers, there
aren’t any outliers in this instance.

Interpretation
According to the boxplot, memory bandwidth is significantly impacted by memory bus
size. In general, the memory bandwidth tends to increase in tandem with the size of the
memory bus. The relationship isn’t exactly linear, though. The memory bandwidth varies
somewhat depending on the size of the memory bus.

5 Inferential Statistics
5.1 Linear Regression Problem for GPUs’ memory bandwidth
Memory Bandwidth is a crucial feature of GPUs, significantly influencing performance
and responsiveness. Higher memory bandwidth enhance efficiency, particularly in tasks
demanding quick data access, such as gaming and graphics rendering. As highlighted in

21
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

this report, the primary objective is to analyze Memory Bandwidth and examine how
various attributes (independent variables) impact this dependent variable.

5.2 Data Splitting


Split the data into two parts: train_data and test_data in an 80:20 ratio. The
train_data will be used to build the model, while the test_data will be used for
making predictions.

5.3 Multiple Linear Regression Analysis


5.3.1 Model Purpose

In this section, we will examine the relationship between Memory Bandwidth and
various key GPU features. The focus will be on Multiple Linear Regression (MLR), as it
is a suitable method for predicting a quantitative dependent variable like Memory Speed,
using a mix of both numerical and categorical independent variables. This approach allows
us to model and understand the linear relationships between Memory Bandwidth and
GPU characteristics, such as Core Speed and Memory Speed (numerical variables). MLR’s
capability to manage these diverse variable types and provide interpretable coefficients
makes it an ideal choice for our analysis.

5.3.2 Model Definition

The multiple linear regression model is mathematically defined as follows:

Y = β0 + β1 x1 + β2 x2 + β3 x3 + ϵ

where:

• Y : Dependent variable (target), representing Memory Bandwidth.

• xi : Independent variables (predictors), including Memory Speed, Memory Bus and


L2 Cache.

• β0 : Intercept (constant term).

• βi : Coefficients representing the effect of each predictor variable xi on the dependent


variable.

22
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• ϵ: Error term (residuals).

To estimate the coefficients (βi ), the least squares criterion is employed, minimizing
the sum of squared residuals. The resulting fitted model is expressed as:

Ŷ = βˆ0 + βˆ1 x1 + βˆ3 x3

Where:

• Ŷ : dependent, target variable - Memory Bandwidth feature

• xi : explanatory, predictor variables - Memory Speed, Memory Bus, L2 Cache

• βˆ0 : y-intercept (constant term)

• β̂i : slope coefficients for corresponding explanatory variable xi

We perform hypothesis tests for each predictor:

H0 : βi = 0, i ∈ {1, 2, 3}

H1 : βi ̸= 0, i ∈ {1, 2, 3}

In this study, the model incorporates three predictors to determine their influence on
memory bandwidth.

Hypothesis Testing For each predictor, hypothesis tests are conducted to assess its
statistical significance:

• Null Hypothesis (H0 ): βi = 0, indicating no relationship between the predictor


and memory bandwidth.

• Alternative Hypothesis (H1 ): βi ̸= 0, indicating a significant relationship be-


tween the predictor and memory bandwidth.

There are some metrics that associate with the hypothesis: t-value and p-value

• t-Value:The t-value quantifies the difference between population means relative to


the variability in the data. A higher absolute t-value indicates that the corresponding
predictor has a stronger influence on the dependent variable. This metric helps
assess whether a predictor contributes significantly to the model in analyzing and
predicting memory bandwidth.

23
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• p-Value: The p-value measures the probability of observing a t-value with an abso-
lute magnitude at least as large as the one calculated from the sample data, assuming
the null hypothesis (H0 ) is true. A p-value less than 0.05 (5%) indicates that the
null hypothesis can be rejected, suggesting that the predictor has a statistically
significant relationship with memory bandwidth.

Model Fitting and conclusion

The hypothesis testing results provide insights into which predictors have a meaningful
impact on memory bandwidth, guiding interpretations and further model refinement. By
applying this framework, the analysis seeks to identify and quantify the key determinants
of GPU memory bandwidth, offering valuable insights for optimizing GPU performance.
The following built model is based on the train_data.

Figure 5.1: Summary of Multiple Linear Regression model

Model Fit:

• The Residual Standard Error (RSE) is 79.75, suggesting that the average deviation
of observed values from predicted values is approximately 79.75 units. While this
indicates some prediction error, its acceptability depends on the domain and scale
of the data.

24
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• The Multiple R-squared value is 0.6405, indicating that about 64.05% of the variance
in Memory Bandwidth is explained by the predictors. The Adjusted R-squared value
(0.6405) is very close to the R-squared value, confirming that the model does not
suffer from overfitting.

• The F-statistic is 1518, with a p-value less than 2.2 × 10−16 . This indicates that
the overall model is highly significant and that the predictors, as a group, explain a
substantial portion of the variance in Memory Bandwidth.

Feature Selection: Based on the model summary, the most significant predictors in-
fluencing Memory Bandwidth are: Memory Speed, L2 Cache and Memory Bus
These variables have high absolute t-values and p-values below 0.05, indicating their
importance in explaining the variability of Memory Bandwidth. Customers should consider
these factors when evaluating GPU performance.

Effects of Predictors on Memory Bandwidth: Analyzing the coefficients provides


insights into the contributions and relationships of each predictor:

1. Numerical Predictors:

• Memory Speed (Estimate = 0.1069): For every 1 MHz increase in Memory


Speed, the predicted Memory Bandwidth increases by 0.1069 MHz, indicating
a positive and direct relationship.
• L2 Cache (Estimate = 0.0572): A 1-unit increase in L2 Cache size leads
to an increase of 0.0572 MHz in the predicted Memory Bandwidth, showing a
smaller but still positive effect.
• Memory Bus (Estimate = 0.2503): For every 1-bit increase in Memory Bus
width, the predicted Memory Bandwidth increases by 0.2503 MHz, reflecting
a strong positive impact.

2. Categorical Predictors: Not applicable in this model summary, but if included,


their coefficients would represent the difference in the predicted Memory Bandwidth
between levels of the categorical variable.

Standard Error (SE): The standard error of the coefficients measures the variability
of the observed data relative to the regression line.

25
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• A small SE for each predictor indicates that the sample data closely approximates
the true population data.

• In this model, the SE values are relatively low for all predictors (e.g., Memory Speed
SE = 0.0042, L2 Cache SE = 0.0021, Memory Bus SE = 0.0090), confirming the
reliability of the estimates.

5.3 Checking Model Assumptions


Using Residual Plots to Validate Assumptions We can generate residual diagnos-
tic plots using the following command, which produces four key plots for model evaluation:

1. Plot 1 - Residuals vs. Fitted Values:

• This plot checks for homoscedasticity. If the residuals are randomly scattered
around the zero line without any discernible pattern, the assumption of con-
stant variance is satisfied.

2. Plot 2 - Normal Q-Q Plot:

• This plot checks whether the residuals are normally distributed. If the points
closely follow the diagonal line, the residuals can be considered normally dis-
tributed.

3. Plot 3 - Scale-Location Plot:

• Another tool for assessing homoscedasticity. This plot helps identify whether
the variance of the residuals remains stable across the range of fitted values.

4. Plot 4 - Residuals vs. Leverage Plot:

• This plot is used to detect influential data points (outliers) that may have a
strong impact on the model. Points with high leverage or extreme residuals
should be examined carefully.

26
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Figure 5.2: Plots for model verification

Observations from the Residual Diagnostic Plots

1. Residuals vs Fitted:

• This plot shows the relationship between residuals (differences between ob-
served and predicted values) and fitted values (predicted values from the model).
• The plot reveals a non-random pattern with increasing spread as fitted values
increase. This suggests a heteroscedasticity (non-constant variance) problem,
meaning that the variability in residuals increases with the fitted values.
• Ideally, the residuals should be randomly scattered around zero with no clear
trend, indicating a good model fit.

2. Q-Q Plot (Quantile-Quantile Plot):

• The Q-Q plot compares the distribution of residuals to a normal distribution.


• In this case, the residuals appear to deviate significantly from the normal dis-
tribution, particularly in the tails. This suggests that the residuals may not be
normally distributed, which is a key assumption for linear regression models.
• A good fit would show the points lying close to the diagonal line.

3. Scale-Location Plot:

27
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

• This plot shows the relationship between the square root of the standardized
residuals and the fitted values.
• Like the "Residuals vs Fitted" plot, it shows a non-random pattern. The spread
of residuals appears to increase with fitted values, confirming the presence of
heteroscedasticity.

4. Residuals vs Leverage:

• This plot helps identify influential data points that may disproportionately
affect the regression model.
• The plot indicates a few points with high leverage, which could be influen-
tial observations. Points with high leverage and large residuals may have a
significant impact on the model’s results and should be examined more closely.

Overall Assessment:

• The model shows potential issues with heteroscedasticity and non-normality of resid-
uals.

• There are points with high leverage, which could distort the model fit.

6 Discussion and Expansion


This study underscores the paramount importance of GPU properties, particularly
Memory Speed and Memory Bus, as primary determinants of memory bandwidth, a crit-
ical performance metric. Shader Count and Dedicated Memory had little effect, indicat-
ing a focus on memory-related improvements rather than raw processing power, but L2
Cache stood out as a significant contributor because of its function in lowering latency
and enhancing data locality. Because they guide GPU manufacturers to concentrate on
enhancing memory performance and bus width for AI and gaming applications, these find-
ings are crucial. While researchers can improve benchmarking techniques to remain at the
forefront of GPU performance evaluation, developers can tailor their algorithms based
on particular memory configurations. Although the study offers insightful information,
it is limited by a small dataset that could not be relevant to more recent architectures
or take dynamic performance, power consumption, and thermal efficiency into account.
Despite its effectiveness, multiple linear regression ignores any non-linear interactions be-
tween variables. However, the potential for future research to address these limitations

28
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

by incorporating machine learning models, expanding datasets, and analyzing workload-


specific performance is reassuring. By showing how data-driven methodologies, especially
machine learning, can propel advances in computing technology across a variety of do-
mains like artificial intelligence, gaming, and real-time graphics, this work advances our
understanding of GPU optimization.

6.1 Limitations
The presented study has limitations. The dataset’s scope is bound within a certain set
of GPU models that might not generalise to newer architectures or emerging manufactur-
ers. However, serious omissions in factors such as thermal efficiency, power consumption,
and dynamic performance under load were made. This could lead to a less comprehensive
analysis of the subject. As a result, multiple linear regression further limits the analysis
since it does not handle the possibility of nonlinear relationships and higher-order interac-
tion among variables. While this model contributes significantly to insight, the real-world
GPU performance has the possibility of such interactions making a great difference.

6.2 Future Directions:


Future research could expand on this work in several ways:

1. Incorporating Non-Linear Models: Delving into machine learning algorithms such


as neural networks or decision trees offers a tantalising prospect- the ability to unveil
intricate non-linear relationships within the GPU’s features.

2. Including Additional Variables: A broad understanding of the GPU capability will


also involve data collection on real-time performance metrics, power consumption, and
thermal efficiency.

3. Broadening the Dataset: Further investigation of additional brands and more recent
models should broaden the applicability of these findings.

4. Scenario-Based Analysis: The analysis will surface potential application-relevant


improvements through understanding the performance of GPU under distinct workloads
such as scientific simulations, AI training, or gaming, which could change these forms.

29
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

The study’s results touch on much more than GPU architecture: insights important to
optimize hardware and software for artificial intelligence, gaming, and real-time graphics
where high computational power is necessary. By showing how data-driven approaches
and statistical methods can help system design and performance prediction, this research
opens avenues to further breakthroughs in computational technology.

7 Conclusion
This report applies statistical methods and computational analysis to examine the
relationship between performance metrics – notably, memory bandwidth – and various
features of GPUs. These features, among others, include Memory Speed, L2 Cache, Mem-
ory Bus, Shader Count, Dedicated Memory, and Manufacturer Manipulation. Notably, R
programming was used to clean, normalize, and transform data sets into robust analy-
sis. Descriptive statistics and histograms showed trends or patterns while improving data
effectiveness through transformations such as logarithmic scaling. The linear regression in-
dicated that Memory Speed and Memory Bus were major predictors of GPU performance.
Therefore, these findings are very useful for researchers, developers, and manufacturers
involved in performance optimization of either standalone or integrated GPUs, providing
practical insights that can be applied in their work. However, the study emphasizes that
it still has a long way to go in exploring nonlinear interactions and the use of lar.

Here the link to our R code for this assignment.

30
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

References
[1] Massimiliano Fatica and Gregory Ruetsch. Chapter 2 - performance measurement
and metrics. In Massimiliano Fatica and Gregory Ruetsch, editors, CUDA Fortran for
Scientists and Engineers, pages 31–42. Morgan Kaufmann, Boston, 2014.

31

You might also like