Democratic and Popular Republic of Algeria
Ministry of Higher Education and Scientific Research
SAAD DAHLEB University - BLIDA 1
Faculty of Sciences
Department of Computer Science
Master Ingénierie des Systèmes Intelligents (ISI)
Mini project
Data Analysis
Prepared by:
Medjrab Feriel
Kamilya Guettai
Malak Benaissi
Contents
1 Introduction 3
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Dataset Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Dataset Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 univariate analysis 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Bivariate Analysis 8
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Correlation Test (Example Between Two Quantitative Variables) . . . . . . . . 9
3.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Analysis of Qualitative Variables (Chi-Square Test) . . . . . . . . . . . . . . . 9
3.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Visualization of Quantitative Relationships . . . . . . . . . . . . . . . . . . . 10
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Multivariate Analysis 12
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . 12
4.3 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
List of Figures
2.1 Age Distribution of the Population . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Gender Distribution of the Population . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Top Types of Injury in the Population . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Relationship between date of death and age . . . . . . . . . . . . . . . . . . . 9
3.2 Relationship between Gender and Citizenship . . . . . . . . . . . . . . . . . . 10
4.1 PCA scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2
Chapter 1
Introduction
1.1 Context
The Palestinian conflict is one of the most prolonged and devastating struggles in modern his-
tory, marked by immense human suffering and the tragic loss of countless innocent lives. Be-
hind every statistic is a story of pain, resilience, and injustice.
This project focuses on a dataset documenting Palestinians killed by Israeli forces. Each
entry in this dataset represents not just a number, but a life lost, a family grieving, and a reality
that cannot be ignored. By leveraging data analysis, we aim to uncover meaningful insights, not
as mere trends but as a way to honor the voices of those impacted and highlight the magnitude
of this crisis.
Using the Python programming language, we take a systematic approach to analyze the
dataset through univariate, bivariate, and multivariate techniques. Our goal is to provide a
deeper understanding of the demographic, geographical, and situational dimensions of these
tragic events while maintaining a respectful and humane perspective.
1.2 Problem Statement
The Palestinian conflict has led to the loss of countless innocent lives, leaving behind data
that tells a devastating story of suffering and injustice. The main challenge is: How can we
analyze this dataset to uncover key patterns, understand the factors behind these tragic
events, and present insights that shed light on the magnitude of this humanitarian crisis?
1.3 Project Objectives
The main objective of this project is to analyze the dataset on Palestinians killed by Israeli forces
to uncover relevant trends and relationships. The specific objectives include:
• Dataset Description: Provide a detailed overview, including the size of the dataset, the
number of variables (quantitative and qualitative), and key information.
• Trend Analysis: Identify demographic and geographical trends, such as age, gender, or
the most affected regions.
• Relationships Between Variables: Explore relationships between factors such as the
type of weapons used, locations of incidents, and characteristics of the victims.
3
• Preparation for Advanced Analysis: Facilitate advanced analysis or impactful visual-
izations to raise awareness and provide a deeper understanding of the reported data.
1.4 Dataset Characteristics
• Number of Variables: 16
• Notable Variables: “Date of Event,” “Age,” “Gender,” “Place of Residence,” “Killed
By,” “Type of Injury,” “Ammunition.”
• Nature of Data: Qualitative (categorical) and quantitative (numerical).
• Columns: Includes information such as name, event date, age, citizenship, event loca-
tion, gender, involvement in hostilities, cause of death, and notes.
• Null Values: Some columns, like ”Age,” ”Gender,” ”Type of injury,” ”Ammunition,”
and others, have missing values.
1.5 Dataset Cleaning
• Date Columns: Converted ”Date of event” and ”Date of death” to datetime format.
• Missing Values:
– ”Age”: Filled with the median age.
– ”Other” categorical columns (”Type of injury,” ”Ammunition”): Filled with ”Un-
known.”
• Gender: standardized to uppercase (e.g., ”M,” ”F”) and replaced missing values with
”UNKNOWN.”
4
Chapter 2
univariate analysis
2.1 Introduction
This document presents a univariate analysis of various data distributions, including age, gender,
and injury types. The analysis aims to provide insights into the central tendencies of these
distributions, with a particular focus on the median as a key measure of central tendency. By
examining the data through the lens of the median, we gain a deeper understanding of the typical
characteristics of the population under study, helping to highlight trends and inform potential
decisions.
2.2 Age Distribution
Median Age: The median age is 23, indicating that half of the individuals are younger than this
age, and the other half are older.
The age distribution peaks in the late teens and early twenties, representing the young adult
population as the majority.
There’s a sharp decline in frequencies for individuals older than 30, suggesting fewer cases
in older age groups.
Insight: The median complements the distribution by confirming the central age range
around which most individuals cluster. If the median is close to the peak, it reinforces the
skewness towards younger ages.
5
Figure 2.1: Age Distribution of the Population
2.3 Gender Distribution
The gender distribution is highly imbalanced:
• Male (M): The overwhelming majority of cases.
• Female (F): A small fraction of entries.
• Unknown genders: Another significant portion.
Figure 2.2: Gender Distribution of the Population
6
Top Types of Injury
The leading cause of injuries is Gunfire, followed by Missile or Explosive.
A substantial portion of cases is labeled as Unknown, which limits detailed conclusions.
Figure 2.3: Top Types of Injury in the Population
2.4 Conclusion
In conclusion, the univariate analysis of the data reveals several key insights into the distribution
of age, gender, and injury types within the population. The median age shows a concentration of
individuals in their late teens and early twenties, highlighting the youthfulness of the population.
The gender distribution is heavily imbalanced, with a predominance of male cases. Finally,
gunfire stands out as the leading cause of injuries, underscoring the importance of addressing
this particular issue. By leveraging these insights, future research and policy-making can be
more targeted and informed.
7
Chapter 3
Bivariate Analysis
3.1 Introduction
The objective of bivariate analysis is to examine relationships between pairs of variables in
the dataset. This analysis helps identify correlations, dependencies, or significant associations
between variables. The techniques include:
1. Correlation analysis for quantitative variables.
2. Chi-Square test for qualitative variables.
3. Visualization of relationships using appropriate graphs.
Correlation Analysis (Quantitative Variables)
A correlation matrix was constructed to examine the strength and direction of relationships
between numerical variables.
3.1.1 Correlation Matrix
Correlation values range from -1 to 1:
• A correlation close to 1 indicates a strong positive relationship.
• A correlation close to -1 indicates a strong negative relationship.
• A correlation close to 0 indicates no linear relationship.
3.1.2 Result
Correlation coefficients and their significance were calculated. A heatmap was used to visualize
the relationships.
8
3.2 Correlation Test (Example Between Two Quantitative Vari-
ables)
Methodology
The Pearson test was used to check if a significant relationship exists between two quantitative
variables (e.g., Variable1 and Variable2).
3.2.1 Results
• Correlation coefficient: r = X.XX (calculated value).
• P-value: p = X.XXXX.
Visualization
Figure 3.1: Relationship between date of death and age
Interpretation:
• If p < 0.05: A significant relationship exists between the two variables.
• Otherwise: No significant relationship is observed.
3.3 Analysis of Qualitative Variables (Chi-Square Test)
3.3.1 Methodology
A cross-tabulation table was generated to evaluate the association between two qualitative vari-
ables (e.g., Variable1 and Variable2). The Chi-Square test was then applied to check for inde-
pendence between these variables.
9
3.3.2 Results
• Chi2: X.XX (calculated value).
• P-value: p = X.XXXX.
Interpretation:
• If p < 0.05: A significant association between the two variables is observed.
• Otherwise: The two variables are independent.
3.3.3 Visualization
Figure 3.2: Relationship between Gender and Citizenship
3.4 Visualization of Quantitative Relationships
A pairplot was generated to visually examine the relationships between all pairs of quantitative
variables. This plot helps identify trends or non-linear relationships that may not appear in the
correlation coefficients.
3.5 Conclusion
1. Significant Correlations: Numerical variables showing strong correlations can be fur-
ther studied to understand their interactions. These relationships could be used for pre-
dictive analysis.
10
2. Significant Qualitative Associations: If significant relationships were detected using
the Chi-Square test, it would be interesting to explore their relevance in the business
context.
3. Next Steps:
• To deepen the analysis, multivariate techniques such as PCA could be used to reduce
dimensionality and identify hidden patterns.
• Integrate these findings into predictive models or advanced analyses to maximize
their utility.
11
Chapter 4
Multivariate Analysis
4.1 Introduction
Multivariate analysis involves the simultaneous examination of three or more variables to under-
stand the complex relationships among them. This method helps determine how these variables
interact and influence outcomes. Commonly used techniques include:
• Principal Component Analysis (PCA),
• Correspondence Factor Analysis (CFA),
• Other methods for dimensionality reduction and uncovering hidden patterns.
The objective of this report is to conduct a multivariate analysis based on the relationships
identified during the bivariate analysis.
4.2 Methodology
4.2.1 Data Preparation
To carry out this multivariate analysis, the following steps were performed:
1. Identification of relevant numerical and categorical variables,
2. Standardization of numerical data using the StandardScaler method from the scikit-learn
library,
3. Verification and handling of missing values through imputation or removal.
4.2.2 Principal Component Analysis (PCA)
PCA was employed to reduce the dimensionality of numerical variables and identify axes that
explain the majority of variance in the data.
12
PCA Results
Two principal components were extracted, explaining a significant proportion of the variance
in the data. A summary of the results is as follows:
• Component 1: explains XX% of the variance,
• Component 2: explains XX% of the variance.
Visualization
Figure 4.1: PCA scatter plot
4.3 Results Interpretation
The extracted axes enable:
• Identification of potential groupings or clusters,
• Highlighting hidden relationships or patterns among variables,
• Simplifying the visualization of high-dimensional data.
13