0% found this document useful (0 votes)
90 views21 pages

Vertopal Com EDA Project

The document summarizes three datasets used to analyze the Indian judiciary. The first dataset includes civil and criminal case details. The second includes judge counts and details. The third includes case classifications and regional data. Features like case age, type, and gender representation are identified as useful for understanding efficiency, workload, and diversity.

Uploaded by

kunal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views21 pages

Vertopal Com EDA Project

The document summarizes three datasets used to analyze the Indian judiciary. The first dataset includes civil and criminal case details. The second includes judge counts and details. The third includes case classifications and regional data. Features like case age, type, and gender representation are identified as useful for understanding efficiency, workload, and diversity.

Uploaded by

kunal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis on

JUDICIARY

by

Group 05

Kunal Anand Anjali Singh Ashutosh Anand


ID: 202318057 ID: 202318050 ID: 202318035
Course: MSc(DS) Course: MSc(DS) Course: MSc(DS)

Course Code: IT 462


Semester: Winter 2023

Under the guidance of

Dr. Gopinath Panda

Dhirubhai Ambani Institute of Information and Communication Technology

April 29, 2024


Acknowledgment
We, Kunal Anand, Anjali Singh, and Ashutosh Anand, would like to express our sincere gratitude to
our supervisor, Dr. Gopinath Panda, for his invaluable guidance, support, and expertise throughout
the course of our project. His wisdom and encouragement have been pivotal in steering this project
towards its successful completion.
Our thanks are also extended to the faculty and administration of the Dhirubhai Ambani Institute
of Information and Communication Technology (DAIICT) for providing us with the resources and
environment conducive to academic inquiry and research. The facilities and intellectual ambiance
offered by DAIICT have greatly enriched our learning experience .
We would also like to acknowledge the contributions of our classmates and colleagues who provided
us with constructive feedback and assisted us in refining our analytical approaches. Their perspectives
and critiques were essential in enhancing the quality of our work.
Special appreciation goes to the technical staff at DAIICT, whose prompt assistance with software
and hardware issues allowed for a seamless research process. Their behind-the-scenes efforts often go
unnoticed but are fundamental to the execution of projects such as ours.
Lastly, we are grateful for the data sources that were made available to us for the purpose of this
study. The insights gained from these datasets have not only facilitated our understanding of the
judiciary system but have also allowed us to contribute meaningfully to the discourse on data-driven
policy-making.
In sum, this project has been a collaborative effort, and its success is attributed to the support and
camaraderie of all those who have been a part of this academic endeavour .We are truly thankful for
everyone's contribution.
Kunal Anand
Anjali Singh
Ashutosh Anand
Declaration
I, Kunal Anand, along with my project partners, Anjali Singh and Ashutosh Anand, hereby declare
that the project titled "Exploratory Data Analysis of Judiciary " submitted to the Dhirubhai Ambani
Institute of Information and Communication Technology (DAIICT), is an original work carried out by
us under the guidance of Mr. Gopinath Panda.
We affirm that this project has not been submitted previously, in part or in full, to this or any other
university or institution for the award of a degree, diploma, or any other qualification. The work
presented in this project is a result of our own efforts, and all sources of information and data used
in the study have been duly acknowledged.
We understand the seriousness of academic dishonesty and the institute's policy against plagiarism,
and we assert that all the information presented in this project is cited and referenced appropriately.
Furthermore, we consent to the institute's use of anti-plagiarism software to check the originality of
our submission. We comprehend that any detected act of plagiarism may lead to disciplinary actions
as per the institute's norms and policies.
Signature:
Kunal Anand
Anjali Singh
Ashutosh Anand
Date:April 29, 2024
Place: Dhirubhai Ambani Institute of Information and Communication Technology (DAIICT)
Dataset Description
Data Sources
In our project report, the data sources section acknowledges the origins of the comprehensive dataset
utilized throughout our exploratory data analysis. The primary source of our data is the National
Judicial Data Grid (NJDG), which operates under the auspices of the Ministry of Law and Justice,
Government of India. This repository is an invaluable asset, offering a vast trove of judicial data that
is instrumental for analysis and research purposes.
Additionally, a significant portion of our dataset was meticulously scraped from data.gov.in, the
central government's open data platform, which encourages citizen engagement and empowerment
through access to public information.
The combination of these two robust sources ensures that our analysis is grounded in comprehensive
and authoritative data, enabling us to draw meaningful insights and to underpin our visualizations
with accuracy and reliability.
In our project on Exploratory Data Analysis (EDA) of the judiciary dataset, we've meticulously
examined three distinct datasets to uncover patterns, trends, and insights within the Indian legal
landscape. These datasets encompass a wide array of features which, when combined, offer a
holistic view of the judiciary's operations and highlight areas that may benefit from policy reform or
intervention.
The first dataset presents a detailed breakdown of civil and criminal cases by the age of the case,
type, and the corresponding bench within the All Indian High Courts. 'Particulars' indicate the
nature of the cases such as 'Writ Petition', 'First Appeal', etc., providing insight into the court's
workload distribution. The 'Civil' and 'Criminal' columns quantify the case types, while the 'Total'
represents their sum, offering a view of the overall caseload. These figures are pivotal in understanding
how different cases are distributed across the judicial spectrum, which in turn can influence court
management and prioritization.

The second dataset focuses on the count and details of judges, including their distribution across
various districts and the court numbers they're associated with. Of particular interest is the
'female_judge' column, which indicates gender representation within the judiciary – an important
aspect when considering diversity and inclusivity in the legal system.
IT 462 EDA

The third dataset broadens the perspective to multiple regions, capturing civil and criminal cases
along with detailed classifications such as 'Original', 'Appeal', 'Application', and 'Execution'. This
enables an assessment of the judicial process efficiency and identifies potential backlog hotspots.

Across these datasets, we have used specific features that serve distinct analytical purposes:
- Case Age: Indicates the longevity of cases, shedding light on the efficiency of the judicial process.
- Case Type (Particulars): Reveals the nature of the judiciary's workload, helping to identify specialized
areas that may require additional resources or reforms.
- Gender Representation (Female Judge): Provides essential data for assessing gender diversity within
the judiciary.
- Geographical Distribution (State, District): Offers insights into the regional disparities or concentra-
tions of caseloads, essential for resource allocation and policy planning.
Comprehensively, these datasets equip us with a granular view of the judiciary's functioning – from
case type distributions and judge demographics to case age and regional disparities. This forms a
bedrock for a data-driven approach to improving judicial services, ensuring gender equality within
the legal profession, and designing targeted interventions for regions displaying disparities in case
pendency or resource allocation. It also allows for the measurement of the impact of judicial reforms
over time and aids in making the judicial system more responsive to the needs of the populace.
To ensure the effectiveness of our analysis and the subsequent recommendations, each feature
within these datasets was meticulously cleaned, encoded, and analyzed, taking into consideration
the nuances and contexts that they represent. The insights derived from this EDA will be crucial in

4
IT 462 EDA

advocating for a more efficient, equitable, and transparent judicial system.


DATA CLEANING
Data cleaning is a crucial preliminary step in any data analysis project, as it ensures the reliability
of the findings derived from the data. In our project focused on Exploratory Data Analysis of the
judiciary dataset, comprehensive data cleaning processes were applied to several datasets to ensure
accuracy and integrity in our subsequent analysis.

Here's a detailed note on the data cleaning steps we undertook and their importance based on the
findings from our analysis:
1. Handling Missing or Incomplete Data: We came across instances where data entries were missing
or recorded as 'NaN', 'null', or inconsistent placeholders (like '##'). These were addressed either
by imputing values based on the context and available information or by removing these entries
when they represented a negligible portion of the dataset, thereby maintaining the dataset's integrity
without skewing the analysis.
2. Standardizing Entries: In datasets with categorical variables, such as case types or judicial positions,
we noticed discrepancies in naming conventions. Standardization was applied by mapping various
synonyms or similar entries to a uniform set of categories, ensuring that the analysis did not fragment
or duplicate categories unnecessarily.
3. Correcting Data Formats: Some columns, particularly those intended to represent numerical
values, contained entries in non-standard formats. These entries were converted into a consistent
numerical format, enabling accurate computations and comparisons during the analysis.
4. Encoding Categorical Variables: For machine learning models, categorical variables were trans-
formed into a format that could be interpreted by the algorithms. Techniques such as one-hot
encoding were used to convert categorical features like 'Particulars' and 'State' into binary vectors,
ensuring that the models could utilize this information effectively.
5. Identifying and Removing Outliers: Through the EDA process, outlier detection was paramount,
especially in understanding the distribution of cases and judge assignments. Outliers were evaluated
to determine if they represented genuine anomalies or data recording errors. Genuine outliers were
kept for their informational value, while errors were corrected or removed.
6. Gender Data Clean-Up: The 'female_judge' column contained binary values indicating gender
representation, where some non-binary values like '-9998 unclear' were encountered. These entries

5
IT 462 EDA

were identified and either recoded with the appropriate binary value or removed if no accurate
recoding was possible.

7. Date Formatting: Start and end dates associated with cases or judicial appointments were
standardized to a common date format. This allowed for the calculation of case durations and the
tenure of judges, providing insights into the judicial process's efficiency and judges' service periods.
8. Validation Against External Sources: For certain data points that appeared ambiguous or unusual,
external data sources were consulted to validate the information. This ensured that our dataset
remained a true representation of the judicial landscape.
9. Data Consistency Checks: Across multiple datasets, we ensured consistency in terms like the
classification of cases and the identification of courts and districts. Inconsistencies were rectified to
align all datasets, enabling a cohesive analysis across different dimensions of the data.
The thorough data cleaning process laid a strong foundation for the EDA, providing confidence in
the insights drawn regarding the distribution and pendency of cases, the demographic composition
of judges, and the judicial system's operational aspects. It emphasized the importance of clean,
well-structured data as the basis for any informed decision-making process. With clean data, we could
accurately analyze trends, identify areas for improvement, and provide actionable recommendations
to enhance the effectiveness and fairness of the judiciary system.

6
IT 462 EDA

Feature Extraction
Feature extraction in our project involves transforming raw data into structured features that can
be fed into our predictive models. This step is critical, especially when dealing with complex and
heterogeneous data such as legal case records and judge profiles.
- One-Hot Encoding: Categorical variables like 'State', 'District', and 'Particulars' represent nominal
data without inherent numeric value. We employ one-hot encoding to convert these categories into
a series of binary variables that indicate the presence or absence of each category. This is crucial for
linear models, which require numerical input.

- Aggregation: For the 'Case Age' feature, we convert the age of cases into categorical bins, such as
'0 to 1 Year', '1 to 3 Years', etc. This not only simplifies the model's understanding of the data but

7
IT 462 EDA

also helps in identifying trends related to case resolution times.


- Gender Representation: The 'female_judge' column is a binary variable indicating the gender of
judges. This is directly used as a feature to analyze diversity within the judiciary without the need
for further transformation.
- Temporal Features: Date columns from the datasets are converted into more informative temporal
features that might help to uncover patterns over time, such as year, month, or even day of the week
when the case was filed or resolved.
Feature Selection
Feature selection involves choosing the most significant features based on their relationship with the
outcome variable. This process not only improves model accuracy but also reduces overfitting and
enhances interpretability. - Correlation Analysis: We utilize heatmaps to understand the correlation

between different features and the target variable. Highly correlated features are prioritized as they
are more likely to have predictive power.

8
IT 462 EDA

- Coefficient Impact: In regression models, the size of the coefficient for each feature can indicate
its influence on the target variable. Features with larger coefficients are likely more important. For
instance, 'Total' and specific 'Particulars' showing high coefficients are indicative of their significant
role in predicting caseloads.
- Feature Importance from Model Output: Bar charts displaying the importance of features from
model outputs, such as from a decision tree or random forest, inform us about the features that
most strongly predict our target variable. This is used to fine-tune feature selection iteratively.
- Predictive Performance Metrics: We evaluate features based on how well they improve the model's
predictive performance, using metrics like RMSE for regression and accuracy for classification tasks.
- Dimensionality Reduction Techniques: While not explicitly shown in the provided data, techniques
such as PCA (Principal Component Analysis) could be employed in scenarios where the datasets have
a high number of features, to reduce the feature space to a smaller set of uncorrelated components.
The ultimate goal of feature engineering in our project is to create predictive models that are both
accurate and interpretable. By selecting features that have a strong and understandable relationship
with the target outcome, we aim to provide clear and actionable recommendations for judicial reform
and resource allocation. Our careful approach to feature engineering ensures that our models are
robust, our conclusions are reliable, and our policy suggestions are grounded in solid empirical
evidence.

9
IT 462 EDA

Visualization
Data visualization plays a crucial role in our project's exploratory data analysis (EDA) of the judiciary
dataset. By transforming complex datasets into graphical representations, we gain a clear and
immediate understanding of the patterns, trends, and anomalies within the judicial system.
In the project report, the data visualization segment is meticulously crafted to illustrate the intricacies
of the judiciary dataset, with a specific emphasis on the state of Gujarat among others. The
visualizations are not merely ornamental but serve as a pivotal tool for unearthing trends, disparities,
and insights that may not be immediately apparent through raw data.
For Gujarat, as with each state, a series of plots encapsulate the distribution of cases, tenure of
unresolved cases, types of cases, and the representation of gender within the judiciary framework.
These plots include bar graphs depicting the count of cases across different courts, line graphs showing
case trends over time, and box plots reflecting the composition of case types. Each visualization is
chosen for its ability to communicate complex data in an accessible and comprehensible manner.
We have used collage format to show visualizations of other states . The use of a collage format
allows for a comparative analysis across states, offering a visual narrative of the judicial landscape.
For instance, the collation of plots in a single view enables an immediate visual assessment of how
Gujarat compares to its counterparts regarding the efficiency of case resolution or the gender balance
in judicial appointments.

In our EDA, we utilize various types of data visualizations to accomplish the following objectives:
1. Identify Distributions and Outliers: Histograms and box plots allow us to quickly identify the
distribution of cases across different courts and the presence of outliers. For instance, we can
highlight courts with unusually high case backlogs, which may require further investigation and
targeted reforms.
2. Compare Groups: Bar charts enable us to compare the number of cases between different
regions or categories. Through these comparisons, we can identify disparities in case numbers and

10
IT 462 EDA

potentially uncover underlying causes, such as regional legal differences or the varying efficiency of
court processes.
3. Understand Relationships: Correlation heatmaps help us discern the relationships between different
variables, such as case type, duration, and frequency. These insights are instrumental in understanding
how different aspects of the judiciary are interconnected and can impact case outcomes.
4. Evaluate Predictive Models: By visualizing model predictions, such as with ROC curves and
scatter plots of actual vs. predicted values, we assess the performance of our predictive models. This
helps ensure that our models are accurate and reliable for making data-driven decisions.
5. Highlight Trends Over Time: Time-series plots would allow us to visualize case trends over time,
helping us understand how case numbers have evolved and to predict future trends.
6. Facilitate Communication: Clear and compelling visualizations communicate our findings effectively
to both technical and non-technical stakeholders, ensuring that our insights have the maximum
impact.
7. Drive Actionable Insights: Ultimately, the goal of our data visualizations is to inform policy
recommendations and strategic decisions aimed at improving the judicial system's efficiency and
fairness.
Analysis of Female Judge Representation
Distribution Analysis
Each graph provides a visual representation of a specific state or territory, detailing the counts of
female judges within its various regions. The heights of the bars offer immediate insight into the
distribution of female representation within the judiciary, with significant variances evident across
different regions. Certain states display a more uniform distribution of female judges, suggesting
a more equitable spread across regions. Conversely, other states show stark disparities, with some
regions significantly outpacing others in female judge representation, calling attention to potential
regional imbalances.

11
IT 462 EDA

These visual representations are crucial for identifying regions where gender representation in the
judiciary is progressive, as well as areas where it is lacking. The graphs are instrumental in informing
policy decisions, directing interventions, and evaluating the success of measures aimed at enhancing
female participation.
Civil and Criminal Case Distribution
- Case Age Distribution: In the bar charts, I've broken down the number of civil and criminal cases
by the age of each case in different high courts.
- Trend Analysis: I use these charts to identify potential backlog issues, with taller bars in older age
categories pointing to more significant backlogs.

12
IT 462 EDA

- Case Type Comparison: The contrasting colours in each category enable a direct comparison
between the numbers of civil and criminal cases, which may highlight systemic biases or areas of
focus for each court.

13
IT 462 EDA

# Female Judges in Judicial Positions


- Intensity Representation: The heatmap appears to represent the concentration of female judges in
top judicial positions, with the intensity of colour denoting the count.
- District Analysis: The heatmap allows me to easily identify which districts have higher or lower
representations of female judges.
- Position Analysis: It also clarifies which judicial positions are most commonly held by female judges
across these districts.

14
Key Findings
- Distribution Insight: From this, we can glean which positions are more commonly occupied by
women, which may illustrate career progression barriers or disparities in gender distribution within
the judiciary.
- Data Clustering: We note that clusters of dots in certain areas may indicate districts with a more
significant female presence in certain judicial roles.
1. Case Age: I observe that some courts have noticeably more significant numbers of older cases,
hinting at potential inefficiencies or heavy caseloads.
2. Case Type Disparity: There's a notable variation in the proportion of civil versus criminal cases
across different high courts.
3. Female Representation: I note that certain judicial positions are more frequently held by females,
suggesting structural nuances in the judiciary's professional development paths for women.
4. Regional Variations: The heatmap reveals that female judges' distribution is not even across
various positions or districts, meriting further investigation into recruitment, selection, and promotion
processes.
IT 462 EDA

Importance and Conclusions of our Analysis:


Further statistical analysis on our part could yield additional insights such as the average age of cases
per court, the civil to criminal case ratios, and variations in female judge representation per position.
Recognizing these patterns is imperative for judicial administrators to address backlog issues, ensure
an equitable distribution of workload, and advocate for gender diversity.
In the context of the broader project, these observations can be leveraged to discuss the imperative
of gender equality in the legal profession. They shed light on the influence of societal norms on
women's professional opportunities and underscore the need for a judiciary that mirrors the diversity
of the population it serves.
Model fitting
In our project, we have taken an analytical approach to understanding the workings of the judiciary
system by performing Exploratory Data Analysis (EDA) on a comprehensive judiciary dataset. A
critical part of this EDA involves fitting data models that can reveal underlying patterns and inform
our understanding of case distributions, pendency, and other significant judicial metrics.
Model fitting is an iterative process where we select the appropriate algorithms to map the underlying
structure of the data. For this judiciary dataset, which encompasses a variety of factors such as
case type, duration, and geographical location, we employ models that accommodate the categorical
nature of many variables and the count-based nature of case volumes.
The model fitting process involves splitting the dataset into training and testing sets, applying a
logistic regression model, and subsequently evaluating its performance. The model's accuracy is
gauged, but it's the analysis of coefficients, confusion matrix, ROC curve, and the distribution of
predicted probabilities that provide a nuanced understanding of the model's predictive power.
In the case of predicting total case counts, we utilize regression models due to the continuous nature
of our target variable. The performance of these models is evaluated using metrics like Root Mean
Squared Error (RMSE), which provides us with feedback on the average deviation of the model's
predictions from the actual values.

Moreover, for predictions concerning the probability of a judge's position being filled by a female,
logistic regression models are implemented due to the binary outcome of the target variable. We
assess the model's classification efficacy with the help of confusion matrices and ROC curves, which
offer insights into the true positive rates and the balance between sensitivity and specificity.

16
IT 462 EDA

Through these models, we aim to discern the influential factors that can predict case outcomes and
judge appointments. The significance of features is determined by the magnitude of their coefficients
in regression models or their importance scores in classification models.

The confusion matrix and ROC curve offer mixed signals. While the confusion matrix indicates a
high number of true classifications, the ROC curve suggests that the model performs no better than
random chance, with an AUC of 0.5. This points to a model that is accurate on the surface but may
not have substantial predictive capability.

The distribution of predicted probabilities exhibits a clustering of values, hinting at the model's
confidence in its predictions, but this confidence does not necessarily equate to correctness, as
indicated by the ROC curve analysis.

17
IT 462 EDA

In conclusion, the logistic regression model provides an initial framework to understand factors
influencing the representation of female judges. However, the ROC curve analysis suggests that
further refinement is required. Future steps could involve augmenting the model with additional
data sources, considering alternative algorithms like random forests or support vector machines, and
applying cross-validation to better understand the model's stability . Our project lays the groundwork
for future studies aimed at enhancing gender diversity within the judiciary and provides a template
for how data-driven methodologies can be employed to inform policy and administrative decisions.
Conclusion
The comprehensive Exploratory Data Analysis (EDA) of the judiciary dataset has provided vital
insights into the functioning of the Indian judicial system. This analysis encompassed a plethora of
features, including the age and type of cases, judicial benches, and gender representation within the
judiciary. The findings indicate significant variances in the distribution of case types and highlight
disparities in gender representation and geographical caseloads.
Findings
- Case Age and Type: The analysis revealed a notable accumulation of cases in specific
age brackets, suggesting potential bottlenecks in the judicial process. Cases spanning
over 30 years indicate a critical need for interventions to address long-standing delays.

18
IT 462 EDA

- Judicial Workload: The distribution of cases among various types, such as writ petitions and appeals,
indicates that certain judicial processes may be more prone to backlogs. This finding emphasizes the
importance of targeted reforms in case management and resource allocation.

- Gender Representation: Data on the gender of judges across various districts revealed the state of
gender diversity within the judiciary. While there has been progress, the representation of female
judges in certain regions remains disproportionately low, suggesting the need for continued efforts
towards gender inclusivity.
- Regional Disparities: The analysis also brought to light regional disparities in caseloads, with
some benches dealing with a much higher number of cases than others, pointing towards uneven
distribution of resources and possibly varying levels of efficiency.
Future Scope
The insights garnered from the EDA have laid down a foundation for a multitude of pathways for
future research and policy-making:
- Predictive Modelling: Leveraging machine learning algorithms to predict case outcomes, durations,
and judge assignments could substantially enhance the efficiency of the judiciary.
- Policy Reform: Using the EDA findings to guide judicial reforms focused on caseload management,
gender diversity, and resource distribution.
- Longitudinal Studies: Conducting studies over time to measure the impact of interventions and
reforms on the efficiency and equity of the judiciary.

19
IT 462 EDA

Challenges
Despite the actionable insights, there are several challenges that need to be considered:
- Data Quality: The datasets may have inconsistencies, missing values, or errors that could affect the
accuracy of the analysis.
- Systemic Inertia: The judiciary is a complex system with entrenched processes, and thus, reforms
based on EDA may face resistance and require sustained advocacy.
- Changing Legal Landscape: As laws evolve and new types of cases emerge, continuously updating
and maintaining the datasets will be essential for the relevance of the findings.
- Technology Adoption: There is a need for the judiciary to adopt advanced analytics and AI
technologies to keep pace with the increasing volume and complexity of cases.
In conclusion, while the EDA has offered critical insights and suggested areas for improvement, the
journey towards a transformed judiciary will be iterative and require ongoing evaluation. The findings
must be interpreted with caution, considering the limitations of the data and the multifaceted nature
of the judicial system. Nonetheless, the future holds promise for a data-driven judiciary characterized
by increased efficiency, transparency, and fairness.

20

You might also like