Deep Learning for Resume Parsing
Deep Learning for Resume Parsing
By
Shivesh Kumar
ERP ID:
0221PGM0357
BATCH: 2022-24
Internal Guide:
Prof. Manisha Verma
Assistant Professor
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to all the people who have supported me
throughout this project. First and foremost, I would like to thank my company guide, Prof.
Manisha Verma for their valuable guidance, encouragement, and feedback. He has been a
constant source of inspiration and motivation for me. She has been very patient and
supportive of me throughout this journey.
Finally, I would like to thank my parents & friends. I would not have been able to complete
this project without their support. They have always been there for me, cheering me up,
listening to my problems, and giving me moral support. I dedicate this project to them.
DECLARATION
I, Shivesh Kumar, roll no. 0221PGM057, student of PGDM of Doon Business School, Dehradun,
hereby declare that the project report on “To develop a deep learning-based ML model to extract
information from digital resumes which can be
generalized across resume formats” is an original and authenticated work done by me. I further
declare that it has not been submitted elsewhere by any other person in any of the institutes for the
award ofany degree or diploma.
Date:
PREFACE
In the ever-evolving landscape of recruitment and talent acquisition, the digitalization of resumes
has revolutionized the way organizations sift through vast pools of candidate information. With
the exponential growth of digital resumes across various formats, the need for efficient and
accurate information extraction tools has become paramount. This report documents the journey
of developing a deep learning-based machine learning (ML) model aimed at extracting pertinent
information from digital resumes while ensuring generalizability across diverse resume formats.
The significance of this endeavor lies in its potential to streamline the recruitment process,
empowering hiring professionals to swiftly identify top talent amidst the sea of applicants. By
harnessing the power of deep learning techniques, we embark on a quest to revolutionize resume
parsing, transcending the limitations posed by traditional rule-based parsing methods.
This project is rooted in the recognition of the challenges inherent in resume parsing, particularly
in the face of varied formats, layouts, and linguistic nuances. Traditional parsing algorithms often
struggle to adapt to the intricacies of modern resume designs, leading to inefficiencies and
inaccuracies in information extraction. Through the lens of deep learning, we aim to transcend
these limitations, leveraging the inherent capabilities of neural networks to discern patterns and
extract information with unprecedented accuracy and versatility.
Our journey begins with a comprehensive exploration of existing literature and methodologies in
the realm of resume parsing and deep learning. By synthesizing insights from cutting-edge
research and real-world applications, we lay the groundwork for our approach, drawing inspiration
from the successes and pitfalls of previous endeavors.
Central to our methodology is the development of a robust deep learning architecture capable of
learning intricate patterns and structures within resumes of varying formats. Leveraging
techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs),
we endeavor to create a model that transcends the constraints of rigid parsing rules, instead
learning to adapt and generalize across diverse resume layouts.
In conclusion, this report stands as a testament to the transformative power of deep learning in the
realm of resume parsing and information extraction. Through meticulous research,
experimentation, and collaboration, we aim to pave the way for a future where recruitment is not
constrained by the limitations of technology but empowered by its boundless possibilities.
EXECUTIVE SUMMARY
The project titled "Deep Learning-Based Resume Information Extraction Model" represents a
significant exploration into the realm of machine learning and artificial intelligence. Its objectives
were to develop a versatile model for extracting information from digital resumes while ensuring
generalizability across various resume formats.
The primary focus was on addressing the challenges inherent in resume parsing and information
extraction. Utilizing deep learning methodologies, the project aimed to leverage neural networks
to accurately discern patterns and extract relevant information. By adopting a multifaceted
approach encompassing data preprocessing, model development, and evaluation, the project
sought to create a robust solution capable of accommodating diverse resume layouts and linguistic
nuances.
Key findings highlighted the efficacy of the deep learning approach in overcoming traditional
parsing limitations. The developed model exhibited promising performance metrics, achieving
high accuracy rates across a spectrum of resume formats. Notably, its adaptability showcased
potential for widespread applicability in diverse recruitment scenarios.
However, the project faced challenges in navigating the complexities of deep learning
methodologies and data preprocessing techniques. Overcoming these obstacles required
perseverance, collaboration, and innovative problem-solving approaches. Through iterative
refinement and optimization, the project succeeded in developing a robust and versatile resume
information extraction model.
The contributions of the project extend beyond its technical achievements, encapsulating a
transformative learning experience for the participants. Through hands-on experimentation and
collaborative problem-solving, team members gained invaluable insights into the intricacies of
deep learning and its applications in real-world scenarios.
In today's digitally driven world, the process of talent acquisition has undergone a significant
transformation, propelled by the widespread adoption of digital resumes. As organizations strive
to identify top talent amidst vast pools of applicants, the need for efficient and accurate resume
parsing tools has become increasingly pronounced. Traditional parsing methods, reliant on rule-
based algorithms, often falter when confronted with the diverse array of formats, layouts, and
linguistic nuances characteristic of modern resumes. In response to these challenges, the
emergence of deep learning techniques offers a promising avenue for revolutionizing resume
parsing and information extraction.
The objective of this project is to develop a deep learning-based machine learning (ML) model
capable of extracting information from digital resumes with a high degree of accuracy and
generalizability across various formats. By harnessing the power of neural networks, this endeavor
seeks to transcend the limitations of traditional parsing methods and unlock new frontiers in talent
acquisition.
The significance of this undertaking lies in its potential to streamline the recruitment process,
empowering organizations to efficiently sift through large volumes of resumes and identify the
most qualified candidates. Traditional parsing algorithms often struggle to adapt to the intricate
designs and diverse structures of modern resumes, leading to inefficiencies and inaccuracies in
information extraction. By contrast, deep learning models have demonstrated remarkable
capabilities in discerning patterns and extracting meaningful insights from complex data, making
them well-suited for the task of resume parsing.
At the heart of this project is the recognition of the multifaceted challenges inherent in resume
parsing across various formats. From chronological resumes to functional resumes, and from PDFs
to text documents, the diversity of formats presents a formidable obstacle for traditional parsing
methods. Moreover, linguistic nuances such as synonyms, abbreviations, and variations in
formatting further complicate the task of information extraction. By developing a deep learning-
based model, we aim to create a solution that transcends these challenges and delivers accurate
and reliable results across a spectrum of resume formats.
The methodology employed in this project encompasses several key phases, each designed to
address specific aspects of resume parsing and information extraction. The initial phase involves
data collection and preprocessing, wherein digital resumes from diverse sources are curated and
standardized to facilitate model training. Subsequently, the model development phase entails the
design and implementation of a deep learning architecture tailored to the task of resume parsing.
Techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
may be leveraged to discern patterns and extract information from resumes effectively. Finally,
the evaluation phase involves assessing the performance of the developed model against a
benchmark dataset, with metrics such as accuracy, precision, and recall serving as indicators of its
efficacy.
In conclusion, this project represents a pioneering effort to harness the power of deep learning for
the task of resume parsing and information extraction. By developing a model capable of
accurately extracting information from digital resumes across various formats, we aim to
revolutionize the recruitment landscape and empower organizations with the tools they need to
identify top talent efficiently and effectively. Through meticulous research, experimentation, and
collaboration, we seek to unlock new possibilities in talent acquisition and pave the way for a
future where recruitment is not constrained by the limitations of technology, but rather empowered
by its boundless potential.
LEARNINGS ACQUIRED
The provided code includes imports and installations necessary for this dissertation report. It
imports the NLTK library, which is a powerful tool for natural language processing tasks, and
downloads necessary resources such as stopwords and tokenizers. Additionally, it installs two
Python packages: docx2txt, which allows extraction of text from Microsoft Word documents, and
chart_studio, which is used for creating and sharing interactive plots and graphs online. These
imports and installations are crucial for setting up your environment to work with text data, extract
information from documents, and potentially visualize your findings.
In addition to setting up the environment for text processing and analysis, the provided code
demonstrates a proactive approach towards leveraging existing libraries and tools to enhance the
efficiency and effectiveness of the dissertation research. By importing the NLTK library and
downloading necessary resources, the code showcases a commitment to leveraging established
tools and techniques in natural language processing (NLP). Moreover, the installation of docx2txt
and chart_studio packages signifies an intention to seamlessly integrate document processing
capabilities and data visualization tools into the dissertation workflow.
Furthermore, by utilizing NLTK for tokenization and stopwords removal, the code lays the
groundwork for preprocessing textual data, a crucial step in extracting meaningful insights from
digital resumes. This proactive approach to preprocessing not only ensures the quality of data but
also streamlines subsequent analysis tasks. Similarly, the installation of docx2txt underscores a
readiness to handle diverse data formats, enabling the extraction of text from Microsoft Word
documents without the need for manual conversion.
The provided code imports a comprehensive set of Python libraries and modules essential for
conducting various tasks related to data analysis, natural language processing, and machine
learning.
NumPy and Pandas: These libraries are fundamental for data manipulation and analysis. NumPy
provides support for multi-dimensional arrays and mathematical functions, while Pandas offers
data structures and tools for working with structured data.
Seaborn and Matplotlib: These libraries are used for data visualization. Seaborn is built on top of
Matplotlib and provides a high-level interface for creating attractive statistical graphics.
NLTK (Natural Language Toolkit): NLTK is a powerful library for natural language processing
tasks. It includes tools for tokenization, stopwords removal, and other text preprocessing tasks.
scikit-learn: Also known as sklearn, scikit-learn is a widely used machine learning library in
Python. It provides various algorithms for classification, regression, clustering, and model
evaluation.
docx2txt: This library allows extraction of plain text from Microsoft Word documents (.docx),
which can be useful for processing textual data.
WordCloud: WordCloud is a Python package for creating word clouds, which visually represent
the most frequent words in a text corpus.
Plotly and Chart Studio: These libraries are used for interactive data visualization. Plotly allows
creating interactive plots directly in Python, while Chart Studio provides a platform for sharing
and collaborating on interactive plots online.
By importing these libraries, you have access to a wide range of tools and functionalities necessary
for preprocessing, analyzing, and visualizing your data, as well as building and evaluating machine
learning models. This comprehensive set of libraries will enable you to conduct thorough research
on deep learning-based resume parsing and information extraction.
The provided code first reads a CSV file named 'UpdatedResumeDataSet.csv' into a Pandas
DataFrame called df, using the pd.read_csv() function. The encoding='utf-8' parameter ensures
that the file is read with the correct character encoding.
After reading the CSV file into the DataFrame, the .head() method is called to display the first few
rows of the DataFrame, providing a quick glimpse of the structure and content of the data.
Subsequently, the .describe() method is invoked on the DataFrame. This method computes
summary statistics for numerical columns in the DataFrame, such as count, mean, standard
deviation, minimum, and maximum values. This summary provides valuable insights into the
distribution and characteristics of the numerical data in the DataFrame.
Together, these code snippets form the initial steps in the data analysis process. The data is loaded
into memory, and basic exploratory analysis is performed to gain a better understanding of its
structure and content. This lays the groundwork for more in-depth analysis and modeling tasks.
The provided code performs several operations on the Pandas DataFrame df, which contains the
data loaded from the 'UpdatedResumeDataSet.csv' file. Here's an explanation of each line:
df.info(): This method prints a concise summary of the DataFrame, including information about
the data types of each column, the number of non-null values, and memory usage. It's a useful
way to
quickly inspect the structure and data types of the DataFrame.
df.shape: This attribute returns a tuple representing the dimensions of the DataFrame, where the
first element indicates the number of rows and the second element indicates the number of
columns. It provides information about the size of the DataFrame, helping to understand the data's
overall structure.
df.nunique(): This method returns the number of unique values in each column of the DataFrame.
It provides insights into the cardinality of categorical variables and the variability of numerical
variables within the dataset. This information is useful for understanding the diversity and
distribution of values across different columns.
Together, these operations allow for a comprehensive exploration of the DataFrame, including
understanding its structure, dimensions, missing values, and uniqueness of values in each column.
This exploration is essential for gaining insights into the dataset and informing subsequent data
preprocessing and analysis steps.
The provided code generates a count plot using Seaborn, illustrating the distribution of categories
within the DataFrame df. It adjusts the plot size and rotates x-axis labels for better visualization.
Additionally, it annotates each bar with the count of occurrences, providing specific information
on category frequencies. This visualization serves as a concise summary, offering insights into the
dataset's categorical composition at a glance.
The result of this code is a distribution plot that visually represents the distribution
of resume lengths in the DataFrame df. The x-axis represents the length of resumes,
while the y-axis represents the density of occurrence for each length value. The plot
provides insights into the distribution pattern of resume lengths, helping to
understand the range and variability of resume lengths in the dataset.
This code snippet generates word clouds for the top three categories of resumes in
the DataFrame df. Let's break down the code and then explain its result:
for label, cmap in zip(top_jobs, a): This loop iterates over each top category and its
corresponding colormap specified in the list a.
plt.figure(figsize=(10, 6)): This line creates a new figure with a size of 10 inches in
width and 6 inches in height using Matplotlib's plt.figure() function. This sets the
dimensions of the plot.
plt.imshow(wc): This line displays the word cloud using Matplotlib's imshow()
function.
plt.axis("off"): This line removes the axis from the plot, making it cleaner and
focusing solely on the word cloud.
The result of this code is a series of word clouds, each representing the common
words used in resumes belonging to the top three categories. Each word cloud
provides a visual representation of the most frequent words in resumes for a specific
category, helping to identify trends and patterns in the language used across different
job categories.
First, it creates a copy of the original DataFrame df and adds a new column named
'cleaned_resume' to store the cleaned text.
Finally, the cleanResume() function is applied to each resume text in the 'Resume'
column of resumeDataSet using the .apply() function, resulting in the creation of the
'cleaned_resume' column with the cleaned text for each resume.
This process standardizes the format of the resume text data, removing unnecessary
elements and preparing it for further analysis or modeling tasks.
The provided code segment encodes the categorical variable 'Category' in the
DataFrame resumeDataSet using Label Encoding.
var_mod = ['Category']: This line specifies the list of variables to be label encoded.
The loop for i in var_mod: iterates over each variable specified in var_mod.
The transformed numerical labels are then assigned back to the 'Category' column
of resumeDataSet.
resumeDataSet.head(): Finally, this line displays the first few rows of the DataFrame
resumeDataSet after the label encoding process, showing the updated 'Category'
column with numerical labels.
Label encoding transforms categorical data into numerical format, which is often
necessary for machine learning algorithms to process the data effectively. In this
case, the 'Category' column is encoded to enable further analysis or modeling tasks
that require numerical input.
The provided code segment prepares the resume text data for machine learning
model training using TF-IDF (Term Frequency-Inverse Document Frequency)
vectorization and splits it into training and testing sets. Here's an overview of the
process:
First, the necessary libraries and modules are imported: train_test_split from
sklearn.model_selection and TfidfVectorizer from sklearn.feature_extraction.text.
The cleaned resume text data and corresponding category labels are extracted from
the DataFrame resumeDataSet and stored in variables requiredText and
requiredTarget, respectively.
The TF-IDF feature vectors (WordFeatures) and the corresponding target labels
(requiredTarget) are split into training and testing sets using train_test_split. The
training set (X_train and y_train) contains 80% of the data, while the testing set
(X_test and y_test) contains 20%. The random_state parameter ensures
reproducibility of the split.
Finally, the dimensions (shape) of the training feature matrix X_train and testing
feature matrix X_test are printed to verify the sizes of the datasets.
Overall, this code segment transforms the resume text data into numerical feature
vectors using TF-IDF vectorization and splits it into training and testing sets,
preparing it for further machine learning model training and evaluation.
The provided code segment trains a K-Nearest Neighbors (KNN) classifier on the
training data and evaluates its performance on both the training and testing sets.
Next, the classifier is fitted to the training data, where it learns the relationships
between the features and the target labels.
Once trained, the classifier predicts the target labels for the testing data.
The accuracy of the classifier on both the training and testing sets is computed
using the .score() method, which compares the predicted labels with the actual
labels. This accuracy score indicates the proportion of correctly classified
instances.
The accuracy of the KNN classifier on the training set is printed, followed by the
accuracy on the testing set. These accuracy scores provide insights into how well
the classifier generalizes to unseen data and help assess its overall performance.
The provided code segment saves the trained machine learning model and
associated objects using the Pickle module. Here's how it works:
import os: This imports the os module, which provides functions for interacting
with the operating system.
import pickle: This imports the pickle module, which is used for serializing and
deserializing Python objects.
__init__(self): This is the constructor method of the class. It initializes the class
instance with the trained objects required for prediction:
self.le: Holds the Label Encoder object for transforming predicted labels back to
their original categorical values.
self.word_vectorizer: Holds the Word Vectorizer object for transforming resume
text into feature vectors.
self.clf: Holds the Classifier object (K-Nearest Neighbors) for making predictions.
predict(self, resume): This method takes a resume text as input, transforms it into a
feature vector using the Word Vectorizer, and then predicts the job category using
the trained Classifier. The predicted label is transformed back to its original
categorical value using the Label Encoder, and the result is returned.
Overall, the JobPredictor class provides a convenient interface for using the
trained machine learning model to predict job categories from resume texts. The
predict() method returns the most likely job category, while the predict_proba()
method provides the probability distribution over all job categories.
SECOND CODE:
provided job description outlines the required skills, roles, and responsibilities for
a technical position. Here's a summary of the key points:
Skills Required:
Hands-on experience with ETL integration, Core JAVA, Spring Boot, and APIs.
Proficiency in DB2 or Azure SQL server, including experience with SQL queries.
Understanding of File Transfer protocols such as FTP, SFTP, and PGP Encryption.
Familiarity with mainframe integration for ETL processing.
Technical expertise in UNIX shell scripting.
Knowledge of Web Services and experience in developing ETL processes.
Ability to write, create, and update technical documents.
Experience in batch job/process scheduling.
Familiarity with data integration, data streaming, WebSphere MQ, and
Communication Networks.
Exposure to event-driven programming concepts.
Understanding of Data Modelling and Data Architecture.
Roles & Responsibilities:
code segment instantiates a JobPredictor object and uses it to predict the position
or job category based on the provided job description. Here's a summary of how it
works:
Overall, this code segment provides a quick way to predict the job position or
category based on a given job description using the trained machine learning
model encapsulated within the JobPredictor class.
SECOND CODE:
The provided code segment utilizes text processing techniques and cosine
similarity to calculate the match percentage between a resume (extracted from a
.docx file) and a job description. Here's an overview of how it works:
mode = "gauge+number": This sets the mode of the indicator to display both as a
gauge and as a numerical value.
value = match_percentage_docx: This sets the value of the indicator to the
calculated match percentage between the resume and the job description.
domain = {'x': [0, 1], 'y': [0, 1]}: This specifies the domain of the indicator within
the figure, indicating its position and size.
title = {'text': "Match with JD"}: This sets the title of the indicator to "Match with
JD", indicating that it represents the match percentage with the job description.
Finally, fig.show() displays the configured figure, showing the gauge indicator
representing the match percentage.
Overall, this code segment predicts the job category of the resume and visualizes
the predicted probabilities of the resume belonging to each job category using a
bar chart, providing insights into the likelihood of the resume being suitable for
different positions.
code segment processes multiple resumes stored in .docx files and predicts their
job positions using the JobPredictor class. It also calculates the match percentage
between each resume and a predefined job description. Here's how it works:
match_percentage = {}: This line initializes an empty dictionary to store the match
percentage between each resume and the job description.
Overall, this code segment visualizes the match percentage between each resume
and the job description using a bar chart, providing insights into how well each
resume aligns with the job requirements.
code segment calculates the number of resumes that match the predicted job
category of the provided job description and displays this information using a delta
indicator from Plotly. Here's how it works:
The print statement displays the predicted job category of the job description.
mode = "delta+number": This sets the mode of the indicator to display both the
delta value and the total number.
delta = {'reference': len(uploaded_files)}: This specifies the reference value as the
total number of uploaded files.
value = total_matched: This sets the current value of the indicator to the number of
resumes that match the predicted job category.
title = {'text': f"{total_matched} out of {len(uploaded_files)} Resume falls on
same category of JD."}: This sets the title of the indicator to display the count of
matching resumes out of the total number, indicating how many resumes
Pie Chart:
The journey embarked upon to develop a deep learning-based ML model for extracting
information from digital resumes, with the overarching goal of generalizing across diverse
resume formats, is a testament to the fusion of innovation, technical expertise, and real-
world application in the field of HR technology. Through a meticulous exploration of
various coding techniques, libraries, and machine learning algorithms, this endeavor
represents a pioneering effort towards revolutionizing the recruitment landscape and
streamlining the talent acquisition process.
The foundation of this endeavor lies in the preprocessing of resume data, where meticulous
steps were taken to transform raw textual information into a format conducive to machine
learning analysis. Techniques such as text tokenization, removal of special characters, and
vectorization played a pivotal role in preparing the textual data for subsequent analysis. By
leveraging these preprocessing techniques, the model was equipped to extract meaningful
insights from the vast corpus of resume information, irrespective of format or structure.
The culmination of this coding endeavor represents more than just a technical achievement;
it signifies a paradigm shift in the way recruitment and talent acquisition are conducted. By
automating the labor-intensive process of resume screening and categorization, the
developed model empowers HR professionals to focus their efforts on strategic decision-
making and talent development initiatives. Moreover, by standardizing the resume extraction
process across different formats, the model promotes fairness, objectivity, and inclusivity in
the recruitment process, thereby mitigating bias and enhancing diversity in the workforce.
In conclusion, the journey towards developing a deep learning-based ML model for resume
extraction epitomizes the convergence of innovation, expertise, and vision in HR
technology. It heralds a new era in recruitment, where AI-driven solutions offer
unprecedented efficiency, accuracy, and scalability in talent acquisition. As organizations
embrace the transformative power of AI, the developed model stands poised to redefine the
contours of recruitment, unlocking new opportunities for growth, diversity, and
organizational success.