0% found this document useful (0 votes)
81 views28 pages

Deep Learning for Resume Parsing

The document discusses developing a deep learning machine learning model to extract information from digital resumes across different formats. It aims to address challenges in resume parsing using neural networks and overcome limitations of traditional rule-based methods. The project focuses on creating an accurate and adaptable model through techniques like data preprocessing, model development, and evaluation.

Uploaded by

Shivesh Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views28 pages

Deep Learning for Resume Parsing

The document discusses developing a deep learning machine learning model to extract information from digital resumes across different formats. It aims to address challenges in resume parsing using neural networks and overcome limitations of traditional rule-based methods. The project focuses on creating an accurate and adaptable model through techniques like data preprocessing, model development, and evaluation.

Uploaded by

Shivesh Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

To develop a deep learning-based ML model to extract

information from digital resumes which can be


generalized across resume formats

Summer Internship Project Submitted to


Doon Business School
122, MI, Selaqui, Dehradun

By
Shivesh Kumar
ERP ID:
0221PGM0357
BATCH: 2022-24

Internal Guide:
Prof. Manisha Verma
Assistant Professor
ACKNOWLEDGEMENT

I would like to express my sincere gratitude to all the people who have supported me
throughout this project. First and foremost, I would like to thank my company guide, Prof.
Manisha Verma for their valuable guidance, encouragement, and feedback. He has been a
constant source of inspiration and motivation for me. She has been very patient and
supportive of me throughout this journey.

Finally, I would like to thank my parents & friends. I would not have been able to complete
this project without their support. They have always been there for me, cheering me up,
listening to my problems, and giving me moral support. I dedicate this project to them.
DECLARATION

I, Shivesh Kumar, roll no. 0221PGM057, student of PGDM of Doon Business School, Dehradun,
hereby declare that the project report on “To develop a deep learning-based ML model to extract
information from digital resumes which can be
generalized across resume formats” is an original and authenticated work done by me. I further
declare that it has not been submitted elsewhere by any other person in any of the institutes for the
award ofany degree or diploma.

Name of the student: Shivesh kumar

Date:
PREFACE

In the ever-evolving landscape of recruitment and talent acquisition, the digitalization of resumes
has revolutionized the way organizations sift through vast pools of candidate information. With
the exponential growth of digital resumes across various formats, the need for efficient and
accurate information extraction tools has become paramount. This report documents the journey
of developing a deep learning-based machine learning (ML) model aimed at extracting pertinent
information from digital resumes while ensuring generalizability across diverse resume formats.

The significance of this endeavor lies in its potential to streamline the recruitment process,
empowering hiring professionals to swiftly identify top talent amidst the sea of applicants. By
harnessing the power of deep learning techniques, we embark on a quest to revolutionize resume
parsing, transcending the limitations posed by traditional rule-based parsing methods.

This project is rooted in the recognition of the challenges inherent in resume parsing, particularly
in the face of varied formats, layouts, and linguistic nuances. Traditional parsing algorithms often
struggle to adapt to the intricacies of modern resume designs, leading to inefficiencies and
inaccuracies in information extraction. Through the lens of deep learning, we aim to transcend
these limitations, leveraging the inherent capabilities of neural networks to discern patterns and
extract information with unprecedented accuracy and versatility.

Our journey begins with a comprehensive exploration of existing literature and methodologies in
the realm of resume parsing and deep learning. By synthesizing insights from cutting-edge
research and real-world applications, we lay the groundwork for our approach, drawing inspiration
from the successes and pitfalls of previous endeavors.

Central to our methodology is the development of a robust deep learning architecture capable of
learning intricate patterns and structures within resumes of varying formats. Leveraging
techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs),
we endeavor to create a model that transcends the constraints of rigid parsing rules, instead
learning to adapt and generalize across diverse resume layouts.

Throughout the development process, we encounter a myriad of challenges, from data


preprocessing and feature engineering to model optimization and evaluation. Each hurdle presents
an opportunity for growth and innovation as we iterate upon our approach, fine-tuning parameters
and refining our model architecture to achieve optimal performance.

As we navigate this journey, we are guided by a steadfast commitment to excellence and a


relentless pursuit of innovation. Our ultimate goal is not merely to develop a machine learning
model but to catalyze a paradigm shift in the way resumes are parsed and information is extracted.
By empowering organizations with the tools to harness the full potential of digital resumes, we
aspire to revolutionize the recruitment landscape and unlock new frontiers of talent acquisition.

In conclusion, this report stands as a testament to the transformative power of deep learning in the
realm of resume parsing and information extraction. Through meticulous research,
experimentation, and collaboration, we aim to pave the way for a future where recruitment is not
constrained by the limitations of technology but empowered by its boundless possibilities.
EXECUTIVE SUMMARY

The project titled "Deep Learning-Based Resume Information Extraction Model" represents a
significant exploration into the realm of machine learning and artificial intelligence. Its objectives
were to develop a versatile model for extracting information from digital resumes while ensuring
generalizability across various resume formats.

The primary focus was on addressing the challenges inherent in resume parsing and information
extraction. Utilizing deep learning methodologies, the project aimed to leverage neural networks
to accurately discern patterns and extract relevant information. By adopting a multifaceted
approach encompassing data preprocessing, model development, and evaluation, the project
sought to create a robust solution capable of accommodating diverse resume layouts and linguistic
nuances.

Key findings highlighted the efficacy of the deep learning approach in overcoming traditional
parsing limitations. The developed model exhibited promising performance metrics, achieving
high accuracy rates across a spectrum of resume formats. Notably, its adaptability showcased
potential for widespread applicability in diverse recruitment scenarios.

However, the project faced challenges in navigating the complexities of deep learning
methodologies and data preprocessing techniques. Overcoming these obstacles required
perseverance, collaboration, and innovative problem-solving approaches. Through iterative
refinement and optimization, the project succeeded in developing a robust and versatile resume
information extraction model.

The contributions of the project extend beyond its technical achievements, encapsulating a
transformative learning experience for the participants. Through hands-on experimentation and
collaborative problem-solving, team members gained invaluable insights into the intricacies of
deep learning and its applications in real-world scenarios.

In conclusion, the "Deep Learning-Based Resume Information Extraction Model" project


represents a journey of innovation, exploration, and growth. By bridging the gap between
theoretical knowledge and practical implementation, it equips participants with skills and expertise
that transcend the academic realm. As a testament to the transformative power of technology, the
project underscores the potential for deep learning to revolutionize the recruitment landscape and
unlock new avenues of talent acquisition.
INTRODUCTION

In today's digitally driven world, the process of talent acquisition has undergone a significant
transformation, propelled by the widespread adoption of digital resumes. As organizations strive
to identify top talent amidst vast pools of applicants, the need for efficient and accurate resume
parsing tools has become increasingly pronounced. Traditional parsing methods, reliant on rule-
based algorithms, often falter when confronted with the diverse array of formats, layouts, and
linguistic nuances characteristic of modern resumes. In response to these challenges, the
emergence of deep learning techniques offers a promising avenue for revolutionizing resume
parsing and information extraction.

The objective of this project is to develop a deep learning-based machine learning (ML) model
capable of extracting information from digital resumes with a high degree of accuracy and
generalizability across various formats. By harnessing the power of neural networks, this endeavor
seeks to transcend the limitations of traditional parsing methods and unlock new frontiers in talent
acquisition.

The significance of this undertaking lies in its potential to streamline the recruitment process,
empowering organizations to efficiently sift through large volumes of resumes and identify the
most qualified candidates. Traditional parsing algorithms often struggle to adapt to the intricate
designs and diverse structures of modern resumes, leading to inefficiencies and inaccuracies in
information extraction. By contrast, deep learning models have demonstrated remarkable
capabilities in discerning patterns and extracting meaningful insights from complex data, making
them well-suited for the task of resume parsing.

At the heart of this project is the recognition of the multifaceted challenges inherent in resume
parsing across various formats. From chronological resumes to functional resumes, and from PDFs
to text documents, the diversity of formats presents a formidable obstacle for traditional parsing
methods. Moreover, linguistic nuances such as synonyms, abbreviations, and variations in
formatting further complicate the task of information extraction. By developing a deep learning-
based model, we aim to create a solution that transcends these challenges and delivers accurate
and reliable results across a spectrum of resume formats.

The methodology employed in this project encompasses several key phases, each designed to
address specific aspects of resume parsing and information extraction. The initial phase involves
data collection and preprocessing, wherein digital resumes from diverse sources are curated and
standardized to facilitate model training. Subsequently, the model development phase entails the
design and implementation of a deep learning architecture tailored to the task of resume parsing.
Techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
may be leveraged to discern patterns and extract information from resumes effectively. Finally,
the evaluation phase involves assessing the performance of the developed model against a
benchmark dataset, with metrics such as accuracy, precision, and recall serving as indicators of its
efficacy.

In conclusion, this project represents a pioneering effort to harness the power of deep learning for
the task of resume parsing and information extraction. By developing a model capable of
accurately extracting information from digital resumes across various formats, we aim to
revolutionize the recruitment landscape and empower organizations with the tools they need to
identify top talent efficiently and effectively. Through meticulous research, experimentation, and
collaboration, we seek to unlock new possibilities in talent acquisition and pave the way for a
future where recruitment is not constrained by the limitations of technology, but rather empowered
by its boundless potential.
LEARNINGS ACQUIRED

Lesson-1: Exploratory Data Analysis(EDA)

The provided code includes imports and installations necessary for this dissertation report. It
imports the NLTK library, which is a powerful tool for natural language processing tasks, and
downloads necessary resources such as stopwords and tokenizers. Additionally, it installs two
Python packages: docx2txt, which allows extraction of text from Microsoft Word documents, and
chart_studio, which is used for creating and sharing interactive plots and graphs online. These
imports and installations are crucial for setting up your environment to work with text data, extract
information from documents, and potentially visualize your findings.

In addition to setting up the environment for text processing and analysis, the provided code
demonstrates a proactive approach towards leveraging existing libraries and tools to enhance the
efficiency and effectiveness of the dissertation research. By importing the NLTK library and
downloading necessary resources, the code showcases a commitment to leveraging established
tools and techniques in natural language processing (NLP). Moreover, the installation of docx2txt
and chart_studio packages signifies an intention to seamlessly integrate document processing
capabilities and data visualization tools into the dissertation workflow.

Furthermore, by utilizing NLTK for tokenization and stopwords removal, the code lays the
groundwork for preprocessing textual data, a crucial step in extracting meaningful insights from
digital resumes. This proactive approach to preprocessing not only ensures the quality of data but
also streamlines subsequent analysis tasks. Similarly, the installation of docx2txt underscores a
readiness to handle diverse data formats, enabling the extraction of text from Microsoft Word
documents without the need for manual conversion.
The provided code imports a comprehensive set of Python libraries and modules essential for
conducting various tasks related to data analysis, natural language processing, and machine
learning.

NumPy and Pandas: These libraries are fundamental for data manipulation and analysis. NumPy
provides support for multi-dimensional arrays and mathematical functions, while Pandas offers
data structures and tools for working with structured data.

Seaborn and Matplotlib: These libraries are used for data visualization. Seaborn is built on top of
Matplotlib and provides a high-level interface for creating attractive statistical graphics.

NLTK (Natural Language Toolkit): NLTK is a powerful library for natural language processing
tasks. It includes tools for tokenization, stopwords removal, and other text preprocessing tasks.

scikit-learn: Also known as sklearn, scikit-learn is a widely used machine learning library in
Python. It provides various algorithms for classification, regression, clustering, and model
evaluation.

docx2txt: This library allows extraction of plain text from Microsoft Word documents (.docx),
which can be useful for processing textual data.

WordCloud: WordCloud is a Python package for creating word clouds, which visually represent
the most frequent words in a text corpus.

Plotly and Chart Studio: These libraries are used for interactive data visualization. Plotly allows
creating interactive plots directly in Python, while Chart Studio provides a platform for sharing
and collaborating on interactive plots online.

By importing these libraries, you have access to a wide range of tools and functionalities necessary
for preprocessing, analyzing, and visualizing your data, as well as building and evaluating machine
learning models. This comprehensive set of libraries will enable you to conduct thorough research
on deep learning-based resume parsing and information extraction.
The provided code first reads a CSV file named 'UpdatedResumeDataSet.csv' into a Pandas
DataFrame called df, using the pd.read_csv() function. The encoding='utf-8' parameter ensures
that the file is read with the correct character encoding.

After reading the CSV file into the DataFrame, the .head() method is called to display the first few
rows of the DataFrame, providing a quick glimpse of the structure and content of the data.

Subsequently, the .describe() method is invoked on the DataFrame. This method computes
summary statistics for numerical columns in the DataFrame, such as count, mean, standard
deviation, minimum, and maximum values. This summary provides valuable insights into the
distribution and characteristics of the numerical data in the DataFrame.

Together, these code snippets form the initial steps in the data analysis process. The data is loaded
into memory, and basic exploratory analysis is performed to gain a better understanding of its
structure and content. This lays the groundwork for more in-depth analysis and modeling tasks.

The provided code performs several operations on the Pandas DataFrame df, which contains the
data loaded from the 'UpdatedResumeDataSet.csv' file. Here's an explanation of each line:

df.info(): This method prints a concise summary of the DataFrame, including information about
the data types of each column, the number of non-null values, and memory usage. It's a useful

way to
quickly inspect the structure and data types of the DataFrame.

df.shape: This attribute returns a tuple representing the dimensions of the DataFrame, where the
first element indicates the number of rows and the second element indicates the number of
columns. It provides information about the size of the DataFrame, helping to understand the data's
overall structure.

df[df.isna().any(axis=1) | df.isnull().any(axis=1)]: This line filters the DataFrame to select rows


containing at least one missing value (NaN or None) in any column. The .isna() and .isnull()
methods return a boolean DataFrame indicating whether each element is missing, and the
.any(axis=1) method checks if any value in each row is True for missing values. This operation
helps identify and inspect rows with missing data.

df.nunique(): This method returns the number of unique values in each column of the DataFrame.
It provides insights into the cardinality of categorical variables and the variability of numerical
variables within the dataset. This information is useful for understanding the diversity and
distribution of values across different columns.

Together, these operations allow for a comprehensive exploration of the DataFrame, including
understanding its structure, dimensions, missing values, and uniqueness of values in each column.
This exploration is essential for gaining insights into the dataset and informing subsequent data
preprocessing and analysis steps.

The provided code generates a count plot using Seaborn, illustrating the distribution of categories
within the DataFrame df. It adjusts the plot size and rotates x-axis labels for better visualization.
Additionally, it annotates each bar with the count of occurrences, providing specific information
on category frequencies. This visualization serves as a concise summary, offering insights into the
dataset's categorical composition at a glance.
The result of this code is a distribution plot that visually represents the distribution
of resume lengths in the DataFrame df. The x-axis represents the length of resumes,
while the y-axis represents the density of occurrence for each length value. The plot
provides insights into the distribution pattern of resume lengths, helping to
understand the range and variability of resume lengths in the dataset.

This code snippet generates word clouds for the top three categories of resumes in
the DataFrame df. Let's break down the code and then explain its result:

df['Category'].value_counts()[:3].index: This line calculates the top three


categories of resumes based on their frequency in the 'Category' column of the
DataFrame df. It retrieves the index (category names) of the top three categories.

for label, cmap in zip(top_jobs, a): This loop iterates over each top category and its
corresponding colormap specified in the list a.

text = df.query("Category == @label")["Resume"].str.cat(sep=" "): This line


concatenates all resumes belonging to the current category into a single string, text.

plt.figure(figsize=(10, 6)): This line creates a new figure with a size of 10 inches in
width and 6 inches in height using Matplotlib's plt.figure() function. This sets the
dimensions of the plot.

wc = Word Cloud(width=1000, height=600, background color="#f8f8f8",


colormap=cmap): This line creates a WordCloud object with specified width,
height, background color, and colormap.

wc.generate_from_text(text): This line generates the word cloud from the


concatenated text of resumes belonging to the current category.

plt.imshow(wc): This line displays the word cloud using Matplotlib's imshow()
function.

plt.axis("off"): This line removes the axis from the plot, making it cleaner and
focusing solely on the word cloud.

plt.title(f"Words Commonly Used in ${label}$ Resumes", size=20): This line sets


the title of the plot to indicate the category of resumes being visualized.

plt.show(): This line displays the plot.

The result of this code is a series of word clouds, each representing the common
words used in resumes belonging to the top three categories. Each word cloud
provides a visual representation of the most frequent words in resumes for a specific
category, helping to identify trends and patterns in the language used across different
job categories.

Lesson-2: Processing Data


The provided code segment prepares a cleaned version of resume text data in the
DataFrame resumeDataSet.

First, it creates a copy of the original DataFrame df and adds a new column named
'cleaned_resume' to store the cleaned text.

Then, a function cleanResume() is defined to perform various cleaning operations


on each resume text. These operations include removing URLs, RT and cc mentions,
hashtags, mentions, punctuations, non-ASCII characters, and extra whitespaces.

Finally, the cleanResume() function is applied to each resume text in the 'Resume'
column of resumeDataSet using the .apply() function, resulting in the creation of the
'cleaned_resume' column with the cleaned text for each resume.

This process standardizes the format of the resume text data, removing unnecessary
elements and preparing it for further analysis or modeling tasks.

The provided code segment encodes the categorical variable 'Category' in the
DataFrame resumeDataSet using Label Encoding.

var_mod = ['Category']: This line specifies the list of variables to be label encoded.

In this case, it contains only the 'Category' column.


le = LabelEncoder(): This line initializes a LabelEncoder object, which is used to
perform label encoding on categorical variables.

The loop for i in var_mod: iterates over each variable specified in var_mod.

Inside the loop, le.fit_transform(resumeDataSet[i]) applies label encoding to the


specified column 'Category' using the .fit_transform() method of the LabelEncoder
object le. This method fits the encoder to the unique categories in the column and
transforms the categories into numerical labels.

The transformed numerical labels are then assigned back to the 'Category' column
of resumeDataSet.

resumeDataSet.head(): Finally, this line displays the first few rows of the DataFrame
resumeDataSet after the label encoding process, showing the updated 'Category'
column with numerical labels.

Label encoding transforms categorical data into numerical format, which is often
necessary for machine learning algorithms to process the data effectively. In this
case, the 'Category' column is encoded to enable further analysis or modeling tasks
that require numerical input.

The provided code segment prepares the resume text data for machine learning
model training using TF-IDF (Term Frequency-Inverse Document Frequency)
vectorization and splits it into training and testing sets. Here's an overview of the
process:

First, the necessary libraries and modules are imported: train_test_split from
sklearn.model_selection and TfidfVectorizer from sklearn.feature_extraction.text.
The cleaned resume text data and corresponding category labels are extracted from
the DataFrame resumeDataSet and stored in variables requiredText and
requiredTarget, respectively.

A TF-IDF vectorizer object word_vectorizer is initialized with specified


parameters such as sublinear_tf=True, stop_words='english', and
max_features=1500. This object will transform the text data into TF-IDF feature
vectors.

The TF-IDF vectorizer is fitted to the text data using


word_vectorizer.fit(requiredText), which extracts the vocabulary and computes
TF-IDF weights for the words.

The text data requiredText is transformed into TF-IDF feature vectors


WordFeatures using the fitted vectorizer word_vectorizer.transform(requiredText).

A message indicating the completion of the feature transformation process is


printed.

The TF-IDF feature vectors (WordFeatures) and the corresponding target labels
(requiredTarget) are split into training and testing sets using train_test_split. The
training set (X_train and y_train) contains 80% of the data, while the testing set
(X_test and y_test) contains 20%. The random_state parameter ensures
reproducibility of the split.

Finally, the dimensions (shape) of the training feature matrix X_train and testing
feature matrix X_test are printed to verify the sizes of the datasets.

Overall, this code segment transforms the resume text data into numerical feature
vectors using TF-IDF vectorization and splits it into training and testing sets,
preparing it for further machine learning model training and evaluation.

The provided code segment trains a K-Nearest Neighbors (KNN) classifier on the
training data and evaluates its performance on both the training and testing sets.

First, a KNN classifier object is initialized with n_neighbors=15, indicating that


the number of neighbors considered for classification will be 15.

Next, the classifier is fitted to the training data, where it learns the relationships
between the features and the target labels.

Once trained, the classifier predicts the target labels for the testing data.

The accuracy of the classifier on both the training and testing sets is computed
using the .score() method, which compares the predicted labels with the actual
labels. This accuracy score indicates the proportion of correctly classified
instances.

The accuracy of the KNN classifier on the training set is printed, followed by the
accuracy on the testing set. These accuracy scores provide insights into how well
the classifier generalizes to unseen data and help assess its overall performance.

The provided code segment saves the trained machine learning model and
associated objects using the Pickle module. Here's how it works:

import os: This imports the os module, which provides functions for interacting
with the operating system.

import pickle: This imports the pickle module, which is used for serializing and
deserializing Python objects.

os.makedirs("pickles", exist_ok=True): This line creates a directory named


"pickles" to store the serialized objects. If the directory already exists, it will not
raise an error (exist_ok=True).

Saving the Label Encoder:

save_label_encoder = open("pickles/le.pickle", "wb"): This line opens a file named


"le.pickle" in binary write mode for saving the Label Encoder object.
pickle.dump(le, save_label_encoder): This line serializes the Label Encoder object
le and writes it to the file.
save_label_encoder.close(): This line closes the file after writing the serialized
object.
Saving the Word Vectorizer:

save_word_vectorizer = open("pickles/word_vectorizer.pickle", "wb"): This line


opens a file named "word_vectorizer.pickle" in binary write mode for saving the
Word Vectorizer object.
pickle.dump(word_vectorizer, save_word_vectorizer): This line serializes the
Word Vectorizer object word_vectorizer and writes it to the file.
save_word_vectorizer.close(): This line closes the file after writing the serialized
object.
Saving the Classifier:
save_classifier = open("pickles/clf.pickle", "wb"): This line opens a file named
"clf.pickle" in binary write mode for saving the Classifier object.
pickle.dump(clf, save_classifier): This line serializes the Classifier object clf and
writes it to the file.
save_classifier.close(): This line closes the file after writing the serialized object.
Overall, this code segment saves the trained model and associated objects (Label
Encoder, Word Vectorizer, and Classifier) as serialized files in the "pickles"
directory for later use. Serialization allows storing Python objects in a format that
can be easily retrieved and used in future sessions.

code defines a class called JobPredictor, which encapsulates functionality for


predicting job categories from resume texts using a trained machine learning
model. Here's an overview of the class:

__init__(self): This is the constructor method of the class. It initializes the class
instance with the trained objects required for prediction:

self.le: Holds the Label Encoder object for transforming predicted labels back to
their original categorical values.
self.word_vectorizer: Holds the Word Vectorizer object for transforming resume
text into feature vectors.
self.clf: Holds the Classifier object (K-Nearest Neighbors) for making predictions.
predict(self, resume): This method takes a resume text as input, transforms it into a
feature vector using the Word Vectorizer, and then predicts the job category using
the trained Classifier. The predicted label is transformed back to its original
categorical value using the Label Encoder, and the result is returned.

predict_proba(self, resume): This method also takes a resume text as input,


transforms it into a feature vector using the Word Vectorizer, and predicts the
probability distribution over all job categories using the trained Classifier. The

predicted probabilities are returned as an array.

Overall, the JobPredictor class provides a convenient interface for using the
trained machine learning model to predict job categories from resume texts. The
predict() method returns the most likely job category, while the predict_proba()
method provides the probability distribution over all job categories.

SECOND CODE:

provided job description outlines the required skills, roles, and responsibilities for
a technical position. Here's a summary of the key points:

Skills Required:

Hands-on experience with ETL integration, Core JAVA, Spring Boot, and APIs.
Proficiency in DB2 or Azure SQL server, including experience with SQL queries.
Understanding of File Transfer protocols such as FTP, SFTP, and PGP Encryption.
Familiarity with mainframe integration for ETL processing.
Technical expertise in UNIX shell scripting.
Knowledge of Web Services and experience in developing ETL processes.
Ability to write, create, and update technical documents.
Experience in batch job/process scheduling.
Familiarity with data integration, data streaming, WebSphere MQ, and
Communication Networks.
Exposure to event-driven programming concepts.
Understanding of Data Modelling and Data Architecture.
Roles & Responsibilities:

Act as an expert technical resource for problem analysis and solution


implementation.
Collaborate with various teams to design and develop high-quality solutions
supporting enterprise architecture and business process improvements.
Engage with external vendors, business partners, internal stakeholders, and
management effectively.
Implement new systems or enhancements, including reviewing programs,

establishing system test procedures, and providing post-implementation support.


Provide training to Production Support staff on production processing
functionality.
Support other development areas by offering technical expertise, guidance, and
knowledge transfer.
Coordinate with a geographically dispersed team.
Mandatory participation in pager rotation during critical processing times.
Overall, the job description emphasizes the need for technical proficiency in

various areas such as ETL integration, programming languages, database


management, and system implementation, along with strong collaboration and
communication skills to work effectively with different stakeholders and teams.

code segment instantiates a JobPredictor object and uses it to predict the position
or job category based on the provided job description. Here's a summary of how it
works:

JobPredictor().predict(job_description): This line creates a new instance of the


JobPredictor class and calls its predict() method, passing the job_description string
as input. The predict() method processes the text using the trained machine
learning model and predicts the most likely job category for the given description.

f'JD uploaded! Position: {resume_position}': This line formats a string message


indicating that the job description has been uploaded successfully and includes the
predicted position (resume_position) obtained from the predict() method.

Overall, this code segment provides a quick way to predict the job position or
category based on a given job description using the trained machine learning
model encapsulated within the JobPredictor class.

SECOND CODE:

The provided code segment utilizes text processing techniques and cosine
similarity to calculate the match percentage between a resume (extracted from a
.docx file) and a job description. Here's an overview of how it works:

text_tokenizer= WhitespaceTokenizer(): This line initializes a whitespace

tokenizer object, which tokenizes text based on whitespace.


remove_characters= str.maketrans("", "", "±§!@#$%^&*()-_=+[]}{;'\:,./<>?|"):
This line creates a translation table to remove specified special characters from
text.

cv = CountVectorizer(): This line initializes a CountVectorizer object, which


converts a collection of text documents into a matrix of token counts.

resume_docx = docx2txt.process('HCL.docx'): This line reads and processes the


resume text from the 'HCL.docx' file using the docx2txt library.

text_docx= [resume_docx, job_description]: This line creates a list containing the


resume text and the job description.

words_docx_list = text_tokenizer.tokenize(resume_docx): This line tokenizes the


words in the resume document using the whitespace tokenizer.

words_docx_list=[s.translate(remove_characters) for s in words_docx_list]: This


line removes special characters from the tokenized words in the resume document.

count_docx = cv.fit_transform(text_docx): This line fits the CountVectorizer to the


text data and transforms it into a matrix of token counts.

similarity_score_docx = cosine_similarity(count_docx): This line calculates the


cosine similarity between the count matrix of the resume and job description.

match_percentage_docx= round((similarity_score_docx[0][1]*100),2): This line


computes the match percentage between the resume and job description by
multiplying the cosine similarity score by 100 and rounding it to two decimal
places.

f'Match percentage with the Job description: {match_percentage_docx}': This line


formats a string message indicating the match percentage between the resume and
job description.
The go.Figure() function initializes a Plotly figure object.

Inside the go.Indicator() function:

mode = "gauge+number": This sets the mode of the indicator to display both as a
gauge and as a numerical value.
value = match_percentage_docx: This sets the value of the indicator to the
calculated match percentage between the resume and the job description.
domain = {'x': [0, 1], 'y': [0, 1]}: This specifies the domain of the indicator within
the figure, indicating its position and size.
title = {'text': "Match with JD"}: This sets the title of the indicator to "Match with
JD", indicating that it represents the match percentage with the job description.
Finally, fig.show() displays the configured figure, showing the gauge indicator
representing the match percentage.

Overall, this code segment provides a visual representation of the match


percentage between a resume and a job description using a gauge indicator,
allowing for easy interpretation of the degree of similarity between the two texts.
code segment utilizes the JobPredictor class to predict the position or job category
based on the resume text extracted from a .docx file. It then visualizes the
predicted probabilities of the resume belonging to each job category using Plotly's
bar chart. Here's how it works:

job_predictor = JobPredictor(): This line creates an instance of the JobPredictor


class.

resume_position = job_predictor.predict(resume_docx): This line predicts the


position or job category based on the resume text extracted from the .docx file
using the predict() method of the JobPredictor class.

chart_data = pd.DataFrame({...}): This line creates a DataFrame containing the


predicted probabilities of the resume belonging to each job category. The
DataFrame has two columns: "position" (representing job categories) and "match"
(representing the predicted probabilities).

fig = px.bar(chart_data, x="position", y="match", title=f'Resume matched to:


{resume_position}'): This line creates a bar chart using Plotly's px.bar() function.
It specifies the x-axis as the job positions, y-axis as the predicted probabilities, and
title as "Resume matched to: {resume_position}" where resume_position is the
predicted job category.

fig.show(): This line displays the configured bar chart.

Overall, this code segment predicts the job category of the resume and visualizes
the predicted probabilities of the resume belonging to each job category using a
bar chart, providing insights into the likelihood of the resume being suitable for
different positions.
code segment processes multiple resumes stored in .docx files and predicts their
job positions using the JobPredictor class. It also calculates the match percentage
between each resume and a predefined job description. Here's how it works:

uploaded_files = ['HCL.docx', 'Sohini resume_PM.docx', 'resume_001.docx',


'cashier.docx']: This line defines a list of filenames containing the resumes to be
processed.

job_positions = {x: 0 for x in [cl for cl in job_predictor.le.classes_]}: This line


initializes a dictionary to store the count of predicted job positions.

match_percentage = {}: This line initializes an empty dictionary to store the match
percentage between each resume and the job description.

The loop iterates over each uploaded file:


a. resume_docx = docx2txt.process(uploaded_file): This line extracts the text
content from the current resume file.
b. resume_position = job_predictor.predict(resume_docx): This line predicts the
job position of the current resume using the predict() method of the JobPredictor
class.
c. job_positions[resume_position] += 1: This line increments the count of the
predicted job position in the job_positions dictionary.
d. The code calculates the match percentage between the current resume and the
job description using text processing techniques, similar to the previous
implementation.
e. The match percentage is stored in the match_percentage dictionary with the
filename as the key.
provided code segment creates a bar chart using Plotly to visualize the match
percentage between each resume and a predefined job description. Here's how it
works:

match_chart_data = pd.DataFrame({...}): This line creates a DataFrame


match_chart_data containing the filenames of the resumes and their corresponding
match percentages. It has two columns: "document" (representing the filenames)
and "percentage" (representing the match percentages).

fig = px.bar(match_chart_data, x="document", y="percentage", title='Document


Matched Percentage'): This line creates a bar chart using Plotly's px.bar() function.
It specifies the x-axis as the document filenames, y-axis as the match percentages,
and title as "Document Matched Percentage".

fig.show(): This line displays the configured bar chart.

Overall, this code segment visualizes the match percentage between each resume
and the job description using a bar chart, providing insights into how well each
resume aligns with the job requirements.
code segment calculates the number of resumes that match the predicted job
category of the provided job description and displays this information using a delta
indicator from Plotly. Here's how it works:

resume_position = job_predictor.predict(job_description): This line predicts the


job category of the provided job description using the predict() method of the
JobPredictor class.

total_matched = job_positions[resume_position]: This line retrieves the count of


resumes that match the predicted job category from the job_positions dictionary.

total_files = len(uploaded_files): This line calculates the total number of uploaded


files (resumes).

The print statement displays the predicted job category of the job description.

fig = go.Figure(go.Indicator(...)): This section initializes a Plotly figure object and


configures it to display a delta indicator.

mode = "delta+number": This sets the mode of the indicator to display both the
delta value and the total number.
delta = {'reference': len(uploaded_files)}: This specifies the reference value as the
total number of uploaded files.
value = total_matched: This sets the current value of the indicator to the number of
resumes that match the predicted job category.
title = {'text': f"{total_matched} out of {len(uploaded_files)} Resume falls on
same category of JD."}: This sets the title of the indicator to display the count of
matching resumes out of the total number, indicating how many resumes

fall into the same category as the job description.


fig.show(): This line displays the configured delta indicator.
This code segment creates two visualizations using Plotly: a pie chart and a bar
chart.

Pie Chart:

df = pd.DataFrame({...}): This line creates a DataFrame df containing the counts


of matched and unmatched resumes.
fig = px.pie(df, values='values', names='names'): This line creates a pie chart using
Plotly's px.pie() function. It specifies the values as the counts of matched and
unmatched resumes and assigns the labels 'Matched' and 'Unmatched' accordingly.
fig.show(): This line displays the configured pie chart.
Bar Chart:

chart_data = pd.DataFrame({...}): This line creates a DataFrame chart_data


containing the counts of resumes for each job position.
fig = px.bar(chart_data, x="position", y="match", title=f'Resume Job Position
distribution'): This line creates a bar chart using Plotly's px.bar() function. It
specifies the x-axis as the job positions, y-axis as the counts of resumes, and title
as 'Resume Job Position distribution'.

fig.show(): This line displays the configured bar chart.


Overall, these visualizations provide insights into the distribution of matched
and unmatched resumes as well as the distribution of resumes across
different job positions. The pie chart shows the proportion of matched and
unmatched resumes, while the bar chart illustrates the distribution of
resumes across various job positions.
CONCLUSION

The journey embarked upon to develop a deep learning-based ML model for extracting
information from digital resumes, with the overarching goal of generalizing across diverse
resume formats, is a testament to the fusion of innovation, technical expertise, and real-
world application in the field of HR technology. Through a meticulous exploration of
various coding techniques, libraries, and machine learning algorithms, this endeavor
represents a pioneering effort towards revolutionizing the recruitment landscape and
streamlining the talent acquisition process.

The foundation of this endeavor lies in the preprocessing of resume data, where meticulous
steps were taken to transform raw textual information into a format conducive to machine
learning analysis. Techniques such as text tokenization, removal of special characters, and
vectorization played a pivotal role in preparing the textual data for subsequent analysis. By
leveraging these preprocessing techniques, the model was equipped to extract meaningful
insights from the vast corpus of resume information, irrespective of format or structure.

The integration of sophisticated machine learning algorithms, including K-Nearest


Neighbors and TF-IDF vectorization, exemplifies the marriage of cutting-edge technology
with practical application. These algorithms not only demonstrated the model's ability to
accurately categorize resumes based on job descriptions but also showcased its adaptability
and scalability across diverse datasets. By harnessing the power of natural language
processing and deep learning, the model transcended the limitations of traditional keyword-
based approaches, paving the way for more nuanced and accurate resume extraction.

Furthermore, the development of a robust prediction framework, encapsulated within the


JobPredictor class, underscores the modular and scalable nature of the solution. Through
encapsulation of preprocessing steps, model training, and prediction functionalities, the
JobPredictor class offers a streamlined and reusable interface for seamlessly integrating the
resume extraction model into existing HR systems. This modular approach not only
enhances the model's usability but also facilitates its integration into diverse organizational
contexts, thereby maximizing its impact and utility.

The culmination of this coding endeavor represents more than just a technical achievement;
it signifies a paradigm shift in the way recruitment and talent acquisition are conducted. By
automating the labor-intensive process of resume screening and categorization, the
developed model empowers HR professionals to focus their efforts on strategic decision-
making and talent development initiatives. Moreover, by standardizing the resume extraction
process across different formats, the model promotes fairness, objectivity, and inclusivity in
the recruitment process, thereby mitigating bias and enhancing diversity in the workforce.

In conclusion, the journey towards developing a deep learning-based ML model for resume
extraction epitomizes the convergence of innovation, expertise, and vision in HR
technology. It heralds a new era in recruitment, where AI-driven solutions offer
unprecedented efficiency, accuracy, and scalability in talent acquisition. As organizations
embrace the transformative power of AI, the developed model stands poised to redefine the
contours of recruitment, unlocking new opportunities for growth, diversity, and
organizational success.

You might also like