PCOS - PHASE 1 Report
PCOS - PHASE 1 Report
Submitted by
Assistant Professor
BACHELOR OF TECHNOLOGY
in
DECEMBER 2024
DEPARTMENT OF INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
This is to certify that the project work entitled “MACHINE LEARNING MODEL FOR PCOS
DEVIPRIYA [21ITL002] in partial fulfillment of the requirement, for the award of [Link]
Degree in Information Technology by Pondicherry University during the academic year 2024 -
2025.
We are thankful and grateful to our beloved guide, Mrs. E. Valarmathi, Assistant
Professor, Department of Information Technology, whose continued support and guidance
enabled us in completing our project. He has been a great source of encouragement to us.
We also sincerely thank our project coordinators, Dr. K. Lakshmi, Professor, Department
of Information Technology, whose continuous reviews and sufficient comments enabled us to
proceed with our project.
We sincerely thank our Management, for their support throughout the entire duration of the
project.
We thank all our Staff members, Family and Friends for their constant support and
encouragement throughout the entire duration of the project.
We would like to thank the ALMIGHTY for his grace and blessings over us throughout the
project.
ABSTRACT
Polycystic Ovary Syndrome (PCOS) is a prevalent endocrine disorder affecting millions of women,
characterized by hormonal imbalances, irregular cycles, metabolic issues, and reproductive challenges.
Despite its long-term health risks, including infertility, diabetes, and cardiovascular diseases, PCOS often
goes undiagnosed due to its varied symptoms and lack of standardized diagnostic methods. This project
leverages machine learning (ML) to improve PCOS detection, severity classification, and management,
addressing significant healthcare gaps. Using feature selection techniques like Boruta, the system identifies
and prioritizes key clinical and demographic features, enhancing diagnostic precision while simplifying
complexity. Advanced ML models, such as CatBoost and Distributed AutoML, classify PCOS severity
based on factors like symptom intensity and metabolic health, ensuring accuracy across diverse datasets.
The system provides personalized lifestyle recommendations, considering BMI, diet, physical activity, and
stress levels. These include tailored dietary plans, exercise regimens, mental health support, and monitoring
schedules. By combining ML with clinical insights, this scalable, patient-focused tool enhances diagnosis,
treatment, and care, aiming to improve outcomes while reducing the strain on healthcare systems.
iv.
LIST OF FIGURES
v.
LIST OF ABBREVIATIONS
ML Machine Learning
vi.
TABLE OF CONTENTS
LIST OF FIGURES v
LIST OF ABBVERIATIONS vi
1 INTRODUCTION 1
2 LITERATURE SURVEY 6
3 PROPOSED SYSTEM 16
3.5 Modules 21
4 SYSTEM REQUIREMENTS 22
5 RESULT 23
5.1 Result and Discussion 23
6 CONCLUSION 24
REFERENCES 25
APPENDIX I 28
APPENDIX II 30
APPENDIX III 33
CHAPTER 1
INTRODUCTION
1.1 About the Project
Polycystic Ovary Syndrome (PCOS) is a common yet complex hormonal disorder affecting
women of reproductive age globally. Estimated to impact 5–10% of this demographic, PCOS presents
with a wide array of symptoms such as irregular menstrual cycles, elevated androgen levels, insulin
resistance, weight gain, and multiple ovarian cysts. This variability not only makes diagnosis
challenging but also often delays timely interventions. Left undiagnosed or untreated, PCOS can lead to
more severe health conditions, including type 2 diabetes, cardiovascular diseases, and infertility.
Therefore, accurate and early diagnosis is crucial for improving health outcomes and providing women
with timely, targeted care.
Currently, PCOS diagnosis relies on the Rotterdam criteria, which involve clinical assessments,
hormonal testing, and ultrasound imaging. These methods are resource-intensive, subjective, and
frequently inaccessible, especially in low-resource settings. Additionally, inconsistencies among
diagnostic criteria used by different healthcare systems further complicate the diagnostic process. Thus,
there is a growing need for a more objective, accessible, and efficient approach to diagnosing PCOS,
one that minimizes subjectivity, reduces costs, and speeds up the diagnostic timeline.
Machine learning (ML) offers transformative potential in healthcare by analyzing vast amounts
of complex data, identifying patterns, and delivering predictions with high accuracy. For PCOS,
machine learning can process clinical, hormonal, and imaging data to provide an integrated diagnostic
approach that circumvents the limitations of traditional methods. Through models like Catboost,
Distributed AutoML, Boruta etc. ML algorithms can classify and predict PCOS more accurately by
analyzing diverse data points collectively. This study proposes an ML-based diagnostic system that aims
to achieve more accurate, objective, and accessible PCOS detection.
Machine learning provides a range of techniques for PCOS detection by analyzing complex clinical,
biochemical, and lifestyle data. Here are key approaches explained in detail:
1
Feature Engineering and Selection: Effective PCOS detection requires identifying significant
patterns in data. Feature engineering involves creating meaningful input variables from raw
data, while feature selection ensures that only the most relevant factors are used. This reduces
noise, enhances model performance, and provides insights into which clinical parameters are
most predictive of PCOS.
Supervised Learning: Supervised machine learning techniques are widely applied for PCOS
detection. These models are trained on labeled datasets where the presence or absence of PCOS
is already known. By learning from the input-output relationships, the models can predict the
likelihood of PCOS in new patients. This approach is effective for structured datasets
containing clinical and diagnostic data.
Imbalanced Data Handling: PCOS datasets often have an uneven distribution of positive
(PCOS) and negative (non-PCOS) cases, which can bias the model. Techniques like
oversampling the minority class or generating synthetic samples ensure that the model learns
effectively from both classes. These methods improve sensitivity and reduce false negatives in
predictions.
Explainable AI: One challenge in healthcare applications is the need for transparency in
predictions. Explainable AI methods, such as feature importance and interpretable
visualizations, provide clinicians with insights into the reasoning behind model predictions.
This fosters trust and aids in clinical decision-making.
2
Deep Learning for Complex Data: For datasets with intricate relationships, such as those
involving hormonal interactions or ultrasound imaging, deep learning methods like neural
networks and convolutional networks excel. These models can automatically identify patterns
and features without extensive preprocessing, making them ideal for unstructured data like
images.
One of the standout features of CatBoost is its ability to work well out-of-the-box, often requiring
minimal parameter tuning to achieve high performance. It is robust against overfitting and supports missing
value handling natively. CatBoost also integrates seamlessly with GPUs, enabling faster model training,
especially for large datasets. Its versatility and ease of use make it popular in applications ranging from
recommendation systems to financial modeling and healthcare analytics.
CatBoost can be highly effective for detecting Polycystic Ovary Syndrome (PCOS) due to its ability
to handle diverse and complex datasets that may include a mix of numerical and categorical features. In
PCOS detection, datasets often consist of clinical parameters (like BMI, insulin levels, hormone profiles),
patient history (e.g., menstrual cycle irregularities), and categorical data (e.g., lifestyle habits or family
history). CatBoost's capability to process categorical features directly, without extensive preprocessing,
makes it a natural fit for such tasks.
3
Fig. 1.1 Catboost Structure
Boruta is a robust feature selection algorithm that identifies the most relevant features in a dataset by
comparing their importance with that of artificially created shadow features. It works by shuffling the
values of original features to generate these shadow counterparts, which serve as a baseline to determine
whether a feature truly contributes meaningful information or just adds noise. Boruta then trains a machine
learning model on the dataset and assesses feature importance. If an original feature's importance
significantly surpasses its shadow counterpart, it is deemed relevant; otherwise, it is rejected.
When combined with CatBoost for a PCOS detection task, Boruta can effectively reduce the
dimensionality of your dataset by selecting only the most informative features, thereby improving both
training speed and model performance. CatBoost, known for its ability to handle categorical data efficiently,
benefits from Boruta's feature selection by eliminating unnecessary or redundant variables, which can
reduce overfitting and enhance generalization. This ensures that the final CatBoost model focuses on the
most impactful clinical and demographic indicators, such as hormone levels, insulin resistance, and lifestyle
factors, ultimately leading to more accurate and reliable PCOS detection outcomes.
4
1.6 DAML
Distributed AutoML is an automated machine learning approach that leverages multiple machines or
computing resources to train models in parallel, significantly speeding up model development and scaling.
It combines hyperparameter optimization, data preprocessing, and model training across distributed
infrastructure, ensuring that large-scale datasets are efficiently handled. This distributed setup enables the
system to explore a wide range of algorithms and parameter combinations simultaneously, thus finding the
best model configurations faster and more effectively than a single-machine setup.
When applied to PCOS severity classification, Distributed AutoML can process large datasets
containing clinical and biochemical indicators, such as hormone levels, BMI, insulin resistance, and other
diagnostic parameters. It trains and optimizes multiple CatBoost models across different machines in
parallel, ensuring a thorough exploration of hyperparameters and feature interactions. By doing so, the
system can accurately detect the severity of PCOS by considering subtle patterns and relationships within
the data. This allows for robust classification of severity levels (mild, moderate, severe) and ensures that
clinical predictions are both scalable and reliable, facilitating better diagnosis and treatment planning.
Forward Chaining is a rule-based reasoning method that starts with known data and applies logical
rules step-by-step to draw conclusions. It sequentially infers new facts until a goal is achieved or no further
inferences can be made, making it efficient for processing relevant information.
In a PCOS recommendation system, forward chaining uses clinical data like hormone levels and
BMI to apply diagnostic rules. It deduces patterns and relationships to determine PCOS severity and
provides personalized, actionable recommendations for treatment and lifestyle adjustments.
PCOS symptoms vary widely among individuals, requiring treatment plans tailored to each
patient’s unique hormonal and lifestyle profile. This project proposes a personalized lifestyle
recommendation system that leverages patient-specific data, including symptom severity, hormonal
imbalances, and lifestyle factors, to generate targeted interventions. This feature aims to not only
manage symptoms but also reduce the risks.
5
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction about survey
Machine learning (ML) offers innovative solutions for PCOS detection by analyzing clinical,
biochemical, and lifestyle data, uncovering patterns beyond traditional methods. These models enhance
diagnostic accuracy and reduce dependence on invasive procedures.
This survey reviews ML techniques for PCOS detection, focusing on feature selection, data
preprocessing, model training, and evaluation. It highlights their strengths, limitations, and research
gaps, aiming to guide the development of reliable diagnostic tools.
Khanna et al.[1] propose a multi-level stack machine learning framework for diagnosing PCOS,
combining high predictive accuracy with enhanced interpretability using tools like SHAP, LIME, ELI5, and
Qlattice. By optimizing feature selection and explainability, the framework addresses the need for
transparency in medical AI applications, helping clinicians understand the relationships between features
and PCOS diagnosis. However, its complexity and reliance on advanced tools present challenges for real-
world implementation, requiring significant computational resources and expertise, which may limit its
scalability and accessibility in healthcare settings.
6
2.2.2 PCONet: A Convolutional Neural Network Architecture to Detect Polycystic Ovary
Syndrome (PCOS) from Ovarian Ultrasound Images
2.2.3 Deep Learning Algorithm for Automated Detection of Polycystic Ovary Syndrome
Using Scleral Images
Lv et al.[3] propose a novel, non-invasive method for detecting PCOS using scleral images,
leveraging U-Net and ResNet deep learning architectures. Their framework achieves an impressive
AUC of 0.9799, demonstrating high accuracy and efficiency. Its lightweight design makes it ideal for
real-time clinical use, especially in settings with limited computational resources. However, the model's
scope is limited to PCOS detection, lacking generalizability to other ovarian cysts. Expanding the
dataset and improving adaptability are crucial steps for increasing its clinical applicability to a wider
range of conditions.
2.2.4 Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized
Feature Selection and Explainable Artificial Intelligence
Elmannai et al.[4] present a machine learning-based approach for PCOS detection that
prioritizes optimized feature selection and explainable AI, ensuring transparency and clinical
interpretability. Using Random Forest and explainable AI tools, the model provides accurate predictions
while offering insights into the diagnostic process, bridging technology and clinical decision-making.
Although effective, the model's complexity and reliance on advanced tools may limit accessibility in
resource-constrained settings. Simplifying its deployment without compromising accuracy and
transparency is essential for broader adoption in diverse healthcare environments.
7
2.2.5 Empowering Early Detection: A Web-Based Machine Learning Approach for PCOS
Prediction
The study by Rahman et al.[5] explores a web-based solution for early PCOS detection using
machine learning, highlighting the potential of digital health tools to improve accessibility, especially in
underserved areas. The platform enables remote diagnostic insights for patients and healthcare
providers, addressing challenges like geographical barriers and limited specialist access. While the
approach offers a cost-effective and convenient solution, it raises cybersecurity concerns regarding the
protection of sensitive patient data. Future work should focus on strengthening security measures while
ensuring continued accessibility and diagnostic accuracy.
The study by Bedi et al.[7] presents an advanced approach to PCOS detection by combining
image processing techniques with machine learning, achieving high diagnostic accuracy. The method
integrates adaptive bilateral filtering to enhance image quality by reducing noise while retaining
important details, and the attention residual U-Net focuses on crucial regions for more accurate feature
extraction. Although the framework offers a robust system for AI-driven diagnostics, its reliance on
computationally intensive algorithms and specialized hardware may limit accessibility in resource-
constrained settings. Future research should focus on optimizing the system's efficiency to ensure
scalability and practicality for widespread clinical use.
8
2.2.8 Polycystic Ovary Syndrome with Machine Learning Algorithms from Electronic
Health Records
The study by Zad et al.[8] investigates using machine learning on electronic health record (EHR)
data for PCOS diagnosis, showcasing the potential of predictive models to enable early and scalable
detection. The research employs algorithms like logistic regression, SVM, gradient-boosted trees, and
random forests to identify patterns within large EHR datasets, ensuring predictions are applicable across
diverse patient profiles. While the approach offers promising diagnostic insights, its reliability needs
validation across different populations and healthcare systems, as variations in EHR structures,
demographics, and diagnostic criteria may impact performance. Further research should focus on
refining the model’s adaptability for broader clinical application.
The study by Kermanshahchi et al.[9] presents a multi-method machine learning approach for
diagnosing PCOS using ultrasound imaging, aiming to improve diagnostic accuracy by addressing the
limitations of traditional methods. The model automates image analysis, reducing subjectivity and
variability, and providing clinicians with reliable diagnostic support. While this enhances early detection
and patient outcomes, the model's effectiveness depends on the quality of ultrasound imaging, which
may not capture all PCOS-related factors. Future research should focus on integrating additional
diagnostic methods and datasets to achieve a more comprehensive evaluation and ensure robustness
across various clinical settings.
2.2.10 Exploring the Dominant Features and Data-Driven Detection of Polycystic Ovary
Syndrome Through Modified Stacking Ensemble Machine Learning Technique
The study by Suha and Islam[10] introduces a modified stacking ensemble approach for PCOS
detection, combining multiple machine learning models to enhance diagnostic accuracy by leveraging
their strengths. The method focuses on identifying critical features that contribute to PCOS, improving
result interpretability and transparency for clinicians. While the approach boosts predictive power and
provides insights into disease factors, it requires extensive model tuning and parameter optimization,
which could hinder practical deployment in resource-limited settings. Future research should aim to
simplify the model and streamline training processes, ensuring efficient implementation in clinical
9
practice without sacrificing diagnostic performance.
The review by Singh et al.[11] offers a comprehensive examination of PCOS, covering its
causes, clinical management, and emerging treatments, providing valuable insights for clinicians and
researchers. It addresses various aspects of PCOS, from genetic and hormonal factors to reproductive
and metabolic impacts, while exploring pharmacological and lifestyle interventions. However, the
review lacks specific insights into algorithmic approaches or machine learning integration, which are
crucial for developing predictive models and diagnostic tools. Future research should focus on
combining clinical practices with computational methods to enhance the diagnosis and treatment of
PCOS in a data-driven healthcare environment.
2.2.12 A Novel Approach for Polycystic Ovary Syndrome Prediction Using Machine
Learning in Bioinformatics
The study by Nasim et al.[12] introduces a bioinformatics and machine learning-based method
for predicting PCOS outcomes, highlighting the potential of computational approaches in analyzing
genetic and biochemical data. This data-driven framework offers significant healthcare implications by
enhancing diagnostic accuracy and supporting personalized treatment strategies. However, the model's
reliance on bioinformatics data may limit its generalizability to routine clinical settings. Future research
should focus on integrating bioinformatics methods with broader clinical practices, ensuring adaptability
across diverse patient populations and enhancing accessibility in standard healthcare environments.
The study by H. G et al.[13] develops a machine learning model for PCOS detection,
emphasizing feature selection to improve diagnostic accuracy by focusing on the most relevant patient
data. The feature selection process enhances model performance by reducing noise and prioritizing
critical variables, leading to better generalization and fewer clinical errors. However, the complexity of
handling large datasets and the need for advanced data preprocessing pose challenges in practical
implementation. Future
10
research should aim to optimize these processes, ensuring the model is efficient, scalable, and accessible
for widespread clinical use.
2.2.15 Hybrid Machine Learning Algorithms for Polycystic Ovary Syndrome Detection
The study by Alshakrani et al.[15] explores a hybrid machine learning approach for PCOS
detection, combining multiple algorithms to enhance accuracy and robustness by leveraging their
complementary strengths. This method addresses the limitations of individual models, such as
overfitting or underfitting, resulting in more reliable and generalized detection outcomes. While the
hybrid approach offers advantages in handling complex datasets, it requires significant computational
resources, which may limit its use in resource-constrained healthcare settings. Future research should
focus on optimizing these models to reduce computational demands, ensuring their accessibility and
efficiency in practical clinical applications.
The study by Sultan Bin Habib et al.[16] examines various machine learning techniques to
improve PCOS diagnosis, emphasizing their practicality in identifying patterns and making accurate
predictions from patient data. These methods offer scalable, data-driven solutions that automate
detection, reduce human error, and ensure consistent results. While the research demonstrates the
potential for early and effective intervention, its scope may be limited to specific populations, which
could affect generalizability across diverse healthcare settings and demographics. Future research should
focus on validating and adapting these models to broader clinical environments, ensuring their
11
effectiveness and
12
applicability across different patient groups and healthcare systems.
The study by Ajil et al.[17] explores the use of ensemble models to enhance PCOS detection,
leveraging methods like bagging, boosting, and stacking to combine the strengths of multiple individual
models. This approach improves diagnostic accuracy by reducing bias and variance, and better handling
data inconsistencies and noise. While ensemble methods offer robust and reliable predictions, their
complexity can make interpretability challenging for clinicians, potentially undermining trust in the
model's outputs. Future research should focus on enhancing the transparency and explainability of
ensemble models, ensuring their practical utility in clinical decision-making without sacrificing trust and
comprehensibility.
2.2.18 A Deep Learning Fusion Approach to Diagnose Polycystic Ovary Syndrome (PCOS)
The study by Alamoudi et al.[18] presents a novel fusion approach using deep learning models
to improve PCOS diagnosis, combining multiple architectures like convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) to enhance diagnostic accuracy. This fusion technique
aims to better analyze complex patient data, offering robust and generalizable solutions across diverse
clinical environments. However, challenges remain in data collection due to privacy concerns and
regulatory compliance (e.g., HIPAA, GDPR). Future research should focus on developing secure,
privacy-preserving methods for data collection, ensuring ethical implementation while leveraging the
full potential of deep learning technologies in healthcare diagnostics.
The review by Gandhi et al.[19] summarizes various machine learning techniques for PCOS
detection, including support vector machines (SVM), decision trees, and neural networks, providing
valuable insights into their strengths and limitations. While the paper offers a comprehensive overview
of current methodologies, it does not introduce new algorithms or groundbreaking findings. Instead, it
serves as a reference for researchers and healthcare professionals, consolidating existing knowledge to
aid clinical decision-making and personalized treatment. Future research should focus on innovating and
refining these methods to further advance machine learning applications in PCOS detection.
13
2.2.20 Labelling Self-Tracked Menstrual Health Records with Hidden Semi-Markov Models
Symul L, Holmes S[20] states that Hidden Semi-Markov Models (HSMM) offer a robust
approach to analyzing menstrual health records for PCOS detection, combining temporal dependencies
with the flexibility to model varying menstrual cycle durations. They effectively handle missing data,
which is common in user-tracked menstrual records, ensuring more reliable detection of menstrual
irregularities associated with PCOS. However, the performance of HSMMs relies on the quality and
consistency of user input; sparse or inconsistent data can hinder accurate pattern detection. Future efforts
should focus on promoting consistent data tracking among users and implementing data validation
techniques to enhance the predictive accuracy of HSMMs in clinical and real-world scenarios.
The application of machine learning (ML) in diagnosing Polycystic Ovary Syndrome (PCOS)
has shown significant advancements across various techniques. However, these approaches come with
inherent challenges that need to be addressed for broader clinical adoption.
Diagnostic criteria like Rotterdam, NIH, and AES rely on symptoms such as irregular cycles,
hyperandrogenism, and polycystic ovarian morphology. These methods are non-invasive and accessible,
offering quick diagnoses. However, they are subjective, prone to misdiagnosis, and fail to consider
metabolic or genetic profiles.
Hormonal tests assess androgen excess and insulin resistance, providing precise biochemical
markers. While reducing subjectivity, these tests are costly, invasive, and require specialized resources,
limiting accessibility. Hormonal fluctuations across cycles can also affect accuracy.
Methods like SVM, Decision Trees, and Random Forest leverage clinical data for scalable PCOS
detection. They perform well with structured data but require extensive preprocessing and struggle with
high-dimensional or unstructured datasets.
14
Deep Learning Techniques:
Deep learning models like CNNs and U-Net excel in analyzing imaging data, offering high
accuracy in PCOS detection. However, they require large datasets, advanced hardware, and are
computationally expensive. Their lack of interpretability also limits clinical trust.
Hybrid and ensemble methods, combining algorithms like Random Forest and XGBoost,
enhance robustness and accuracy. However, they are computationally demanding, harder to interpret,
and risk redundancy or overfitting if not implemented properly.
XAI techniques like SHAP and LIME improve model transparency, aiding clinical adoption by
explaining predictions. Yet, they add computational overhead and may oversimplify complex models,
limiting their ability to fully capture decision-making nuances.
High-quality, diverse datasets improve model performance and reduce bias. However, privacy
concerns and ethical issues hinder access to large, labeled datasets, limiting generalizability and
scalability of PCOS detection systems.
Imbalanced datasets, where non-PCOS cases outnumber PCOS cases, skew models toward the
majority class, reducing recall for PCOS detection. Addressing this often involves oversampling,
undersampling, or advanced algorithms, but these may lead to overfitting or underperformance in
diverse real-world scenarios. A balanced approach is critical for clinical reliability.
16
representative datasets.
Computational Complexity:
Deep learning models demand substantial computational power, limiting their use in resource-
constrained settings. Simplified architectures or lightweight frameworks are needed for practical
deployment without sacrificing accuracy.
The use of sensitive patient data raises ethical and privacy issues. Ensuring compliance with
regulations, informed consent, and data security requires robust governance frameworks and
interdisciplinary collaboration.
Machine learning models often rely on datasets with numerous redundant or irrelevant features,
which can increase computational complexity and lead to overfitting. Including features that do not
contribute meaningfully to the diagnostic process can also obscure the clinical relevance of the model’s
predictions, reducing its trustworthiness among practitioners.
17
CHAPTER 3
PROPOSED SYSTEM
3.1 Problem Definition
PCOS detection systems often suffer from overcomplexity due to the inclusion of redundant or
insignificant attributes, which can obscure actionable insights and hinder their practical use in clinical
settings. Additionally, lifestyle recommendation systems for PCOS are typically generic, failing to
account for individual variations such as medical history, specific symptoms, and lifestyle factors. This
lack of personalization limits their effectiveness in providing tailored interventions essential for
managing a multifaceted condition like PCOS. Addressing these issues is crucial for developing systems
that are both clinically relevant and patient-centric.
The proposed work focuses on creating a comprehensive diagnostic and recommendation system
for PCOS. The main phases include data collection, model training, diagnostic evaluation, and
recommendation generation.
1. Data Collection: Clinical data, including symptoms, menstrual history, hormonal profiles, and
imaging data, will be collected from a representative sample of patients diagnosed with PCOS.
To ensure dataset balance, data from healthy individuals will also be acquired. This phase will
aim to create a comprehensive and diverse dataset suitable for training and evaluating the
models.
2. Data Preprocessing: The collected data will undergo cleaning, normalization, and
standardization to ensure compatibility with machine learning algorithms. Feature selection will
be performed using Boruta to identify the most relevant variables for PCOS classification and
severity analysis.
3. Model Selection and Training: Several machine learning techniques will be evaluated.
CatBoost will be used for PCOS classification due to its efficiency with categorical data and
robust performance. Distributed AutoML (DAML) will be applied for severity classification to
leverage its capability to handle complex datasets. Both models will be trained and optimized
using the processed data to achieve high diagnostic accuracy.
18
4. Lifestyle Recommendation System: A personalized lifestyle recommendation system will be
19
developed using a forward chaining approach. Based on the diagnostic profile, this system will
provide tailored recommendations, including dietary advice, exercise routines, and lifestyle
adjustments, to help manage PCOS symptoms effectively.
The User Input module serves as the entry point for patients and healthcare providers to provide
health-related data. This includes details like symptoms, hormonal levels, and lifestyle habits such as
diet and exercise. The interface is designed to be user-friendly, ensuring accessibility for individuals
with varying levels of technical expertise. It also includes input validation mechanisms to ensure that the
data collected is complete and accurate before it is passed to the next stage.
Additionally, this module facilitates real-time interaction with the system, ensuring that data
entry is seamless and efficient. It can be deployed on multiple platforms, such as web-based interfaces
or mobile applications, providing flexibility in how users interact with the system. By streamlining the
process of data entry, this component lays the foundation for accurate and effective diagnosis and
recommendations.
20
The Database module is a centralized repository that securely stores all the data collected from
21
users. This includes clinical data, hormonal profiles, imaging results, and other relevant medical
information. The database ensures data security through robust encryption protocols and controlled
access mechanisms, preventing unauthorized access to sensitive patient information.
This module prioritizes scalability and efficiency. It is designed to handle large volumes of data,
ensuring smooth operation even as the user base grows. The database also facilitates fast and reliable
data retrieval, allowing other system components, such as the Machine Learning Model, to access the
required information for processing without delays.
The Machine Learning Model forms the analytical core of the system, responsible for processing
patient data to diagnose PCOS and classify its severity. Trained on diverse datasets, it employs
advanced algorithms to identify patterns and relationships in the input data, which may indicate the
presence of PCOS. Specifically, CatBoost is used for PCOS diagnosis, leveraging its efficiency in
handling categorical data and ensuring high accuracy and reliability in predictions based on clinical,
hormonal, and imaging data stored in the database.
For severity classification, Distributed AutoML (DAML) is implemented, enabling the system
to assess the progression or risk level of the condition by analyzing complex patterns in the data. This
classification helps healthcare providers prioritize treatment plans and interventions based on the
patient's specific needs.
The model is designed to continuously evolve by retraining on new data, ensuring it stays
current and adapts to emerging clinical findings and patterns. This iterative learning process enhances its
diagnostic and classification capabilities over time, providing robust support for PCOS management.
22
3.3 Work Flow:
1. Data Input: Patients enter personal and health data through the system interface, including
lifestyle, clinical symptoms, and medical history.
2. Data Processing: The system preprocesses the input data, applying cleaning, normalization,
and feature selection for compatibility with machine learning algorithms.
3. Diagnosis: The processed data is analyzed using the trained machine learning model to
detect PCOS and its severity.
5. Feedback Loop: Patients provide feedback on the recommendations, allowing the system
to adapt and improve future suggestions.
23
3.4 Use Case Diagram:
The use case diagram represents the interactions between different actors and the PCOS
detection system, highlighting the various functionalities and processes that the system supports. It
provides a clear view of how patients, doctors, and the system interact to achieve PCOS detection,
severity classification, and personalized healthcare recommendations.
24
3.5 Modules:
The proposed system consists of several key modules, each responsible for a different aspect
of PCOS detection and management:
1. Data Collection Module: This module collects data from patients, including clinical,
hormonal, and lifestyle information. It ensures that data is collected in a standardized format for
efficient preprocessing.
2. Data Preprocessing Module: This module cleanses and normalizes the data, removing outliers
and ensuring that only relevant features are retained. Proper preprocessing is crucial for
accurate model training.
3. Machine Learning Module: The core of the system, this module includes various machine
learning algorithms tested for accuracy in PCOS diagnosis. It includes model training,
validation, and selection of the best-performing model.
4. Recommendation Module: Based on the diagnostic outcomes, this module provides lifestyle
recommendations tailored to the patient’s unique needs. It considers factors such as dietary
adjustments, exercise suggestions, and health tips to improve overall well-being and manage
PCOS symptoms.
5. Feedback Module: This module collects patient feedback on the lifestyle recommendations
and diagnostic outcomes. Feedback helps in refining the recommendation algorithm, ensuring
that future suggestions are more effective and relevant.
25
CHAPTER 4
SYSTEM REQUIREMENTS
This section outlines the hardware and software requirements necessary to implement the proposed
PCOS detection and lifestyle recommendation system. These requirements are essential for running
machine learning models, processing patient data, and providing personalized recommendations.
Ram: 8 GB
26
CHAPTER 5
RESULT
5.1 Result and Discussion
In the PCOS detection system, data collection focused on gathering critical features, including
follicle numbers in the right and left ovaries, skin darkening, hair growth, weight gain, and menstrual
cycle regularity. These features were selected for their relevance in distinguishing between individuals
with and without PCOS. To understand the data distribution and identify trends, boxplots were utilized.
The boxplots highlighted significant differences in feature distribution across PCOS and non-PCOS
groups, with the PCOS group exhibiting higher follicle counts, increased prevalence of skin darkening
and hair growth, greater weight gain, and irregular menstrual cycles. Outliers were identified,
showcasing variability within the PCOS group.
Data preprocessing began with a correlation analysis to uncover relationships among features.
The analysis revealed strong correlations between PCOS and follicle numbers in the right (0.65) and left
ovaries (0.60), as well as a high correlation (0.80) between follicle counts in both ovaries, indicating
symmetry in ovarian changes. Moderate correlations were observed for skin darkening (0.48), hair
growth (0.37), and weight gain (0.44) with PCOS, while menstrual cycle irregularity showed a weaker
correlation (0.40) but remained clinically significant. These findings guided the selection of key features
for model training, ensuring that the most relevant attributes were included to enhance diagnostic
accuracy.
To ensure consistency and usability, missing data was addressed, and preprocessing steps were
implemented to normalize and clean the dataset. The combination of visualization techniques and
correlation analysis provided a comprehensive understanding of the data, setting a strong foundation for
developing effective machine learning models for PCOS detection.
27
CHAPTER 6
CONCLUSION
This project successfully developed a machine learning-based system for the detection of Polycystic
Ovary Syndrome (PCOS), incorporating an innovative personalized lifestyle recommendation module. By
integrating clinical, hormonal, and imaging data, the system addresses key challenges associated with
traditional diagnostic methods, such as subjectivity, high costs, and limited accessibility. These
improvements make the system particularly beneficial in resource-constrained settings, where advanced
diagnostic facilities may not be readily available.
The machine learning model demonstrated exceptional performance in diagnosing PCOS, validated
through metrics such as accuracy, precision, recall, and ROC-AUC scores. This high level of reliability
ensures that the system can serve as a dependable diagnostic tool for both healthcare professionals and
patients. Furthermore, the integration of personalized lifestyle recommendations adds significant value to
PCOS management by offering tailored advice that empowers patients to make informed decisions about
their health. This novel approach to symptom management can lead to improved outcomes and an enhanced
quality of life for individuals living with PCOS.
A unique aspect of this system is its feedback mechanism, which allows continuous refinement and
adaptation of lifestyle recommendations based on user input and evolving clinical data. This adaptability
ensures that the system remains relevant and effective in addressing the diverse and changing needs of
patients. Additionally, the modular design of the system facilitates future enhancements, such as
incorporating new diagnostic criteria or extending its functionality to other health conditions.
The proposed system represents a significant advancement in leveraging technology for PCOS
detection and management. It not only provides accurate and accessible diagnostics but also empowers
patients with actionable, personalized health insights. This dual focus on diagnosis and personalized care
has the potential to transform PCOS management, making it more patient-centric, efficient, and effective.
By bridging the gap between advanced diagnostics and holistic care, this project underscores the
transformative power of technology in addressing complex healthcare challenges.
28
REFERENCES
[1] Khanna, V. V., Chadaga, K., Sampathila, N., Prabhu, S., Bhandage, V., & Hegde, G. K. (2023). A
distinctive explainable machine learning framework for detection of polycystic ovary syndrome.
Applied System Innovation, 6(2), 32.
[2] Hosain, A.K.M. Salman, Mehedi, Md Humaion Kabir, Kabir, Irteza Enan PCONet: A
Convolutional Neural Network Architecture to Detect Polycystic Ovary Syndrome (PCOS) from
Ovarian Ultrasound Images DOI - 10.1109/ICEET56468.2022.10007353
[3] Lv, W., Song, Y., Fu, R., Lin, X., Su, Y., Jin, X., ... & Huang, G. (2022). Deep learning algorithm
for automated detection of polycystic ovary syndrome using scleral images. Frontiers in
Endocrinology, 12, 789878.
[4] Elmannai, H., El-Rashidy, N., Mashal, I., Alohali, M. A., Farag, S., El-Sappagh, S., & Saleh, H.
(2023). Polycystic ovary syndrome detection machine learning model based on optimized feature
selection and explainable artificial intelligence. Diagnostics, 13(8), 1506.
[5] Rahman, M. M., Islam, A., Islam, F., Zaman, M., Islam, M. R., Sakib, M. S. A., & Babu, H. M. H.
(2024). Empowering early detection: A web-based machine learning approach for PCOS prediction.
Informatics in Medicine Unlocked, 47, 101500.
[6] Agrawal, A., Ambad, R., Lahoti, R., Muley, P., & Pande, P. S. (2022). Role of artificial intelligence
in PCOS detection. Journal of Datta Meghe Institute of Medical Sciences University, 17(2), 491-
494.
[7] Bedi, P., Goyal, S. B., Rajawat, A. S., & Kumar, M. (2024). An integrated adaptive bilateral filter-
based framework and attention residual U-net for detecting polycystic ovary syndrome. Decision
Analytics Journal, 10, 100366.
[8] Zad, Z., Jiang, V. S., Wolf, A. T., Wang, T., Cheng, J. J., Paschalidis, I. C., & Mahalingaiah, S.
(2024). Predicting polycystic ovary syndrome with machine learning algorithms from electronic
health records. Frontiers in Endocrinology, 15, 1298628.
[9] Kermanshahchi, J., Reddy, A. J., Xu, J., Mehrok, G. K., & Nausheen, F. (2024). Development of a
29
Machine Learning-Based Model for Accurate Detection and Classification of Polycystic Ovary
30
Syndrome on Pelvic Ultrasound. Cureus, 16(7).
[10] Suha, S. A., & Islam, M. N. (2023). Exploring the dominant features and data-driven detection
of polycystic ovary syndrome through modified stacking ensemble machine learning technique.
Heliyon, 9(3).
[11] Singh, S., Pal, N., Shubham, S., Sarma, D. K., Verma, V., Marotta, F., & Kumar, M. (2023).
Polycystic ovary syndrome: etiology, current management, and future therapeutics. Journal of
Clinical Medicine, 12(4), 1454.
[12] S. Nasim, M. S. Almutairi, K. Munir, A. Raza and F. Younas, "A Novel Approach for Polycystic
Ovary Syndrome Prediction Using Machine Learning in Bioinformatics," in IEEE Access, vol. 10,
pp. 97610-97624, 2022, doi: 10.1109/ACCESS.2022.3205587.
[13] Harinisri. G et al., "Machine Learning-Driven Polycystic Ovary Syndrome Detection with
Feature Selection," 2023 6th International Conference on Recent Trends in Advance Computing
(ICRTAC), Chennai, India, 2023, pp. 523-529, doi: 10.1109/ICRTAC59277.2023.10480753.
[14] K. Gupta and R. Prasad, "Polycystic Ovary Syndrome Detection using Deep Learning," 2023 6th
International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha
Nagar, India, 2023, pp. 1465-1468, doi: 10.1109/IC3I59117.2023.10397615.
[15] S. Alshakrani, S. Hilal and A. M. Zeki, "Hybrid Machine Learning Algorithms for Polycystic
Ovary Syndrome Detection," 2022 International Conference on Data Analytics for Business and
Industry (ICDABI), Sakhir, Bahrain, 2022, pp. 160-164, doi:
10.1109/ICDABI56818.2022.10041525.
[16] A. Z. Sultan Bin Habib, M. A. Bin Syed, M. E. Islam and T. Tasnim, "Investigation of
Polycystic Ovary Syndrome (PCOS) Diagnosis Using Machine Learning Approaches," 2023 5th
International Conference on Sustainable Technologies for Industry 5.0 (STI), Dhaka, Bangladesh,
2023, pp. 1-6, doi: 10.1109/STI59863.2023.10465079.
[17] Ajil A., Jain T., Narmartha T. M., Visamya S., & Lakshmi T. G. (2023). Detection of PCOS
using ensemble models. International Journal for Research in Applied Science & Engineering
Technology (IJRASET), 11(5), 420-424. [Link]
31
[18] Alamoudi, A., Khan, I. U., Aslam, N., Alqahtani, N., Alsaif, H. S., Al Dandan, O., ... & Al Bahrani,
32
R. (2023). A deep learning fusion approach to diagnosis the polycystic ovary syndrome (pcos). Applied
Computational Intelligence and Soft Computing, 2023(1), 9686697.
[19] Gandhi,Diya & Patel, Bansri & Dave, Namrata PCOS Detection using Machine Learning
Algorithms, 2024.
[20] Symul L, Holmes S. Labeling Self-Tracked Menstrual Health Records with Hidden Semi-
Markov Models. IEEE J Biomed Health Inform. 2022 Mar;26(3):1297-1308. doi:
10.1109/JBHI.2021.3110716. Epub 2022 Mar 7. PMID: 34495854.
33
APPENDIX I
CODE DESCRIPTION
import pandas as pd
import numpy as np
import [Link] as plt import
seaborn as sns
[Link](style='darkgrid')
df_inf=pd.read_csv("PCOS_infertility.csv", encoding='iso-8859-1')
df_noinf=pd.read_csv("data without infertility _final.csv", encoding='iso-8859-1')
print(f"Shape of df_inf:{df_inf.shape}")
print(f"Shape of df_noinf:{df_noinf.shape}")
df_inf.sample(5)
df_noinf.sample(5)
df_noinf.drop(["Unnamed: 42"], axis=1, inplace=True)
df_noinf.dtypes
df_noinf['AMH(ng/mL)'].unique()
df_noinf['AMH(ng/mL)'] = df_noinf['AMH(ng/mL)'].replace('a', 0).astype(float)
df_noinf.dtypes
corr_features=df_noinf.corrwith(df_noinf["PCOS (Y/N)"]).abs().sort_values(ascending=False)
corr_features=corr_features[corr_features>0.4].index
corr_features
df_noinf['Cycle(R/I)'].unique()
df_inf.dtypes
df_inf['AMH(ng/mL)'].unique()
df_inf['AMH(ng/mL)'] = df_noinf['AMH(ng/mL)'].replace('a', 0).astype(float)
df_inf.dtypes
df_inf.corrwith(df_inf["PCOS (Y/N)"]).abs()
34
df_noinf=df_noinf[corr_features]
df_noinf.head()
df_noinf.columns
[Link](figsize=(14,5))
[Link](1,6,1)
[Link](x='PCOS (Y/N)',y='Follicle No. (R)',data=df_noinf)
[Link](1,6,2)
[Link](x='PCOS (Y/N)',y='Follicle No. (L)',data=df_noinf)
[Link](1,6,3)
[Link](x='PCOS (Y/N)',y='Skin darkening (Y/N)',data=df_noinf)
[Link](1,6,4)
[Link](x='PCOS (Y/N)',y='hair growth(Y/N)',data=df_noinf)
[Link](1,6,5)
[Link](x='PCOS (Y/N)',y='Weight gain(Y/N)',data=df_noinf)
[Link](1,6,6)
[Link](x='PCOS (Y/N)',y='Cycle(R/I)',data=df_noinf)
[Link](figsize=(6,5))
[Link](df_noinf.corr(), annot=True)
[Link]()
y=df_noinf['PCOS (Y/N)']
X=df_noinf.drop(['PCOS (Y/N)'], axis=1)
35
APPENDIX II
SCREENSHOTS
30
31
32
APPENDIX III
PUBLICATION DETAILS
SURVEY PAPER
33