Faculty: Medicine
Course Name: Data Sciences, Robotics and AI
Course Code: MBM 3102
Lecturer: Ms E.T Nyakujipa
Tittle: Introduction to Data Sciences
Student name: Kundhlande Tadiwanashe
Student number: N02316854Y
1. Define the term "Data Science" and describe the three core disciplines that intersect to
form it. (8 marks)
● Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It integrates techniques from statistics, computer science (including
programming and machine learning), and domain expertise such as medicine to
analyze, interpret, and apply data for solving real-world problems. These three
intersecting disciplines form the foundation:
● Statistics -analyzes and interprets data trends
● Computer Science and Programming - builds models, automats processes, and works
with large datasets.
● Domain Expertise- Understands data in its real-world context (medical, financial, etc.)
● By combining these skills, data science enables medical researchers to build
predictive models, analyze trends, and extract actionable insights from complicated
healthcare data.
2. List any three characteristics of big data (3 marks)
● Volume- Very large amounts of data, often measured in terabytes to petabytes
● Variety- Different data types, such as text, images, records, and sensor readings
● Velocity- The speed at which data is generated and must be processed
3. Explain the purpose of each step in the ETL (Extract, Transform, Load) process. Use
a hypothetical example of collecting patient blood pressure readings from multiple
clinics to illustrate your answer. (8 marks)
● Extract- Gather patient blood pressure recordings from various clinic databases or
sources. In this step, the aim is to collect all those readings, regardless of format or
location.
● Transform: Clean and standardize the data, such as converting all pressure readings to
the same units (e.g., mmHg), correcting errors, and removing duplicates. Also incudes
inputting missing data if necessary.
● Load: Place the transformed data into a central hospital database where it can be
analyzed for trends, outliers, or clinical decision-making.
● For example, if Clinic A logs blood pressure in mmHg and Clinic B in kPa, during
transformation you'd convert all readings to mmHg before loading them into your
patient records system.
4. Why is data transformation (e.g., normalization) a critical step before building a
machine learning model? (3 marks)
● Data transformation, including normalization, is a crucial preprocessing step that
scales data features to a common range (e.g., 0 to 1), ensuring that no single
measurement dominates due to its numeric size.
● This equal scaling allows machine learning algorithms to treat all features fairly,
improving model convergence, stability, and overall performance.
● Normalization also helps satisfy algorithmic assumptions and handles inconsistencies
and technical variations in the data. This is especially important for gradient-based
optimization methods such as neural networks and logistic regression, where
normalization accelerates training and prevents issues caused by widely varying
feature scales.
5. Distinguish between data exploration and data cleaning (4 marks)
● Data exploration helps identify problems, while data cleaning removes them.
Aspect Data Exploration Data Cleaning
Purpose Understanding data patterns, trends, Fixing or removing errors,
and identifying potential issues inconsistencies, and incomplete data
Timing Performed first to guide subsequent Follows exploration insights for
cleaning efforts targeted corrections
Activities Statistical summaries, visualizations, Removing duplicates, correcting errors,
pattern discovery handling missing values
Outcome Insights about data quality and Clean, reliable dataset ready for analysis
characteristics
6. Differentiate between Supervised and Unsupervised Learning. Provide one medical
example for each type. (6 marks)
Feature Supervised Learning Unsupervised Learning
Definition Uses labeled training data to Discovers hidden patterns in unlabeled
predict known outcomes data without predefined targets
Medical Automated ECG interpretation to Identifying patient subgroups in heart
Example classify heart rhythm failure based on clinical characteristics
abnormalities (normal vs. atrial to discover new disease phenotypes
fibrillation)
Application Disease diagnosis, risk prediction, Patient phenotyping, drug discovery,
treatment recommendation precision medicine
● Supervised learning might use thousands of X-rays labeled as "disease" or "healthy"
to train an algorithm; unsupervised learning could reveal new subgroups of diabetes
patients based on their test results.
7. A research team uses a logistic regression model to predict the likelihood of a disease
based on patient biomarkers.
* a) What is the output of a logistic regression model? (2 marks)
● The output of a logistic regression model is a probability, a value between 0
and 1, that represents the likelihood of a specific outcome (e.g., the probability
of a disease based on patient biomakers)
* b) Define what accuracy and recall measure in the context of evaluating this
diagnostic model. Why might recall be particularly important in a medical setting? (6
marks)
● Accuracy is the proportion of total correct predictions (both positive and negative)
made by the model, measuring overall how often the model is right.
● Recall (also called sensitivity) is the proportion of actual positive cases (patients who
have the disease) that the model correctly identifies as positive.
● In a medical setting, recall is especially important because missing actual positive
cases (false negatives) means some patients with the disease could go undiagnosed,
leading to potentially serious consequences. Prioritizing recall ensures that most, if
not all, sick patients are identified, which is essential in healthcare where missing a
case can impact patient health and safety
8. Describe two distinct ways Artificial Intelligence is currently being used to assist in
medical diagnosis. (4 marks)
● AI analyzes medical images, such as X-rays or MRI scans, to detect abnormalities
like tumors, fractures, or signs of diseases. These image-recognition systems help
radiologists identify conditions with high speed and accuracy.
● AI-powered clinical decision support systems review large volumes of patient health
records, lab results, and symptoms to suggest potential diagnoses or flag high-risk
patients for further assessment, aiding clinicians in making faster and better-informed
decisions.
9. A hospital implements an AI system to prioritize patients in the emergency room
based on the severity of their condition. Discuss two potential ethical risks or biases
that could be present in such a system and how they might be mitigated. (6 marks)
● Algorithmic bias- it can arise from training data reflecting historical disparities,
potentially leading to inequitable prioritization for minority groups leading to unfair
differences in care. This can be mitigated by using diverse, representative training
data and regularly checking the AI’s performance across different groups to address
disparities.
● Lack of transparency (“black box problem”)- clinicians may not understand how the
AI reached its decision, making it hard to trust or challenge the system. Mitigation
strategies include using diverse and representative training data, continuous
monitoring for biased outcomes, implementing explainable AI (XAI) techniques, and
establishing clear ethical frameworks and human oversight for the system.
References
● Arulanandham, A., Suresh, A., Senthil Kumar, R., 2022. Role of Data Science in
Healthcare, in: Data Science with Semantic Technologies. John Wiley & Sons, Ltd, pp.
105–137. https://doi.org/10.1002/9781119865339.ch5
● Bajwa, J., Munir, U., Nori, A., Williams, B., 2021. Artificial intelligence in healthcare:
transforming the practice of medicine. Future Healthc. J. 8, e188–e194.
https://doi.org/10.7861/fhj.2021-0095
● Sarker, I.H., 2021. Data Science and Analytics: An Overview from Data-Driven Smart
Computing, Decision-Making and Applications Perspective. Sn Comput. Sci. 2, 377.
https://doi.org/10.1007/s42979-021-00765-8
● Subrahmanya, S.V.G., Shetty, D.K., Patil, V., Hameed, B.M.Z., Paul, R., Smriti, K., Naik,
N., Somani, B.K., 2022. The role of data science in healthcare advancements:
applications, benefits, and future prospects. Ir. J. Med. Sci. 191, 1473–1483.
https://doi.org/10.1007/s11845-021-02730-z
● What is Data Science? | IBM [WWW Document], n.d. URL
https://www.ibm.com/think/topics/data-science (accessed 10.2.25).