0% found this document useful (0 votes)
92 views8 pages

Data Science

This paper aimed to predict disease diagnoses using language from Facebook posts. The researchers collected Facebook status updates from 999 patients who consented to share their social media and electronic medical record data. They used natural language processing techniques like topic modeling to analyze word patterns in the Facebook posts. Statistical models were built to predict 21 common medical diagnoses based on Facebook language alone, patient demographics alone, and a combination of both. The results showed Facebook language could significantly predict several diagnoses and provided additional predictive value beyond demographics alone. Key topics related to specific diagnoses were also identified. This research demonstrated the potential for analyzing social media data to predict individual health conditions.

Uploaded by

Abreham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views8 pages

Data Science

This paper aimed to predict disease diagnoses using language from Facebook posts. The researchers collected Facebook status updates from 999 patients who consented to share their social media and electronic medical record data. They used natural language processing techniques like topic modeling to analyze word patterns in the Facebook posts. Statistical models were built to predict 21 common medical diagnoses based on Facebook language alone, patient demographics alone, and a combination of both. The results showed Facebook language could significantly predict several diagnoses and provided additional predictive value beyond demographics alone. Key topics related to specific diagnoses were also identified. This research demonstrated the potential for analyzing social media data to predict individual health conditions.

Uploaded by

Abreham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reviewed Article: Facebook Language Predicts Disease Diagnosis

Reviewer:

Abreham Getachew
IT Doctoral program Addis Ababa University

A paper Submitted in Partial Fulfillment for the course Advanced Seminar in


Information systems for requirement in PhD information systems

Submitted to:

Tibebe Beshah(PhD)

Dida Midekso(PhD)
Contents
1. Introduction ......................................................................................................................................... 1

2. The Methodology employed in the research ..................................................................................... 2

2.1. Statistical Analysis ...................................................................................................................... 3

3. Results and Discussions ...................................................................................................................... 4

3.1. Prediction of diagnoses from FACEBOOK posts .................................................................... 4

3.2. Topics predicting diagnosis ........................................................................................................ 5

4. Conclusions .......................................................................................................................................... 5

5. References ............................................................................................................................................ 6

i
1. Introduction
Almost two million people share information via Facebook according to the statistics report
2017[8], In now days social media is coming a platform for communicating information. Many of
them share information daily via such Social Medias including their demographic characteristics,
personalities, beliefs and actions. Information can be communicated by different ways such as
posts, status updates, photographs, location check-ins, and other details. Therefore, social media
data can provide information about individuals experience and behavior. The intension of the
authors is using such social media data which have the potential to detect individual’s behavior
and experience for predicting the health condition of individuals.

There are also other research studies which shows social media data can be used to reveal
unexplored pathways for understanding individual and population health such as predicting heart
disease mortality rates [1][3][5], tracking cholera outbreaks[2], and identifying individuals with
postpartum depression[4][6].

The authors claim that, studies that link social media activity and medical diagnoses at the
individual level are rare and limited to connections with self-reported data over limited samples
not validated with health records. I appreciate the attempt to predict health conditions at individual
level based on self-reported data. Amazingly in current days very large amount of data is generated
which is never seen in human history [9]. Big data must be changed to some useful information’s
and insights that can solve organizations and individuals problems. Currently social media is a
place where everyone communicate and share ideas, attitudes and feelings, so that industries
including health sectors should give attention to build some kind of systems that can enables them
to control and monitor health conditions and outbreaks as well as public sentiments in order to
improve services and public health’s.

1
2. The Methodology employed in the research
The paper is done by relating patient’s Facebook posts with their diagnoses results in their
electronic medical records (EMR).They could collect Facebook updates of 1772 individuals who
are willing to share their data.

Adult patients seeking care in an urban academic health system were invited to share their past
social media activity and EMR data. Of all participants enrolled through October 2015 in the
emergency department and agreeing to share their Facebook data (n =1772).They retrieved status
updates up to 5 years prior, ranging from March 2009 through October 2015. Then, the analysis is
limited to those with at least 500 words across all of their Facebook status updates. But peoples
always worry about online social media privacy. Willingness to give permission to access their
social media will raise privacy concerns. Therefore the authors should address such privacy
concerns and give guaranty about their privacy. This will impact in the collection of subjects and
will lead to biased result.

The other point is since individuals posts are written in natural languages, there are so many issues
in order to analyze such data. But totally the researchers ignored NLP issues like cleaning or
preprocessing the data, linguistic analyses e.t.c. It would be more accurate or increase the quality
of the result if the researcher integrate different preprocessing and linguistic analysis techniques.

The authors also did not show how they can extract social media data and what tool they used,
because social media data is secures and no one simply get access over other person data. They
should explicitly discuss the processes and procedures how they retrieved from subjects.

From the health system’s EMRs, they retrieved demographics (age, sex, and race) and prior
diagnoses (by International Classification of Diseases [ICD-9] codes).They have grouped ICD-9
codes and added categories for diagnosis codes not reflected in the index but prevalent in the
sample (e.g., pregnancy-related diagnoses) for a total 41 categories and filtered the list of diagnosis
categories to those attributed to at least 30 patients in the cohort, resulting in 21 categories.

They have extracted all words and word pairs (two neighboring words -- e.g. “sick of”) from each
participant’s status updates. Then grouped similar words into “topics” using Latent Dirichlet
Allocation (LDA) techniques. A probabilistic technique which automatically clusters words by
looking at other words with which they often co-occur. It generate 200 topics which they tested

2
for their association with diagnoses from the [Link] a result, topics may contain slang,
misspellings, and variations on contractions which themselves may be predictive.

2.1. Statistical Analysis

They built three predictive models associating Facebook posts with EMR-based diagnoses:

 Model 1 used Facebook language (words, word pairs, and topics).


 Model 2 used the demographics of age, sex, and race.
 Model 3 used both demographics and Facebook language.

For model 1, including tens of thousands of predictors (i.e. word, word pairs, and topics), they
used extremely random trees (ERT) and a variant of random forests well suited to handle many
predictors.

For model 2, they fit an L2-penalized (ridge) logistic regression, an ideal model when there are
relatively few predictors (to confirm, they also run the ERT approach and found all accuracies
were the same or lower). Model 3 was an combination of models 1 and 2.

The approach compares the predictive ability of Facebook language (model 1) to that of
demographics (model 2) as well as the incremental contribution of Facebook language to
demographics (comparing model 3 to model 2).

They measured predictive ability using the area under the receiver operating characteristic curve
(AUC), a measure of discrimination. To control for model over fit, they measured AUC with 10-
fold cross-validation. They split the sample into 10 equal-sized, non-overlapping, and stratified
partitions, fit models over 9 partitions, and tested the fit model on the remaining held-out partition.
The process repeats ten iterations such that each partition is used as the held-out test partition once.
They used a Monte Carlo permutation test to calculate significance of the difference between any
two AUCs, correcting for multiple hypothesis testing using the Benjamini-Hochberg False-
discovery rate procedure.

They also evaluated each individual topic’s predictive ability by comparing three AUCs:

(1) From usage scores for the topic alone


(2) From a logistic regression model over age, sex, and race
(3) From a logistic regression model over age, sex, race, plus the topic.

3
They used word clouds to display topics. All statistical analyses were performed in Python 2.7.10
(Python Software Foundation, [Link] But python is provided in two versions which
are python 2 and python 3, they should use python 3 because of python is not backward compatible
the codes written in python2 not run in python [Link] may be challenging for future researchers
who need to extend from this work.

3. Results and Discussions


They have evaluated the association between consenting patients’ Facebook posts and their
diagnoses evident in their electronic medical record (EMR). They identified 999 study participants
who shared access to their EMR and Facebook data and had a minimum of 500 words in status
updates. This represented 949,530 status updates containing 20,248,122 words. There were 1143
patients who shared both social media and EMR data for the social mediome study. Of these 87%
(999/1143) had an adequate number of status updates and health record data for language analysis.
These 999 participants contributed a total of 949,530 status updates containing 20,248,122 words
(words per post average: 21.32+/- 27.95).

Most participants were young adults (mean age 28.4+/-8.6, range 18-65), female (78%, 758/999),
and African-American (71%, 710/999).

The most common diagnoses categories identified in the EMR of these patients included:

 digestive abdominal symptoms (64%,641/999)

 genitourinary disorders (56%, 562/999)

 Injury and poisoning (54%, 543/999)

 Respiratory symptoms (43%, 433/999) and

 Pregnancy (43%, 323/758)

3.1. Prediction of diagnoses from FACEBOOK posts

The accuracies (AUCs) of the 3 predictive models for each of the 21 diagnosis categories
evaluated. Many, 86%(18/21) of 21 diagnosis categories evaluated were significantly better
predicted using features from Facebook posts than from age, sex, and race alone (AUC 0.58-0.70;
p < .05).

4
The diagnosis categories best predicted by Facebook statuses included hypertension (AUC=0.70),
pregnancy (AUC=0.70), digestive abdominal symptoms (AUC=0.68), depression (AUC=0.64),
and diabetes (AUC=0.63).

3.2. Topics predicting diagnosis

Predictive topics includes:

• [“please,” come,” “help,” “somebody,” “someone”] for depression

• [“drink,” “drinking,” “bottle,” “wine,” “water,” “coffee”] for alcohol abuse,

• [“wanna,” “call,” “text,” “bored”] for drug abuse

• [“thank,” “god,” “please,” “pray,”“help”] for diabetes

• [“hospital,” “surgery,” “pain,” “blood, “dr”] for hypertension,

• [“stomach,” “hurt,” “bad”] for digestive abdominal symptoms and

• [“mother,” “kids,” “father,” “children”] for obesity.

4. Conclusions
Generally, the primary finding of the study was the language people use in Facebook is predictive
of their health conditions reported in an EMR, and that it often predicts health conditions better
than typically available data: patient age, sex, and race. Data from social media offer a personalized
window into what people think, feel, and do. Although some early research has linked social media
activity and content with health, this is the first study to do so at the level of the patient with EMR
health record data.

Twitter alone have more than 319 million users as of the fourth quarter of 2016[7].Eichstaedt et.
Al[5],used language expressed on Twitter to characterize community- level psychological
correlates of age-adjusted mortality from atherosclerotic heart disease (AHD).This the reviewed
paper claim that, data used is retrieved from social media but only from Facebook. Using both
Facebook and tweeter may increase the quantity of data that will help to increase the quality of the
accuracy.

5
5. References
1. Eichstaedt JC, Schwartz HA, Kern ML, et al. Psychological language on Twitter predicts
county-level heart disease mortality. Psychol Sci 2015;26:159–69.
2. Chunara R, Andrews JR, Brownstein JS. Social and news media enable estimation of
epidemiological patterns early in the 2010,Haitian cholera outbreak. Am J Trop Med Hyg
2012;86:39–45.
3. Prieto VM, Matos S, Álvarez M, et al. Twitter: a good place to detect health conditions.
PLoS ONE 2014;9:e86191.
4. Park S, Lee SW, Kwak J, et al. Activities on Facebook reveal the depressive state of users.
J Med Internet Res 2013;15:e217.
5. Eichstaedt JC, Schwartz HA, Kern ML, et al. Psychological language on Twitter predicts
county-level heart disease mortality. Psychol Sci 2015;26:159–69.
6. De Choudhury M, Gamon M, Counts S, et al. Predicting depression via social media.
Paper presented at the International AAAI Conference on Weblogs and Social Media; 27
May 2014; Ann Arbor, MI.
7. [Link]
accessed April 10, 2017.
8. [Link]
users/,last accessed April 9,2017.
9. Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and
analytics." International Journal of Information Management 35.2 (2015): 137-144.

You might also like