Data at Scale
Sara Muneeb
[email protected]
Computer Science Department
Introduction
Data at scale (big data) describes
all kinds of data including
databases of numbers
images of people, things and places
footage of conversations recorded
Videos
Texts
environmentally sensed data.
Data at Scale
Data at scale has huge potential for
grounding and elucidating problems
Itcan be collected, used, and communicated
in a wide variety of ways
Big Data and Issues
However, beyond societal benefits, data can also be
used in potentially harmful ways
◦ Security Issues
Data collected by multiple organization contains sensitive
information which is always prone to being attacked by
hackers and leaked online.
◦ Privacy Risk
The collected information can be used for malicious activities,
and lead to elimination of freedom of anonymity.
◦ Unintentional Data Discrimination
Inaccurate big data may create an analytical biases, and
unintentionally discriminate individuals based on age, race,
gender and ethnicity.
Approaches for Collecting and
Analyzing Data
1. Scrapping and 2nd Source Data
2. Collecting Personal Data
3. Crowdsourcing Data
4. Sentiment Analysis
5. Social Network Analysis
6. Combining Multiple Source of
Data
Scrapping and 2nd Source Data
Scrapping
One way to extract data is by “scraping” it
from the websites (assuming that this is
allowed by the application).
Once the data is scraped, it can be entered
into a spreadsheet (human readable) for
study and analyzed using data science tools.
Scrapping and 2nd Source
Data
Second Source
The openly available big data
that Google and other companies
now provide for researchers to
mine offers a “second source”
methodology
Search terms,
Facebook posts,
Instagram comments, and so on
Analytical Tool: Google Trend
Scrapping and 2nd Source
Data
Analysis of this data can
indirectly reveal new insights
about the users’ concerns,
desires, behaviors, and habits.
Important is what is done with
the new available data?
Collecting Personal Data
Nowadays, many apps and wearable
devices exist that people can buy off
the shelf, which can collect all sorts of
personal data and visualize it.
For example quantify health, screen time, and
sleep.
Self-trackingis also increasingly being
used by people who have a condition
or disease as a form of self-care.
Collecting Personal Data
Quantified-self projects generate
lots of data.
Crowdsourcing Data
People crowdsource information or
work together using online
technologies to collect and share data.
Crowd Research, where many
researchers from all over the world
come together to work on large
problems.
For example climate science
Crowdsourcing Data
Conducting research on a massive
scale enables potentially hundreds or
thousands of people to work on a
single project.
Examples: iSpotNature, eBird, iNaturalist and the
Zooniverse.
Crowd projects raise a number of
issues as to who owns and manages
the data.
Sentiment Analysis
Sentiment analysis is a technique that
is used to infer the effect of what a
group of people or a crowd is feeling or
saying.
The phrases that people use when
offering their opinions or views are
scored as being negative, positive, or
neutral.
For Example
anger, sadness, or fear (negative feelings)
happiness, joy, or enthusiasm (positive feelings).
Sentiment Analysis
Sources
◦ Tweets
◦ Text
◦ Online reviews
◦ Social media
◦ Facial expressions
Tools
◦ DisplayR
◦ CrowdFlower
◦ MonkeyLearn
Sentiment Analysis
Sentiment analysis is commonly used
by marketing and advertising
companies to decide on what types of
ads to design and place.
Sentiment analysis as a technique is
not an exact science and should be
viewed more as a heuristic than as an
objective evaluation method.
Social Network Analysis
Social network analysis (SNA) is a
method based on social network theory
for analyzing and evaluating the
strength of social ties within a network.
Used to understand the relationships
that form among people and groups
within and across different social media
platforms, and with offline social
networks too.
Social Network Analysis
Sources
1. Weibo
2. Tencent
3. Baidu
4. Facebook
5. Twitter
6. Instagram
7. YouTube.
Social Network Analysis
Two main entities make up a social network.
1. Nodes, which are also sometimes called
entities or vertices, represent people and
topics.
2. Edges are the connections between the
nodes, which are also known as links or
ties. The edges show the connections
among nodes, for example, the
members of a family.
Directional and Nondirectional Edges
Combining Multiple Sources of
Data
Collecting data from multiple sources
by combining automatic sensing and
subjective reporting.
The goal is to obtain a more
comprehensive picture about a
domain
such as a population’s mental health
Combining Multiple Sources of
Data
Example: Studentlife (Harari et al.,
2017)
Student’s activity, sleep, and attendance levels against
deadlines during a term Source: StudentLife Study
Visualizing and Exploring
Data
Visualization
include being able to see it and
understand the way that it is represented
and its context (data should be meaningful)
1. What kind of data is it?
2. What is the data about?
3. Why was it collected?
4. Why was it analyzed and represented in a
particular way?
The skills needed to understand and
interpret visualizations are referred to as
visual literacy.
Visualizing and Exploring Data
A simplified path for data to be
meaningful
Visualizing and Exploring
Data
Thegoal of data visualization tools is to
amplify human cognition so that users
can see patterns, trends, correlations,
and anomalies in the data that lead
them to gain new insights and make
new discoveries
Datavisualization tools can help users
change and manipulate variables to see
what happens
“overview first, zoom and filter, and then details on
demand.” Ben Shneiderman (1996)
Example
A market map of the S&P 500, which is a financial index for
stocks. Green indicates stocks that increased in value, and
red indicates stocks that decreased in value that day.
Visualizing and Exploring Data -
Dashboard
The dashboard is an interactive panel of
control widgets that contains
Sliders
Checkboxes
Radio buttons,
and coordinated multiple window
displays of different kinds of graphical
representations
such as bar and line graphs, heat maps, tree
maps, infographics, word clouds, scatterplots, and
other kinds of visualizations.
Example
A dashboard that was created to show changes in sales
information.
Ethical Design Concerns
By “ethics,” this is usually taken
to mean “the standards of
conduct that distinguish between
right and wrong, good and bad,
and so on”
Privacy by design is a way to
avoid collecting excessive data
that might be sensitive but not
needed
Data Ethics Principles
1. Fairness refers to impartial and just treatment or
behavior without favoritism or discrimination
2. Accountability refers to whether an intelligent or
automated system that uses AI algorithms can
explain its decisions in ways that enable people to
believe they are accurate and correct.
3. Transparency refers to the extent to which a system
makes its decisions visible and how they were derived
4. Explainability refers to a growing expectation in HCI
and AI that systems, especially those that collect data
and make decisions about people, provide
explanations that laypeople can understand.
Conclusion
Introduction
Approaches to collecting and
analyzing data
Visualizing and exploring data
Ethical design concerns