Data Fundamentals: Key Concepts Explained
Data Fundamentals: Key Concepts Explained
data-action-lab.com
data-action-lab.com
“Reports that say that something hasn't happened are always interesting
to me, because as we know, there are known knowns; there are things
we know that we know. There are known unknowns; that is to say,
there are things that we now know we don't know. But there are also
unknown unknowns – there are things we do not know we don't know.”
Donald Rumsfeld, US Department of Defense News Briefing, 2002
data-action-lab.com
OUTLINE
data-action-lab.com
DATA 101 – BASIC DATA CONCEPTS
DATA FUNDAMENTALS
“You can have data without information, but you cannot have
information without data.”
Daniel Keys Moran (attributed)
data-action-lab.com
MODULE LEARNING OBJECTIVES
data-action-lab.com
WHAT IS DATA? WHERE DOES IT COME FROM?
data-action-lab.com
OBJECTS AND ATTRIBUTES
Object: apple
Shape: spherical
Colour: red
Function: food
Location: fridge
Owner: Jen
data-action-lab.com
FROM ATTRIBUTES TO DATASETS
Attributes are fields (or columns) in a database; objects are instances (or rows)
data-action-lab.com
[https://archive.ics.uci.edu/ml/datasets/Mushroom]
Amanita muscaria
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
Classification problem: Is Amanita muscaria edible, or poisonous?
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
Habitat: woods
Gill Size: narrow
Odor: none
Spores: white
data-action-lab.com
DISCUSSION
data-action-lab.com
ASKING THE RIGHT QUESTIONS
Warning: not every situation calls for data science, artificial intelligence, machine
learning, or analytics.
data-action-lab.com
DATA SCIENCE/MACHINE LEARNING/A.I. TASKS
Classification and class probability estimation: which clients are likely to be repeat
customers?
Clustering: do customers form natural groups?
Association rule discovery: what books are commonly purchased together?
Others:
profiling and behaviour description; link prediction; value estimation (how much is a client
likely to spend in a restaurant); similarity matching (which prospective clients are similar to a
company's best clients?); data reduction; influence/causal modeling, etc.
data-action-lab.com
CLASSIFICATION
Many different techniques to carry this out, but the steps are
the same:
¡ Use a training set to teach the classifier how to classify.
¡ Test/validate the classifier using new data
¡ Use the classifier to classify novel instances
Some classifiers (e.g. neural nets) are very ‘black box’. They
might be good at classifying, but you don’t know why!
data-action-lab.com
TIME SERIES ANALYSIS
data-action-lab.com
ANOMALY DETECTION
data-action-lab.com
sub-daily time window level by developing predictive classification over several different stages, as shown in Fig. 2.
models for each time window. The process made it possible to The first stage is aimed at data preparation. Data pre-processing
discover improper building energy management by detecting and is a crucial task to prepare the time series for the load pattern
analysing anomalous reduced daily patterns. To this aim, the output analysis. At this stage, the energy consumption time series are
of the process consists of a set of rules and a graphical visualisation analysed in order to identify any missing values and/or punctual
of anomalous observations/trends that deviate from the frequent/ outliers that have to be removed. The second stage of the analysis is
expected energy consumption patterns. In order to demonstrate aimed at transforming the energy consumption time series by
the effectiveness of the methodology, the whole process was tested implementing an enhanced SAX process. In detail, two preliminary
on two different buildings. hypotheses are formulated in different ways from the classic SAX
The process was conceived to be general for different types of implementation presented in Section 2. The first hypothesis is
buildings and to be useful in the post-occupancy phase so as to related to the length of the non-overlapping W windows on the
Energy
journal homepage: www.elsevier.com/locate/energy
a r t i c l e i n f o a b s t r a c t
Article history: The energy management of buildings currently offers a powerful opportunity to enhance energy effi-
Received 5 February 2018 ciency and reduce the mismatch between the actual and expected energy demand, which is often due to
Accepted 19 May 2018 an anomalous operation of the equipment and control systems. In this context, the characterisation of
Available online 21 May 2018
energy consumption patterns over time is of fundamental importance. This paper proposes a novel
methodology for the characterisation of energy time series in buildings and the identification of infre-
Keywords:
quent and unexpected energy patterns. The process is based on an enhanced Symbolic Aggregate
Energy consumption
approXimation (SAX) process, and it includes an optimised tuning of the time window width and of the
Building energy management
Adaptive symbolic aggregate approximation
symbol intervals according to the building energy behaviour. The methodology has been tested on the
Anomaly detection whole electrical load of buildings for two case studies, and its flexibility and robustness have been
Data mining confirmed. In order to demonstrate the implications for a preliminary diagnosis, some unexpected trends
Smart buildings of the total electrical load have also been discussed in a post-mining phase, using additional datasets
related to heating and cooling electrical energy needs.
The process can be used to support stakeholders in characterising building behaviour, to define
appropriate energy management strategies, and to send timely alerts based on anomaly detection
outcomes.
© 2018 Elsevier Ltd. All rights reserved. Fig. 2. e Framework for advanced energy consumption characterisation in buildings and anomalous pattern detection.
Unsupervised techniques:
¡ Association rules
¡ Recommender engines
¡ Novel categories (clustering)
data-action-lab.com
SOME PRACTICAL DEFINITIONS
DATA FUNDAMENTALS
data-action-lab.com
MODULE LEARNING OBJECTIVES
data-action-lab.com
WHAT IS DATA ANALYSIS?
(Carrying out calculations on data?) The more complicated the pattern, the
more complicated the analysis (?)
data-action-lab.com
WHAT IS DATA SCIENCE?
data-action-lab.com
WHAT IS MACHINE LEARNING?
data-action-lab.com
WHAT IS ARTIFICIAL/AUGMENTED INTELLIGENCE?
data-action-lab.com
MODULE LEARNING OBJECTIVES
Utilization and
Objective/ Data Data
Decision
Rationale Collection Exploration
Support
Infrastructure
Data Modeling and
and Data Communication
Preparation Analysis
Management
data-action-lab.com
THE DATA SCIENCE “WORKFLOW”
Utilization and
Objective/ Data Data
Decision
Rationale Collection Exploration
Support
Infrastructure
Data Modeling and
and Data Communication
Preparation Analysis
Management
data-action-lab.com
THE DATA ANALYSIS PROCESS
Iterative process: feature selection and data reduction may require numerous visits
to domain experts before models start yielding promising results.
data-action-lab.com
[James Taylor]
data-action-lab.com
[James Taylor]
data-action-lab.com
LIFE AFTER ANALYSIS
When an analysis or model is ‘released into the wild’, it can take on a life of its own.
Analysts may eventually have to relinquish control over dissemination. Results may
be misappropriated, misunderstood, or shelved. What can be done to prevent this?
Because of analytic decay, better to see the last analytical step NOT as a static
dead end, but rather as an invitation to return to the beginning of the process.
data-action-lab.com
DATA SCIENCE ECOSYSTEM
Data analysis is a team sport, with team members needing a good understanding of
both data and context
§ data management
§ data preparation
§ analysis
§ communications
Even slight improvements over a current approach can find a useful place in an
organization – data science is not solely about Big Data and disruption!
data-action-lab.com
MODELS AND SYSTEMS THINKING
DATA FUNDAMENTALS
“What if the only valid model of the Universe is the Universe itself?”
Unknown
data-action-lab.com
MODULE LEARNING OBJECTIVES
data-action-lab.com
REPRESENTATION
data-action-lab.com
detail rigour
DATA*
ACTION
(based on goal)
eyeball data-action-lab.com
THINKING IN SYSTEMS TERMS
In order to understand how various aspects of the World interact with one another,
we need to carve out chunks corresponding to the aspects and define their
boundaries.
A system is made up of objects with properties that potentially change over time.
Within the system we perceive actions and evolving properties leading us to think in
terms of processes.
data-action-lab.com
THINKING IN SYSTEMS TERMS
This generates data points, capturing the underlying reality to some degree of
accuracy and error (biased or unbiased).
data-action-lab.com
IDENTIFYING GAPS IN KNOWLEDGE
The solution is to be flexible. When faced with such a gap, go back, ask questions,
and modify the system representation.
data-action-lab.com
CONCEPTUAL MODELS
Exercise:
¡ assume that an acquaintance has just set foot in your living space for the first time.
¡ you are on the phone with them but not currently at home.
¡ explain to them how to go about preparing a cup of sugar.
Is the data which has been collected and analyzed going to be of any use when it
comes to understanding the system?
Is the combination of system and data sufficient to understand the aspects of the
world under consideration?
data-action-lab.com
TAKE-AWAYS
Certain aspects of the Universe can be approximated with the help of systems.
System models provide the basis under which data is identified and collected, but
data itself is approximate and selective.
Knowledge gaps happen. Be prepared and ready to re-visit your set-up regularly.
We often only rely on implicit conceptual modeling, but there’s danger that way.
If the data, the system, and the world are out of alignment, insights might prove
useless.
data-action-lab.com
data-action-lab.com
ETHICAL CONSIDERATIONS AND BEST PRACTICES
DATA FUNDAMENTALS
“We have flown the air like birds and swum the sea like fishes, but
have yet to learn the simple act of walking the Earth like brothers.”
Martin Luther King, Jr.
data-action-lab.com
MODULE LEARNING OBJECTIVES
data-action-lab.com
DISCUSSION
data-action-lab.com
THE NEED FOR ETHICS
Formerly: “Wild West” mentality to data collection (and use). Whatever wasn’t
technologically forbidden was allowed.
Now: professional codes of conduct are being devised for data scientists (outline
responsible ways to practice data science).
Additional responsibility for data scientists; but also protection against being hired
to carry out questionable analyses.
Does your organization have a code of ethics for its data scientists? For its
employees?
data-action-lab.com
WHAT ARE ETHICS?
Broadly speaking, ethics refers to the study and definition of right and wrong
conducts:
§ “not […] social convention, religious beliefs, or laws”. (R.W. Paul, L. Elder)
Analytically, the general is preferred to the anecdotal – decisions made on the basis
of machine learning and A.I. (security, financial, marketing, etc.) may affect real
beings in unpredictable ways.
data-action-lab.com
BEST PRACTICES
“Do No Harm”: data collected from an individual should not be used to harm the
individual.
Informed Consent:
¡ Individuals must agree to the collection and use of their data
¡ Individuals must have a real understanding of what they are consenting to, and of
possible consequences for them and others
data-action-lab.com
BEST PRACTICES
Keep Data Public: data should be kept public (all? most? any?).
Opt-In/Opt-Out: Informed consent requires the ability to opt out.
Anonymize Data: removal of id fields from data prior to analysis.
“Let the Data Speak”:
¡ no cherry picking
¡ importance of validation (more on this later)
¡ correlation and causation (more on this later, too)
¡ repeatability
data-action-lab.com
MODEL ASSESSMENT AND VALIDITY
Data can be used in conjunction with existing models to come to some conclusions,
or can be used to update the model itself.
At what point does one determine that the current data model is out-of-date or is
not useful anymore?
data-action-lab.com
READINGS AND REFERENCES
DATA FUNDAMENTALS
data-action-lab.com
REFERENCES
data-action-lab.com
REFERENCES
Mayer-Schönberger, V. and Cukier, K. [2013], Big Data: A Revolution That Will Transform How We Live,
Work, and Think, Eamon Dolan/Houghton Mifflin Harcourt.
Mayer-Schönberger, V. [2009], Delete: The Virtue of Forgetting in the Digital Age, Princeton University
Press.
Data Science Association, Data Science Code of Professional Conduct.
Chen, M. [2013], Is ‘Big Data’ Actually Reinforcing Social Inequalities?, The Nation.
Shin, L. [2013], How the New Field of Data Science is Grappling With Ethics, SmartPlanet.
Schutt, R. and O'Neill, C. [2013], Doing Data Science: Straight Talk From the Front Line, O'Reilly.
O'Neill, C. [2016], Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy, Crown.
data-action-lab.com
REFERENCES
Chang, R.M., Kauffman, R.J., Kwon, Y. [2014], Understanding the paradigm shift to computational social
science in the presence of big data, Decision Support Systems, 63:67–80, Elsevier.
Hurlburt, G.F., Voas, J. [2014], Big Data, Networked Worlds, IEEE Computer Society.
Introna, L.D. [2007], Maintaining the reversibility of foldings: Making the ethics (politics) of information
technology visible, Ethics and Information Technology, 9:11–25, Springer.
Floridi, L. [2011], The philosophy of information, Oxford University Press.
Floridi, L. (ed) [2006], The Cambridge handbook of information and computer ethics, Cambridge
University Press, 2006.
Big Data & Ethics
Mason, H. [2012], What is a Data Scientist?, Forbes.
data-action-lab.com
REFERENCES
Schlimmer, J.S. [1987], Concept Acquisition Through Representational Adjustment (Technical Report
87-19). Department of Information and Computer Science, UCalifornia, Irvine.
Iba, W., Wogulis, J., Langley, P. [1988], Trading off Simplicity and Coverage in Incremental Concept
Learning, in Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor,
Michigan: Morgan Kaufmann.
Gorelik, B. [2017], Don’t study data science as a career move; you’ll waste your time!, gorelik.net
J. Leskovec, A. Rajaraman, J. Ullman [2015] Mining of Massive Datasets, Cambridge University Press.
Hastie, T., Tibshirani, R., and J. Friedman [2008], The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer.
Provost, F., Fawcett, T. [2013], Data Science for Business, O'Reilly.
data-action-lab.com