Unit 1
Prepared By :: Prof. Megha Mehta
Defining data science
Recognizing the different types of data
Gaining insight into the data science process
Fields of data science
Data science is the study of data to extract meaningful insights for business.
It is a multidisciplinary approach that combines principles and practices from the fields
of mathematics, statistics, artificial intelligence, and computer engineering to analyse
large amounts of data.
This analysis helps data scientists to ask and answer questions like what happened, why
it happened, what will happen, and what can be done with the results.
Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today. It adds methods from computer science to the
repertoire of statistics
Data science is important because it combines tools, methods, and technology to
generate meaning from data.
Modern organizations are inundated with data; there is a proliferation of devices that
can automatically collect and store information.
Online systems and payment portals capture more data in the fields of e-commerce,
medicine, finance, and every other aspect of human life.
We have text, audio, video, and image data available in vast quantities.
In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques.
The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Structured data is data that depends on a data model and resides in a fixed field within a
record.
As such, it’s often easy to store structured data in tables within databases or Excel files.
SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
You may also come across structured data that might give you a hard time storing it in a
traditional relational database.
Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
One example of unstructured data is your regular email.
Although email contains structured elements such as the sender, title, and body text, it’s
a challenge to find the number of people who have written an email complaint about a
specific employee because so many ways exist to refer to a person, for example.
The thousands of different languages and dialects out there further complicate this.
Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain don’t generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of
text.
This shouldn’t be a surprise though: humans struggle with natural language as well. It’s
ambiguous by nature.
The concept of meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can vary
when coming from someone upset or joyous.
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
Machine-generated data is becoming a major data resource and will continue to do so.
Wikibon has forecast that the market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex physical machinery with
networked sensors and software) will be approximately $540 billion in 2020.
IDC (International Data Corporation) has estimated there will be 26 times more
connected things than people in 2020. This network is commonly referred to as the
internet of things.
The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network event
logs, and telemetry
“Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory.
In graph theory, a graph is a mathematical structure to model pair-wise relationships
between objects.
Graph or network data is, in short, data that focuses on the relationship or adjacency of
objects. The graph structures use nodes, edges, and properties to represent and store
graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows
you to calculate specific metrics such as the influence of a person and the shortest path
between two people.
Examples of graph-based data can be found on many social media websites.
For instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.
The power and sophistication comes from multiple, overlapping graphs of the same
nodes.
For example, imagine the connecting edges here to show “friends” on Facebook.
Imagine another graph with the same people which connects business colleagues via
LinkedIn. Imagine a third graph based on movie interests on Netflix.
Overlapping the three different-looking graphs makes more interesting questions
possible.
Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.
Graph data poses its challenges, but for a computer interpreting additive and image
data, it can be even more difficult.
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers. MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture to approximately 7 TB per game
for the purpose of live, in-game analytics.
High-speed cameras at stadiums will capture ball and athlete movements to calculate in
real time, for example, the path taken by a defender relative to two baselines.
Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games.
This algorithm takes the video screen as input and learns to interpret everything via a
complex process of deep learning.
It’s a remarkable feat that prompted Google to buy the company for their own Artificial
Intelligence (AI) development plans.
The learning algorithm takes in data as it’s produced by the computer game; it’s
streaming data.
While streaming data can take almost any of the previous forms, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a
data store in a batch.
Although this isn’t really a different type of data, we treat it here as such because you
need to adapt your process to deal with this type of information.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the
stock market.
The data science process typically consists of six steps, as you can see in the mind map.
The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
Data science is mostly applied in the context of an organization.
When the business asks you to perform a data science project, you’ll first prepare a
project charter.
This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and
deliverables.
The second phase is data retrieval.
You want to have data available for analysis, so this step includes finding suitable data
and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
You’ve stated in the project charter which data you need and where you can find it.
In this step you ensure that you can use the data in your program, which means checking
the existence of, quality, and access to the data.
Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets to different types of databases.
Now that you have the raw data, it’s time to prepare it.
This includes transforming the data from a raw form into data that’s directly usable in
your models.
To achieve this, you’ll detect and correct different kinds of errors in the data, combine
data from different data sources, and transform it.
If you have successfully completed this step, you can progress to data visualization and
modeling.
This phase consists of three sub phases : data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches data sources
by combining information from multiple data sources, and data transformation ensures
that the data is in a suitable format for use in your models.
Data collection is an error-prone process; in this phase you enhance the quality of the
data and prepare it for use in subsequent steps.
The fourth step is data exploration.
The goal of this step is to gain a deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive techniques. The insights
you gain from this phase will enable you to start modeling.
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the data,
and whether there are outliers.
To achieve this you mainly use descriptive statistics, visual techniques, and simple
modelling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
In this phase you use models, domain knowledge, and insights about the data you found
in the previous steps to answer the research question.
You select a technique from the fields of statistics, machine learning, operations
research, and so on.
Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics.
It is now that you attempt to gain the insights or make the predictions stated in your
project charter. Now is the time to bring out the heavy guns, but remember research has
taught us that often (but not always) a combination of simple models tends to
outperform one complicated model.
If you’ve done this phase right, you’re almost done.
The last step of the data science model is presenting your results and automating the
analysis, if needed.
One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business
process as expected. This is where you can shine in your influencer role.
The importance of this step is more apparent in projects on a strategic and tactical level.
Certain projects require you to perform the business process over and over again, so
automating the project will save time.
Finally, you present the results to your business.
These results can take many forms, ranging from presentations to research reports.
Sometimes you’ll need to automate the execution of the process because the business
will want to use the insights you gained in another project or enable an operational
process to use the outcome from your model.
The field of data science encompasses multiple sub disciplines such as data analytics,
data mining, artificial intelligence, machine learning, and others.
Data Analytics
While data analysts are focused on extracting meaningful insights from various data sources,
data scientists go beyond that to “forecast the future based on past patterns,” according to
SimpliLearn. “A data scientist creates questions, while a data analyst finds answers to the
existing set of questions.”
Artificial Intelligence
Commonly called AI, artificial intelligence, according to Techopedia, “aims to imbue software
with the ability to analyze its environment using either predetermined rules and search
algorithms or pattern recognizing machine learning models, and then make decisions based
on those analyses. In this way, AI attempts to mimic biological intelligence to allow the software
application or system to act with varying degrees of autonomy, thereby reducing manual
human intervention for a wide range of functions.”
Machine Learning
Machine learning algorithms use statistics to find patterns in massive amounts of data,
according to MIT Technology Review. A sub discipline of AI, “machine learning is the process
that powers many of the services we use today — recommendation systems like those on
Netflix, YouTube, and Spotify; search engines like Google and Baidu; social-media feeds like
Facebook and Twitter; voice assistants like Siri and Alexa. The list goes on.”
Data Visualization
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
Data mining
Data mining is the process of extracting and discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems.
Statistics and Probability
Statistics and probability represent a considerable area of mathematics that also greatly
impacts data science. Statistics and probability are one of the most widely used fields of data
science. This specialty area is all about establishing and working with finite figures as well as
the effects of the ever-present factor of “chance” in all things. Data scientists with training in this
area are a great asset to general and specialized areas of the data science industry today
including:
Epidemiologist
Statistician
Business Intelligence Analyst
Social Science Data Analyst
General Data Scientist