0% found this document useful (0 votes)
8 views60 pages

Module 1

Uploaded by

richusuriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views60 pages

Module 1

Uploaded by

richusuriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Science & Machine

Learning
20INMCA501
Module 1
• Introduction to Data Science - Benefits and uses of data science and
big data-Facets of data-The data science process-The big data
ecosystem and data science- introduction to Hadoop.
Data science and big data
• Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems).
• Data science involves using methods to analyze massive amounts of
data and extract the knowledge it contains.
• You can think of the relationship between big data and data science as
being like the relationship between crude oil and an oil refinery.
The characteristics of big data are
often referred to as the three Vs:
■ Volume—How much data is there?
■ Variety—How diverse are different types of data?
■ Velocity—At what speed is new data generated?
Often these characteristics are complemented with a fourth V,
■ veracity: How accurate is the data?
These four properties make big data different from the data found
in traditional data management tools.
Benefits and uses of data science and big data
• Data science and big data are used almost everywhere in both commercial
and non-commercial settings.
• Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and
products.
• Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
• A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched to the
person browsing the internet.
• Financial institutions use data science to predict stock markets, determine
the risk of lending money, and learn how to attract new clients for their
services.
• Many governmental organizations not only rely on internal data
scientists to discover valuable information, but also share their data
with the public.
• A data scientist in a governmental organization gets to work on
diverse projects such as detecting fraud and other criminal activity or
optimizing project funding.
• Nongovernmental organizations (NGOs) are also no strangers to using
data. They use it to raise money and defend their causes.
• Universities use data science in their research but also to enhance the
study experience of their students. The rise of massive open online
courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
Facets of data
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data
• Structured data is data that depends on a data model and resides in a
fixed field within a record.
• SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.
Unstructured data

• Unstructured data is data that isn’t easy to fit into a data


model because the content is context-specific or
varying. One example of unstructured data is your
regular email.
Natural language
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of
specific data science techniques and linguistics.
• The natural language processing community has had success
in entity recognition, topic recognition, summarization, text
completion, and sentiment analysis, but models trained in
one domain don’t generalize well to other domains.
Machine-generated data
• Machine-generated data is information that’s automatically
created by a computer, process, application, or other
machine without human intervention. Machine-generated
data is becoming a major data resource and will continue to
do so.
• Examples of machine data are web server logs, call detail
records, network event logs, and telemetry
Graph-based or network data
• Graph data” can be a confusing term because any data can
be shown in a graph. “Graph” in this case points to
mathematical graph theory.
• In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is a
natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence
of a person and the shortest path between two people.
Audio, image, and video
• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing objects
in pictures, turn out to be challenging for computers.
• High-speed cameras at stadiums will capture ball and athlete
movements to calculate in real time, for example, the path
taken by a defender relative to two baselines.
Streaming data
• The data flows into the system when an event happens
instead of being loaded into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live
sporting or music events, and the stock market.
Data Science
Data science is a collection of techniques used to extract value from data.
Data science techniques rely on finding useful patterns, connections, and
relationships within data.
Data science is also commonly referred to as knowledge discovery,
machine learning, predictive analytics, and data mining.
The use of the term science in data science indicates that the methods are
evidence based, and are built on empirical knowledge, more specifically
historical observations.
Artificial intelligence, Machine learning, and data science are all related to
each other.
Artificial intelligence is about giving machines the capability of mimicking
human behaviour. Examples would be: facial recognition, automated
driving etc.
Machine learning can either be considered a sub-field or one of the tools
of artificial intelligence, is providing machines with the capability of
learning from experience. Experience for machines comes in the form of
data. Data that is used to teach machines is called training data.
Machine learning algorithms, also called “learners”, take both the known
input and output (training data) to figure out a model for the program
which converts input to output.
What is Data Science
Data science is the business application of machine learning, artificial
intelligence, and other quantitative fields like statistics, visualization, and
mathematics.
In the context of how data science is used today, it relies heavily on
machine learning and is sometimes called data mining.
Examples of data science user cases are: recommendation engines that
can recommend movies for a particular user, a fraud alert model that
detects fraudulent credit card transactions or predict revenue for the
next quarter.
Data science starts with data, which can range from a simple array of a
few numeric observations to a complex matrix of millions of
observations with thousands of variables. Data science utilizes certain
specialized computational methods in order to discover meaningful and
useful structures within a dataset.
The data science process
Setting the research goal
• Data science is mostly applied in the context of an organization.
• When the business asks you to perform a data science project, you’ll
first prepare a project charter.
• This charter contains information such as what you’re going to
research, how the company benefits from that, what data and
resources you need, a timetable, and deliverables.
• The data science process will be applied to bigger case studies and
you’ll get an idea of different possible research goals.
Retrieving data
• The second step is to collect data.
• You’ve stated in the project charter which data you need and where you
can find it.
• In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data.
• Data can also be delivered by third-party companies and takes many
forms ranging from Excel spreadsheets to different types of databases.
Data preparation
• Data collection is an error-prone process;
• in this phase you enhance the quality of the data and prepare it for
use in subsequent steps.
• This phase consists of three sub phases: data cleaning removes false
values from a data source and inconsistencies across data sources,
data integration enriches data sources by combining information from
multiple data sources, and data transformation ensures that the data
is in a suitable format for use in your models.
Data exploration
• Data exploration is concerned with building a deeper understanding
of your data.
• You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers.
• To achieve this you mainly use descriptive statistics, visual techniques,
and simple modeling.
• This step often goes by the abbreviation EDA, for Exploratory Data
Analysis.
Data modeling or model building
• In this phase you use models, domain knowledge, and insights about
the data you found in the previous steps to answer the research
question.
• You select a technique from the fields of statistics, machine learning,
operations research, and so on. Building a model is an iterative
process that involves selecting the variables for the model, executing
the model, and model diagnostics.
Presentation and automation
• Finally, you present the results to your business.
• These results can take many forms, ranging from presentations to
research reports.
• Sometimes you’ll need to automate the execution of the process
because the business will want to use the insights you gained in
another project or enable an operational process to use the outcome
from your model.
AN ITERATIVE PROCESS
• The previous description of the data science process gives you the
impression that you walk through this process in a linear way, but in
reality you often have to step back and rework certain findings.
• For instance, you might find outliers in the data exploration phase
that point to data import errors.
• As part of the data science process you gain incremental insights,
which may lead to new questions.
• To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.
The big data ecosystem and
data science
The big data ecosystem can be grouped into technologies that have similar goals
and functionalities.
• Distributed file systems
• Distributed programming framework
• Data integration framework
• Machine learning frameworks
• NoSQL databases
• Scheduling tools
• Benchmarking tools
• System deployment
• Service programming
• Security
Distributed file systems
• A distributed file system is similar to a normal file system, except that it
runs on multiple servers at once.
• Because it’s a file system, you can do almost all the same things you’d do
on a normal file system. Actions such as storing, reading, and deleting files
and adding security to files are at the core of every file system, including
the distributed one.
• Distributed file systems have significant advantages:
 They can store files larger than any one computer disk.
Files get automatically replicated across multiple servers for redundancy or
parallel operations while hiding the complexity of doing so from the user.
The system scales easily: you’re no longer bound by the memory or
storage restrictions of a single server.
• In the past, scale was increased by moving everything to a server with
more memory, storage, and a better CPU (vertical scaling). Nowadays
you can add another small server (horizontal scaling). This principle
makes the scaling potential virtually limitless.
• The best-known distributed file system at this moment is the Hadoop
File System (HDFS). It is an open source implementation of the Google
File System.
Distributed programming framework
• Once you have the data stored on the distributed file system, you want to
exploit it.
• One important aspect of working on a distributed hard disk is that you
won’t move your data to your program, but rather you’ll move your
program to the data.
• When you start from scratch with a normal general-purpose programming
language such as C, Python, or Java, you need to deal with the complexities
that come with distributed programming, such as restarting jobs that have
failed, tracking the results from the different sub processes, and so on.
• Luckily, the open source community has developed many frameworks to
handle this for you, and these give you a much better experience working
with distributed data and dealing with many of the challenges it carries.
Data integration framework
• Once you have a distributed file system in place, you need to add
data.
• You need to move data from one source to another, and this is where
the data integration frameworks such as Apache Sqoop and Apache
Flume excel. The process is similar to an extract, transform, and load
process in a traditional data warehouse.
Machine learning frameworks
• When you have the data in place, it’s time to extract the coveted
insights. This is where you rely on the fields of machine learning,
statistics, and applied mathematics.
• With the enormous amount of data available nowadays, one
computer can no longer handle the workload by itself.
• One of the biggest issues with the old algorithms is that they don’t
scale well. With the amount of data we need to analyze today, this
becomes problematic, and specialized frameworks and libraries are
required to deal with this amount of data.
• The most popular machine-learning library for Python is Scikit-learn.
It’s a great machine-learning toolbox
There are, of course, other Python libraries:
• PyBrain for neural networks—Neural networks are learning algorithms that
mimic the human brain in learning mechanics and complexity.
• NLTK or Natural Language Toolkit—As the name suggests, its focus is
working with natural language.
• Pylearn2—Another machine learning toolbox but a bit less mature than
Scikit-learn.
• TensorFlow—A Python library for deep learning provided by Google.
Spark is a new Apachelicensed machine-learning engine, specializing in
real-learn-time machine learning. It’s worth taking a look at and you can
read more about it at http://spark.apache.org/.
NoSQL databases
• If you need to store huge amounts of data, you require software that’s
specialized in managing and querying this data.
• By solving several of the problems of traditional databases, NoSQL
databases allow for a virtually endless growth of data. s “No” in this
context stands for “Not Only.”
• Many different types of databases have arisen, but they can be
categorized into the following types:
• Column databases, Document stores, Streaming data, Key-value stores,
SQL on Hadoop, New SQL, Graph databases.
Scheduling tools
• Scheduling tools help you automate repetitive tasks and trigger jobs
based on events such as adding a new file to a folder.
• These are similar to tools such as CRON on Linux but are specifically
developed for big data.
Benchmarking tools
• This class of tools was developed to optimize your big data installation by
providing standardized profiling suites.
• A profiling suite is taken from a representative set of big data jobs.
• Using an optimized infrastructure can make a big cost difference.
System deployment
• Setting up a big data infrastructure and assisting engineers in deploying
new applications into the big data cluster is where system deployment
tools shine.
• They largely automate the installation and configuration of big data
components.
Service programming
• Service tools excel here by exposing big data applications to other
applications as a service.
• Data scientists sometimes need to expose their models through services. The
best-known example is the REST service; REST stands for representational
state transfer. It’s often used to feed websites with data.
Security
• Big data security tools allow you to have central and fine-grained control
over access to the data.
• Big data security has become a topic in its own right, and data scientists
are usually only confronted with it as data consumers; seldom will they
implement the security themselves.
Introduction to hadoop
• Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
• Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation.
• Its framework is based on Java programming with some native code in C and
shell scripts.
• Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
• It is designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
• Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume.
• .It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and
storage.
Hadoop has two main components:
• HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware, which makes it
cost-effective.
• YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as
CPU and memory) for processing the data stored in HDFS.
• Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
• Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining.

You might also like