Big data notes :
Module 2 :
Lec 1 : what launched the big data era ?
As per mckinsey report 2013, a torrent – big data , the users are generating more and more data dat by
day and the demand for on – demand computing or cloud computing is also increasing, hence it started
big data era. Although google or facebook, one of them stated it in 2004.
What makes big data valuable ?
Question : describe examples from fields where big data has enabled better models which allows for
higher precision recommendations or solutions to make the world a better place.
Answer : big data better models - high precision.
- Personalized marketing data.
- Business development.
- Hear the voice of each consumer, example – Walmart etc uses it for personalized consumer
service.
- Better marketing campaign : example – recommendation engines on amazon , Netflix etc.
- Sentiment analysis of events, customer or products - through reviews.
- Mobile advertising – through GPS providing real-time location based advertising.
- Collective consumer behavior : Using global consumer behavior for product growth.
- Biomedical applications : genome data editing etc.
- Personalized cancer treatment.
- Smart cities.
Saving lives with big data : wildfire analysis.
Question : Give an example for sensor, organizational and ppl – generated data used in wildfire
analytics.
Break up into two parameter : prediction and response.
Where does Big data come from ?
Origin of big data :
- Not new
- Generated by – machine , ppl, organization.
- Large hedron – 40 tb every second generation.
- Org : transaction etc
How machine generated data is useful :
Ques : understand how machine generated big data is being used to enable real time actions.
And identify whats needed to start creating a big data strategy that includes machine generated
data ?
Ans :
Human generated data for example – notes, texts, images, audio , videos, is unstructured and
not suited for pre-defined data models so we need to clean it in line of particular predefined
data model.
Still, companies are using it using techs like – Hadoop, storm, spark, NoSQL, to clean
unstructured data.
- Hadoop is open source big data framework which is designed to support the enormous amount
of data in distributed computing environment.
- Real time data , called high velocity data is being handled by apache storm and spark.
-
Module 4 :-
Getting value out of big data is a team work where ppl with different domains or expertise sit
together to take out insights from data. Yes, many insights can be collected from same peace of
data. The insight or lets say prediction is taken out using data as empirical evidence and near-
time- real data .
Building a big data strategy : a paln of action or policy designed to achieve an overall aim .
- Strategy : aim, policy, plan , action.
5 P’s of data science : ppl , purpose, process, platforms , programmability.
-
Purpose: The purpose refers to the challenge or set of challenges defined by your big
data strategy. The purpose can be related to a scientific analysis with a hypothesis or a
business metric that needs to be analyzed based often on Big Data.
People: The data scientists are often seen as people who possess skills on a variety of
topics including: science or business domain knowledge; analysis using statistics,
machine learning and mathematical knowledge; data management, programming and
computing. In practice, this is generally a group of researchers comprised of people with
complementary skills.
Process: Since there is a predefined team with a purpose, a great place for this team to
start with is a process they could iterate on. We can simply say, People with Purpose will
define a Process to collaborate and communicate around! The process of data science
includes techniques for statistics, machine learning, programming, computing and data
management. A process is conceptual in the beginning and defines the course set of steps
and how everyone can contribute to it. Note that similar reusable processes can be
applicable to many applications with different purposes when employed within different
workflows. Data science workflows combine such steps in executable graphs. We believe
that process-oriented thinking is a transformative way of conducting data science to
connect people and techniques to applications. Execution of such a data science process
requires access to many datasets, Big and small, bringing new opportunities and
challenges to Data Science. There are many Data Science steps or tasks, such as Data
Collection, Data Cleaning, Data Processing/Analysis, Result Visualization, resulting in a
Data Science Workflow. Data Science Processes may need user interaction and other
manual operations, or be fully automated.Challenges for the data science process include
1) how to easily integrate all needed tasks to build such a process; 2) how to find the best
computing resources and efficiently schedule process executions to the resources based
on process definition, parameter settings, and user preferences.
Platforms: Based on the needs of an application-driven purpose and the amount of data
and computing required to perform this application, different computing and data
platforms can be used as a part of the data science process. This scalability should be
made part of any data science solution architecture.
Programmability: Capturing a scalable data science process requires aid from
programming languages, e.g., R, and patterns, e.g., MapReduce. Tools that provide
access to such programming techniques are key to making the data science process
programmable on a variety of platforms.
To summarize, data science can be defined as a craft of using the five pieces identified above.
Having a process between the more business driven P’s people and purpose and the more
technical driven P’s platforms and programmability leads to a streamlined approach that starts
and ends with a defined business value, team accountability and collaboration in mind.
Asking the right questions :
- Define the problem.
- Assess the current situation.
Data analysis process steps : acquire – prepare – analyze – report – act.
1) Acquiring data :