Basic principles of Big Data
M1. BASIC CONCEPTS OF BIG DATA
Every day we hear the term Big Data in the media, as well as
in the communications of all companies.
Big Data is clearly a key concept in today's economy and in our society.
Next, we will explain how technology has evolved and what the
motivations for new technologies
of a project that uses Big Data technology.
Welcome
We are facing an era of change that no previous generation has experienced.
terms of economic and social effects. A large part of this change that we are experiencing
is driven by globalization and the digitalization of the economy. Companies are in
a race towards digitalization, with a focus on being able to compete in an economy
that is increasingly digital.
Introduction
Background of Big Data
We are going to review how technology has evolved over the last 40 years.
For all those who are millennials or centennials, it will be like a history class.
Technologies and schedule of Big Data
Let's explain how Big Data technology was born. Like much of the technology that
what we currently use, both in terms of tools and services, has its origins in
Google.
Google continued making progress in the creation of inverted indexes.
specifically a through two technologies: Google File System and the
paradigm Map/Reduce.
Below, in the attached document, you can see the Chronology of the Big ecosystem.
Data.
Phases of a Big Data project
Let's define the stages or phases of a Big Data project. When we talk about Big Data,
we are actually talking about an ecosystem of tools that allow us to tackle the
different phases of a Big Data project.
The 4 V's of Big Data: Volume and Variety
We are going to finish this module talking about the 4V's.
When we talk about Big Data, it is inevitable to talk about the Vs that characterize Big Data.
Define the characteristics or situations that must occur within a project in order to
to say what Big Data truly is.
The four Vs are as follows:
Volume
Variety
Speed
Value
In the following video, we will learn about Volume and Variety.
The 4 V's of Big Data: Speed and Value
In the following video we will learn about Speed and Value.
Test M1
Your answers
The phases of a Big Data project are...
Almacenamiento, Identificación, Tratamiento del dato, Ingesta de Data Lake,
Visualization.
Identificación, Ingesta de Data Lake, Almacenamiento, Tratamiento del dato,
Visualization.
Identification
Visualization.
Storage, Data Lake Ingestion, Identification, Data Processing;
Visualization.
Why is it important to verify the accuracy of data?
The untrue information has great distortion as it would result in a product
that does not meet expectations.
All of the above.
If we use unreliable sources, we may end up having biases in the data analysis that
can lead us to make incorrect decisions.
A larger volume of data helps us draw more accurate conclusions. But not only
It is not only the quantity that matters, but also the quality of the data to ensure a reliable result.
Where can we find unstructured data?
Web page.
The result of the multiple choice questionnaires.
Videos.
HTML.
What does speed refer to as a characteristic of Big Data?
Speed refers to the ability to know information at the speed of
that is generated.
The possibility of processing data in real time and obtaining real-time information.
To everyone.
Speed refers to the ability to handle and process data over time.
that is valid to keep the product updated and thus get the most out of it.
Where can we find semi-structured data?
Spreadsheet.
Texts.
Web page.
The result of the multiple choice questionnaires.
Who created a project in Apache called Hadoop, which is an implementation
of the Map/Reduce paradigm?
Mark Zuckerberg.
Matei Zaharia.
Doug Cutting.
Jeff Dean.
When we refer to an inferential data study, we talk about...
A study that aims to find and establish connections between the data.
We look for correlations, linearity, and relationships between the variables.
A study that tries to explain what happens to one variable when another is changed.
A study aimed at testing theories that are greatly affected by the sample.
of the data since we only have part of the data and its uncertainty. It is the
objective of statistical models.
A study in which the data scientist relies on the data they have to predict the
future.
The phases of a Big Data project are...
Four.
Six.
Three.
Five.
What is distributed programming?
We talk about distributed programming whenever there are several programmers.
working on a common project.
Distributed programming refers to Open Source tools that are
they are found openly and can be used at no cost.
The use of different machines that work together to provide a solution to a problem.
For the processing of Big Data, different machines collaborate in an established order.
Distributed programming is only discussed when there is no more than one machine working.
at the same time.
What is Big Data?
Big Data refers to the knowledge that the data contains.
When we talk about Big Data, we refer to a volume of data that can be managed.
with tools like Excel spreadsheets.
The concept of Big Data refers to a massive amount of
structured, semi-structured, and unstructured data that have the potential to be
extracted to obtain information.
Big Data is another word for Artificial Intelligence.
M2. STRATEGY AND DATA GOVERNANCE
Data is the fundamental element of information. It is the basis of the
knowledge and, by extension, the support of many decisions.
Data is also known as facts or events. They are symbolic representations.
such as: text, numbers, images, audio or video recordings, etc. that can be
stored and even processed.
The data may be inaccurate, incomplete, outdated, or even be
incomprehensible, which in practice is a problem, since in these cases its value is
reduce enormously.
Concepts: Data and objectives of the Data Governance
The purpose of the Government of data is to control and optimize the data.
the ones the company provides for its understanding and utilization. This means that
they are defined, there are people responsible for them, their quality is known, and rules exist
clear to manage them.
Approaches to the definition of Data Strategy
Now more than ever, the ability to manage a large volume and diversity of
information is critical for the survival of businesses. The mere act of handling such
the amount of data leads to problems that often require complex solutions.
solutions, such as: ensuring data uniqueness, ensuring its quality, safeguarding
for its accessibility, to take care of its security...
Defensive-Offensive Data Strategy and Balance between
both
Let's see what the Defensive-Offensive data strategy consists of and the balance that is
produce between both to achieve business objectives.
Principles and roles in Data Governance
The Data Governance is defined as a set of principles, policies,
procedures, tools, roles and responsibilities, aimed at promoting the
improvement of data quality and consistency, and to achieve a greater and better
availability of the same.
In this way, companies can meet the information needs in the
management, reporting, and decision-making.
The establishment of suitable data governance allows for a clear view of the
data, know who the owner is, understand what uses are made of the data, how they can
manage themselves and how value can be extracted from this data.
In order for data to have value, they must be available and understandable by the different
users and being reliable for decision making.
It is estimated that only 15% of the data stored by companies is valuable.
Data governance is a task for the entire organization. To a greater or lesser extent.
All people perform some function on the data.
It is important to distinguish between the main roles and the associated responsibilities.
identify within the company who will perform each function. Refer to the document
attached to learn more about it.
Functions and tools of Data Governance
The main problem of Data Governance is that it encompasses a wide range of functions.
To make things easier, it is usually broken down into subdisciplines that can be prioritized.
according to its alignment with the organization's goals and the economic impact that
represent
To carry out the functions of Government and Data Management, it is necessary to have
tools that allow for the management and execution of things automatically
different defined processes.
Discover Artificial Intelligence
Artificial intelligence may be the most disruptive technology we have known so far.
the moment, and represents one of the greatest milestones of our time.
After decades of development, artificial intelligence has come out of the university and of
laboratory, and it has gradually seeped into various areas of our lives: in
our mobile phones, in our cars, in the banks, and even in the way that
we listen to music.
Applications of Artificial Intelligence
Next, we will learn about the different uses of artificial intelligence in the
main sectors.
Test M2
Data governance is a task of...
The entire organization.
CDO.
CEO.
Data Owner.
Who coined the term Artificial Intelligence?
Doug Cutting.
Andreas Kaplan.
John McCarthy.
Alan Turing.
What is the data treatment pyramid starting from the base to the pinnacle?
Data, Information, Knowledge, Wisdom.
Knowledge, Wisdom, Data, Information.
Wisdom, Knowledge, Data, Information.
Information, Data, Knowledge, Wisdom.
In which years were there the first advancements that can be considered as AI?
In the 1930s of the 20th century.
In the 1940s of the 19th century.
In the 1940s of the 20th century.
In the 1960s of the 20th century.
Many of the tools in the DAMA framework are Open Source. That means that...
They are open and anyone can use them, but they do not allow access to their
programming code.
They are open and anyone can use them. They also allow access to
your programming code.
Once the license is paid, they offer unlimited use of the tool.
They comply with transparency standards and break down their sources.
When do we talk about Knowledge in the data treatment pyramid?
We talk about Knowledge when we internalize knowledge and put it into
practice.
We talk about Knowledge whenever we have stored data without having it.
processed.
We talk about Knowledge when we give meaning to a fact or data.
incorporates a formal definition that allows it to be standardized and organized.
We talk about Knowledge when we add a perspective, a hypothesis or a
interpretation of the information we have about its meaning.
What are the data governance tools to manage its processes?
Data dictionary.
Data library.
Dashboard.
All of the above.
What is not a function of data governance?
Data architecture.
Data creation.
Operations management.
Security management.
The Dartmouth Conference marked the birth of Artificial Intelligence.
What year did it occur?
1950.
1957.
1956
1946.
When do we talk about Information in the data processing pyramid?
We talk about Information when we internalize a knowledge and put it in
practice.
We talk about Information whenever we have stored data without having it.
processed.
We talk about information when we give meaning to a fact or data, it
incorporates a formal definition that allows it to be standardized and organized.
We talk about Information when we add a perspective, a hypothesis or a
interpretation of the meaning of data.
M3. USE CASES AND STRATEGIES OF
VISUALIZATION
In the next module, we learn about use cases in real companies where Big Data is applied.
Data and the most important strategies applied in information visualization
compiled.
Introduction to use cases
In this video, we will be able to see practical cases in which Big Data is used.
Many companies make use of this technology in their daily operations. We are going to
Let us see so that we become aware that Big Data is in our day to day.
M3. USE CASES AND STRATEGIES OF
VISUALIZATION
In the following module, we learn about use cases in real companies where Big Data is applied.
Data and the most important strategies applied in information visualization
compiled.
Introduction to use cases
In this video, we will be able to see practical cases in which Big Data is used.
Many companies use this technology in their daily operations. We are going to
let's see them so that we are aware that Big Data is in our day to day.
Applied example 'House of cards'
Netflix is an American company that provides a service for which a user
(client) can register by paying a fixed monthly fee, gaining unlimited access to
all the content on the platform, mainly movies and series.
Its infrastructure is very mature and innovative in data processing, based on
Amazon technology for both storage and processing, with strong usage
deSpark.
Before starting with the case of 'House of Cards', let's describe the data that
potentially can use Netflix for its subsequent analysis and feedback of its
systems.
What static data is obtained?
Datos del cliente, edad, género, país y ciudad donde reside.
The data of movies and series, among others:
Producer.
Main actors and actresses.
Supporting actors and actresses.
comedy, intrigue, love.
Screenwriters.
What dynamic data is stored?
For each client:
What have you seen, genre, actors.
Ratings, how customers rate the content they have viewed 4 million daily.
When he/she has taken a break.
When has the tape accelerated, forward or backward.
What day of the week do you see which content?
Viewing date.
Viewing time.
When has a content been abandoned.
Search for content that each client has performed.
How he moves around the web, time spent on each content.
If clicked for more information.
If you see the trailer.
When do the credits start?
Recommendation systems and active listening: 'Amazon'
"Netflix" y "Walmart"
Amazon
Recommendation systems are a highly utilized personalization tool.
that are very effective.
Amazon is another company that uses data and learning extensively.
automatic. 35% of the sales it makes come from its recommendation system.
various recommendation systems in this section we will describe two of them.
Netflix
Netflix has one of the most powerful recommendation systems in existence. 70% of the
The views that Netflix has come from its recommendation system.
Walmart
Walmart is also one of the companies that has invested most heavily in the
Big technology Data.
Its results are a faithful reflection of this strategic decision, while it uses Big Data from
extensive way to analyze prices, optimization of units in the warehouse, selection of
personal y retention.
In the video, we will see how they use social media to optimize references.
available in each store and be able to provide better service to their customers.
Retail sector: Investment funds and price optimization
Investment funds also use Big Data with the intention of obtaining
information ahead of the competition, thus having an advantage at the time of decision making
decisions regarding the purchase or sale of a certain stock.
Another area where Big Data is used is in price optimization.
Let's see in the video how it applies in these two sectors.
How do we obtain the information
We are more than used to seeing presentations and, in many cases, giving them. Many
Sometimes we are struck by how clear some presentations are when it comes to conveying.
the message, they are efficient.
In this video we are going to see the keys to visualization so that from now on you
you can also make efficient presentations.
Phases of a visualization process
Let's look at the steps that must be followed to create an effective visualization:
The objective of visualization.
The audience.
The data.
Data types
Know the different types of data we can have as a basis for making a correct
visualization.
Properties and elements of a visualization
A visual element is any attribute that composes it, such as colors,
bars, the shapes, and other resources that we can use to elaborate it.
Let's see in the video the most commonly used elements in visualizations and whether or not they have the
property of natural order.
Graphics, colors, and text: recommended use
Let's look at the most commonly used graphs in daily life and when their use is recommended.
Lastly, let's learn some more about colors and text in a
visualization.
You no longer have excuses to create effective visualizations. Put them into practice to
achieve great results!
Test M3
When we talk about types of data that cannot be measured with a value
numerical, and we refer to numerical modalities that do not allow an order,
we talk about...
Categorical qualitative variable.
Discrete quantitative variable.
Nominal qualitative variable.
Continuous quantitative variable.
In the phase of a visualization process, when we talk about types of data that
It takes isolated values and does not allow intermediate values between two specific values.
we refer to...
Categorical qualitative variable.
Discrete quantitative variable.
Continuous quantitative variable.
Nominal qualitative variable.
What characteristics or conditions of the audience do not influence the decision of how
visualize the data?
Your experience.
Your sector.
His last promotion.
Your position.
How do Netflix and similar companies use Big Data?
They use user data to decide what types of content to create in order to increase
the satisfaction of users who increasingly have content that they like in the
platform.
They use user data to set the monthly price based on the level.
socioeconomic status of the users.
They use users' data to conduct surveys and thus increase knowledge about
they.
All of the above.
What does Business Intelligence not describe?
Capture information from the organization's available sources and, after the application of
analysis algorithm, the sample in order to assist in strategic decision-making
in the company.
It processes mainly unstructured information, such as natural language or social networks.
All the analyzed and visualized information comes from structured data sources.
Managers use it mainly to turn their companies into effective organizations.
efficient.
What percentage of information that our brain processes is captured through the
view?
70%
80%
40%
55%
How does the collaborative recommendation system work?
Big Data analysis tools are extensively used to analyze the
prices that each user would be willing to pay for the products and recommends them
products.
The system geolocates and analyzes content on social media and cross-references this information
with the products available in their stores. If we are talking about a product that does not
they include it in the catalog.
First, all users who have viewed the same content are identified. In the second
an algorithm calculates based on the ratings of other users the rating that the user would give
to these contents.
The system analyzes whether a user views or has purchased a product. The system understands that
All those users who also viewed or compared the same product are similar.
That is why it recommends other products that other users have viewed to you.
In a company, there are 25 female workers and 25 male workers. What type of variables are they?
workers?
Continuous quantitative variable.
Nominal qualitative variable.
Discrete quantitative variable.
Categorical qualitative variable.
The use of uppercase letters in the text can...
Suggest anger.
None of the above.
Make it easier to read.
Distract the focus.
Which of these visualization elements does not have a natural order property?
Shapes.
Text labels.
Length.
Brightness.