1
MINERÍA DE DATOS
APLICADA A REDES ELÉCTRICAS
Dr.-Ing. Jaime Cepeda
Conceptos de Big Data
MAYO 2023
2
1.
Introducción
3
Datos, datos, datos, …
4
Datos, datos, datos, …
Información
5
Gartner's Hype Cycle for Emerging Technologies Maps the Journey to Digital Business
6
▪ Google trends
https://trends.google.com/trends/explore?date=all&q=big%20data
7
8
A cualquier proyecto de análisis de datos
se le está poniendo la etiqueta de Big
Data simplemente porque se tratan de
muchos datos..
(www.stratebi.com, 2017 )
9
2.
¿Qué no es Big Data?
10
¿Qué no es Big Data?
▪ No solo es una base de datos enorme.
▪ No es un data warehouse enorme.
▪ No es una nueva forma de Business Intelligence.
▪ No es llevar la base de datos a la nube.
▪ No es analizar solo redes sociales.
▪ No es solo es Hadoop
11
3.
¿Qué es Big Data?
12
Big Data is data that exceeds the processing
capacity of conven4onal database systems.
The data is too big, moves too fast, or
doesn’t fit the structures of your database
architectures. To gain value from this data,
you must choose an alterna4ve way to
process it.
(Edd Dumbill, analyst at O’Reilly Media )
13
Big Data is high-volume, high-velocity and/or
high-variety informa-on assets that demand
cost-effec-ve, innova-ve forms of informa-on
processing that enable enhanced insight,
decision making, and process automa-on.
(Gartner IT Glossary)
14
Big Data
Conjuntos de datos cuyo tamaño (volumen), complejidad (variabilidad) y
velocidad de crecimiento (velocidad) dificultan su captura, gestión,
procesamiento o análisis mediante tecnologías y herramientas
convencionales, tales como bases de datos relacionales y estadísticas
convencionales o paquetes de visualización, dentro del tiempo necesario para
que sean útiles.
Recoger – Almacenar – Buscar – Compartir – Analizar – Visualizar - Procesar
15
4.
Dimensiones de Big
Data
16
1.- Volumen
2.- Velocidad
3.- Variedad
4.- Variabilidad
5.- Veracidad
6.- Visualización
7.- Valor
(https://goo.gl/rRrLWA)
17
18
Velocidad
19
Velocidad
20
Volumen
21
Volumen
22
Volumen
23
Variedad
24
Variedad
25
¿Todas las dimensiones
se aplican a todos los proyectos?
Buyya, R., Calheiros, R. N., &
Dastjerdi, A. V. (Eds.). (2016). Big
data: principles and paradigms.
Morgan Kaufmann.
26
5.
Casos de estudio
27
Big data y el deporte
https://goo.gl/cXBKo6
https://www.youtube.com/watch?v=DXq30dvE0Xg
28
Big data de las redes sociales para
predecir el comportamiento ciudadano
h"ps://goo.gl/FGhofB
h"ps://www.youtube.com/watch?v=yoSqojO2-CQ
29
Big data aplicado a la política
https://www.youtube.com/watch?v=ku78zo9fhoI
30
6.
Proceso de Big Data
31
Adquisición Almacenamiento Indexación
Toma de
Análisis Visualización
decisiones
El proceso Big Data 32
Ecosistema Big Data 33
Visualización Análisis Aprendizaje
Business estadístico automático
Intelligence
Plataforma Bases de datos DataMart
BigData SQL, noSQL
ETL: Extraction, Transformation, Load
PMUs UTRs Web R.Sociales Imagenes Audios
34
Adquisición
Logstash es una herramienta administrar logs la cual nos
Logstash sirve para recolectar convertir y re- direccionar a una fuente
de almacenamiento tipo NoSQL.
Python scripts Python permite la conexión con varias fuentes de
información, twitter, facebook, etc. Scripts para preparar la
data de insumo
35
Almacenamiento
Es una base de datos basada en Graphos altamente
transaccional y utilizada para encontrar interrelaciones entre
Neo4j entidades.
Es un servidor NoSQL distribuido y altamente escalable el
CouchBase cual nos permite almacenar grandes cantidades de
información en formato JSON.
Hadoop es un framework de código abierto para el
procesamiento y almacenamiento distribuido de grandes
Hadoop (HDFS) volúmenes de información utilizando un modelo de
programación llamado MapReduce.
36
Indexación
Es un servidor de búsqueda que nos permite indexar grandes
cantidades de datos de forma distribuida.
elasticsearch
37
Visualización y
análisis
Es una herramienta de visualización que nos permite mostrar
Kibana los datos indexados que provienen de Elasticsearch.
38
7.
Conceptos varios y
terminología
39
40
Conceptos
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search
capabilities across all nodes. A cluster is identified by a unique name. This name is important because a node can only be part of a cluster if
the node is set up to join the cluster by its name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a
cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup.
You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to
identify which servers in your network correspond to which nodes in your cluster.
Index
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data,
another index for a product catalog, and yet another index for order data. An index is identified by a name and this name is used to refer to the
index when performing indexing, search, update, and delete operations against the documents in it.
41
42
43
44
JSON
45
46
JSON
47
48
JSON
▪ https://jsonformatter.curiousconcept.com/
▪ http://www.json.org/
49
50
Data Science & Data Engineering
51
Data Engineer
Data Scientist
https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
52
53
54
7.
Data Mining
55
Data mining
“Data Mining” constitutes a young and promising area of mathematics whose
objective is to allow the “knowledge discovery from data” (KDD).
In general terms, data mining refers to “extracting or mining knowledge from large
amounts of data (i.e. big data)”. This knowledge is obtained via the determination or
extraction of patterns immersed in the data (i.e. pattern recognition).
Databases and
Data Data Mining Patterns Knowledge
Warehouse
56
Data mining
Data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tools characterize the general properties of the data in the
database, whereas predictive mining techniques carry out inference on the current
data in order to make predictions (based on learning). Both types of data mining tools
can be applied in different instances, such as: numerosity reduction, dimensionality
reduction, signal processing, clustering, classification, regression.
Most of the data mining techniques are designed to analyze multivariate data (i.e.
data set consisting of a large number of interrelated variables). This data set usually
structures a data matrix (Xnp), where n constitutes the number of observations, and p
represents the number of variables.
Data Mining is an interdisciplinary discipline encompassing a blend of statistical,
artificial intelligence, and management science & information systems disciplines for
pattern recognition, mathematical modeling, and databases activities.
57
Data Mining
Data Science
58
Inteligencia Artificial
Small Data Ciencia de datos
Inteligencia de
Aprendizaje
Big Data
automático
negocio
Learning
Deep
59
Niveles de la ciencia de datos
Niveles de la ciencia de datos
60
61
Data mining & Big Data & Smart Grid
Smart Grid