Terabit Metadata extraction application solution for MNO
2023
Terabit Metadata extraction
application
Challenges faced by MNOs in Managing and Analyzing Large
Volumes of Data
● Scale: Exponential growth of mobile networks leads to massive volumes of metadata, reaching
terabit levels.
● Storage: Storing vast amounts of metadata requires significant infrastructure investments. MNOs
face challenges in finding cost-effective, scalable solutions.
● Processing Speed: Real-time analysis of metadata is crucial for network optimization, security, and
customer insights. Traditional processing methods often fall short in handling the speed and scale of
modern network metadata.
● Compliance and Privacy: MNOs must comply with strict data privacy regulations. Ensuring
compliance adds complexity and requires robust security measures to protect user information.
● Lack of Insights: Despite abundant metadata, extracting actionable insights is challenging. MNOs
struggle to derive meaningful patterns, trends, and correlations from the vast data pool, limiting
informed decision-making.
● Resource Constraints: MNOs face limitations in skilled personnel, budget, and technological
capabilities. These constraints hinder the effective management and analysis of large metadata
volumes.
3
Managing data retention in large-scale systems cost-effectively
Cubro’s innovative solution can minimize data volume for enhanced efficiency.
Cubro’s solutions can reduce the quantity of data generated and stored for network
monitoring, security and analytics use cases while retaining all the required data – this
approach can reduce the capacity and footprint required of network tools and associated
data lakes. It provides significant benefits in performance, CAPEX and OPEX, and power and
space consumption. This document describes the associated challenges and solutions.
4
Overview and Advantages of Cubro’s Solution
1:) The first step is to produce the metadata information in a very efficient manner. The Cubro solution is much more efficient than common monitoring.
● The Metadata volume on an IPFIX (flow-based solution) is 3% of the raw input.
● The Cubro solution offers the same metadata quality with a value of 0.1% of the raw input.
2:) The second step is aggregation, filtering and enrichment at the beginning of the processing chain
● This is important to reduce the data volume and reduce the processing efforts along the chain
● This is the reason proper use case planning is so important – it makes the faster, cost-efficient and increases the ROI.
3:) Data lake usage versus Database.
Data Lake: Database:
A data lake is a centralized repository that stores large volumes of A database is a structured collection of data organized and managed to
structured, semi-structured, and unstructured data in its raw, support efficient data storage, retrieval, and manipulation. Databases
unprocessed form. It is designed to store vast amounts of data from provide a structured way to store and organize data, ensuring data
various sources, such as transactional systems, social media feeds, integrity and consistency.
log files, and sensor data.
The data lake gives more options to design use case because the original idea of a data lake is to collect data from multiple sources and then
produce data tables which can be consumed by different applications.
5
Omnia QM Solution
6
Cubro Omnic (NPU) cards in a Server
(User plane) Raw packet input up to 40 – 85 Gbit
Enrichment and data per Smart NIC depending on the traffic type
processing on the Intel CPUs
(Signalling traffic)
(Signalling traffic) Enrichment data input
Enrichment data input from different sources
from different sources
Efficient Metadata
generation
Enrichment metadata output different outputs
7
1–2 Tbps User Plane monitoring in an MNO (real picture)
NPB EXA64100
Signalling probe
Efficient Metadata generation
3 Server with
24 NPU smart NIC
3 Kafka server for only 19 U
Metadata handling 836 mm
8
aggregation, filtering and enrichment at the
beginning of the processing chain!
9
System Schema for 1 - 1.2 TB of user traffic
The amount of data challenge
Volume per Hour in GByte 300 600 1200 1800 2400 3000 3600 4200 4800 5400 6000
Volume per Day in TByte 7,031 14,063 28,125 42,188 56,250 70,313 84,375 98,438 112,500 126,563 140,625
Volume per Week in PByte 0,048 0,096 0,192 0,288 0,385 0,481 0,577 0,673 0,769 0,865 0,961
Volume per Month in PByte 0,213 0,426 0,851 1,277 1,703 2,129 2,554 2,980 3,406 3,831 4,257
Volume per 3 Month in PByte 0,639 1,277 2,554 3,831 5,109 6,386 7,663 8,940 10,217 11,494 12,772
Number of subscribers in Mio. 0,5 1 2 3 4 5 6 7 8 9 10
CDR's per sec 8.333 16.667 33.333 50.000 66.667 83.333 100.000 116.667 133.333 150.000 166.667
This table shows the raw traffic volume versus the numbers of subscribers. This shows that only a data lake
approach can handle this volume of data. This approach also gives the flexibility to develop uses cases.
10
The challenge is to define the use case first
Filter and remove data Use compression The Datalake has the biggest
Modify the metadata in which is not relevant for algorithms to compress impact to optimizing the
the way to produce the applied use cases. data, which is not solution.
more efficient CDR relevant for indexing.
Produce Kafka topics in Time to keep the data
the way that they can Aggregate data over time
easily consume in the
data lake.
The definition of the use cases is a major factor in making such a solution efficient in terms of performance and cost.
It must be a closed loop. All elements in the chain can support optimizing the data volume.
11
Use case examples
For this use case, there are different feeds needed to build the
table.
Proactive Service Reselling 1:) Subscriber base station attachment over time
Outage Alerts Misuse of of
Service geolocation data 2:) a LOG/MAP file from the service provider which builds the
correlation between the cell ID and the geographical data.
For this use case, only a For this use case, a table CDR-based information
slim table is needed, with is needed, with only the Static log information
only the total amount of total amount of traffic
traffic per application. per subscriber.
2325670310 Cell ID 55
Cell ID 55 Lat,Long 48.2030915,16.2076345
2325670310 Cell ID 55
Xbox 2,2 Gbit/sec 2325670310 2,2 TByte/day Lat,Long
Cell ID 56
48.1802423,16.26807372
2325645540 Cell ID 55
Whatsapp 0,5 Gbit/sec 2325545673 1,4 TByte/day
Lat,Long
Cell ID 57
2325645540 Cell ID 56 48.1614862149,16.28135
Office 365 0 Gbit/sec 2325645540 8,2 TByte/day
Cell ID 58 Lat,Long xxxxx
2325635143 Cell ID 58
Netflix 7,1 Gbit/sec 2325234784 0,2 TByte/day
Cell ID 59 Lat,Long xxxxx
YouTube 6,2 Gbit/sec 2325635143 3,2 TByte/day
2325670310 Lat,Long 48.2030915,16.2076345
The data lake approach is that these tables are preprocessed so that if an
application wants to consume the data, only reading these files is needed 2325645540
consumable table
Lat,Long 48.2030915,16.2076345
and no search ore processing is needed. If all tables for use case are
produced, then the original data can be deleted. This helps to keep the
storage volume small. 2325645540 Lat,Long 48.1802423,16.26807372
12
The full data solution
Data consumer are typical 3rd party tools Proactive Service
which consume the data out of the object Outage Alerts
storage.
Reselling
of
geolocation data
Misuse of
Service
The data lake is a software solution
which reads the raw data and produces
consumable data tables, based on the
use case definition.
13
Data Lake design
14
15 @Cubro Confidential
Advantages of a data lake compared to a DB
All reports and results are preprocessed
This means the excess is very fast.
But it costs more storage and CPU resources than DB approach.
16
List of use cases
Service related
● Availability of a service in the entire network Subscriber related
● Availability of a service separate by RAT ● Number of subscribers online
● Availability of a service per base station ● Movement of a subscriber
● How many subscribers are located per base station
● Service distribution in the entire network ● How many subscribers per sector or region
● Service distribution per RAT
● Service distribution per network region
Security related
● Subscribers with significant SIM card change
Performance related ● Subscribers with constant high load
● Performance per subscriber (bandwidth and volume) ● Subscribers with high load of suspicious applications (only VPN, or TOR)
● Performance per base station (bandwidth and volume)
● Performance per data center
● Performance per network segment
● Performance per RAT
The combination of the use cases helps also to trouble
shoot customer calls to the call center. See next page.
Geolocation related
● Geolocation of a subscriber
● Movement of a subscriber
● How many subscribers are located per base station
● How many subscribers per sector or region
● There are much more possible use cases but cannot be
disclosed to her. Call for more information.
17
A service related call flow (XBOX as example)
18
US & North America
THANK YOU americas@[Link]
APAC
jl@[Link]
Japan
japan@[Link]
EMEA
sales@[Link]