Introduction to Big Data
Instructor: Li Yang
What’s big data?
• Term for data sets so large and complex that traditional data processing
and storage techniques fail
• Large Volum: (M,G,TB,PB)
• Large Variety: (online data: web, photo, video, social; offline data: sensor
data,…)
• Varied Velocity: (periodic, realtime)
• Veracity: (quality of data: usually poor for big data)
Why big data?
• High tech results in the need for restoring and processing huge amounts of data
• Web and Super (cloud) computing
• Traditional data processing techniques (RDBMSs) fail:
• Fit for numeric, well-structured, clean (no missing) data
• Scaling requires for high costs (expensive hardware)
• Fault tolerance (ability to rescue the hardware failure) is also expensive
• Traditional data processing techniques can’t scale to fit for big data without massive code
development
Evolution of big data techniques
• Hadoop
• HDFS
• Map Reduce
• Spark: designed to run on top of Hadoop (upgrade)
• User-friendly
• Efficient: 100 times faster in memory and 10 times faster running in disk than MapReduce
• Combines SQL, Streaming, and other complicated analytics
• Runs ‘everywhere’ (not only on Hadoop but also Mesos, …)
Big data analytics: comparison of Hadoop MapReduce and Apache Spark, 2016
Big Data Examples
• Walmart (more details later)
• Data mining: discover consumer’s purchase pattern
• Hadoop and NoSQL technique
• Uber
• Machine learning: predict the demand everywhere and set the local price
• Netflix (more details later)
• Machine learning: cater each consumer’s preference (recommendation engine)
• Hadoop, SQL, Cassandra: online on-demand video streaming data
• eBay
• requirement: rapidly data analysis for streaming data and quick action on it
• Apache Spark, Storm, Kafka
• Procter&Gamble
• marketing, product development, supply chain
• Hadoop
Example of Big Data: Walmart
How Big Data Analysis helped increase Walmarts Sales turnover?
• Walmart is an American multinational retail corporation that operates a chain of hypermarkets,
discount department stores, and grocery stores from the united states, headquartered in
Bentonville, Arkansas. (by Wikipedia)
• Walmart ranks ? in Fortune 500 in 2021.
Walmart had a banner 2020, with
U.S. e-commerce sales up 79%
as pandemic-weary customers
consolidated shopping trips to
fewer retailers and took
advantage of the big-box giant’s
strong curbside pickup offering.
Its Sam’s Club and international
businesses also boomed for
similar reasons.
Example of Big Data: Walmart
How Big Data Analysis helped increase Walmarts Sales turnover?
• Walmart is an American multinational retail corporation that operates a chain of hypermarkets,
discount department stores, and grocery stores from the united states, headquartered in
Bentonville, Arkansas. (by Wikipedia)
• Walmart ranks ? # 1 in Fortune 500 in 2021.
Walmart had a banner 2020, with
U.S. e-commerce sales up 79%
as pandemic-weary customers
consolidated shopping trips to
fewer retailers and took
advantage of the big-box giant’s
strong curbside pickup offering.
Its Sam’s Club and international
businesses also boomed for
similar reasons.
Walmart Data Source 1: consumers
Walmart tracks and targets
every consumer individually
• Walmart gathers information on what
customer’s buy, where they live and what are
the products they like through in-store Wi-Fi
• Walmart collects every clickable action on
Walmart.com-what consumers buy in-store
and online
• Walmart also pay attention to the local news,
trending on social network, even local
weather.
.
Walmart Data Source 2: employees and itself
Walmart tracks every
employe
• Walmart collects the online retailers’
informatio
• Walmart gathers the employees’ information
to optimize its own organization and improve
ef ciency
fi
n
Example of Big Data: Walmart
• Summary: American multinational retail giant Walmart collects 2.5
petabytes of unstructured data from 1 million customers every hour
Usage of Big Data by Walmart
• Launching new products
• Design the most popular product to catch the trend (Christmas products)
• Better Predictive Analysis
• Demand
• Pricing
• Logistics
• Customized Recommendations
• Designed coupon
• Designed advertising
Big Data Analytic Solutions
• Social Media Big Data Solutions
• Social Media Data is unstructured, informal and generally ungrammatical
• Big part of Walmart’s data driven decision are based on social media data: (Facebook comments, Pinterest pins,
Twitter Tweets, LinkedIn shares …)
1. Social Genome: developed by WalmartLabs; social network data; better analyze the context of their users
2. Shopycat-gift recommendation engine at Walmart: app developed by Walmart; help consumers to buy ideal
gift for their friends during the holiday rush; also give detail reference information for the recommendations
3. Inventory management at Walmart: help managers to optimize the storage for the products; how many cashiers
and self-checkout should be open?
• Mobile Big Data Solutions
• More than half of the Walmart’s customers use Smartphones
• Walmart’s mobile application: a shopping list that can tell customers the position of their wants and helps them by
providing discounts; geofencing feature of Walmart’s mobile app senses whenever a user enters the Walmart store in US.
• https://www.forbes.com/sites/bernardmarr/2017/08/29/how-walmart-is-using-machine-learning-ai-iot-and-
big-data-to-boost-retail-performance/?sh=68bd71496cb1
Example of Big Data: Netflix
Net ix Recommender System — A Big Data Case Study, Kasula, 2020
• Netflix an American over-the-top content platform and production company headquartered in Los
Gatos, CA. The company's primary business is a subscription-based streaming service offering
online streaming from a library of films and television series, including those produced in-house.
(by Wikipedia)
• Their main source of income comes from users’ subscription fees. They allow users to stream data
from a wide range of their movies and TV shows at any time on a variety of internet-connected
services
• The primary asset of Netflix is their technology. Especially their recommendation system. The
study of the recommendation system is a branch of information filtering systems (Recommender
system, 2020).
• Most of the recommender systems study users by using their history. Recommender systems
have two primary approaches. They are collaborative filtering and content-filtering.
fl
Big Data Source: Netflix
• Internal source of data:
• Billion ratings from its members
• Stream related data such as the duration, time of playing, type of the device, day of the week and other
context-related information.
• The pattern and the titles that their subscribers add to their queues
• All the metadata related to a title in their catalog such as director, actor, genre, rating and reviews from different
platforms.
• The search-related text information by Netflix subscribers
• External source of data:
• box office information, performance and critic reviews
• demographics, culture, language, and other temporal data
Big Data Example: Netflix
• What does Netflix want from Big Data?
• Recommend the `next content’ to its user
• What is the `next content’ for each consumer?
• What are the big-data challenges for Netflix?
• volume: approximately 105TB of data with respect to videos alone; 10,000 GB of rating data alone
• velocity: collect data about the time of the data, the types of devices you watch content on, the duration of your watch
• Veracity: bias, noise, and abnormalities in data; Not all movies were rated equally by an individual
• Variety: most of the data in a structured format such as time of the day, duration of watch, popularity, social data, search-related
information, stream related data, etc. However, Netflix could also be using unstructured data. For example, thumbnail pictures that
it uses for personalization.
Data Ecosystem: Netflix
Big Data Example: Netflix
• What are advanced techniques used for Big Data?
• Data Storage and preprocessing
• Hadoop, Cassandra, S3
• Machine learning
• Supervised learning: classification, regression
• Unsupervised learning: clustering, compression, dimension reduction
• Other techniques
• Matrix factorization
• Singular valuation decomposition
• Probabilistic graphic model
• Ensemble method
Big Data Example: Netflix
• What are the results obtain from Big Data for Netflix?
• The overall engagement rate by the user with Netflix has increased with the help of the
recommender system. This led to lower cancellation rates and increased streaming hours volume
• Member satisfaction increased with the development and changes to the recommendation system.
• Personalization and recommendations save Netflix more than $1Billion per year.
• Examples:
• the winning algorithm was able to increase the predicting ratings and improved ‘Cinematch’
by 10.06% (Netflix Prize, 2020).
• According to (Netflix Technology Blog, 2017b), Singular Value Decomposition was able to
reduce the RMSE to 89.14% whereas Restricted Boltzmann Machines helped in reducing
RMSE to 89.90%