0% found this document useful (0 votes)
42 views17 pages

Teaching Note - Big Data and Cloud Computing-Vaidik

The document discusses big data, its sources, types, and the 4 Vs (volume, velocity, variety, and veracity). It describes big data as large data that cannot be processed by traditional systems due to its size and complexity. Common sources are people through social media, machines generating data, and data from organizations. The types are structured, unstructured, and semi-structured. Big data solutions need to store and process huge volumes of data flexibly and at scale. The document also covers cloud computing models and services.

Uploaded by

nilesh.das22h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views17 pages

Teaching Note - Big Data and Cloud Computing-Vaidik

The document discusses big data, its sources, types, and the 4 Vs (volume, velocity, variety, and veracity). It describes big data as large data that cannot be processed by traditional systems due to its size and complexity. Common sources are people through social media, machines generating data, and data from organizations. The types are structured, unstructured, and semi-structured. Big data solutions need to store and process huge volumes of data flexibly and at scale. The document also covers cloud computing models and services.

Uploaded by

nilesh.das22h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Contents

Big Data................................................................................................................................................... 2
Big Data and Its Various Aspects .............................................................................................................. 2
Big Data Major Sources ........................................................................................................................... 3
Types of Data .......................................................................................................................................... 4
4 Vs of Big Data ....................................................................................................................................... 4
Summary ................................................................................................................................................. 4
Big Data Industry Case Studies ................................................................................................................. 5
Difference between veracity and validity ................................................................................................. 6
Conventional Data Processing System and Big Data ................................................................................. 6
Job roles in Big Data ................................................................................................................................ 7
Big data engineers ................................................................................................................................... 7
Cloud Computing..................................................................................................................................... 9
Traditional Data Centers .......................................................................................................................... 9
Benefits of Colocation Data Centre ...................................................................................................... 9
Cloud Computing..................................................................................................................................... 9
Benefits of Cloud Computing ................................................................................................................. 10
Cloud-based Architecture ...................................................................................................................... 10
Cloud Deployment Models .................................................................................................................... 11
1. Private Cloud ................................................................................................................................. 11
2. Public Cloud ................................................................................................................................... 12
3. Hybrid Cloud .................................................................................................................................. 12
4. Community Cloud .......................................................................................................................... 13
5. Multi Cloud Strategy ...................................................................................................................... 14
Types of Cloud Services ......................................................................................................................... 14
1. Infrastructure as a Service (IaaS) .................................................................................................... 14
2. Platform as a Service (PaaS) ........................................................................................................... 15
3. Software as a Service (SaaS) ........................................................................................................... 16
Big Data

Big Data and Its Various Aspects


As per Wikipedia, data refers to the “fact that some existing information or knowledge is represented or
coded in some form suitable for better usage or processing”.

Big Data shares the same definition as that of data, with the only difference that it is huge in size. Big Data
is not only massive, but it also has the potential to grow exponentially for an indefinite period. It can grow
even to the extent where it cannot be managed or processed by using traditional techniques such as
RDBMSs.

There is no benchmark concerning size to determine whether the given data is big data or not. One of the
definitions for big data is —

“If the data has expanded to such an extent that now a single computing system is unable to store and
process it, then we can call this data, big data.”

Today, access to the internet has become affordable, and Internet usage has grown widespread. Each
online activity, such as sending emails, booking movie tickets, posting on blogs, posting reviews on an e-
commerce portal, etc., generates massive volumes of data. This growing use of the Internet facilitates
data generation and easy access to the generated data. Hence, we have entered a world that has become
data-driven. Organisations and various industries are leveraging the power of big data to make important
decisions.

Following are some important aspects of big data. This is very similar to the 4Vs associated with big data
which we will cover later.

 Volume: The size of the data is huge, i.e. in the range of terabytes or even more than that.

 Rate of change: The nature of the data changes because of the changes in transactions. There
could be multiple reasons supporting the changes in transactions, such as a change in business
logic or a change in requirements.

 Variety: Based on the form of data, it can be broadly divided into three categories:

o Structured: Data stored in a tabular format.

o Unstructured: Data that does not have a well-defined structure, e.g. videos and images.

o Semi-structured: Data that is partially structured or a combination of both structured and


unstructured data, e.g. email. E-mail is semi-structured because it has a well-defined
structure containing sender address, receiver addresses, subject, message body,
attachments, etc. But the content mentioned in the subject or message body is entirely
unstructured.

Big Data Major Sources


Today, almost all leading companies such as Amazon, Google, and Facebook store and process big data to
gain insights into several previously unknown facts and figures. Looking at the potential benefits of storing
and processing big data, organisations which were not previously into big data have started big data
processing for better decision-making.

Broadly, the sources of big data can be classified into three major categories:

People: Today, people are quite active on the Internet through social networking sites such as Facebook,
Twitter, and Instagram. On these platforms, they share a lot of information through posts; it could be a
valid opinion regarding a political issue or a post about their recent visit to a hill station. This data is termed
as ‘shared data’ — data that is shared by people. Even the user ratings for a movie or product can be
referred to as data generated by people.

Machine: Data that is generated by a machine/computer in a periodic manner, or at the occurrence of


some event, is termed as ‘machine-generated data’. Some common examples of data produced by
machines are from cell phone towers, RFID tag scanners and car sensors.

Organisation: This refers to the data that is generated by an organisation itself. This data mostly has a
well-defined structure. Some examples of data produced by organisations are internal sales data or
customers’ demographic information. This kind of data can be integrated with some external data to
generate useful insights. Archived data is also helpful in performing a comparative analysis with the latest
data and for historical analyses, which is useful for prediction of future trends.

Retail: In retail, big data is used in Clickstream analysis and Sentiment analysis which help in
understanding buying patterns, customer purchase intent and many other insights.

Financial Services: Big data is used to detect and stop fraudulent transactions in financial service
companies. It is also used to design and modify predictive models on various investment strategies which
are used by institutions to make decisions like how much money to invest, which sectors to invest in when
to invest etc.
Healthcare: In healthcare, predictive models can be generated for resource planning and distribution. It
is also used in research labs and therapy. Because of the ability to remotely monitor health, patients no
longer have to go to physicians for primary consultations.

Manufacturing: In manufacturing big data can be used to read and analyse data from sensors attached to
various machine parts. This helps in maintenance of equipment and prevent the loss of machine hours.

Types of Data
the key requirements for big data systems are —

 The ability to store huge volumes of data


 The ability to process the huge volumes of stored data
 Flexibility and scalability to accommodate the growth in data

The solution proposed by Google to handle big data is as follows:

 Google File System: To store data in a distributed manner across multiple interconnected
computers.
 MapReduce: To run processes on data stored in a distributed manner across multiple
interconnected computers.

4 Vs of Big Data
 Volume: The data cannot be stored, processed, and interpreted by a single computing machine.
 Velocity: This is the speed at which the data arrives. The system should be capable of accessing,
storing, and processing this high-speed data.
 Variety: Data can be structured, semi-structured, or unstructured. The system should be capable
of handling all kinds of data.
 Veracity: The data can either be of questionable quality or lack authenticity. It is prone to noise
or bias.

Summary
 What is Big Data - A large amount of data which cannot be processed and stored by a single
system.
 Sources of Big Data - The three broad categories of sources of Big data are People, Machine and
Organisation. Customer ratings and social media posts are some examples of people generated
data. Data captured by sensors and cell phone towers are examples of data generated by
machines.
 Types of Data - Data can be classified as structured, semi-structured or unstructured.
 Applications of Big Data - Big data has applications across retail, financial services, healthcare,
manufacturing, etc.
 4Vs - The 4Vs of Big data are volume, velocity, variety and veracity.

Big Data Industry Case Studies


We’re currently living in the digital era. One of the first examples of digitisation was the advent of emails.
With emails, message sharing has become convenient and time-saving. Hence, we still use emails for both
personal and professional message sharing. Almost all the services around us have become digital, from
booking tickets to transferring money to another bank account.

Apart from the 4 Vs covered by our industry expert, researchers have come up with some additional Vs
as well. They are —

 Variability: This refers to the way the meaning of the data is always changing. In other
words, the meaning of the data changes according to how the data is collected. One of
the simplest examples supporting this is the numerous words in the English language that
have multiple meanings. So, interpreting a word without considering its surrounding
words will not give you its exact meaning. Here, if you analyse the entire sentence in
which the word is present, you get the exact sense of the word.
 Validity: This refers to the correctness of the values present in the dataset. The dataset is bound
to have some anomalies in it, such as missing values, incorrect values, etc. It's important to get
rid of these anomalies before using this dataset for processing. The presence of these defects will
influence the output, thereby giving you erroneous results. One example of invalid data is
negative values in the age column.
 Volatility: Volatility refers to determining the expiry or life expectancy of data. For better results
based on some business requirements, it's sometimes required that you archive or eliminate the
old data completely. An example of volatility could be that after the implementation of GST, data
corresponding to taxes such as VAT, etc. are archived because they will never come into the
picture again.

Difference between veracity and validity


To find out how veracity is different from validity, we will discuss a few examples from our day-to-day
lives.

Veracity determines whether the given data is trustworthy or not. For instance, the data will suffer from
veracity issues if its source is not credible. Suppose a student is doing a case study on the "Pollution of
Rivers in India". For this, he/she has to collect water samples from various rivers and has to get them
examined in a lab to determine the degree of pollution in each stream.

Here, if the student has collected a single sample from every river, then you can say that the data collected
suffers from veracity issues. This is because ideally, the entire course of the river is not polluted. A river is
least polluted near its origin and most polluted towards the end of its path, where it meets the sea. Let's
assume that in this example, the student has collected a single sample from each river at its origin. So, the
lab readings will show that the rivers are not polluted. However, the rivers may be highly polluted when
they are flowing through highly populated regions. So ideally, the data would not suffer from veracity
issues if, for each river, multiple samples were collected at various points from their entire courses.

Validity refers to the correctness of the data. While performing lab experiments, if properly calibrated
standard instruments are not used, then the readings will be inaccurate. It can then be said that these
readings are not valid.

Conventional Data Processing System and Big Data


You now know the characteristics of big data and its industry use cases. However, you must also know
exactly what the problems with traditional data-processing systems were and why those systems can’t
process big data.

Previously, the most commonly used data-storing and data-processing systems were RDBMSs (Relational
Database Management Systems). An RDBMS uses tables to store data in a row-column format. These
tables have a well-defined schema/metadata, and the data that is stored in each table has to comply with
the underlying schema. Mostly, operational (transactional) data used to be stored and analysed using
RDBMSs. But you can see that we no longer store and process big data using these traditional systems.

Job roles in Big Data


 Big data consultant: This role is mostly functional and domain-specific. The responsibilities
include —
o Identifying big data-relevant use cases
o Cost-benefit analysis of big data
o Impact of big data on the organisation

 Architect: This role is mostly technical and requires a sound knowledge of the various big data
tools. The responsibilities include —
o Understanding how traditional and big data systems interact
o Designing the infrastructure for distributed computing
o Determining the specifications of the infrastructure or the cluster in terms of the number
of nodes in it, the capacity of each node, etc.

 Analyst: Analysts have data-relevant skills. The responsibilities of an analyst include —


o Visualising and reporting facts in an appropriate format
o Deriving insights and patterns from data
o Decent subject matter and implementation knowledge regarding various machine
learning and artificial intelligence techniques

Big data engineers


 Software developers who have expertise in programming languages such as Java, Python, etc. can
preferably transition to a role such as that of a big data engineer. This role requires coding skills,
enterprise architecture expertise, and some data science knowledge. Following are the
responsibilities or skills of a big data engineer in detail:

 Big data engineers follow the design suggested by big data architects and build the required
system.
 They are responsible for developing, maintaining, testing, and evaluating big data solutions.
 They directly work with various components in the Hadoop technology stack, such as MapReduce,
Pig, Hive, etc. Hence, big data engineers are expected to have a thorough understanding of such
tools.
 They must have good knowledge of data warehousing and NoSQL technologies.
 A software engineer/developer has prior experience in object-oriented design, coding, and
testing. These skills are must-haves for a big data engineer as well because a big data engineer is
expected to code and develop applications that process big data.
 Experience in algorithms and good problem-solving skills are huge advantages.
Cloud Computing

Traditional Data Centers


A data centre is a dedicated space within a building, in which all the computer systems and other
components related to storage or network are kept.
The following are the two ways of deploying a traditional data centre:
On-premise Systems- The data centres are hosted and maintained within the organisation.
Colocation- The data centre used by an organisation is hosted by a third-party firm. The
organisation provides the required computer servers, storage and networking, whereas the third-
party firm provides the power, cooling and physical security. This is similar to the relationship
between the property manager and the customer who uses a rental space to
facilitate equipment.
Benefits of Colocation Data Centre
Risk Management: In colocation data centres the network traffic is not affected in case of power
failure. It ensures that there is continuity in business in case of a major disaster.
Security: Latest security technology is provided to the data centres that prevent unauthorised
access.
Cost: It results in costs savings like reduction in capital expenditures related to the power source,
backup generators etc.
Bandwidth: Colocation provides the amount of bandwidth required by an organisation to
function properly.
Scalability: Enterprises can expand according to their needs through colocation.

Cloud Computing
Cloud computing refers to the delivery of services such as processing, storage, databases, networking, etc.
to users and organisations based on their requirement, over the Internet. The servers on which
these softwares and databases run are located in data centres across the world. Users and organisations
can access these servers through the Internet from anywhere.

Cloud computing is a pay-as-you-go service, i.e., the users pay only for the services that they use.

The two main users of Cloud are:

 End Users- These users use cloud services for proprietary benefits.
 Business Management Users- These users utilise cloud services on an organisational
level.

The three major cloud providers are:

 Amazon Web Service (AWS)


 Microsoft Azure
 Google Cloud Platform

Benefits of Cloud Computing


 Cost Saving: Cloud computing eliminates the need for buying or maintaining hardware and
software resources and for setting up data centres. It also eliminates the need for maintaining
computing infrastructure. Most of the cloud services are pay-as-you-go, which means customers
only pay for the services that they use. Use of the cloud also reduces the cost associated with staff
wages.
 Scalability: Cloud computing provides businesses with the ability to expand their resources when
needed. Organisations need not worry about installing infrastructure, as it is done by the cloud
service providers. They can also scale up their applications as per requirement.
 Availability: Cloud is available at different locations across the world. It allows companies to
expand to new geographical regions and deploy the resources globally.
 Security: Cloud allows access to data and applications to authorised and authenticated users only.
 Data Storage Space: Organisations can opt for the exact amount of storage they need and pay
only for the space that they use.

Cloud-based Architecture
Cloud architecture is a combination of different technologies that are brought together to create the
cloud, which provides shared scalable resources across a network.

Cloud architecture has the following two components:

Front End: This is the client or end-user. It consists of all the applications and interfaces that are used by
clients to access cloud resources.
Back End: It is the cloud itself, as it consists of infrastructure such as databases, computing resources,
deployment models, etc. that are required to build the cloud

Both the front end and back end are connected to each other through a network, usually the Internet.

The following image illustrates the schematic structure of a cloud-based architecture.

Cloud Deployment Models


1. Private Cloud
In a private cloud, cloud resources are solely operated for a single organisation. The cloud is either
managed by the organisation itself or by a third party and is hosted internally or externally.

These are some of the private cloud providers:

 IBM
 Oracle
 Vmware
 Hewlett Packard Enterprise (HPE)

Advantages:

 It provides high security and also restricts access only to authorised users. Hence, this kind of
infrastructure is generally preferred in financial institutions like banks, insurance firms, etc.
 It provides high control over the resources.
Disadvantages:

 It is not cost-effective when compared with a public cloud.


 It has limited scalability and can be scaled only up to the internal hosted resources.

2. Public Cloud
In a public cloud, cloud resources are owned and operated by a third party. The services are provided to
the users through the Internet. The cloud service provider is solely responsible for maintaining the
resources.

Following are some of the major public cloud service providers:

 Google Cloud Platform


 Microsoft Azure
 Amazon Web Services

Advantages:

 It is highly scalable. It offers flexibility to either scale up or scale down the usage of
resources as per the demand or based on a user’s request.
 It is also cost-effective. Users of a public cloud have to pay for only what they use.

Disadvantages:

 A public cloud might have security issues.


 It is not 100% customisable as per an organisation’s requirements.

3. Hybrid Cloud
Hybrid cloud is a combination of public and private clouds and allows organisations to share data between
them. It can be a combination of:

 At least one private cloud and at least one public cloud,


 Two or more private clouds, or
 Two or more public clouds.
The clouds chosen for the development of hybrid clouds should be able to connect multiple computers
through a single network. These clouds should also be able to move workloads between one environment
to the other. The ordered or consistent connections between separate clouds make it a hybrid cloud.
Usually, the performance of a hybrid cloud is dependent on the development and management of its
connections. The linking of public and private clouds is done using complex networks of LANs, APIs, VPNs,
etc.Several cloud providers give their customers preconfigured connections. For Example:

 Dedicated Interconnect by Google Cloud


 Direct Connect by AWS, and
 ExpressRoute by Microsoft Azure.

Advantages:

 A private cloud is secure, and hence, a hybrid cloud is secure as well.


 Scalability: you already know that the public cloud is scalable. Therefore, the hybrid cloud which
is the combination of public and private cloud is also scalable.
 Users can access both the private and the public cloud as per their requirements; thus, a hybrid
cloud offers flexibility.
 Public cloud is cost-effective, hence hybrid cloud is also cost-effective if the user wants to use the
public cloud properties.

Disadvantages:

 Complex networking problems: Due to the complexity of having the public and the private
cloud, there would be an issue in configuring the network.
 Organisation’s security compliance: Both the public and the private cloud should comply
with the organisation’s security norms, and it is not easy to set up the clouds to meet this
requirement.

4. Community Cloud
When different cloud services are integrated into a single cloud to meet the specific needs of an industry,
community or business sector, the cloud is known as the community cloud.

The infrastructure of the community cloud is shared between organisations that have common concerns
or interests. Industries such as healthcare, media, etc. opt for community clouds.
Advantages:

 The cost of maintenance can be shared among the organisations in the community.
 It is more secure than a public cloud and less expensive than a private cloud.

Disadvantages:

 It is difficult to distribute the responsibilities among the organisations in a community.


 It is difficult to segregate the data among the organisations in a community.

5. Multi Cloud Strategy


A multi-cloud strategy is when an organisation uses multiple public and private clouds by the different
provider. This is adopted in order to avoid lock-in with a single vendor and to make use of the best services
provided by these cloud providers.

Types of Cloud Services


1. Infrastructure as a Service (IaaS)
These services are a set of compute, storage and network that are virtualised by cloud providers so that
users can access and configure resources according to their needs. With IaaS, a user can rent IT
infrastructure.

Common examples of IaaS are as follows:

 AWS EC2
 Google Compute Engine (GCE)
 Digital Ocean

Advantages

 The service provider provides the infrastructure, and the user has to just install an operating
system of their requirements and work on it.
 The user can modify the architecture as per their requirements since it is basic cloud
infrastructure.
 The user has full control over all the computing resources.
Disadvantages

 There are security issues in an IaaS environment because of its multitenant environment.
 The outage of the vendors makes it difficult for users to access the data for a while.

2. Platform as a Service (PaaS)


These are cloud services that provide an on-demand environment for developing and managing a
software application. PaaS can be used to build, run and manage application programming interfaces
(APIs).

The following are some of the common examples of PaaS:

 Windows Azure
 OpenShift
 AWS Elastic Beanstalk.

Advantages:

 Prebuilt platform: PaaS provides an already built platform for users to build and run their
applications.
 It is a simple model to use and deploy applications.
 Low cost: Since the platform is already built, the user needs to create only their applications. This
reduces the costs related to hardware and software.

Disadvantages:

 Migration issues: Migrating the user applications from one PaaS vendor to another might raise
some issues.
 Platform restrictions: The platforms provided by some vendors may have certain restrictions, for
instance, the user can use only certain specified languages.
3. Software as a Service (SaaS)
These are cloud services that provide the user with a complete software application over the Internet. All
the infrastructure, application tools, data, etc. are located at data centres managed by the service
providers.

Some of the examples of SaaS are as follows:

 Google Apps
 Salesforce
 Dropbox.

There are two modes of SaaS:

Simple Multi-Tenancy: Each user has independent resources that are different from the resources
of other users.

Fine Grain Multi-Tenancy: Resources are shared by several users but the functionality of these
resources remains the same.

Advantages:

 Ease of access: Users can access the applications on the server from anywhere using any Internet-
connected device. Most types of internet-connected devices can access SaaS applications.
 Low maintenance: Users need not update an application. The application is on the server, and it
is the service provider’s responsibility to maintain the application.
 Quick setup: Users do not require any hardware to install the application. The SaaS application is
already present on the cloud.

Disadvantages:

 Lack of control: Users do not have control over the SaaS applications. Only the vendor has full
control of SaaS applications.
 Connectivity issue: The applications can only be accessed only via the Internet. Hence, if there is
no Internet, then the users cannot access the applications.

You might also like