DATA COLLECTION
Data Collection Preliminaries
The quantity that gets measured is reflected in our records as “data.”
The word data comes from the Latin root datum for “given.” Thus,
data (datum in plural) becomes facts which are given or known to be
true.
Primary Versus Secondary Dichotomy
PRIMARY DATA SECONDARY DATA
Data that is collected “at source” and specifically
for the research at hand. Secondary data is that which has been
The data source could be individuals, groups, previously collected for a purpose
organizations, etc. that is not specific to the research at
Data from them would be actively elicited or hand.
passively observed and collected. Thus, surveys,
interviews, and focus groups all fall under the ambit
of primary data.
Tailored specifically to the questions posed by the
research project.
The disadvantages are cost and time.
Data generation through a designed experiment
The data is not readily available to the scientist.
He designs an experiment and generates the data.
This method of data collection is possible when we can control different factors precisely while studying
the effect of an important variable on the outcome
Collection of Data That Already Exists
Collection of such data is usually done in three possible ways:
(1) complete enumeration,
(2) sample survey, and
(3) through available sources where the data was collected possibly for a different purpose and is
available in different published sources
Complete enumeration- is collecting data on all items/individuals/firms. The census is an example of
complete enumeration.
In a sample survey- the data is not collected on the entire population, but on a representative
sample. It is commonly employed in market research, social sciences, public administration, etc
Secondary data can be collected from two sources: internal or external
Internal data is collected by the company or its agents on behalf of the company. The defining
characteristic of the internal data is its proprietary nature
The external data, on the other hand, can be collected by either third-party data providers (such as IRI,
AC Nielsen) or government agencies. In addition, recently another source of external secondary data
has come into existence in the form of social media/blogs/review websites/search engines where users
themselves generate a lot of data through C2B or C2C interactions.
Secondary data can also be classified on the nature of the data along the dimension of structure
Structured data are sales records, financial reports, customer records such as purchase history, etc.
Unstructured data is in the form of free-flow text, images, audio, and videos, which are difficult to store
in a traditional database.
Data is somewhere in between structured and unstructured and thus is called semi-structured or
hybrid data. For example, a product web page will have product details (structured) and user reviews
(unstructured).
The data and its analysis can also be classified on the basis of whether a single unit is observed over
multiple time points (time-series data), many units observed once (cross-sectional data), or many
units are observed over multiple time periods (panel data).
The panel could be balanced (all units are observed over all time periods) or unbalanced (observations
on a few units are missing for a few time points either)
DATA TYPES
Numerals
Alphabets
Special characters
Nominal
The categorical variable scale is defined as a scale that labels variables into distinct classifications and
doesn’t involve a quantitative value or order. (NO ORDERING OR DIRECTION)
Ordinal
a variable measurement scale used to simply depict the order of variables and not the difference between
each variable. These scales generally depict non-mathematical ideas such as frequency, satisfaction,
happiness, a degree of pain, etc. (RANKINGS, ORDERING, OR SCALING)
Interval
A numerical scale where the variables’ order is known and the difference between these variables. Variables
that have familiar, constant, and computable differences are classified using the Interval scale. It is easy to
remember the primary role of this scale, too, ‘Interval’ indicates ‘distance between two entities,’ which is what
the Interval scale helps achieve
Ratio
A variable measurement scale that not only produces the order of variables but also makes the difference
between variables known, along with information on the value of true zero. It is calculated by assuming that
the variables have an option for zero, the difference between the two variables is the same, and there is a
specific order between the options.
Problem Formulation Preliminaries
Depending on the identification of the problem, data collection strategies, resources, and approaches will differ.
Four important points:
reality is messy
there are symptoms and then there is the cause or ailment itself. Curing the symptoms may not cure
the ailment.
note the pattern of connections between symptom(s) and potential causes.
Diagnose a problem (or cause) by narrowing the field of “ailments”
Data Management: Relational Database Systems (RDBMS)
Storage and management of data is a key aspect of data science.
Data, simply speaking, is nothing but a collection of facts—a snapshot of the
world—that can be stored and processed by computers
Database
A database is a collection of organized data in the form of rows, columns, tables and indexes.
In a database, even a small piece of information becomes data.
Data owners tend to aggregate related information together and put them under one gathered name
called a Table
A database system is a digital record-keeping system or an electronic filing cabinet.
Database systems can be used to store large amounts of data, and data can then be queried and
manipulated later using a querying mechanism/language.
A database management system (DBMS) is the system software that enables users to create, organize, and
manage databases
The main objectives of DBMS
Mass Storage
removal of duplicity
DBMS makes sure that same data has not been stored earlier
providing multiple user access—two or more users can work concurrently
Data integrity
Ensuring the privacy of the data and preventing unauthorised access
Data backup and recovery
Non Dependence on a particular platform; and so on.
RELATIONAL DATABASE SYSTEMS
Relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model of data.
DBMS Relational DBMS
specifies about relations between different entities in the DBMS tells us about the
database tables
Relational Database Systems
The two main principles of the RDBMS are entity integrity and referential integrity.
Entity integrity
All the data should be organized by having a unique value (primary key), so it cannot accept null values.
Referential integrity
Must have constraints specified between two relations and the relationship must always be consistent (e.g.,
foreign key column must be equal to the primary key column).
Advantages of RDBMS over EXCEL
RDBMS EXCEL
In RDBMS, the information is stored Excel is a two-dimensional spreadsheet and
independently from the user interface. thus it is extremely hard to make
connections between information in various
This separation of storage and access makes spreadsheets.
the framework considerably more scalable and
versatile. It is easy to view the data or find the
particular data from Excel when the size of
Data can be easily cross-referenced between the information is small. It becomes very
multiple databases using relationships hard to read the information once it crosses
between them a certain size.
RDBMS utilizes centralized data storage The data might scroll many pages when
systems that makes backup and maintenance endeavoring to locate a specific record
much easier.
Structured Query Language (SQL)
SQL (structured query language) is a computer language exclusive to a particular application domain in
contrast to some other general-purpose language (GPL) such as C, Java, or Python that is broadly applicable
across domains.
SQL is text oriented, and designed for managing (access and manipulate) data.
SQL was authorized as a national standard by the ANSI (American National Standards Institute) in 1992. It is
the standard language for relational database management systems.
SQL statements are used to select the particular part of the data, retrieve data from a database, and update
data on the database using CREATE, SELECT, INSERT, UPDATE, DELETE, and DROP
commands. SQL commands can be sliced into four categories:
DDL (data definition language)
deals with the database schemas and structure
DML (data manipulation language)
deals with tasks like storing, modifying, retrieving, deleting, and updating the data in/from the database.
DCL (data control language)
used to uphold the database security during multiple user data environments. The database
administrator (DBA) is responsible for “grant/revoke” privileges on database objects.
TCL (transaction control language)
enable you to control and handle transactions to keep up the trustworthiness of the information within
SQL statements.
BIG DATA MANAGEMENT - (LESSON 4)
The twenty-first century is characterized by the digital revolution. The Digital Revolution, also known as the
Third Industrial Revolution, started in the 1980s and sparked the advancement and evolution of technology
from analog electronic and mechanical devices to the shape of technology in the form of machine learning and
artificial intelligence today.
Prevalence of big data:
• The total amount of data generated by mankind is 2.7 Zeta bytes, and it continues to grow at an
exponential rate.
• In terms of digital transactions, according to an estimate by IDC, we shall soon be conducting nearly
450 billion transactions per day.
• Facebook analyzes 30+ peta bytes of user generated data every day.
ELEMENTS OF BIG DATA
1Note: When we say large datasets that means data size ranging from petabytes to exabytes and
more. Please note that 1 byte = 8 bits
the term big data is used colloquially to describe a vast variety of data that is being generated.
When we describe traditional data, we tend to put it into 3 categories:
1. Structured data is highly organized information that can be easily stored in a spreadsheet or table
using rows and columns. Any data that we capture in a spreadsheet with clearly defined columns and
their corresponding values in rows is an example of structured data.
2. Unstructured data may have its own internal structure. It does not conform to the standards of
structured data where you define the field name and its type. Video files, audio files, pictures, and
text are best examples of unstructured data.
3. Semi-structured data tends to fall in between the two categories mentioned above. There is generally
a loose structure defined for data of this type, but we cannot define stringent rules like we do for storing
structured data. Prime examples of semi-structured data are log files and Internet of Things (IoT) data
generated from a wide range of sensors and devices.
CHARACTERISTICS OF BIG DATA (4 characteristics)
1. Volume:
It is the amount of the overall data that is already generated (by either individuals or companies). The
Internet alone generates huge amounts of data. It is estimated that the Internet has around 14.3 trillion
live web pages, which amounts to 672 exabytes of accessible data.
2. Variety:
Data is generated from different types of sources that are internal and external to the organization such
as social and behavioral and also comes in different formats such as structured, unstructured (analog
data, GPS tracking information, and audio /video streams), and semi-structured data—XML, Email, and
EDI.
3. Velocity:
Velocity simply states the rate at which organizations and individuals are generating data in the world
today. For example, a study reveals that videos that are 400 hours of duration are uploaded onto
YouTube every minute.
4. Veracity:
It describes the uncertainty inherent in the data,
whether the obtained data is correct or consistent. It is
very rare that data presents itself in a form that is
ready to consume. Considerable effort goes into
processing of data especially when it is unstructured or
semi-structured.
PROCESSING BIG DATA
1. Analytical usage of data:
Organizations process big data to extract relevant information to a field of study. This relevant
information then can be used to make decisions for the future. Organizations use techniques like data mining,
predictive analytics, and forecasting to get timely and accurate insights that help to make the best possible
decisions. For example, we can provide online shoppers with product recommendations
2. Enable new product development:
The recent successful startups are a great example of leveraging big data analytics for new product
enablement. Companies such as Uber or Facebook use big data analytics to provide personalized services to
its customers in real time.
APPLICATIONS OF BIG DATA ANALYSIS
1. Customer Analytics in the Retail industry
-Retailers, especially those with large outlets across the country, generate
huge amount of data in a variety of formats from various sources such as POS transactions, billing details,
loyalty programs, and CRM systems. This data needs to be organized and analyzed in a systematic manner to
derive meaningful insights.
-Customers can be segmented based on their buying patterns and spend at every transaction. Marketers can
use this information for creating personalized promotions.
-Organizations can also combine transaction data with customer preferences and market trends to understand
the increase or decrease in demand for different products across regions.
-This information helps organizations to determine the inventory level and make price adjustments.
2. Fraudulent claims detection in Insurance industry
- In industries like banking, insurance, and healthcare, fraudulent transactions are mostly to do with monetary
transactions, those that are not caught might cause huge expenses and lead to loss of reputation to a firm.
Prior to the advent of big data analytics, many insurance firms identified fraudulent transactions using
statistical methods/models. However, these models have many limitations and can prevent fraud up to limited
extent because model building can happen only on sample data.
Big data analytics enables the analyst to overcome the issue with volumes of data—insurers can combine
internal claim data with social data
and other publicly available data like bank statements, criminal records, and medical bills of customers to
better understand consumer behavior and identify any suspicious behavior.
BIG DATA TECHNOLOGIES
Big data requires different means of processing
such voluminous, varied, and scattered data
-compared to-
traditional data storage and processing systems
like RDBMS (relational database management
systems), which are good at storing, processing,
and analyzing structured data only
Distributed Computing and Parallel Computing
Loosely speaking, distributed computing is the idea of dividing a problem
into multiple parts, each of which is operated upon by an individual machine or computer.
Distributed Computing and Parallel Computing Limitations and Challenges
• Multiple failure points: If a single computer fails, and if other machines cannot reconfigure themselves
in the event of failure then this can lead to overall system going down.
• Latency: It is the aggregated delay in the system because of delays in the completion of individual
tasks. This leads to slowdown in system performance.
• Security: Unless handled properly, there are higher chances of an unauthorized user access on
distributed systems.
• Software: The software used for distributed computing is complex, hard to develop, expensive, and
requires specialized skill set. This makes it harder for every organization to deploy distributed computing
software in their infrastructure.
HADOOP FOR BIG DATA
Hadoop—the first open source big data platform that is mature and has widespread usage. Hadoop was
created by Doug Cutting at Yahoo!, and derives its roots directly from the Google File System (GFS) and
MapReduce Programming for using distributed computing.
THE 3 COMPONENTS OF HADOOP ARCHITECTURE
-Hadoop Distributed File System (HDFS) for file storage to store large amounts of Data.
-MapReduce for processing the data stored in HDFS in parallel,
-resource manager known as Yet Another Resource Negotiator (YARN) for ensuring proper allocation of
resources.
• Cluster: A cluster is nothing but a collection of individual computers interconnected via a network. The
individual computers work together to give users an impression of one large system.
• Node: Individual computers in the network are referred to as nodes. Each node has pieces of Hadoop
software installed to perform storage and computation tasks.
• Master–slave architecture: Computers in a cluster are connected in a master–slave configuration. There is
typically one master machine that is tasked with the responsibility of allocating storage and computing duties to
individual slave machines.
• Master node: It is typically an individual machine in the cluster that is tasked with the responsibility of
allocating storage and computing duties to individual slave machines.
• DataNode: DataNodes are individual slave machines that store actual data and perform computational tasks
as and when the master node directs them to do so.
• Distributed computing: The idea of distributed computing is to execute a program across multiple machines,
each one of which will operate on the data that resides on the machine.
• Distributed File System: As the name suggests, it is a file system that is
responsible for breaking a large data file into small chunks that are then stored on individual machines.
Additionally, Hadoop has in-built salient features such as scaling, fault tolerance, and rebalancing. We
describe them briefly below.
• Scaling: At a technology front, organizations require a platform to scale up to handle the rapidly increasing
data volumes and also need a scalability extension for existing IT systems in content management,
warehousing, and archiving.
Hadoop can easily scale as the volume of data grows, thus circumventing the size limitations of
traditional computational systems.
• Fault tolerance: To ensure business continuity, fault tolerance is needed to ensure that there is no loss of
data or computational ability in the event of individual node failures. Hadoop provides excellent fault tolerance
by allocating the tasks to other machines in case an individual machine is not available.
• Rebalancing: As the name suggests, Hadoop tries to evenly distribute data among the connected systems so
that no particular system is overworked or is lying idle.
HADOOP ECOSYSTEM
DATA VISUALIZATION -(LESSON 5)
Data analytics is a burgeoning field—with methods emerging quickly to explore and make sense of the huge
amount of information that is being created every day. However, with any data set or analysis result, the
primary concern is in communicating the results to the reader. Unfortunately, human perception is not
optimized to understand interrelationships between large (or even moderately sized) sets of numbers.
However, human perception is excellent at understanding interrelationships between sets of data, such as
series, deviations, and the like, through the use of visual representations.
The classic example of “Anscombe’s Quartet” is presented. In 1973,
Anscombe created these sets of data to illustrate the importance of visualizing data.
Humans are bombarded by information from multiple sources and through multiple channels. This information
is gathered by our five senses, and processed by the brain. However, the brain is highly selective about what it
processes and humans are only aware of the smallest fraction of sensory input. Much of sensory input is
simply ignored by the brain, while other input is dealt with based on heuristic rules and categorization
mechanisms; these processes reduce cognitive load. Data visualizations, when executed well, aid in the
reduction of cognitive load, and assist viewers in the processing of cognitive evaluations.
Six Meta-Rules for Data Visualization
1. Simplicity Over Complexity. The simplest chart is usually the one that communicates most clearly. Use the
“not wrong” chart—not the “cool” chart
2. Direct Representation. Always directly represent the relationship you are trying to communicate. Do not
leave it to the viewer to derive the relationship from other information
3. Single Dimensionality. In general, do not ask viewers to compare in two dimensions. Comparing
differences in length is easier than comparing differences in area
4. Use Color Properly. Never use color on top of color—color is not absolute
5. Use Viewers’ Experience to Your Advantage. Do not violate the primal perceptions of your viewers.
Remember, up means more
6. Represent the Data Story with Integrity. Chart with graphical and ethical integrity. Do not lie, either by
mistake or intentionally
Rules for graphical integrity are summarized below:
Use Consistent Scales. What this means is that when building axes in visualizations, the meaning of a
distance should not change, so if 15 pixels represents a year at one point in the axis, 15 pixels should not
represent 3 years at another point in the axis.
Standardize (Monetary) Units. “In time-series displays of money, deflated and standardized units [... ]
are almost always better than nominal units.” This means that when comparing numbers, they should be
standardized.
Present Data in Context. “Graphics must not quote data out of context” (Tufte, p. 60). When telling
any data story, no data has meaning until it is compared with other data.
Show the Data. “Above all else show the data” (Tufte, p. 92). Tufte argues that visualizers often fill
significant portions of a graph with “non-data” ink. He argues that as much as possible, show data to the
viewer, in the form of the actual data, annotations that call attention to particular “causality” in the data, and
drive viewers to generate understanding.