0% found this document useful (0 votes)

13 views10 pages

Midterm Notes

Uploaded by

Cindy Roque

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

Midterm Notes

Uploaded by

Cindy Roque

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA COLLECTION

Data Collection Preliminaries

 The quantity that gets measured is reflected in our records as “data.”

 The word data comes from the Latin root datum for “given.” Thus,
data (datum in plural) becomes facts which are given or known to be
true.

Primary Versus Secondary Dichotomy

PRIMARY DATA SECONDARY DATA

 Data that is collected “at source” and specifically

for the research at hand.  Secondary data is that which has been
 The data source could be individuals, groups, previously collected for a purpose
organizations, etc. that is not specific to the research at
 Data from them would be actively elicited or hand.
passively observed and collected. Thus, surveys,
interviews, and focus groups all fall under the ambit
of primary data.
 Tailored specifically to the questions posed by the
research project.
 The disadvantages are cost and time.

Data generation through a designed experiment

 The data is not readily available to the scientist.
 He designs an experiment and generates the data.
 This method of data collection is possible when we can control different factors precisely while studying
the effect of an important variable on the outcome

Collection of Data That Already Exists

 Collection of such data is usually done in three possible ways:

(1) complete enumeration,
(2) sample survey, and
(3) through available sources where the data was collected possibly for a different purpose and is
available in different published sources
 Complete enumeration- is collecting data on all items/individuals/firms. The census is an example of
complete enumeration.
 In a sample survey- the data is not collected on the entire population, but on a representative
sample. It is commonly employed in market research, social sciences, public administration, etc
 Secondary data can be collected from two sources: internal or external
 Internal data is collected by the company or its agents on behalf of the company. The defining
characteristic of the internal data is its proprietary nature
 The external data, on the other hand, can be collected by either third-party data providers (such as IRI,
AC Nielsen) or government agencies. In addition, recently another source of external secondary data
has come into existence in the form of social media/blogs/review websites/search engines where users
themselves generate a lot of data through C2B or C2C interactions.
 Secondary data can also be classified on the nature of the data along the dimension of structure
 Structured data are sales records, financial reports, customer records such as purchase history, etc.
 Unstructured data is in the form of free-flow text, images, audio, and videos, which are difficult to store
in a traditional database.
 Data is somewhere in between structured and unstructured and thus is called semi-structured or
hybrid data. For example, a product web page will have product details (structured) and user reviews
(unstructured).
 The data and its analysis can also be classified on the basis of whether a single unit is observed over
multiple time points (time-series data), many units observed once (cross-sectional data), or many
units are observed over multiple time periods (panel data).
 The panel could be balanced (all units are observed over all time periods) or unbalanced (observations
on a few units are missing for a few time points either)

DATA TYPES
 Numerals
 Alphabets
 Special characters

Nominal
The categorical variable scale is defined as a scale that labels variables into distinct classifications and
doesn’t involve a quantitative value or order. (NO ORDERING OR DIRECTION)

Ordinal
a variable measurement scale used to simply depict the order of variables and not the difference between
each variable. These scales generally depict non-mathematical ideas such as frequency, satisfaction,
happiness, a degree of pain, etc. (RANKINGS, ORDERING, OR SCALING)

Interval
A numerical scale where the variables’ order is known and the difference between these variables. Variables
that have familiar, constant, and computable differences are classified using the Interval scale. It is easy to
remember the primary role of this scale, too, ‘Interval’ indicates ‘distance between two entities,’ which is what
the Interval scale helps achieve

Ratio
A variable measurement scale that not only produces the order of variables but also makes the difference
between variables known, along with information on the value of true zero. It is calculated by assuming that
the variables have an option for zero, the difference between the two variables is the same, and there is a
specific order between the options.

Problem Formulation Preliminaries

Depending on the identification of the problem, data collection strategies, resources, and approaches will differ.

Four important points:

 reality is messy
 there are symptoms and then there is the cause or ailment itself. Curing the symptoms may not cure
the ailment.
 note the pattern of connections between symptom(s) and potential causes.
 Diagnose a problem (or cause) by narrowing the field of “ailments”

Data Management: Relational Database Systems (RDBMS)

 Storage and management of data is a key aspect of data science.

 Data, simply speaking, is nothing but a collection of facts—a snapshot of the
world—that can be stored and processed by computers
Database
 A database is a collection of organized data in the form of rows, columns, tables and indexes.
 In a database, even a small piece of information becomes data.
 Data owners tend to aggregate related information together and put them under one gathered name
called a Table
 A database system is a digital record-keeping system or an electronic filing cabinet.
 Database systems can be used to store large amounts of data, and data can then be queried and
manipulated later using a querying mechanism/language.

A database management system (DBMS) is the system software that enables users to create, organize, and
manage databases

The main objectives of DBMS

Mass Storage
 removal of duplicity
 DBMS makes sure that same data has not been stored earlier
 providing multiple user access—two or more users can work concurrently

Data integrity
 Ensuring the privacy of the data and preventing unauthorised access
 Data backup and recovery
 Non Dependence on a particular platform; and so on.

RELATIONAL DATABASE SYSTEMS

Relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model of data.

DBMS Relational DBMS

specifies about relations between different entities in the DBMS tells us about the
database tables

Relational Database Systems

The two main principles of the RDBMS are entity integrity and referential integrity.

Entity integrity
All the data should be organized by having a unique value (primary key), so it cannot accept null values.

Referential integrity
Must have constraints specified between two relations and the relationship must always be consistent (e.g.,
foreign key column must be equal to the primary key column).
Advantages of RDBMS over EXCEL
RDBMS EXCEL

 In RDBMS, the information is stored  Excel is a two-dimensional spreadsheet and

independently from the user interface. thus it is extremely hard to make
connections between information in various
 This separation of storage and access makes spreadsheets.
the framework considerably more scalable and
versatile.  It is easy to view the data or find the
particular data from Excel when the size of
 Data can be easily cross-referenced between the information is small. It becomes very
multiple databases using relationships hard to read the information once it crosses
between them a certain size.

 RDBMS utilizes centralized data storage  The data might scroll many pages when
systems that makes backup and maintenance endeavoring to locate a specific record
much easier.

Structured Query Language (SQL)

SQL (structured query language) is a computer language exclusive to a particular application domain in
contrast to some other general-purpose language (GPL) such as C, Java, or Python that is broadly applicable
across domains.

SQL is text oriented, and designed for managing (access and manipulate) data.

SQL was authorized as a national standard by the ANSI (American National Standards Institute) in 1992. It is
the standard language for relational database management systems.

SQL statements are used to select the particular part of the data, retrieve data from a database, and update
data on the database using CREATE, SELECT, INSERT, UPDATE, DELETE, and DROP
commands. SQL commands can be sliced into four categories:

DDL (data definition language)

 deals with the database schemas and structure
DML (data manipulation language)
 deals with tasks like storing, modifying, retrieving, deleting, and updating the data in/from the database.
DCL (data control language)
 used to uphold the database security during multiple user data environments. The database
administrator (DBA) is responsible for “grant/revoke” privileges on database objects.
TCL (transaction control language)
 enable you to control and handle transactions to keep up the trustworthiness of the information within
SQL statements.

BIG DATA MANAGEMENT - (LESSON 4)

The twenty-first century is characterized by the digital revolution. The Digital Revolution, also known as the
Third Industrial Revolution, started in the 1980s and sparked the advancement and evolution of technology
from analog electronic and mechanical devices to the shape of technology in the form of machine learning and
artificial intelligence today.

Prevalence of big data:

• The total amount of data generated by mankind is 2.7 Zeta bytes, and it continues to grow at an
exponential rate.
• In terms of digital transactions, according to an estimate by IDC, we shall soon be conducting nearly
450 billion transactions per day.
• Facebook analyzes 30+ peta bytes of user generated data every day.

ELEMENTS OF BIG DATA

1Note: When we say large datasets that means data size ranging from petabytes to exabytes and
more. Please note that 1 byte = 8 bits

the term big data is used colloquially to describe a vast variety of data that is being generated.

When we describe traditional data, we tend to put it into 3 categories:

1. Structured data is highly organized information that can be easily stored in a spreadsheet or table
using rows and columns. Any data that we capture in a spreadsheet with clearly defined columns and
their corresponding values in rows is an example of structured data.

2. Unstructured data may have its own internal structure. It does not conform to the standards of
structured data where you define the field name and its type. Video files, audio files, pictures, and
text are best examples of unstructured data.

3. Semi-structured data tends to fall in between the two categories mentioned above. There is generally
a loose structure defined for data of this type, but we cannot define stringent rules like we do for storing
structured data. Prime examples of semi-structured data are log files and Internet of Things (IoT) data
generated from a wide range of sensors and devices.

CHARACTERISTICS OF BIG DATA (4 characteristics)

1. Volume:
It is the amount of the overall data that is already generated (by either individuals or companies). The
Internet alone generates huge amounts of data. It is estimated that the Internet has around 14.3 trillion
live web pages, which amounts to 672 exabytes of accessible data.

2. Variety:
Data is generated from different types of sources that are internal and external to the organization such
as social and behavioral and also comes in different formats such as structured, unstructured (analog
data, GPS tracking information, and audio /video streams), and semi-structured data—XML, Email, and
EDI.

3. Velocity:
Velocity simply states the rate at which organizations and individuals are generating data in the world
today. For example, a study reveals that videos that are 400 hours of duration are uploaded onto
YouTube every minute.

4. Veracity:
It describes the uncertainty inherent in the data,
whether the obtained data is correct or consistent. It is
very rare that data presents itself in a form that is
ready to consume. Considerable effort goes into
processing of data especially when it is unstructured or
semi-structured.

PROCESSING BIG DATA

1. Analytical usage of data:

Organizations process big data to extract relevant information to a field of study. This relevant
information then can be used to make decisions for the future. Organizations use techniques like data mining,
predictive analytics, and forecasting to get timely and accurate insights that help to make the best possible
decisions. For example, we can provide online shoppers with product recommendations

2. Enable new product development:

The recent successful startups are a great example of leveraging big data analytics for new product
enablement. Companies such as Uber or Facebook use big data analytics to provide personalized services to
its customers in real time.

APPLICATIONS OF BIG DATA ANALYSIS

1. Customer Analytics in the Retail industry

-Retailers, especially those with large outlets across the country, generate
huge amount of data in a variety of formats from various sources such as POS transactions, billing details,
loyalty programs, and CRM systems. This data needs to be organized and analyzed in a systematic manner to
derive meaningful insights.

-Customers can be segmented based on their buying patterns and spend at every transaction. Marketers can
use this information for creating personalized promotions.

-Organizations can also combine transaction data with customer preferences and market trends to understand
the increase or decrease in demand for different products across regions.

-This information helps organizations to determine the inventory level and make price adjustments.

2. Fraudulent claims detection in Insurance industry

- In industries like banking, insurance, and healthcare, fraudulent transactions are mostly to do with monetary
transactions, those that are not caught might cause huge expenses and lead to loss of reputation to a firm.
Prior to the advent of big data analytics, many insurance firms identified fraudulent transactions using
statistical methods/models. However, these models have many limitations and can prevent fraud up to limited
extent because model building can happen only on sample data.

Big data analytics enables the analyst to overcome the issue with volumes of data—insurers can combine
internal claim data with social data
and other publicly available data like bank statements, criminal records, and medical bills of customers to
better understand consumer behavior and identify any suspicious behavior.
BIG DATA TECHNOLOGIES

Big data requires different means of processing

such voluminous, varied, and scattered data
-compared to-
traditional data storage and processing systems
like RDBMS (relational database management
systems), which are good at storing, processing,
and analyzing structured data only

Distributed Computing and Parallel Computing

Loosely speaking, distributed computing is the idea of dividing a problem
into multiple parts, each of which is operated upon by an individual machine or computer.

Distributed Computing and Parallel Computing Limitations and Challenges

• Multiple failure points: If a single computer fails, and if other machines cannot reconfigure themselves
in the event of failure then this can lead to overall system going down.

• Latency: It is the aggregated delay in the system because of delays in the completion of individual
tasks. This leads to slowdown in system performance.

• Security: Unless handled properly, there are higher chances of an unauthorized user access on
distributed systems.

• Software: The software used for distributed computing is complex, hard to develop, expensive, and
requires specialized skill set. This makes it harder for every organization to deploy distributed computing
software in their infrastructure.

HADOOP FOR BIG DATA

Hadoop—the first open source big data platform that is mature and has widespread usage. Hadoop was
created by Doug Cutting at Yahoo!, and derives its roots directly from the Google File System (GFS) and
MapReduce Programming for using distributed computing.

THE 3 COMPONENTS OF HADOOP ARCHITECTURE

-Hadoop Distributed File System (HDFS) for file storage to store large amounts of Data.
-MapReduce for processing the data stored in HDFS in parallel,
-resource manager known as Yet Another Resource Negotiator (YARN) for ensuring proper allocation of
resources.

• Cluster: A cluster is nothing but a collection of individual computers interconnected via a network. The
individual computers work together to give users an impression of one large system.

• Node: Individual computers in the network are referred to as nodes. Each node has pieces of Hadoop
software installed to perform storage and computation tasks.
• Master–slave architecture: Computers in a cluster are connected in a master–slave configuration. There is
typically one master machine that is tasked with the responsibility of allocating storage and computing duties to
individual slave machines.

• Master node: It is typically an individual machine in the cluster that is tasked with the responsibility of
allocating storage and computing duties to individual slave machines.

• DataNode: DataNodes are individual slave machines that store actual data and perform computational tasks
as and when the master node directs them to do so.

• Distributed computing: The idea of distributed computing is to execute a program across multiple machines,
each one of which will operate on the data that resides on the machine.

• Distributed File System: As the name suggests, it is a file system that is

responsible for breaking a large data file into small chunks that are then stored on individual machines.

Additionally, Hadoop has in-built salient features such as scaling, fault tolerance, and rebalancing. We
describe them briefly below.

• Scaling: At a technology front, organizations require a platform to scale up to handle the rapidly increasing
data volumes and also need a scalability extension for existing IT systems in content management,
warehousing, and archiving.

Hadoop can easily scale as the volume of data grows, thus circumventing the size limitations of
traditional computational systems.

• Fault tolerance: To ensure business continuity, fault tolerance is needed to ensure that there is no loss of
data or computational ability in the event of individual node failures. Hadoop provides excellent fault tolerance
by allocating the tasks to other machines in case an individual machine is not available.

• Rebalancing: As the name suggests, Hadoop tries to evenly distribute data among the connected systems so
that no particular system is overworked or is lying idle.

HADOOP ECOSYSTEM

DATA VISUALIZATION -(LESSON 5)

Data analytics is a burgeoning field—with methods emerging quickly to explore and make sense of the huge
amount of information that is being created every day. However, with any data set or analysis result, the
primary concern is in communicating the results to the reader. Unfortunately, human perception is not
optimized to understand interrelationships between large (or even moderately sized) sets of numbers.
However, human perception is excellent at understanding interrelationships between sets of data, such as
series, deviations, and the like, through the use of visual representations.

The classic example of “Anscombe’s Quartet” is presented. In 1973,

Anscombe created these sets of data to illustrate the importance of visualizing data.

Humans are bombarded by information from multiple sources and through multiple channels. This information
is gathered by our five senses, and processed by the brain. However, the brain is highly selective about what it
processes and humans are only aware of the smallest fraction of sensory input. Much of sensory input is
simply ignored by the brain, while other input is dealt with based on heuristic rules and categorization
mechanisms; these processes reduce cognitive load. Data visualizations, when executed well, aid in the
reduction of cognitive load, and assist viewers in the processing of cognitive evaluations.

Six Meta-Rules for Data Visualization

1. Simplicity Over Complexity. The simplest chart is usually the one that communicates most clearly. Use the
“not wrong” chart—not the “cool” chart
2. Direct Representation. Always directly represent the relationship you are trying to communicate. Do not
leave it to the viewer to derive the relationship from other information
3. Single Dimensionality. In general, do not ask viewers to compare in two dimensions. Comparing
differences in length is easier than comparing differences in area
4. Use Color Properly. Never use color on top of color—color is not absolute

5. Use Viewers’ Experience to Your Advantage. Do not violate the primal perceptions of your viewers.
Remember, up means more
6. Represent the Data Story with Integrity. Chart with graphical and ethical integrity. Do not lie, either by
mistake or intentionally

Rules for graphical integrity are summarized below:

Use Consistent Scales. What this means is that when building axes in visualizations, the meaning of a
distance should not change, so if 15 pixels represents a year at one point in the axis, 15 pixels should not
represent 3 years at another point in the axis.

Standardize (Monetary) Units. “In time-series displays of money, deflated and standardized units [... ]
are almost always better than nominal units.” This means that when comparing numbers, they should be
standardized.

Present Data in Context. “Graphics must not quote data out of context” (Tufte, p. 60). When telling
any data story, no data has meaning until it is compared with other data.
Show the Data. “Above all else show the data” (Tufte, p. 92). Tufte argues that visualizers often fill
significant portions of a graph with “non-data” ink. He argues that as much as possible, show data to the
viewer, in the form of the actual data, annotations that call attention to particular “causality” in the data, and
drive viewers to generate understanding.

MODULE 4 Data Data Collection DBMS
No ratings yet
MODULE 4 Data Data Collection DBMS
29 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
Data Analytics for CSE Students
No ratings yet
Data Analytics for CSE Students
91 pages
Lecture Notes 1 - AD
No ratings yet
Lecture Notes 1 - AD
67 pages
Understanding Data Types and Structures
No ratings yet
Understanding Data Types and Structures
9 pages
Bca101 A Unit 1 Dbms
No ratings yet
Bca101 A Unit 1 Dbms
33 pages
DBMS 4
No ratings yet
DBMS 4
34 pages
Overview of Data Science Concepts
No ratings yet
Overview of Data Science Concepts
77 pages
Lecture Notes 1 - AD - Database Concepts - An Overview
No ratings yet
Lecture Notes 1 - AD - Database Concepts - An Overview
67 pages
How Data Is Col
No ratings yet
How Data Is Col
11 pages
Data vs Information Explained
No ratings yet
Data vs Information Explained
27 pages
CS3352 Foundations of Data Science Nov Dec 2023
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2023
26 pages
Data Analytics Lecture 3-1
No ratings yet
Data Analytics Lecture 3-1
23 pages
3rd Sem DBMS Notes - Bangalore University
No ratings yet
3rd Sem DBMS Notes - Bangalore University
104 pages
CHAPTER 4 Relational Databases
No ratings yet
CHAPTER 4 Relational Databases
31 pages
DBMS Material 2023
No ratings yet
DBMS Material 2023
50 pages
Basis Midterm Database
No ratings yet
Basis Midterm Database
19 pages
Principles of Data Science
No ratings yet
Principles of Data Science
46 pages
Understanding Relational Databases and SQL
No ratings yet
Understanding Relational Databases and SQL
35 pages
Title: Powerful Basic Concepts of Database System Target Population: Second Year BSCS/ICT and First Year ACT Students III - Overview
No ratings yet
Title: Powerful Basic Concepts of Database System Target Population: Second Year BSCS/ICT and First Year ACT Students III - Overview
6 pages
3 Sem Dbms Notes
No ratings yet
3 Sem Dbms Notes
104 pages
BAD601 Module 1 PDF
No ratings yet
BAD601 Module 1 PDF
64 pages
Da Notes
No ratings yet
Da Notes
61 pages
DBMS Unit1 BCA Notes
No ratings yet
DBMS Unit1 BCA Notes
41 pages
Unit 2 1
No ratings yet
Unit 2 1
70 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Intro To Data Science Prelims Reviewer
No ratings yet
Intro To Data Science Prelims Reviewer
12 pages
Final Dbms
No ratings yet
Final Dbms
32 pages
Advanced Database Management
No ratings yet
Advanced Database Management
274 pages
DBMS Ctevt Students
100% (1)
DBMS Ctevt Students
230 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
MISLectures Part 2
No ratings yet
MISLectures Part 2
207 pages
Data vs Information: Key Differences
100% (1)
Data vs Information: Key Differences
27 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Databases and Database Management Systems
No ratings yet
Databases and Database Management Systems
31 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Understanding Database Management Systems
No ratings yet
Understanding Database Management Systems
11 pages
MDU BCA - Database System
No ratings yet
MDU BCA - Database System
281 pages
Coursera - Data Analytics - Course 3
No ratings yet
Coursera - Data Analytics - Course 3
14 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
Database Management System (DBMS) 1
No ratings yet
Database Management System (DBMS) 1
8 pages
Unit 2
No ratings yet
Unit 2
37 pages
Management Information Systems Unit - 3 Notes-1
No ratings yet
Management Information Systems Unit - 3 Notes-1
13 pages
DBMS
No ratings yet
DBMS
63 pages
Cape Notes Unit 2 Module 1 Content 1 3
No ratings yet
Cape Notes Unit 2 Module 1 Content 1 3
12 pages
Database Class12
No ratings yet
Database Class12
27 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
DBMS Chapter 1
No ratings yet
DBMS Chapter 1
30 pages
Chapter 3 Database Systems and Big Data
No ratings yet
Chapter 3 Database Systems and Big Data
39 pages
DBS Unit 1
No ratings yet
DBS Unit 1
23 pages
DBMS Intro Presentation
No ratings yet
DBMS Intro Presentation
45 pages
Data Science in Climate Change
No ratings yet
Data Science in Climate Change
164 pages
CH 9 Imp
No ratings yet
CH 9 Imp
5 pages
1chap22 and Chap 24
No ratings yet
1chap22 and Chap 24
56 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
35 pages
UNIT4
No ratings yet
UNIT4
20 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
PCI DSS v4.0 AOC for Service Providers
No ratings yet
PCI DSS v4.0 AOC for Service Providers
13 pages
5g Wireless Technology
No ratings yet
5g Wireless Technology
25 pages
EPP 6 Summative 3 Week 5 7 6
No ratings yet
EPP 6 Summative 3 Week 5 7 6
2 pages
Perform Exploratory Data Analysis
No ratings yet
Perform Exploratory Data Analysis
3 pages
Abhishek Kumar ResumeJ
No ratings yet
Abhishek Kumar ResumeJ
1 page
Enterprise Structure Guide
No ratings yet
Enterprise Structure Guide
4 pages
Odoo Shell
No ratings yet
Odoo Shell
13 pages
2 - Database As A Service - Current Issues and Its Future - Zheng2018
No ratings yet
2 - Database As A Service - Current Issues and Its Future - Zheng2018
5 pages
Create A Calendar Using C Programming: Department of Computer Engineering
100% (1)
Create A Calendar Using C Programming: Department of Computer Engineering
13 pages
Task 1
No ratings yet
Task 1
4 pages
Big Data & Hadoop Mastery Guide
No ratings yet
Big Data & Hadoop Mastery Guide
2 pages
Computer Project Final
No ratings yet
Computer Project Final
13 pages
SAP CRM Conceptual Overview
No ratings yet
SAP CRM Conceptual Overview
5 pages
Hospital Management Database Schema
No ratings yet
Hospital Management Database Schema
1 page
Elasticsearch Datasheet
No ratings yet
Elasticsearch Datasheet
2 pages
Coursework Assignment Lahiru Sehan
No ratings yet
Coursework Assignment Lahiru Sehan
15 pages
Preços e Detalhes de Produtos AC500-eCo
No ratings yet
Preços e Detalhes de Produtos AC500-eCo
7 pages
Getting Started Guide: Forcepoint Web Security Cloud
No ratings yet
Getting Started Guide: Forcepoint Web Security Cloud
42 pages
ch04 Supply Chain Management
No ratings yet
ch04 Supply Chain Management
54 pages
Software Engg Unit-3 Testing Notes
No ratings yet
Software Engg Unit-3 Testing Notes
8 pages
Solution Manual For Discovering Computers 2011 Complete 1st Edition
No ratings yet
Solution Manual For Discovering Computers 2011 Complete 1st Edition
5 pages
Integrated Planning-Advanced DSO For Planning - SAP Blogs
No ratings yet
Integrated Planning-Advanced DSO For Planning - SAP Blogs
7 pages
Ehr in Practice Ehr Software Directory Original - Original
No ratings yet
Ehr in Practice Ehr Software Directory Original - Original
97 pages
Tensorflow Lite Micro
No ratings yet
Tensorflow Lite Micro
12 pages
Migrating To Azure Sentinel - Data Sheet
No ratings yet
Migrating To Azure Sentinel - Data Sheet
2 pages
SaaS Quality Agreement for Backup Services
No ratings yet
SaaS Quality Agreement for Backup Services
10 pages
Functional Dependencies & DB Normalization
No ratings yet
Functional Dependencies & DB Normalization
9 pages
Python for Non-Tech Students
No ratings yet
Python for Non-Tech Students
2 pages
Legacy Loading: Shipment & Booking Guide
No ratings yet
Legacy Loading: Shipment & Booking Guide
12 pages
Kartikey Mishra Resume
No ratings yet
Kartikey Mishra Resume
1 page

Midterm Notes

Uploaded by

Midterm Notes

Uploaded by

DATA COLLECTION

Data Collection Preliminaries

 The quantity that gets measured is reflected in our records as “data.”

Primary Versus Secondary Dichotomy

PRIMARY DATA SECONDARY DATA

 Data that is collected “at source” and specifically

Data generation through a designed experiment

Collection of Data That Already Exists

 Collection of such data is usually done in three possible ways:

Problem Formulation Preliminaries

Four important points:

Data Management: Relational Database Systems (RDBMS)

 Storage and management of data is a key aspect of data science.

The main objectives of DBMS

RELATIONAL DATABASE SYSTEMS

DBMS Relational DBMS

Relational Database Systems

 In RDBMS, the information is stored  Excel is a two-dimensional spreadsheet and

Structured Query Language (SQL)

DDL (data definition language)

BIG DATA MANAGEMENT - (LESSON 4)

Prevalence of big data:

ELEMENTS OF BIG DATA

When we describe traditional data, we tend to put it into 3 categories:

CHARACTERISTICS OF BIG DATA (4 characteristics)

PROCESSING BIG DATA

1. Analytical usage of data:

2. Enable new product development:

APPLICATIONS OF BIG DATA ANALYSIS

1. Customer Analytics in the Retail industry

2. Fraudulent claims detection in Insurance industry

Big data requires different means of processing

Distributed Computing and Parallel Computing

Distributed Computing and Parallel Computing Limitations and Challenges

HADOOP FOR BIG DATA

THE 3 COMPONENTS OF HADOOP ARCHITECTURE

• Distributed File System: As the name suggests, it is a file system that is

DATA VISUALIZATION -(LESSON 5)

The classic example of “Anscombe’s Quartet” is presented. In 1973,

Six Meta-Rules for Data Visualization

Rules for graphical integrity are summarized below:

You might also like