Data
- unorganized; a given fact, statement, or image
- Latin word that is plural form of datum (individual value in a collection of data)
- neuter past principle of dare, “to give”
- first used to mean "transmissible and storable computer information" in 1946
- collection of discrete or continuous values that convey information
- collected using techniques such as measurement, observation, query, or analysis, and is
typically represented as numbers or characters
● Prior to analysis, raw data (or unprocessed data) is typically cleaned: Outliers are
removed and obvious instrument or data entry errors are corrected
Field Data
- collected in an uncontrolled in-situ environment
Experimental Data
- generated in the course of a controlled scientific experiment
- analyzed using calculation, reasoning, discussion, presentation, visualization
Computer Data
- information that is stored and processed digitally on a computer
- may be loaded into memory and processed by the computer's CPU, then stored as files
in folders on a hard drive or solid-state drive
- data is stored as binary data (series of 1s and 0s called bits)
- 8 bits (1 byte - basic unit of storage)
- Text data (unicode); image data (pixel)
- Flash drives can easily transfer data from one computer to another
- Computers with optical drives can burn data to (re)writable CDs and DVDs
- Network-connected computers can transfer data from one computer to another over a
local network or the Internet
- Reading or transferring digital data does not cause any deterioration or quality loss over
time
Information - data that have meaning within a context
● Data > Processing > Information
Science
- system of knowledge covering general truths or the operation of general laws
Data Science
- study of data to extract meaningful insights for business
- a multidisciplinary approach that combines principles and practices from different fields
to analyze large amounts of data
- combines math and statistics…to uncover actionable insights hidden in an organization’s
data
Data Ingestion
- data collection--both raw structured and unstructured data from all relevant sources
using a variety of methods
- manual entry, web scraping, and realtime streaming data
Data storage and data processing
- cleaning data, deduplicating, transforming and combining the data using ETL (extract,
transform, load) jobs or other data integration technologies
- essential for promoting data quality before loading into a data warehouse, data lake, or
other repository
Data analysis
- conduct an exploratory data analysis to examine biases, patterns, ranges, and
distributions of values within the data
- drives hypothesis generation for a/b testing
- allows analysts to determine the data’s relevance for use within modeling efforts
- organizations can become reliant on these insights for business decision making,
allowing them to drive more scalability
Communicate
- nsights are presented as reports and other data visualizations that make the
insights—and their impact on business
Data Science
- considered a discipline
Data Scientists
- practitioners within that field
- not necessarily directly responsible for all the processes involved in the data science
lifecycle
- make recommendations about what sort of data is useful or required
- responsibilities can commonly overlap with a data analyst, particularly with exploratory
data analysis and data visualization
- skillset is typically broader than the average data analyst
- works on new ways to capture, store, manipulate and analyze that data
Data Analyst
- makes sense out of existing data through routine analysis and writing reports
Business intelligence (BI)
- umbrella term for the technology that enables data preparation, data mining, data
management, and data visualization
- geared toward static (unchanging) data that is usually structured
Data Science Tools
R Studio - open source programming language for developing statistical computing and
graphics
Python - dynamic and flexible programming language
SAS - comprehensive tool suite, including visualizations and interactive dashboards, for
analyzing, reporting, data mining, and predictive modeling
IBM SPSS - advanced statistical analysis
Excel - created by Microsoft that uses spreadsheets to organize numbers and data with
formulas and functions
Cloud Computing
- scales data science by providing access to additional processing power, storage, and
other tools required for data science projects
Data Analytics
- science of analyzing raw data to make conclusions about that information
- process of manipulating data to extract useful trends and hidden patterns which can help
us derive valuable insights to make business predictions
1. Identify the business question you’d like to answer
2. Collect raw data sets
3. Clean the data to prepare it for analysis
4. Analyze the data
5. Interpret the results of your analysis
Analytics
- discovery and communication of meaningful patterns in data
- relies on the simultaneous application of statistics, computer programming, and
operation research to qualify performance
- favors data visualization to communicate insight
- scientific process of transforming data into insight for making better decisions
- Data analytics (aims to get actionable insights resulting in smarter decisions and better
business outcomes)
● critical to design and built a data warehouse or Business Intelligence (BI) architecture
Uses of Data Analytics
1. Improved Decision-Making
2. Better Customer Service – churn modelling (predict or identify what leads to customer
churn and change those things accordingly)
3. Efficient Operations
4. Effective Marketing - market segmentation techniques
Types of Data Analytics
1. Descriptive analytics
- looks at data and analyze past event for insight as to how to approach future
events
- quantifies relationships in data in a way that is often used to classify customers or
prospects into groups
- identifies many different relationships between customer and product
- ex. Data queries, reports, descriptive stats, data dashboard
2. Diagnostic analytics
- generally use historical data over other data to answer any question or for the
solution of any problem
- try to find any dependency and pattern in the historical data of the particular
problem
- Ex. data discovery, data mining, correlations
3. Predictive analytics
- turn the data into valuable, actionable information
- uses data to determine the probable outcome of an event or a likelihood of a
situation occurring
- Techniques: linear regression, data mining, and Time Series Analysis and
Forecasting
- Basic cornerstones: predictive modelling, decision analysis and optimization,
transaction profiling
4. Prescriptive analytics
- automatically synthesize big data, mathematical science, business rule, and
machine learning to make a prediction and then suggests a decision option to
take advantage of the prediction
- suggesting action benefits from the predictions and showing the decision maker
the implication of each decision option
- not only anticipates what will happen and when to happen but also why it will
happen
- Ex. ecommending optimal pricing strategies, supply chain adjustments, or
personalized treatment plans in healthcare
Further scope:
1. Retail - study sales pattern and consume behavieo
2. Healthcare - evaluate patien data
3. Finance - investment data and spot trends
4. Marketing - consumer behaivor for marketing strategies
5. Manufacturing - examine production data
6. Transportation - evaluate logistics data
Data collection
- process of acquiring, collecting, extracting, and storing the voluminous amount of data
which may be in the structured or unstructured form
- Step before analyzing data
- collects raw data in which is processed to obtain information called knowledge
- main goal of data collection is to collect information-rich data
- Qualitative data and quantitative data
Actual data is divided into two:
1. Primary data
- Raw, original, and extracted directly from source
- Interview, survey method, observation, and experimental
2. Secondary data
- has already been collected and reused again for some valid purpose
- Internal source (found within org)
- External source (through external third party resources)
Other sources: Sensors data, Satellites data, Web traffic, and Open data
Data Source
- origin of a specific set of information
Big data
- Extremely large data sets used by data analysts split into two categories
1. Machine data sources - labeled by users, stored in the input machine, and not
easily shareable
2. File data sources - reside within single, shareable files, allowing multiple users
to access and edit the data from different locations
Qualitative Data (Categorical Data)
- describes the object under consideration using a finite set of discrete classes
- can’t be counted or measured easily using numbers and therefore divided into
categories
a. Nominal
- set of values that don’t possess a natural ordering
- not quantifiable and cannot be measured through numerical units
b. Ordinal
- natural ordering while maintaining their class of values
- help us deciding which encoding strategy can be applied to which type of
data
Quantitative Data
- tries to quantify things and it does by considering numerical values that make it
countable in nature
- there can be an infinite number of values a feature can take
a. Discrete
- numerical values which fall under are integers or whole numbers
- cannot be measured – it can only be counted as the objects included in discrete
data have a fixed value
- identified through charts, including bar charts, pie charts, and tally charts
b. Continuous
- fractional numbers are considered as continuous values
- can break down into smaller pieces and can take any value
- represented using a graph that easily reflects value fluctuation by the highs and
lows of the line through a certain period of time
Excel Data Types
1. Number data
- includes any kind of number
- Ex. monetary totals, whole numbers, percentages, decimals, dates, times,
integers, phone numbers
2. Text data
- includes characters such as alphabetical, numerical and special symbols
- you can use calculations on number data but not text data
- manually change the format of a cell to ensure it operates the way you want
- ex. Words, sentences, dates, times, addresses
3. Logical data
- either TRUE or FALSE, usually as the product of a test or comparison
- use a function to determine whether the data in your spreadsheet meets different
measures
a. AND
- help you determine whether your data meets multiple conditions
- ex. use this function to test if data in one cell is larger than a
certain amount and the data in another cell is also larger than
another amount
b. OR
- use this function to determine that at least one of your arguments
meets your conditions
- if none of the data matches your conditions, Excel produces a
FALSE value
c. XOR
- stands for "Exclusive Or," which means that only one argument
may be TRUE or FALSE
- Ex. use this function to ensure that only one of your cells contains
a certain value
d. NOT
- use this function when you want to filter out arguments that don't
match your conditions
- marks each argument as TRUE so you can assess possible
patterns in data that doesn't match your conditions
4. Error data
- occurs when Excel recognizes a mistake or missing information while processing
your entry
- produces the error value #VALUE!
- "#" character at the beginning of each error value can help you easily recognize
these instances
a. #NAME?
- you have a value inside a formula without quotes or with a
beginning or end quote missing
b. #DIV/0
- arise if you try dividing a number by zero
c. #REF!
- invalid cell reference error value may result if you remove or paste
items in a cell or range of cells where you previously entered a
formula
- To correct: undo your previous action and place your new data in a
cell or cell range that doesn't contain a formula
d. #NUM!
- appear if you enter an invalid formula or function
- also appear if the total that a formula or function produces is too
large for Excel to represent in a cell
e. #N/A
- enter this error value when you want to indicate to yourself areas
where you can enter a value later
- Excel may also automatically populate this value if imported data
contains empty or unreadable cells
f. #VALUE!
- indicates that an argument or operator in a function or formula is
invalid
- Ex. if you try to calculate the sum of a range of cells where one
cell contains alphabetical characters, you can get a #VALUE!
Result
g. #NULL!
- If you're referencing the intersection between a range of cells in a
function, you may see this error value because those cells don't
actually intersect
- also appear if a range of cells for a function are missing
separating commas
Tips for using Excel data types
1. Use the Excel TEXT function
- allows you to format data in a cell as text data
- =Text(Value, format_text)
2. Limit number data to 15 digits
- excel allows you to enter large numbers, but limits numbers that have more than
15 numerical digits, not including commas and decimal points
- ex. instead use the letter B to signify that the number is in the billions
3. Check data after exporting
- when export data into a spreadsheet, Excel may sometimes revert settings based
on what it thinks certain values represent
- Checking the new spreadsheet with the old one may help you identify missing or
altered formats so you can correct them
4. Search for the number sign
- to check your spreadsheet for error values, you may select "Ctrl+F" on your
keyboard and search for the "#" symbol
5. Determine number data display
- change the settings for that cell in the Format menu