0% found this document useful (0 votes)
13 views20 pages

Data Analytics

Data analytics involves collecting, processing, and analyzing data from various sources, categorized into primary, secondary, internal, and external data. It encompasses structured, semi-structured, and unstructured data, each requiring different storage and processing methods. The need for data analytics arises from its ability to inform decision-making, improve efficiency, and enhance customer experience in today's data-driven landscape.

Uploaded by

lipima3572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Data Analytics

Data analytics involves collecting, processing, and analyzing data from various sources, categorized into primary, secondary, internal, and external data. It encompasses structured, semi-structured, and unstructured data, each requiring different storage and processing methods. The need for data analytics arises from its ability to inform decision-making, improve efficiency, and enhance customer experience in today's data-driven landscape.

Uploaded by

lipima3572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Analytics

Sources and Nature of


Data in Data Analytics
Data analytics involves collecting, processing, and analyzing
data from various sources. These sources can be categorized
into different types:
Data Sources-
• Primary Data: Data that is collected first-hand through direct
interaction or original research. Methods include:
• Surveys: Questionnaires, online polls, or face-to-face interactions.
• Interviews: Direct conversations with subjects to gather
qualitative insights.
• Experiments: Controlled conditions to observe specific outcomes.
Sources of Data
• Secondary Data: Data collected by someone else, often available
publicly or through licensed databases. Examples include:
• Government Reports: Statistical data from census, economic surveys, etc.
• Research Papers: Academic journals, white papers.
• Market Data: Purchased or publicly available industry reports, financial data.
• Internal Data: Data generated from within an organization’s own
processes. This can include:
• Sales Data: Information from point-of-sale systems, CRM platforms.
• Operational Data: Logs from manufacturing, inventory, and supply chains.
• Customer Data: Behavioral data from customer interactions, user feedback.
Sources of Data

• External Data: Data obtained from outside the


organization, such as:
• Public Databases: Open data portals (e.g., government or
world health databases).
• Social Media: User-generated content, sentiment analysis,
trends.
• Third-party Data Providers: Data purchased from vendors.
Nature of Data
• Data used in analytics can come in different forms and qualities. It's important to
understand its characteristics:
• Structured Data: Organized data that follows a specific format, easily stored in
databases (e.g., SQL databases). Examples include:
• Tables of numbers: Sales figures, financial statements.
• Log data: User interaction logs, transaction logs.
• Unstructured Data: Data that doesn’t fit into predefined models or structures,
requiring more complex processing. Examples include:
• Text: Emails, social media posts, documents.
• Multimedia: Images, audio, and video files.
• Semi-Structured Data: Data that is not fully structured but has some organizational
properties, often stored in formats like XML or JSON. Examples include:
• Sensor Data: IoT device logs, environmental readings.
• Web Data: Web scraping results, clickstream data.
• Data Collection Methods
 Manual Collection: Human-driven, such as filling out surveys or manually
inputting data.
 Automated Collection: Data gathered through software tools, sensors, or
web scraping.

Understanding the sources and nature of data is critical for ensuring


that the analytics process yields meaningful insights and drives
decision-making.
Classification of Data in Data
Analytics
Data Type Structure Examples Storage Method
• Data in analytics can be classified
based on its structure, which Structured Data Predefined format SQL databases, Relational
(rows/columns) financial records, Databases (SQL,
impacts how it is stored, POS data Oracle)

processed, and analyzed. The


three main types of data
Semi-Structured Partially organized JSON, XML, log Document Stores
classification are structured, semi- Data (tags/keys) files, NoSQL (MongoDB,
documents CouchDB)
structured, and unstructured
data.
Unstructured Data No specific Text documents, Data Lakes, Cloud
structure multimedia, social Storage (HDFS)
media
Characteristics of Data in
Data Analytics
• Understanding the characteristics of data is essential in data
analytics, as it affects the way data is collected, processed,
stored, and analyzed. Below are the key characteristics that
define data in the context of analytics.
• Accuracy: Accuracy refers to how closely the data
represents the true values or conditions of the entities being
measured. High accuracy ensures that the data is reliable
and can be used confidently in analysis and decision-making.
• Example: A weather sensor measuring the correct temperature
without any deviations or errors.
Characteristics of Data in
Data Analytics
• Completeness: Completeness refers to whether all necessary data is
available. Missing or incomplete data can lead to inaccurate results
or biased conclusions in the analysis process.
• Example: A customer database where all customers have complete
information (name, address, phone number, etc.) vs. a database where
some key information is missing.
• Consistency: Consistency ensures that data across different sources
or datasets follows the same formats, conventions, and units of
measurement. It ensures that data does not conflict when combined
from multiple sources.
• Example: Sales data across different branches of a company recorded in the
same currency, with uniform product codes.
Big Data
• Big Data refers to datasets that are so large and complex that they
cannot be processed using conventional data management tools.
The key characteristics of Big Data are:
• Volume: The size of the data is extremely large, often ranging from
terabytes to petabytes and beyond.
• Velocity: Data is generated and processed at high speed, often in real
time or near-real-time.
• Variety: Data comes in different formats, such as structured, semi
structured, and unstructured data (e.g., text, images, videos, social
media posts).
Big Data

Other characteristics include:


• Veracity: The uncertainty or unreliability of data, particularly
unstructured data.
• Value: The potential insights or business value derived from
Big Data analytics.
Components of a Big Data
Platform
• A Big Data platform typically includes a variety of tools and
technologies to handle the challenges posed by Big Data.
Some key components are:
• Data Storage:
• Distributed File Systems: Big Data platforms often use distributed
storage solutions to handle large datasets.
• Examples include the Hadoop Distributed File System (HDFS) and cloud
storage solutions (Amazon S3, Google Cloud Storage).
• NoSQL Databases: These databases (e.g. MongoDB) are designed
to store and manage semi-structured and unstructured data.
Components of a Big Data
Platform
• Data Processing Frameworks:
• Batch Processing: Frameworks like Apache Hadoop allow for the
processing of large datasets in batches.
• Stream Processing: Tools like Apache Kafka and Apache link enable
real-time or near-real-time processing of continuous data streams.
• Data Management Tools:
• Data Integration: Tools like Apache NiFi and Talend help in integrating
various data sources, allowing data to be collected, cleaned, and
transformed for analysis.
• Data Governance: Ensures data security, privacy, and compliance with
regulations using tools like Apache Ranger or AWS Lake Formation.
Components of a Big Data
Platform
• Data Analytics Tools:
• Big Data Querying: Apache Hive and Apache Impala allow users
to query large datasets using SQL-like queries.
• Machine Learning Frameworks: Big Data platforms often
integrate with machine learning libraries (e.g., Apache Spark
MLlib, TensorFlow) for advanced predictive analytics and data
modeling.
• Data Visualization: Tools like Tableau, Power BI, and
Apache Superset provide visual representations of large
datasets, allowing for easier interpretation and insights.
Big Data Technologies
• Apache Hadoop: One of the most widely used frameworks, Hadoop enables
distributed storage (HDFS) and distributed processing (MapReduce) of large datasets.
• Apache Spark: A powerful data processing engine that supports both batch and real-
time analytics. Spark is often used for large-scale machine learning, graph processing,
and stream analytics.
• NoSQL Databases: Unlike traditional relational databases, NoSQL databases such as
MongoDB, Cassandra, and HBase can handle unstructured or semi-structured data,
providing scalability and flexibility.
• Apache Kafka: A distributed streaming platform that enables the ingestion of real-
time data streams for processing.
• Cloud-based Big Data Platforms: Major cloud providers such as Amazon Web Services
(AWS), Google Cloud Platform (GCP), and Microsoft Azure offer scalable Big Data
solutions, including data lakes, distributed computing, and machine learning services.
Need for Data Analytics
• Data analytics is essential in today's digital age for several reasons, offering key benefits
across industries and organizations:
• Informed Decision-Making: Analytics helps businesses make data-driven decisions by
uncovering trends, patterns, and insights from large datasets.
• Example: Companies use sales data to optimize product pricing and marketing
strategies.
• Improving Efficiency: By analyzing operational data, businesses can identify bottlenecks,
reduce costs, and improve overall efficiency.
• Example: Manufacturers use predictive analytics for maintenance, reducing downtime.
• Enhancing Customer Experience: Data analytics provides insights into customer
preferences, allowing businesses to personalize services and improve customer
satisfaction.
• Example: E-commerce platforms recommend products based on user behavior and
purchase history.
Need for Evolution of Analytics
Scalability

• The evolution of analytics scalability is driven by the


increasing complexity, volume, and diversity of data in
today's digital landscape. The need for scalable analytics
arises from the following factors:
• Growing Data Volume: With the explosion of data from social
media, IoT devices, and other sources, traditional systems
cannot handle the massive datasets. Scalable analytics
enables processing and analyzing large data efficiently.
• Example: A retailer analyzing billions of transactions to optimize
inventory and supply chain.
• Real-Time Analytics: As businesses demand quicker
insights for real-time decision-making, scalable systems are
essential to process high-velocity data streams without
delays.
• Example: Financial institutions using real-time analytics to detect
fraud instantly.
• Variety of Data: Data comes in structured, semi-
structured, and unstructured formats (e.g., text, images,
videos), requiring scalable analytics platforms that can
process and integrate diverse data sources.
• Example: Social media analytics that combine text, image, and
video content for sentiment analysis.
Analytic Process and
Tools
• The data analytics process consists of several steps that help in
turning raw data into actionable insights:
• Data Collection: Gathering data from various sources such as
databases, sensors, social media, or surveys.
• Tools: Web scraping tools (e.g., Scrapy), database management systems
(e.g., MySQL, MongoDB).
• Data Cleaning: Preparing the data by handling missing values,
correcting errors, and removing duplicates to ensure data quality.
• Tools: OpenRefine, Trifacta, Pandas (Python).
• Data Exploration: Analyzing the data to understand its patterns and
distributions using descriptive statistics or visualizations.
• Tools: Excel, Tableau, Power BI, Python libraries (e.g., Matplotlib, Seaborn).
• Data Modelling: Applying statistical models, machine learning
algorithms, or predictive analytics to extract meaningful insights.
• Tools: R, Python (e.g., Scikit-learn, TensorFlow), SAS, SPSS.
• Data Interpretation: Drawing conclusions from the analysis and
translating them into actionable business strategies or solutions.
• Tools: Visualization tools (e.g., Power BI, Tableau), reporting tools (e.g., Google
Data Studio).
• Deployment & Monitoring: Implementing the analytics models in real-
world scenarios and continuously monitoring the results for
improvements.
• Tools: Apache Kafka (real-time processing), Jenkins (automation), cloud
platforms (e.g., AWS, Azure).

You might also like