0% found this document useful (0 votes)
16 views18 pages

U1 D CLSRM

The document provides an overview of Big Data, including its types (structured, unstructured, semi-structured), characteristics, and the analytic process which consists of phases like business understanding, data collection, preparation, modeling, evaluation, and deployment. It distinguishes between reporting and analysis, emphasizing their different purposes and outputs. Additionally, it mentions various modern data analytic tools available for organizations to consider.

Uploaded by

lolrofl102938
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

U1 D CLSRM

The document provides an overview of Big Data, including its types (structured, unstructured, semi-structured), characteristics, and the analytic process which consists of phases like business understanding, data collection, preparation, modeling, evaluation, and deployment. It distinguishes between reporting and analysis, emphasizing their different purposes and outputs. Additionally, it mentions various modern data analytic tools available for organizations to consider.

Uploaded by

lolrofl102938
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture-7

Big Data (KCS-061)


Unit 1: Introduction to Big Data
• Types of digital data
• History of Big Data innovation
• Introduction to Big Data platform, drivers for Big Data
• Big Data architecture and characteristics
• 5 Vs of Big Data
• Big Data technology components
• Big Data importance and applications
• Big Data features – security, compliance, auditing and protection
• Big Data privacy and ethics
• Big Data Analytics
• Challenges of conventional systems
• Intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools
Nature of Data

• The data in Big Data can be any of the following:

• Structured

• Unstructured

• Semi-structured
• Usually, data is in the unstructured format which makes extracting
information from it difficult.
• According to Merrill Lynch, 80–90% of business data is either
unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the
whole enterprise data.
Formats of Digital Data
Here is a percent distribution of the three forms of data
• Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms. For instance,
the employee table in a company database will be structured as the employee details,
their job positions, their salaries, etc., will be present in an organized manner.

• Unstructured
This data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two
important types of big data.
• Semi-structured
This data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has
not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. Thus we come to
the end of types of data. Lets discuss the characteristics of data.
Example: data in an XML file
The Analytic Process
• An analysis process contains all or some of the following phases:

• Business Understanding

• Data Collection and Understanding

• Data Preparation

• Modeling

• Evaluation

• Deployment
1. Business Understanding:

• This step mostly focuses on understanding the Business in all the different aspects. It follows the
below different steps.

a) Identify the goal and frame the business problem.

b) Gather information on resource, constraints, assumptions, risks etc

c) Prepare Analytical Goal

d) Flow Chart
2. Data Collection:

• The process of collecting data is an important task in executing a project plan accurately.

• In this phase, data from different data sources is collected furst and then described in terms of its
application and need of the project.

• This process is also called data exploration.

• Exploration of data is required to ensure the quality of the collected data.


3. Data Preparation:

• In this step, the provided data is prepared and cleaned.

• In other words, unnecessary or unwanted data is removed in this phase.

4. Data Modeling:

• In this phase, a model is created by using a data modeling technique.

• The data model is used to analyze the relationship between different selected objects in the data.

• Test cases are created to assess the applicability of model and data is structured according to the
model.
5. Data Evaluation:

• The results obtained from the different test cases are evaluated and reviewed for errors.

• After validating the results, analysis reports are created for determining the next plan of action.

5. Deployment:

• In this phase, the plan is finalized for deployment.

• The deployed plan is constantly checked for errors and maintenance.

• This process is also termed as reviewing the project.


• Phases of analysis:
Analysis vs Reporting
• Sometimes the line between reporting and analysis tends to blur.
• We need to be able to distinguish between these two areas.

• Reporting:
• It is a process in which data is organized and summarized in an easy-to-understand format.
• Reports enable organizations to monitor various performance parameters and imporve customer
satisfaction.

• Analysis:
• It is a process in which data and reports are examined to get insights from them.
• These insights help an organization to perform important tasks in a timely manner, such as
planning a strategy, taking important business decisions, introducing a new product, and
improving customer satisfaction.
• In simple words, reporting can be sonsidered as a process in which raw data is transformed into
useful information and analysis as a process that transforms information into insights.

• While both draw upon the same collected online data, reporting and analysis are very different in
terms of their purpose, tasks, outputs, delivery, and value as shown below:
Modern Data Analytic Tools
• Various types of analytical tools are available in the market, but no company can buy and
implement all of them.

• Some of the open-source analytical tools are as follows:


✔ GridGain
✔ HPCC
✔ Storm
✔ Terrastore
✔ Neo4j

*
• The decision to invest in an analytical tool is a crucial one and needs careful consideration on the
part of a company on various parameters.

• The following are some popular analytical tools:

✔ The R Project for Statistical Computing

✔ IBM SPSS

✔ SAS
Thank You

Common questions

Powered by AI

The data collection phase is crucial to the analytics process as it involves gathering data from various sources relevant to the project's goals. This phase is also referred to as data exploration, wherein data is evaluated to ensure its quality and applicability for subsequent analysis. Proper data collection lays the groundwork for accurate data preparation, modeling, and evaluation, directly impacting the quality of analytical outcomes. High-quality, relevant data helps create reliable models and generate actionable insights, whereas poor-quality data can lead to erroneous conclusions and inefficient decisions .

Understanding the nature of data is fundamental in crafting effective Big Data solutions because it determines the methods and tools used for data processing and analysis. With structured data, traditional database management methods suffice, while unstructured and semi-structured data require advanced analytical tools and methodologies. The prevalence of unstructured data in enterprises means that specialized techniques are needed to harness its potential. Recognizing these differences helps in choosing appropriate technologies and developing efficient data models, ensuring that data-driven decisions are well-supported by accurate analysis .

Security, compliance, auditing, and protection are critical components that impact Big Data applications by ensuring data integrity, privacy, and trustworthiness. Security measures protect sensitive information from unauthorized access and breaches. Compliance ensures that data handling meets legal and industry-specific regulations. Auditing provides transparency and traceability in data operations, allowing organizations to monitor and verify compliance with these regulations. Protection involves setting policies and practices that safeguard data throughout its lifecycle. Together, these elements help mitigate risks associated with data storage and processing, fostering trust in Big Data applications and their outputs .

Reporting and analysis differ greatly in their functions despite both utilizing collected data. Reporting organizes and summarizes data into a clear format that allows monitoring of performance parameters, enhancing decision-making by providing factual information at a glance. Analysis, however, involves a deeper examination of data and reports to derive insights, which can guide strategic planning and decision-making. Therefore, while reporting provides the 'what' of business performance, analysis offers the 'why' and 'how', allowing companies to not only track past performance but also predict and prepare for future trends .

Modern data analytic tools are characterized by their ability to handle vast, diverse datasets and perform complex analyses at speed. These tools, such as GridGain, Neo4j, and SAS, provide features like real-time processing, support for multiple data formats, and advanced visualization capabilities. They facilitate every stage of the analytics process, from data preparation to modeling and evaluation, thus allowing businesses to draw insights from both structured and unstructured data efficiently. By leveraging these tools, organizations can enhance their decision-making processes, optimize operations, and innovate by transforming raw data into valuable insights .

Conventional systems often struggle with scalability, data volume, and real-time processing needs, limiting their ability to handle the demands of modern-day data environments. Big Data technologies, however, are specifically designed to address these challenges by offering distributed computing, high storage capacity, and parallel processing. Technologies like Hadoop allow for efficient handling of large datasets across multiple servers. Additionally, Big Data systems can process diverse data formats—from structured to unstructured—more effectively than conventional systems. This capability provides organizations with deeper insights and faster, more efficient data processing solutions .

Ethical considerations in Big Data primarily revolve around privacy and data usage. Data privacy concerns arise from the vast amount of personal information processed in Big Data applications, potentially leading to misuse or unauthorized exposure. Ethical data usage requires organizations to balance the benefits of data analysis with individuals' rights to privacy. This involves adhering to strict data protection regulations, ensuring transparency in data usage practices, and obtaining informed consent from data subjects. Addressing these ethical concerns is vital to maintaining public trust and preventing legal repercussions while leveraging Big Data's capabilities for societal benefits .

The history of Big Data innovation has significantly shaped contemporary Big Data platforms and architectures. Early advancements focused on improving data storage and computing power to manage large datasets. This evolution has led to the development of sophisticated architectures that support distributed computing, fault tolerance, and scalability. Modern Big Data platforms are designed to accommodate the increasing volume, variety, and velocity of data by incorporating technologies such as Hadoop and cloud computing. These innovations have enabled real-time analytics, enhanced data integration, and improved access to insights, driving more informed and timely decision-making across industries .

The '5 Vs of Big Data'—Volume, Velocity, Variety, Veracity, and Value—define the major characteristics and challenges of managing Big Data. Volume refers to the massive amounts of data generated; Velocity is the speed at which data is produced and must be processed; Variety encompasses the different types of data, from structured to unstructured formats; Veracity highlights the uncertainty of data quality; and Value pertains to the insights and business benefits derived from the data. Each 'V' presents unique challenges, such as storage capacity for Volume, real-time processing for Velocity, integration of diverse data sources for Variety, trustworthiness for Veracity, and extraction of actionable insights for Value .

Big Data analytics involves three main categories of data: structured, unstructured, and semi-structured. Structured data refers to highly organized data that can be easily stored and accessed, such as entries in a database. Unstructured data lacks a predefined format, making it challenging to process and analyze; examples include emails and social media posts. Semi-structured data contains elements of both structured and unstructured data, such as XML files. Understanding these categories is essential for effectively extracting valuable insights, as most enterprise data is unstructured or semi-structured .

You might also like