0% found this document useful (0 votes)

6 views9 pages

BIG Data File Format

Analyse de gros volume de données

Uploaded by

amir gael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views9 pages

BIG Data File Format

Analyse de gros volume de données

Uploaded by

amir gael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

JSON (JavaScript Object Notation)

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for
humans to read and write, and easy for machines to parse and generate. JSON is often used
to transmit data between a server and a web application, as well as to store configuration
settings and exchange data between different programming languages.

• JSON represents data as key-value pairs, similar to a dictionary or an associative

array in other programming languages.

• Data is organized in a hierarchical and nested structure using objects and arrays.

• JSON uses a simple and readable syntax. Data is enclosed in curly braces {} for
objects and square brackets [] for arrays.

• Key-value pairs are separated by colons (:), and elements in an array are separated
by commas.

• JSON supports several data types, including strings, numbers, objects, arrays,
booleans, and null.

• JSON is widely used for configuration files, APIs, and as a data storage format for
various applications.

Example of a simple JSON object representing information about a person:

{
"name": "John Doe",
"age": 30,
"city": "New York",
"isStudent": false,
"hobbies": ["reading", "traveling"]
}

In the below example JSON file contains an object with a key “employees,” which maps to
an array of two objects representing employee information. Each employee object has keys
like “firstName,” “lastName,” “age,” and “department.” The structure of JSON allows for
flexibility and ease of representation for various types of data.

{
"employees": [
{
"firstName": "John",
"lastName": "Doe",
"age": 30,
"department": "Engineering"
},
{
"firstName": "Jane",
"lastName": "Smith",
"age": 28,
"department": "Marketing"
}
]
}

CSV (Comma Separated Values)

CSV is a simple and widely used file format for storing tabular data (numbers and text) in
plain text form. In a CSV file, each line of the file represents a row of data, and the values
within each row are separated by commas (or another delimiter)

• Structure — Data is organized in rows, where each row corresponds to a record or

entry.

• Within each row, values are separated by commas or other delimiters (such as
semicolons or tabs).

• Delimiter — The comma is the most common delimiter used in CSV files, but other
delimiters like semicolons or tabs may be used depending on regional conventions
or specific requirements.

• The choice of delimiter is important to avoid conflicts with the data itself. For
example, if the data contains commas, using a comma as a delimiter may cause
parsing issues.

• Text Qualification — If a field value contains the delimiter or special characters, the
value is often enclosed in double quotes to distinguish it from the delimiter used to
separate fields. For example: "John Doe",25,"New York, NY", "Male"

• Header Row — CSV files often include a header row at the beginning that contains
the names of the columns. This row helps to identify the meaning of each column.
Name,Age,City,Gender

• CSV files typically have a “.csv” file extension.

• CSV is a platform-independent format and can be easily created and read by a

variety of software applications, including spreadsheet programs like Microsoft
Excel and database systems.

• CSV files store data as plain text, so all values are treated as strings. It’s up to the
interpreting software to recognize and handle data types appropriately.

• CSV is commonly used for data interchange between different systems and
applications.

Name,Age,City,Gender
John Doe,25,New York, Male
Jane Smith,30,San Francisco, Female
Bob Johnson,22,Chicago, Male

** Parquet **
Parquet is a columnar storage file format optimized for use with big data processing
frameworks. It is designed to be highly efficient for both storage and processing of large
datasets. Parquet is widely used in the Apache Hadoop ecosystem, particularly with tools
like Apache Spark, Apache Hive

• Columnar Storage — Unlike row-oriented storage formats, such as CSV or JSON,

Parquet stores data in a columnar format. This means that values from the same
column are stored together, allowing for better compression and improved query
performance for analytical workloads.

• Compression — Parquet uses compression techniques to reduce storage space

requirements. The columnar storage format allows for effective compression
because similar data types and values are grouped together.

• Common compression algorithms used with Parquet include Snappy, Gzip, and
LZO.

• Schema Evolution — Parquet supports schema evolution, allowing changes to the

data schema over time without requiring the entire dataset to be rewritten. This is
beneficial for evolving data structures without significant disruptions to data
processing workflows.
• Predicate PushDown — Parquet enables predicate pushdown, a feature that allows
the filtering of data at the storage level before it is read into memory. This minimizes
the amount of data that needs to be processed, leading to improved query
performance.

• Metadata — Parquet files contain metadata, including schema information and

statistics about the data. This metadata is used by processing engines to optimize
queries and filter data efficiently.

• Data Types — Parquet supports a wide range of data types, including primitive types
(integers, floating-point numbers, strings, etc.) and complex types (arrays, maps,
structs). This flexibility makes it suitable for diverse data processing needs.

• Performance and Scalability — Due to its columnar storage and compression,

Parquet is well-suited for analytical processing on large datasets. It allows for
efficient scanning of specific columns and parallel processing in distributed
environments.

• File Extension — Parquet files typically have a “.parquet” file extension.

Example Parquet File Structure Illustration

<Column 1>
<Value 1>
<Value 2>
...
<Column 2>
<Value 1>
<Value 2>
...
...

While I can’t provide an actual binary representation of a Parquet file, I can provide a
simplified example of the logical structure of a Parquet file along with some sample data.
Remember that the actual binary format is more complex due to the use of advanced
compression and encoding techniques.

Let’s consider a scenario where we have a dataset containing information about users, and
we’ll represent this dataset using a few columns: user_id, name, age, and city.

Here’s a simplified representation of a Parquet file structure with sample data:

+-----------------------------------------------------------+
| Parquet File Header |
+-----------------------------------------------------------+
| Metadata (Schema, Compression, etc.) |
+-----------------------------------------------------------+
| Row Group 1 |
| +-----------------------------+-------------------------+
| | user_id | name | age | city |
| +-----------------------------+-------------------------+
| | 1 | Alice | 25 | New York |
| | 2 | Bob | 30 | San Francisco |
| | 3 | Charlie | 28 | Chicago |
| +-----------------------------+-------------------------+
| Row Group 2 |
| +-----------------------------+-------------------------+
| | user_id | name | age | city |
| +-----------------------------+-------------------------+
| | 4 | Dave | 35 | Los Angeles |
| | 5 | Eve | 22 | Seattle |
| +-----------------------------+-------------------------+
+-----------------------------------------------------------+

** Avro **
Avro is a binary serialization format developed within the Apache Hadoop project. It is
designed to provide a compact and fast serialization mechanism for data exchange
between systems, especially in big data processing environments

• Schema-Based Serialization — Avro uses a schema to define the structure of the

data being serialized. The schema is often defined in JSON format and is used to
encode and decode the data.

• Data Types — Avro supports a rich set of data types, including primitive types (int,
long, float, double, boolean, string, bytes) and complex types (record, enum, array,
map, union, fixed).

• Binary Format — Avro serializes data in a compact binary format, resulting in

smaller file sizes compared to some text-based formats like JSON or XML. The
binary format also contributes to faster data serialization and deserialization.
• Compression — Avro files can be compressed to further reduce storage
requirements. Common compression algorithms, such as Snappy or deflate, can be
applied to Avro data. Compression helps minimize storage costs and improve data
transfer efficiency.

• Self-Describing Data — Avro data files are self-describing, meaning they include
the schema information along with the serialized data. This makes it easy to
interpret the data without needing the schema in advance. The schema is stored at
the beginning of the Avro file, allowing readers to understand the structure of the
data without external schema files.

• Forward and Backward Compatibility- Avro supports schema evolution, allowing

for changes to the schema over time without breaking compatibility. Both forward
and backward compatibility are maintained, meaning new data can be read by old
readers, and old data can be read by new readers.

• Language-Independent — Avro is designed to be language-independent, meaning

it can be used across various programming languages. Avro schemas can be
defined in JSON, and libraries for reading and writing Avro data are available in
multiple programming languages, including Java, Python, C++, and more.

• File Extension — Avro files typically have a “.avro” file extension

{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "city", "type": "string"}
]
}

Avro Data

{"id": 1, "name": "Alice", "age": 25, "city": "New York"}

{"id": 2, "name": "Bob", "age": 30, "city": "San Francisco"}
{"id": 3, "name": "Charlie", "age": 28, "city": "Chicago"}

In this example, the Avro schema defines a record type named “User” with four fields. The
Avro data represents instances of this record with specific values for each field.
** ORC **

ORC (Optimized Row Columnar) is a columnar storage file format designed for use with the
Apache Hive data warehouse system. It is highly optimized for performance, especially for
complex query processing in big data analytics. ORC files are often used in conjunction
with Apache Hive, Apache Spark, and other big data processing frameworks.

• Columnar Storage — Data is stored in a columnar format, which allows for better
compression and improved query performance. This is particularly advantageous
for analytical workloads where only a subset of columns is often queried.

• Compression — ORC supports various compression algorithms, including Zlib,

Snappy, and LZO. Compression is applied at the column level, providing efficient
storage and reduced I/O.

• Predicate Pushdown — ORC files support predicate pushdown, a feature that

allows the filtering of data at the storage level before it is read into memory. This
reduces the amount of data that needs to be processed during query execution.

• Lightweight Indexing — ORC files include lightweight indexes, known as bloom

filters, that help skip irrelevant data blocks during query execution. This further
improves query performance.

• Statistics and Metadata — ORC files store statistics and metadata about the data,
including column statistics like minimum and maximum values. This information is
used by query engines to optimize query execution plans.

• Data Types — ORC supports a wide range of data types, including primitive types
(integers, floating-point numbers, strings, etc.) and complex types (arrays, maps,
structs). This flexibility makes it suitable for diverse data processing needs.

• Hive Integration — ORC is closely integrated with Apache Hive, making it a popular
choice for storing and processing Hive tables.

Sample dataset with information about users

| user_id | name | age | city |

|---------|--------|-----|------------|
| 1 | Alice | 25 | New York |
| 2 | Bob | 30 | San Fran |
| 3 | Charlie| 28 | Chicago |

Example ORC File Structure

Each column is stored separately, and the values within each column are stored in a
compressed, columnar format. This structure allows for efficient compression and retrieval
of specific columns during query processing.

Lecture 2 File Types Suitable For Storing Big Data
No ratings yet
Lecture 2 File Types Suitable For Storing Big Data
12 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
Module 1
No ratings yet
Module 1
11 pages
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
Online Cinema Ticket Booking System
No ratings yet
Online Cinema Ticket Booking System
26 pages
DP 900 Data Fundamentals 1710103456
No ratings yet
DP 900 Data Fundamentals 1710103456
35 pages
DP900 Chapter1 Notes
No ratings yet
DP900 Chapter1 Notes
10 pages
Mysql PPT Ver8
No ratings yet
Mysql PPT Ver8
486 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Aaryan
No ratings yet
Aaryan
32 pages
Azure Data Concepts for Beginners
No ratings yet
Azure Data Concepts for Beginners
8 pages
Azure Data Fundamentals - Study Notes
No ratings yet
Azure Data Fundamentals - Study Notes
22 pages
Bigdata PPT
No ratings yet
Bigdata PPT
140 pages
Data Sources and Formats in ML Systems
No ratings yet
Data Sources and Formats in ML Systems
14 pages
Harvard Lecture
No ratings yet
Harvard Lecture
51 pages
IP - Sep
No ratings yet
IP - Sep
2 pages
Big Data File Formats Explained
No ratings yet
Big Data File Formats Explained
3 pages
Domain 1
No ratings yet
Domain 1
8 pages
UNIT-5 File System Interface and Operations
No ratings yet
UNIT-5 File System Interface and Operations
30 pages
IP PROJECT On t20 Analusis
No ratings yet
IP PROJECT On t20 Analusis
25 pages
Algorithm and Data Structure Lecture 1a
No ratings yet
Algorithm and Data Structure Lecture 1a
4 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
CSV and JSON Files Explained
No ratings yet
CSV and JSON Files Explained
13 pages
Unit - 3
No ratings yet
Unit - 3
91 pages
SQL ANalyst by CT Taylor Part 4
No ratings yet
SQL ANalyst by CT Taylor Part 4
5 pages
Fundamentals of Data Engineering by Joe Reis and Matt Housley 89
No ratings yet
Fundamentals of Data Engineering by Joe Reis and Matt Housley 89
6 pages
Mysql PPT Ver9
No ratings yet
Mysql PPT Ver9
713 pages
42 P16cse5a-P16ite3a 2020052204503639
No ratings yet
42 P16cse5a-P16ite3a 2020052204503639
23 pages
Azure Data Fudamentals - DP 900
No ratings yet
Azure Data Fudamentals - DP 900
181 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Extendible Hashing in File Organization
No ratings yet
Extendible Hashing in File Organization
62 pages
File Organization
No ratings yet
File Organization
11 pages
8 9day
No ratings yet
8 9day
23 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
File Organization-Lec8
No ratings yet
File Organization-Lec8
31 pages
Csvkit Manual
No ratings yet
Csvkit Manual
53 pages
Untitled Document
No ratings yet
Untitled Document
14 pages
Aadityaji
No ratings yet
Aadityaji
17 pages
Unit-05 DBMS Notes - Merged
No ratings yet
Unit-05 DBMS Notes - Merged
47 pages
Group 4 Presentation
No ratings yet
Group 4 Presentation
14 pages
Day 03
No ratings yet
Day 03
11 pages
Arpit
No ratings yet
Arpit
30 pages
Employee Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Employee Data Analysis System (Ip Class 12) (2024-25)
30 pages
CSV File
No ratings yet
CSV File
30 pages
Azure Data Fundamentals Overview
No ratings yet
Azure Data Fundamentals Overview
97 pages
Comparison of File Formats For Big Data
No ratings yet
Comparison of File Formats For Big Data
4 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
Unit 3-BDA
50% (2)
Unit 3-BDA
26 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
Inbound 6146778812034617939
No ratings yet
Inbound 6146778812034617939
16 pages
Os R22 2-2 Unit-5
No ratings yet
Os R22 2-2 Unit-5
21 pages
Python CSV File Operations Guide
No ratings yet
Python CSV File Operations Guide
7 pages
Understanding Data Hierarchy Basics
No ratings yet
Understanding Data Hierarchy Basics
8 pages
Database Management Essentials
No ratings yet
Database Management Essentials
21 pages
CSV File Handling
No ratings yet
CSV File Handling
16 pages
Python Unit 5
No ratings yet
Python Unit 5
21 pages
Os Unit 5
No ratings yet
Os Unit 5
73 pages
A CSV
No ratings yet
A CSV
6 pages
2163236949GN
No ratings yet
2163236949GN
1 page
Forage Certificate
No ratings yet
Forage Certificate
1 page
Mistakes To Avoid When Transitioning To Digital Pathology
No ratings yet
Mistakes To Avoid When Transitioning To Digital Pathology
3 pages
Well Integrity Standards - Well Design
No ratings yet
Well Integrity Standards - Well Design
11 pages
Weekly Job Global 16-09-2025
No ratings yet
Weekly Job Global 16-09-2025
6 pages
Practical Data Analysis and Machine Learning Using Python For Petroleum Engineers
0% (1)
Practical Data Analysis and Machine Learning Using Python For Petroleum Engineers
10 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
8 pages
COBOL to Java Migration Overview
No ratings yet
COBOL to Java Migration Overview
8 pages
SYAD Group Assignment
No ratings yet
SYAD Group Assignment
128 pages
Software Testing - JIC
No ratings yet
Software Testing - JIC
29 pages
Blue Team Tools For SOC Analysts
No ratings yet
Blue Team Tools For SOC Analysts
7 pages
Resume Template in Docx Format
No ratings yet
Resume Template in Docx Format
1 page
Letter of Appointment
No ratings yet
Letter of Appointment
5 pages
Go Backend Dev Roadmap
No ratings yet
Go Backend Dev Roadmap
5 pages
Hostel Management System
No ratings yet
Hostel Management System
10 pages
Sla, Ola and Uc: Customer Service Provider
No ratings yet
Sla, Ola and Uc: Customer Service Provider
26 pages
CCPA25 M4 Securing Your Environment With Commvault
No ratings yet
CCPA25 M4 Securing Your Environment With Commvault
18 pages
Cpp-Imp Questions
No ratings yet
Cpp-Imp Questions
4 pages
JIRA Test Execution Report 2020
No ratings yet
JIRA Test Execution Report 2020
8 pages
React JS Slides
No ratings yet
React JS Slides
70 pages
Bcom V Sem Unit - 1
No ratings yet
Bcom V Sem Unit - 1
7 pages
GCP Cloud Architecture for Game Scaling
100% (2)
GCP Cloud Architecture for Game Scaling
100 pages
Real Estate Listing
No ratings yet
Real Estate Listing
20 pages
Sagar Kumar Updated Resume
No ratings yet
Sagar Kumar Updated Resume
1 page
ServicePlus - Issuance of New Above Poverty Line (APL) Ration Card
No ratings yet
ServicePlus - Issuance of New Above Poverty Line (APL) Ration Card
2 pages
Slides - Resiliency and Incident Response
No ratings yet
Slides - Resiliency and Incident Response
24 pages
Business Analyst Expertise & Projects
No ratings yet
Business Analyst Expertise & Projects
2 pages
AI Lec 1
No ratings yet
AI Lec 1
48 pages
Online Exam Instructions & Details
No ratings yet
Online Exam Instructions & Details
3 pages
OOP and Java Course Syllabus
No ratings yet
OOP and Java Course Syllabus
2 pages
Fake Profile Detection Report
No ratings yet
Fake Profile Detection Report
94 pages
Firm-Level Capabilities Towards Big Data Value Creation
No ratings yet
Firm-Level Capabilities Towards Big Data Value Creation
10 pages
Incomplete Recovery With BackupControlfile
No ratings yet
Incomplete Recovery With BackupControlfile
119 pages
NFC For Embedded Applications: Your Critical Link For The Internet of Things
No ratings yet
NFC For Embedded Applications: Your Critical Link For The Internet of Things
20 pages
Application Audit Review Process Guide
No ratings yet
Application Audit Review Process Guide
2 pages
MWU Project Management System Proposal
No ratings yet
MWU Project Management System Proposal
27 pages

BIG Data File Format

Uploaded by

BIG Data File Format

Uploaded by

** JSON (JavaScript Object Notation) **

• JSON represents data as key-value pairs, similar to a dictionary or an associative

Example of a simple JSON object representing information about a person:

** CSV (Comma Separated Values) **

• Structure — Data is organized in rows, where each row corresponds to a record or

• CSV files typically have a “.csv” file extension.

• CSV is a platform-independent format and can be easily created and read by a

• Columnar Storage — Unlike row-oriented storage formats, such as CSV or JSON,

• Compression — Parquet uses compression techniques to reduce storage space

• Schema Evolution — Parquet supports schema evolution, allowing changes to the

• Metadata — Parquet files contain metadata, including schema information and

• Performance and Scalability — Due to its columnar storage and compression,

• File Extension — Parquet files typically have a “.parquet” file extension.

Example Parquet File Structure Illustration

Here’s a simplified representation of a Parquet file structure with sample data:

• Schema-Based Serialization — Avro uses a schema to define the structure of the

• Binary Format — Avro serializes data in a compact binary format, resulting in

• Forward and Backward Compatibility- Avro supports schema evolution, allowing

• Language-Independent — Avro is designed to be language-independent, meaning

• File Extension — Avro files typically have a “.avro” file extension

{"id": 1, "name": "Alice", "age": 25, "city": "New York"}

• Compression — ORC supports various compression algorithms, including Zlib,

• Predicate Pushdown — ORC files support predicate pushdown, a feature that

• Lightweight Indexing — ORC files include lightweight indexes, known as bloom

Sample dataset with information about users

| user_id | name | age | city |

Example ORC File Structure

You might also like

JSON (JavaScript Object Notation)

CSV (Comma Separated Values)