0% found this document useful (0 votes)
42 views70 pages

BA 06 Data Loading Storage and File Formats

Uploaded by

Nguyen Van Kien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views70 pages

BA 06 Data Loading Storage and File Formats

Uploaded by

Nguyen Van Kien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Business Analytics

Data Loading, Storage, and File Formats

Phan Xuân Hiếu • VNU-UET • [email protected]


Lecture objectives

§ Describing reading and writing data in text formats


– CSV, JSON, XML, HTML ...

§ Describing reading and writing data in binary formats


– Pickle, Micorosft Excel, HDF5
– Explaining some other file formats: ORC and Apache Parquet

§ Explaining how Python can work with (relational) databases


– Using specific drivers for particular DBMSs, or
– Using SQLAlchemy
– Giving examples of working with SQLite3

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 2
Contents

§ Reading and Writing Data in Text Format


– Reading and Writing Text Files
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping

§ Binary Data Formats


– Reading Microsoft Excel Files
– Using HDF5 Format

§ Interacting with Databases

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 3
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 4
Text and binary data loading functions in Pandas

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 5
Text and binary data loading functions in Pandas (2)

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 6
The optional arguments for these functions
§ Indexing:
– Can treat one or more columns as the returned DataFrame.
§ Type inference and data conversion:
– Includes the user-defined value conversions + custom list of missing value markers.

§ Date and time parsing:


§ Iterating:
– Support for iterating over chunks of very large files.
§ Unclean data issues:
– Includes skipping rows or a footer, comments, ...

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 7
Several functions are very complicated ...
The function pandas.read_csv has around ... 50 parameters!

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 8
Reading a small comma-separated values (CSV) file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 9
Reading a non-header CSV file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 10
Reading and adding names of columns

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 11
Reading and specifying the index column

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 12
Reading a CSV file with a hierarchical index

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 13
Reading data from a space-separated text file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 14
Reading and skipping some rows in the file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 15
Reading data from a CSV file with some missing values

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 16
Some pandas.read_csv function arguments

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 17
Some pandas.read_csv function arguments (2)

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 18
Some pandas.read_csv function arguments (3)

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 19
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 20
Reading data from a larger CSV file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 21
Reading a small number of rows

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 22
Reading a file in pieces by specifying the chunksize = #rows

§ The TextFileReader object returned by pandas.read_csv allows


– Iterate over the parts of the file according to the chunksize.
§ The next slide: Iterating over ex6.csv, aggregating the value counts in
the “key” column.

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 23
Aggregating the value counts in the “key” column

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 24
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– Working with Other Delimited Formats
– JSON Data
– XML and HTML: Web Scraping

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 25
Writing Pandas objects to CSV files

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 26
Writing with other options

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 27
Writing with other options (2)

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 28
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 29
JSON data

§ JSON (JavaScript Object Notation):


– One of the standard formats for sending data by HTTP request,
– Much more free-form data format than a tabular text form like CSV.

§ JSON is very nearly valid Python code.

§ Basic types: objects (dictionaries), arrays (lists), strings, numbers, Booleans, nulls.
– All of the keys in an object must be strings.

§ Several Python libraries for reading and writing JSON data


– A popular lib: json

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 30
Converting a JSON string to Python form

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 31
Converting a Python object back to JSON

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 32
Converting a part of JSON object to a DataFrame

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 33
Using the pandas.read_json to load JSON data

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 34
Writing a DataFrame to a JSON file or standard output

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 35
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 36
Python libs for parsing and reading XML and HTML data

§ Python has many libraries for reading and writing data from XML & HTML
– lxml, Beautiful Soup, and html5lib
§ Pandas:
– A built-in function: pandas.read_html: parsing and reading HTML tables
– Built on top of lxml, Beautiful Soup, and html5lib ...

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 37
Reading HTML tables using pandas.read_html

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 38
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 39
Computing the number of bank failures by year

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 40
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 41
Reading from and writing (serialization) to pickle file

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 42
Choosing the binary formats to store data

§ Pickle is recommened only as a short-term storage format


– Hard to guarantee the format will be stable over time
– Pandas has tried to maintain backward compatibility but not alwasy ...

§ Other file formats:


– HDF5
– ORC
– Apache Parquet

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 43
The ORC file format
§ ORC (Optimized Row Columnar) developed by Hortonworks
§ Use cases:
– Hive or Hadoop-based data lakes
– Heavy analytics on large datasets

§ Key features:
– Columnar storage: efficiently for reading only a subset of columns
– High compression
– Fast performance
– Splittable: support parallel processing. ORC files can be split across multiple workers ...
– Schema support: store rich metadata, including schema definitions ...

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 44
The Apache Parquet file format
§ Apache Parquet is a columnar storage file format
– Developed by Cloudera and Twitter for efficient big data storage and processing
– Widely used in Hadoop ecosystem with Apache Spark, Hive, Presto, and Pandas

§ Key features:
– Columnar format: great for analytics and read-heavy workloads
– Efficient compression
– Schema evolution: store metadata including schema definitions; support schema evolution,
... you can add new columns over time.
– Splittable: designed for parallel processing ...
– Language agnostic: support many language like Python, Java, C++ ... and many libraries like
Pandas, PyArrow, Spark ...

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 45
Reading from Apache Parquet format

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 46
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 47
Working with Microsoft Excel files

§ Pandas supports reading tabular data stored in Excel 2003 (and higher)
§ Using pandas.ExcelFile class or pandas.read_excel function
§ Internally, these tools are based on
– packages like xlrd and openpyxl to read old-style XLS and newer XLSX files
– xlrd and openpyxl must be installed separately from pandas

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 48
Reading data from Excel files using pandas.ExcelFile

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 49
Reading data from Excel files using pandas.ExcelFile (2)

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 50
Reading data from Excel files using pandas.read_excel

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 51
Writing data to Excel files

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 52
Writing data to Excel files (with multiple sheets)

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 53
Writing data to Excel files using pandas.to_excel

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 54
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 55
HDF5 file format
§ The HDF5 (Hierarchical Data Format version 5) file format (with .h5 or .hdf5)
– efficient for storing and organizing large amount of numerical data
– especially useful in scientific computing, machine learning, and data engineering

§ Key features:
– High-performance I/O with large and complex datasets
– Storing data in a hierarchical structure (like a file system: folders + files)
• Groups (like folders); Datasets (like files: multi-dimensional arrays); Attributes (metadata)
– Self-describing: storing metadata along with the data
– Supporting on-the-fly compression; efficient to read/write small sections of very large arrays

§ Supported in Python, R, C/C++, MATLAB, Java, Julia ...

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 56
Working with the HDF5 file format

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 57
Working with the HDF5 file format (2)

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 58
Working with the HDF5 file format (3)

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 59
Notes on binary data formats

§ HDF5
– Is NOT a database
– It is best suited for write-once, read-many datasets
– Data can be added to a HDF5 file at any time, but avoiding simultaneous writing
§ If working with large quantities of data locally
– Explore PyTables and h5py
– Many data analysis problems are I/O-bound (rather than CPU-bound)
§ If processing data that is stored on remote servers (e.g., Amazon S3, HDFS)
– Using binary formats designed for distributed storage like Apache Parquet

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 60
Interacting with Databases

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 61
SQL-based database management systems (DBMSs)
§ Python can work with various SQL-based DBMSs
– Oracle (via oracledb),
– SQL Server (via pyodbc),
– PostgreSQL (via psycopg),
– MySQL (via mysql.connector)
§ SQLite (the current version: SQLite3) with Python’s built-in sqlite3 driver:
– A lightweight, file-based relational database (does not require server)
– It is embedded: can be included in many applications like browsers / mobile apps
– It is self-contained, zero-configuration
– Very fast for small and medium-scale apps

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 62
Interacting with SQLite3 using the sqlite3 driver

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 63
Inserting a few rows of data into the test table

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 64
Querying the data from the test table

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 65
Creating a DataFrame from the list of data rows

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 66
SQLAlchemy

§ A powerful and flexible Python library used to interact with SQL databases
§ SQLAlchemy as a bridge between Python and relational databases like:
– SQLite, PostgreSQL, MySQL, SQL Server, Oracle ...

§ With SQLAlchemy:
– Connect to a database
– Create tables
– Insert, update, and delete data
– Run SQL queries
– Map Python classes to database tables

March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 67
Working with SQLite using SQLAlchemy

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 68
Lecture summary

§ Described reading and writing data in text formats


– CSV, JSON, XML, HTML ...

§ Described reading and writing data in binary formats


– Pickle, Micorosft Excel, HDF5
– Explaining some other file formats: ORC and Apache Parquet

§ Explained how Python can work with (relational) databases


– Using specific drivers for particular DBMSs, or
– Using SQLAlchemy
– Giving examples of working with SQLite3

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 69
Lecture materials

§ Allen B. Downey. Think Python: How to Think Like a Computer Scientist. O’Reilly Media
Inc., 2024.
§ Luciano Ramalho. Fluent Python, O’Reilly Media Inc., 2015.
§ Brett Slatkin. Effective Python: 125 Specific Ways to Write Better Python, Pearson
Education, Inc., 2025.
§ Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter,
O’Reilly Media Inc., 2022.
§ Guido van Rossum, Barry Warsaw, and Alyssa Coghlan. Python Enhancement Proposal 8
(PEP 8) – Style Guide for Python Code, 2001. URL: https://peps.python.org/pep-0008/

29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 70

You might also like