Business Analytics
Data Loading, Storage, and File Formats
Lecture objectives
§ Describing reading and writing data in text formats
– CSV, JSON, XML, HTML ...
§ Describing reading and writing data in binary formats
– Pickle, Micorosft Excel, HDF5
– Explaining some other file formats: ORC and Apache Parquet
§ Explaining how Python can work with (relational) databases
– Using specific drivers for particular DBMSs, or
– Using SQLAlchemy
– Giving examples of working with SQLite3
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 2
Contents
§ Reading and Writing Data in Text Format
– Reading and Writing Text Files
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping
§ Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format
§ Interacting with Databases
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 3
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 4
Text and binary data loading functions in Pandas
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 5
Text and binary data loading functions in Pandas (2)
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 6
The optional arguments for these functions
§ Indexing:
– Can treat one or more columns as the returned DataFrame.
§ Type inference and data conversion:
– Includes the user-defined value conversions + custom list of missing value markers.
§ Date and time parsing:
§ Iterating:
– Support for iterating over chunks of very large files.
§ Unclean data issues:
– Includes skipping rows or a footer, comments, ...
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 7
Several functions are very complicated ...
The function pandas.read_csv has around ... 50 parameters!
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 8
Reading a small comma-separated values (CSV) file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 9
Reading a non-header CSV file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 10
Reading and adding names of columns
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 11
Reading and specifying the index column
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 12
Reading a CSV file with a hierarchical index
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 13
Reading data from a space-separated text file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 14
Reading and skipping some rows in the file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 15
Reading data from a CSV file with some missing values
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 16
Some pandas.read_csv function arguments
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 17
Some pandas.read_csv function arguments (2)
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 18
Some pandas.read_csv function arguments (3)
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 19
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 20
Reading data from a larger CSV file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 21
Reading a small number of rows
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 22
Reading a file in pieces by specifying the chunksize = #rows
§ The TextFileReader object returned by pandas.read_csv allows
– Iterate over the parts of the file according to the chunksize.
§ The next slide: Iterating over ex6.csv, aggregating the value counts in
the “key” column.
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 23
Aggregating the value counts in the “key” column
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 24
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– Working with Other Delimited Formats
– JSON Data
– XML and HTML: Web Scraping
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 25
Writing Pandas objects to CSV files
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 26
Writing with other options
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 27
Writing with other options (2)
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 28
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 29
JSON data
§ JSON (JavaScript Object Notation):
– One of the standard formats for sending data by HTTP request,
– Much more free-form data format than a tabular text form like CSV.
§ JSON is very nearly valid Python code.
§ Basic types: objects (dictionaries), arrays (lists), strings, numbers, Booleans, nulls.
– All of the keys in an object must be strings.
§ Several Python libraries for reading and writing JSON data
– A popular lib: json
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 30
Converting a JSON string to Python form
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 31
Converting a Python object back to JSON
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 32
Converting a part of JSON object to a DataFrame
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 33
Using the pandas.read_json to load JSON data
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 34
Writing a DataFrame to a JSON file or standard output
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 35
Reading and Writing Data in Text Format
– Reading Text Files in Pieces
– Writing Data to Text Format
– JSON Data
– XML and HTML: Web Scraping
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 36
Python libs for parsing and reading XML and HTML data
§ Python has many libraries for reading and writing data from XML & HTML
– lxml, Beautiful Soup, and html5lib
§ Pandas:
– A built-in function: pandas.read_html: parsing and reading HTML tables
– Built on top of lxml, Beautiful Soup, and html5lib ...
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 37
Reading HTML tables using pandas.read_html
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 38
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 39
Computing the number of bank failures by year
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 40
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 41
Reading from and writing (serialization) to pickle file
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 42
Choosing the binary formats to store data
§ Pickle is recommened only as a short-term storage format
– Hard to guarantee the format will be stable over time
– Pandas has tried to maintain backward compatibility but not alwasy ...
§ Other file formats:
– HDF5
– ORC
– Apache Parquet
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 43
The ORC file format
§ ORC (Optimized Row Columnar) developed by Hortonworks
§ Use cases:
– Hive or Hadoop-based data lakes
– Heavy analytics on large datasets
§ Key features:
– Columnar storage: efficiently for reading only a subset of columns
– High compression
– Fast performance
– Splittable: support parallel processing. ORC files can be split across multiple workers ...
– Schema support: store rich metadata, including schema definitions ...
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 44
The Apache Parquet file format
§ Apache Parquet is a columnar storage file format
– Developed by Cloudera and Twitter for efficient big data storage and processing
– Widely used in Hadoop ecosystem with Apache Spark, Hive, Presto, and Pandas
§ Key features:
– Columnar format: great for analytics and read-heavy workloads
– Efficient compression
– Schema evolution: store metadata including schema definitions; support schema evolution,
... you can add new columns over time.
– Splittable: designed for parallel processing ...
– Language agnostic: support many language like Python, Java, C++ ... and many libraries like
Pandas, PyArrow, Spark ...
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 45
Reading from Apache Parquet format
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 46
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 47
Working with Microsoft Excel files
§ Pandas supports reading tabular data stored in Excel 2003 (and higher)
§ Using pandas.ExcelFile class or pandas.read_excel function
§ Internally, these tools are based on
– packages like xlrd and openpyxl to read old-style XLS and newer XLSX files
– xlrd and openpyxl must be installed separately from pandas
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 48
Reading data from Excel files using pandas.ExcelFile
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 49
Reading data from Excel files using pandas.ExcelFile (2)
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 50
Reading data from Excel files using pandas.read_excel
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 51
Writing data to Excel files
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 52
Writing data to Excel files (with multiple sheets)
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 53
Writing data to Excel files using pandas.to_excel
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 54
Binary Data Formats
– Reading Microsoft Excel Files
– Using HDF5 Format
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 55
HDF5 file format
§ The HDF5 (Hierarchical Data Format version 5) file format (with .h5 or .hdf5)
– efficient for storing and organizing large amount of numerical data
– especially useful in scientific computing, machine learning, and data engineering
§ Key features:
– High-performance I/O with large and complex datasets
– Storing data in a hierarchical structure (like a file system: folders + files)
• Groups (like folders); Datasets (like files: multi-dimensional arrays); Attributes (metadata)
– Self-describing: storing metadata along with the data
– Supporting on-the-fly compression; efficient to read/write small sections of very large arrays
§ Supported in Python, R, C/C++, MATLAB, Java, Julia ...
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 56
Working with the HDF5 file format
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 57
Working with the HDF5 file format (2)
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 58
Working with the HDF5 file format (3)
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 59
Notes on binary data formats
§ HDF5
– Is NOT a database
– It is best suited for write-once, read-many datasets
– Data can be added to a HDF5 file at any time, but avoiding simultaneous writing
§ If working with large quantities of data locally
– Explore PyTables and h5py
– Many data analysis problems are I/O-bound (rather than CPU-bound)
§ If processing data that is stored on remote servers (e.g., Amazon S3, HDFS)
– Using binary formats designed for distributed storage like Apache Parquet
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 60
Interacting with Databases
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 61
SQL-based database management systems (DBMSs)
§ Python can work with various SQL-based DBMSs
– Oracle (via oracledb),
– SQL Server (via pyodbc),
– PostgreSQL (via psycopg),
– MySQL (via mysql.connector)
§ SQLite (the current version: SQLite3) with Python’s built-in sqlite3 driver:
– A lightweight, file-based relational database (does not require server)
– It is embedded: can be included in many applications like browsers / mobile apps
– It is self-contained, zero-configuration
– Very fast for small and medium-scale apps
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 62
Interacting with SQLite3 using the sqlite3 driver
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 63
Inserting a few rows of data into the test table
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 64
Querying the data from the test table
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 65
Creating a DataFrame from the list of data rows
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 66
SQLAlchemy
§ A powerful and flexible Python library used to interact with SQL databases
§ SQLAlchemy as a bridge between Python and relational databases like:
– SQLite, PostgreSQL, MySQL, SQL Server, Oracle ...
§ With SQLAlchemy:
– Connect to a database
– Create tables
– Insert, update, and delete data
– Run SQL queries
– Map Python classes to database tables
March 29, 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 67
Working with SQLite using SQLAlchemy
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 68
Lecture summary
§ Described reading and writing data in text formats
– CSV, JSON, XML, HTML ...
§ Described reading and writing data in binary formats
– Pickle, Micorosft Excel, HDF5
– Explaining some other file formats: ORC and Apache Parquet
§ Explained how Python can work with (relational) databases
– Using specific drivers for particular DBMSs, or
– Using SQLAlchemy
– Giving examples of working with SQLite3
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 69
Lecture materials
§ Allen B. Downey. Think Python: How to Think Like a Computer Scientist. O’Reilly Media
Inc., 2024.
§ Luciano Ramalho. Fluent Python, O’Reilly Media Inc., 2015.
§ Brett Slatkin. Effective Python: 125 Specific Ways to Write Better Python, Pearson
Education, Inc., 2025.
§ Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter,
O’Reilly Media Inc., 2022.
§ Guido van Rossum, Barry Warsaw, and Alyssa Coghlan. Python Enhancement Proposal 8
(PEP 8) – Style Guide for Python Code, 2001. URL: https://peps.python.org/pep-0008/
29 March 2025 Data Loading, Storage, and File Formats • Phan Xuân Hiếu • VNU-UET • [email protected] 70