jugad2 - Vasudev Ram on software innovation: data-formats

Showing posts with label data-formats. Show all posts

Friday, November 25, 2016

Processing DSV data (Delimiter-Separated Values) with Python

I needed to process some DSV files recently. Here is a Python program I wrote for it, with a few changes over the original. E.g. I do not show the processing of the data here; I only read and print it. Also, I support two different command-line options (and ways) to specify the delimiter character.

DSV (Delimiter-separated values) is a common tabular text data format, with one record per line, and some number of fields per record, where the fields are separated or delimited by some specific character. Some common delimiter characters used in DSV files are tab (which makes them TSV files, Tab-Separated Values, common on Unix), comma (CSV files, Comma-Separated Values, a common spreadsheet and database import-export format), the pipe character (|), the colon (:) and others.

They - DSV files - are described in this section, Data File Metaformats, under Chapter 5: Textuality, of Eric Raymond (ESR)'s book, The Art of Unix Programming, which is a recommended read for anyone interested in Unix (one of the longest-lived operating systems [1]) and in software design and development.

[1] And, speaking a bit loosely, nowaday Unix is also the most widely used OS in the world, by a fair margin, due to its use (as a variant) in Android and iOS based mobile devices, both of which are Unix-based, not to mention Apple MacOS and Linux computers, which also are. Android devices alone number in the billions.

The program, read_dsv.py, is a command-line utility, written in Python, that allows you to specify the delimiter character in one of two ways:

- with a "-c delim_char" option, in which delim_char is an ASCII character,
- with a "-n delim_code" option, in which delim_code is an ASCII code.

It then reads either the files specified as command-line arguments after the -n or -c option, or if no files are given, it reads its standard input.

Here is the code for read_dsv.py:

from __future__ import print_function

"""
read_dsv.py
Author: Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
Purpose: Shows how to read DSV data, i.e. 

https://en.wikipedia.org/wiki/Delimiter-separated_values 

from either files or standard input, split the fields of each
line on the delimiter, and process the fields in some way.
The delimiter character is configurable by the user and can
be specified as either a character or its ASCII code.

Reference:
TAOUP (The Art Of Unix Programming): Data File Metaformats:
http://www.catb.org/esr/writings/taoup/html/ch05s02.html
ASCII table: http://www.asciitable.com/
"""

import sys
import string

def err_write(message):
    sys.stderr.write(message)

def error_exit(message):
    err_write(message)
    sys.exit(1)

def usage(argv, verbose=False):
    usage1 = \
        "{}: read and process DSV (Delimiter-Separated-Values) data.\n".format(argv[0])
    usage2 = "Usage: python" + \
        " {} [ -c delim_char | -n delim_code ] [ dsv_file ] ...\n".format(argv[0])
    usage3 = [
        "where one of either the -c or -n option must be given,\n",  
        "delim_char is a single ASCII delimiter character, and\n", 
        "delim_code is a delimiter character's ASCII code.\n", 
        "Text lines will be read from specified DSV file(s) or\n", 
        "from standard input, split on the specified delimiter\n", 
        "specified by either the -c or -n option, processed, and\n", 
        "written to standard output.\n", 
    ]
    err_write(usage1)
    err_write(usage2)
    if verbose:
        for line in usage3:
            err_write(line)

def str_to_int(s):
    try:
        return int(s)
    except ValueError as ve:
        error_exit(repr(ve))

def valid_delimiter(delim_code):
    return not invalid_delimiter(delim_code)

def invalid_delimiter(delim_code):
    # Non-ASCII codes not allowed, i.e. codes outside
    # the range 0 to 255.
    if delim_code < 0 or delim_code > 255:
        return True
    # Also, don't allow some specific ASCII codes;
    # add more, if it turns out they are needed.
    if delim_code in (10, 13):
        return True
    return False

def read_dsv(dsv_fil, delim_char):
    for idx, lin in enumerate(dsv_fil):
        fields = lin.split(delim_char)
        assert len(fields) > 0
        # Knock off the newline at the end of the last field,
        # since it is the line terminator, not part of the field.
        if fields[-1][-1] == '\n':
            fields[-1] = fields[-1][:-1]
        # Treat a blank line as a line with one field,
        # an empty string (that is what split returns).
        print("Line", idx, "fields:")
        for idx2, field in enumerate(fields):
            print(str(idx2) + ":", "|" + field + "|")

def main():
    # Get and check validity of arguments.
    sa = sys.argv
    lsa = len(sa)
    if lsa == 1:
        usage(sa)
        sys.exit(0)
    if lsa == 2:
        # Allow the help option with any letter case.
        if sa[1].lower() in ("-h", "--help"):
            usage(sa, verbose=True)
            sys.exit(0)
        else:
            usage(sa)
            sys.exit(0)

    # If we reach here, lsa is >= 3.
    # Check for valid mandatory options (sic).
    if not sa[1] in ("-c", "-n"):
        usage(sa, verbose=True)
        sys.exit(0)

    # If -c option given ...
    if sa[1] == "-c":
        # If next token is not a single character ...
        if len(sa[2]) != 1:
            error_exit(
            "{}: Error: -c option needs a single character after it.".format(sa[0]))
        if not sa[2] in string.printable:
            error_exit(
            "{}: Error: -c option needs a printable ASCII character after it.".format(\
            sa[0]))
        delim_char = sa[2]
    # else if -n option given ...
    elif sa[1] == "-n":
        delim_code = str_to_int(sa[2])
        if invalid_delimiter(delim_code):
            error_exit(
            "{}: Error: invalid delimiter code {} given for -n option.".format(\
            sa[0], delim_code))
        delim_char = chr(delim_code)
    else:
        # Checking for what should not happen ... a bit of defensive programming here.
        error_exit("{}: Program error: neither -c nor -n option given.".format(sa[0]))

    try:
        # If no filenames given, read sys.stdin ...
        if lsa == 3:
            print("processing sys.stdin")
            dsv_fil = sys.stdin
            read_dsv(dsv_fil, delim_char)
            dsv_fil.close()
        # else (filenames given), read them ...
        else:
            for dsv_filename in sa[3:]:
                print("processing file:", dsv_filename)
                dsv_fil = open(dsv_filename, 'r')
                read_dsv(dsv_fil, delim_char)
                dsv_fil.close()
    except IOError as ioe:
        error_exit("{}: Error: {}".format(sa[0], repr(ioe)))
        
if __name__ == '__main__':
    main()

Here are test runs of the program (both valid and invalid), and the results of each one:

Run it without any arguments. Gives a brief usage message.

$ python read_dsv.py
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...

Run it with a -h option (for help). Gives the verbose usage message.

$ python read_dsv.py -h
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
where one of either the -c or -n option must be given,
delim_char is a single ASCII delimiter character, and
delim_code is a delimiter character's ASCII code.
Text lines will be read from specified DSV file(s) or
from standard input, split on the specified delimiter
specified by either the -c or -n option, processed, and
written to standard output.

Run it with a -v option (invalid run). Gives the brief usage message.

$ python read_dsv.py -v
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...

Run it with a -c option but no ASCII character argument (invalid run). Gives the brief usage message.

$ python read_dsv.py -c
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...

Run it with a -c option followed by the pipe character (invalid run). The OS (here, Windows) gives an error message because the pipe character cannot be used to end a pipeline.

$ python read_dsv.py -c |
The syntax of the command is incorrect.

Run it with the -c option and the pipe character as the delimiter character, but protected (by double quotes) from interpretation by the OS shell (CMD).

$ python read_dsv.py -c "|" file1.dsv
processing file: file1.dsv
Line 0 fields:
0: |1|
1: |2|
2: |3|
3: |4|
4: |5|
5: |6|
6: |7|
Line 1 fields:
0: |field1|
1: |fld2|
2: | fld3 with spaces around it |
3: |    fld4 with leading spaces|
4: |fld5 with trailing spaces     |
5: |next field is empty|
6: ||
7: |last field|
Line 2 fields:
0: ||
1: |1|
2: |22|
3: |333|
4: |4444|
5: |55555|
6: |666666|
7: |7777777|
8: |88888888|
Line 3 fields:
0: ||
1: ||
2: ||
3: ||
4: ||
5: |                      |
6: ||
7: ||
8: ||
9: ||
10: ||

Run it with the -n option followed by 124, the ASCII code for the pipe character as the delimiter.

$ python read_dsv.py -n 124 file1.dsv
[Gives exact same output as the run above, as it should, because both use the same delimiter and read the same input file.]

Copy file1.dsv to file3.dsv. Change all the pipe characters (delimiters) to colons:
Run it with the -n option followed by 58, the ASCII code for the colon character.

$ python read_dsv.py -n 58 file3.dsv
[Gives exact same output as the run above, as it should, because other than the delimiters (pipe versus colon), the input is the same.]

I added support for the -n option to the program because it makes it more flexible, since you can specify any ASCII character as the delimiter (that makes sense), by giving its ASCII code.

And of course, to find out the values of the ASCII codes for these delimiter characters, I used the char_to_ascii_code.py program from my recent post:

Trapping KeyboardInterrupt and EOFError for program cleanup

You may have noticed that I mentioned delimiter characters and DSV files in that post too. The char_to_ascii_code.py utility shown in that post was created to find the ASCII code for any character (without having to look it up on the web each time).

- Enjoy.

- Vasudev Ram - Online Python training and consulting

- Black Flyday at Flywheel Wordpress Managed Hosting - get 3 months free on the annual plan.

Get updates on my software products / ebooks / courses.

Jump to posts: Python DLang xtopdf

Subscribe to my blog by email

My ActiveState recipes

Share |

Friday, January 9, 2015

Convert TSV (Tab Separated Values) to PDF with xtopdf

By Vasudev Ram

I wrote this program, TSVToPDF.py, as a demo of how to convert TSV data to PDF, using my xtopdf toolkit.

TSV, which stands for Tab Separated Values, is a common data format. From the Wikipedia article linked in the previous sentence:

"TSV is a simple file format that is widely supported, so it is often used to move tabular data between different computer programs that support the format. For example, a TSV file might be used to transfer information from a database program to a spreadsheet.

TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas – literal commas are very common in text data, but literal tab stops are infrequent in running text. The IANA standard for TSV achieves simplicity by simply disallowing tabs within fields."

TSVToPDF.py uses the TSVReader module, for reading TSV data, and uses the PDFWriter module, for writing the PDF output. Both TSVReader.py and PDFWriter.py are part of my xtopdf toolkit for PDF creation in Python.

Here is TSVToPDF.py:

"""
TSVToPDF.py
A demo program to show how to convert TSV data to PDF, 
where TSV stands for Tab Separated Values, a data format commonly 
used on Unix and other operating systems.
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""

import sys
from TSVReader import TSVReader
from PDFWriter import PDFWriter

def usage():
    sys.stderr.write("Usage: python " + sys.argv[0] + " tsv_file pdf_file\n")

def main():
    # check for right # of args
    if (len(sys.argv) != 3):
        usage()
        sys.exit(1)

    # extract tsv and pdf filenames from args -
    # using Python's parallel assignment
    tsv_fn, pdf_fn = sys.argv[1:3]

    # create and open the TSVReader instance
    tr = TSVReader(tsv_fn)
    tr.open()

    # create the PDFWriter instance
    # and set some of its fields:
    pw = PDFWriter(pdf_fn)
    pw.setFont("Courier", 10)
    pw.setHeader("Conversion of TSV data to PDF: Input: " + tsv_fn)
    pw.setFooter("Generated by xtopdf: http://slid.es/vasudevram/xtopdf")

    sep = '=' * 68
    pw.writeLine(sep)

    # print the TSV data to PDF
    rec_num = 0
    try:
        while True:
            row = tr.next_row()
            s = ""
            for col in row:
                s = s + col + " "
            pw.writeLine(str(rec_num).rjust(5) + ": " + s)
            rec_num += 1
    except StopIteration:
        pass

    pw.writeLine(sep)
    tr.close()
    pw.savePage()

if __name__ == '__main__':
    main()

# EOF

I ran the demo program like this:

python TSVtoPDF.py file1.tsv file1.pdf

where file1.tsv was a TSV file that I created for the purpose of testing.

And here is a screenshot of the output PDF file, in Foxit PDF Reader:

- Vasudev Ram - Dancing Bison Enterprises

Signup to hear about new products or services from me.

Contact Page

Share |

Sunday, September 14, 2014

Read a DBF file's metadata and data with Python and xtopdf

By Vasudev Ram

DBF files (a.k.a. XBASE files), were one of the most widely used data formats for storing structured relational data on PCs, due to the fact that the original products that used DBF files, dBase II and III, were among the most successful database products of their time (the early personal computer era). DBF is probably still very widely used as a format in small and medium-sized desktop-based applications.

DBFReader.py, a program I wrote as part of my xtopdf toolkit for PDF creation using Python, can be used to read the contents of a DBF file, including both the metadata (file and field header information) and the data records, and display them on the screen.

Here is an example invocation of DBFReader.py:

python DBFReader.py test3.dbf | more

This command will read the metadata and data of the specified DBF file and display them on your screen. Here is the output from the above command:

File header :

key: rec_len value: 30
key: ver value: 245
key: hdr_len value: 193
key: last_update value: 02/11/04
key: num_flds value: 5
key: num_recs value: 4

Field headers :

num_flds =  5
       Name |   Type | Length | Decimals
  FIELD1    |       C|       5|       0
  FIELD2    |       N|       5|       0
  FIELD3    |       L|       1|       0
  FIELD4    |       D|       8|       0
  FIELD5    |       M|      10|       0

Data records:

(' ', ['AAAAA', '11111', 'F', '19010101', '          '])
(' ', ['BBBBB', '22222', 'T', '19020202', '          '])
(' ', ['CCCCC', '33333', 'F', '19030303', '          '])
(' ', ['DDDDD', '44444', 'T', '19040404', '          '])

- Vasudev Ram - Python consulting and training - Dancing Bison Enterprises

Contact Page

Share |

Thursday, March 20, 2014

JSONLint.com, an online JSON validator

By Vasudev Ram

JSON page on Wikipedia.

JSON, as most developers nowadays know, has become useful as a data format both for web client-server communication and for data interchange between different languages, since most popular programming languages have support for it (see the lower part of the JSON home page linked above in this sentence).

While searching for information about some specific aspects of JSON for some Python consulting work, I came across this site:

JSONLint.com

JSONLint.com is an online JSON validator. It is from the Arc90 Lab. (Arc90 is the creator of Readability, a tool that removes the clutter from web pages and makes a clean view for reading now or later on your computer, smartphone, or tablet.)

You paste some JSON data into a text box on the site and then click the Validate button, and it tells you whether the JSON is valid or not.

JSONLint.com is a useful resource for any language with JSON support, including Python.

P.S. Arc90 is being acquired by SFX Entertainment, Inc. (NASDAQ:SFXE).

- Vasudev Ram - Python consulting and training

Contact Page

Share |

Tuesday, October 23, 2012

YAML site eats its own dog food :)

By Vasudev Ram

Just noticed that the YAML web site eats its own dog food
:) Cool. I had used YAML (Wikipedia link) sometime earlier when working on Ruby and Ruby on Rails. (Rails used YAML then and may still do.)

- Vasudev Ram - Dancing Bison Enterprises

Share |

jugad2 - Vasudev Ram on software innovation

Pages

Friday, November 25, 2016

Processing DSV data (Delimiter-Separated Values) with Python

Friday, January 9, 2015

Convert TSV (Tab Separated Values) to PDF with xtopdf

Sunday, September 14, 2014

Read a DBF file's metadata and data with Python and xtopdf

Thursday, March 20, 2014

JSONLint.com, an online JSON validator

Tuesday, October 23, 2012

YAML site eats its own dog food :)

Blog Archive

Labels