jugad2 - Vasudev Ram on software innovation: PDFBuilder

Showing posts with label PDFBuilder. Show all posts

Wednesday, July 17, 2013

PDFBuilder can now handle unlimited input files

By Vasudev Ram

I had blogged about PDFBuilder a couple of times earlier, here:

PDFBuilder can create composite PDFs

and here:

PDFBuilder can now take multiple input files from command line

I modified PDFBuilder to be able to take the list of input files from a filename specified on the command line with a -f option. So it can now handle an unlimited (*) number of input files.

(*) Well, strictly speaking, still not unlimited, but limited only by the available memory and hard disk space, and by the maximum size of a single file (the output PDF file). But for practical purposes, that can be considered as unlimited.

Here is the updated PDFBuilder program:

# Filename: PDFBuilder.py
# Description: To create composite PDF files containing the content from 
# a variety of input sources, such as CSV files, TDV (Tab Delimited 
# Values) files, XLS files, etc.

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

# ------------------------- imports -------------------------

import sys
import os
import os.path
import string
import csv
from  CSVReader import CSVReader
from  TDVReader import TDVReader
from  PDFWriter import PDFWriter

# ------------------------ class PDFBuilder ----------------

class PDFBuilder:
 """
 Class to build a composite PDF out of multiple input sources.
 """

 def __init__(self, pdf_filename, font, font_size, 
    header, footer, input_filenames):
  """
  PDFBuilder __init__ method.
  """
  self._pdf_filename = pdf_filename
  self._input_filenames = input_filenames

  # Create a PDFWriter instance
  self._pw = PDFWriter(pdf_filename)
  debug("PDFBuilder.__init__(): Created PDFWriter instance")

  # Set its font
  self._pw.setFont(font, font_size)

  # Set the header and footer for the PDFWriter instance
  self._pw.setHeader(header)
  self._pw.setFooter(footer)
  
 def build_pdf(self, input_filenames):
  """
  PDFBuilder.build_pdf method.
  Builds the PDF using contents of the given input_filenames.
  """
  for input_filename in input_filenames:
   # Check if name ends in ".csv", ignoring upper/lower case
   if input_filename[-4:].lower() == ".csv":
    reader = CSVReader(input_filename)
    debug("Created a CSVReader from " + input_filename)
   # Check if name ends in ".csv", ignoring upper/lower case
   elif input_filename[-4:].lower() == ".tdv":
    reader = TDVReader(input_filename)
    debug("Created a TDVReader from " + input_filename)
   else:
    sys.stderr.write("Error: Invalid input file. Exiting\n")
    sys.exit(0)

   debug("Reading from %r" % reader.get_description())
   hdr_str = "Data from reader: " + \
    reader.get_description()
   self._pw.writeLine(hdr_str)
   self._pw.writeLine('-' * len(hdr_str))

   reader.open()
   try:
    while True:
     row = reader.next_row()
     debug("row", row)
     s = ""
     for item in row:
      s = s + item + " "
     debug("s", s)
     self._pw.writeLine(s)
   except StopIteration:
    # Close this reader, save this PDF page, and 
    # start a new one for next reader.
    reader.close()
    self._pw.savePage()
    #continue

 def close(self):
  self._pw.close()

# ------------------------- main() --------------------------

def main():

 # global variables

 # program name for error messages
 global prog_name
 # debug flag - if true, print debug messages, else don't
 global DEBUGGING
 
 # Set the debug flag based on environment variable
 debug_env_var = os.getenv("DEBUG")
 if debug_env_var == "1":
  DEBUGGING = True

 sysargv = sys.argv
 lsa = len(sysargv)

 # Save program filename for error messages
 prog_name = sysargv[0]
 debug("Entered " + prog_name + ":main()")

 # check for right args
 debug("lsa =", lsa)
 if lsa < 2:
  usage()
  debug(prog_name + ": Incorrect number of args, exiting.")
  sys.exit(1)

 # Get output PDF filename from the command line.
 pdf_filename = sys.argv[1]
 debug("PDF filename = ", pdf_filename)

 # Check if -f option given
 if sysargv[2] == '-f' and lsa == 4: 
  # If so, read the input filenames from the file given as 
  # sysargv[3] (the input filenames list)
  input_filenames = []
  with open(sysargv[3], "r") as ifl:
   for fn in ifl:
    input_filenames.append(fn.strip('\n'))
 else:
  # Get the input filenames from the command line.
  input_filenames = sys.argv[2:]

 # Create a PDFBuilder instance.
 pdf_builder = PDFBuilder(pdf_filename, "Courier", 10, 
  "Composite PDF", "Composite PDF", input_filenames)

 # Build the PDF using the inputs.
 pdf_builder.build_pdf(input_filenames)

 pdf_builder.close()

 sys.exit(0)

#------------------------- debug ----------------------------

def debug(msg, *args):

 global DEBUGGING
 if not DEBUGGING:
  return
 sys.stderr.write(msg + ": ")
 sys.stderr.write(repr(args) + "\n")

#------------------------- usage ----------------------------

def usage():
 
 global prog_name
 sys.stderr.write("Usage: python " + prog_name + \
  " pdf_filename input_filename(s)\n" + \
  " OR python " + prog_name + " pdf_filename -f input_filename_list\n" + \
  " where input_filename_list is a file containing input filenames\n")

#------------------------- call main ------------------------

if __name__ == "__main__":
 # Set default value for DEBUGGING, override later in main() 
 # based on value of env. var. DEBUG.
 try:
  DEBUGGING = False
  main()
 except Exception, e:
  sys.stderr.write("Caught an exception: " + e)
  sys.exit(1)


#------------------------- EOF: PDFBuilder.py -----------------

You can run it like this:

python PDFBuilder.py PDFBuilder11.pdf -f input_filename_list.txt

where the same input filenames used in the earlier posts, are now stored in the file input_filename_list.txt, one per line (with no leading or trailing spaces).

This will create the composite PDF file PDFBuilder11.pdf, generated from the contents of all those files, as in the earlier posts.

The difference is that in the previous post about PDFBuilder, the input filenames were specified on the command line, which is subject to some limit for length (in earlier UNIX versions it was typically 512 or 1024 bytes, which sometimes led to errors or core dumps, but it has been increased in more recent UNIX and Linux versions).

But this version of PDFBuilder can handle a very large number of input files, since they are not specified on the command line but in another text file, which is given after the -f option in the above command.

I will upload this new PDFBuilder version to the Bitbucket repository for xtopdf shortly.

To read all my posts about xtopdf, you can use this search:

jugad2.blogspot.com/search/label/xtopdf

and similarly, to read all my posts about Python, use this search:

jugad2.blogspot.com/search/label/python

This is a Blogger feature that I got to know about, thanks to Michael Foord.

- Vasudev Ram - Dancing Bison Enterprises

Contact / Hire me

Share |

Monday, November 5, 2012

PDFBuilder can now take multiple input files from command line

By Vasudev Ram

PDFBuilder, which I blogged about recently, can now build a composite PDF from an arbitrary number [1] of input files (CSV and TDV) [2] specified on the command line. (I've removed the hard-coding in the first version.)

I've also cleaned up and refactored the PDFBuilder code some, though I still need to do some more.

UPDATE: I've pasted a few code snippets from PDFBuilder.py at the end of this post.

This version of PDBBuilder can be downloaded here, as a part of xtopdf v1.4, from the Bitbucket repository.

[1] Arbitrary number, that is, subject to the limitations of the length of the command line supported by your OS, of course - whether Unix / Linux, Mac OS X or Windows. However, there is a solution for that.

[2] The design of PDFBuilder allows for easily adding support for other input file formats that are row-oriented. See the method next_row() in the file CSVReader.py in the source package, for an example of how to add support for other compatible input formats. You just have to write a reader class (analogous to CSVReader) for that other format, called, say, FooReader, and provide an open() method and a next_row() method as in the CSVReader class, but adapted to handle Foo data.

Some code snippets from PDFBuilder.py:

The PDFBuilder class:

class PDFBuilder:
 """
 Class to build a composite PDF out of multiple input sources.
 """

 def __init__(self, pdf_filename, font, font_size, 
    header, footer, input_filenames):
  """
  PDFBuilder __init__ method.
  """
  self._pdf_filename = pdf_filename
  self._input_filenames = input_filenames

  # Create a PDFWriter instance.
  self._pw = PDFWriter(pdf_filename)

  # Set its font.
  self._pw.setFont(font, font_size)

  # Set its header and footer.
  self._pw.setHeader(header)
  self._pw.setFooter(footer)
  
 def build_pdf(self, input_filenames):
  """
  PDFBuilder.build_pdf method.
  Builds the PDF using contents of the given input_filenames.
  """

  # Loop over all names in input_filenames.
  # Instantiate the appropriate reader for each filename, 
  # based on the filename extension.

  # For each reader, get each row, and for each row,
  # combine all the columns into a string separated by a space,
  # and write that string to the PDF file.

  # Start a new PDF page after each reader's content is written
  # to the PDF file.

  for input_filename in input_filenames:
   # Check if name ends in ".csv", ignoring upper/lower case
   if input_filename[-4:].lower() == ".csv":
    reader = CSVReader(input_filename)
   # Check if name ends in ".tdv", ignoring upper/lower case
   elif input_filename[-4:].lower() == ".tdv":
    reader = TDVReader(input_filename)
   else:
    sys.stderr.write("Error: Invalid input file. Exiting\n")
    sys.exit(0)

   hdr_str = "Data from reader: " + \
    reader.get_description()
   self._pw.writeLine(hdr_str)
   self._pw.writeLine('-' * len(hdr_str))

   reader.open()
   try:
    while True:
     row = reader.next_row()
     s = ""
     for item in row:
      s = s + item + " "
     self._pw.writeLine(s)
   except StopIteration:
    # Close this reader, save this PDF page, and 
    # start a new one for next reader.
    reader.close()
    self._pw.savePage()
    #continue

 def close(self):
  self._pw.close()

The main() function that uses the PDFBuilder class to create a composite PDF:

def main():

 # global variables

 # program name for error messages
 global prog_name
 # debug flag - if true, print debug messages, else don't
 global DEBUGGING
 
 # Set the debug flag based on environment variable DEBUG, 
 # if it exists.
 debug_env_var = os.getenv("DEBUG")
 if debug_env_var == "1":
  DEBUGGING = True

 # Save program filename for error messages
 prog_name = sys.argv[0]

 # check for right args
 if len(sys.argv) < 2:
  usage()
  sys.exit(1)

 # Get output PDF filename from the command line.
 pdf_filename = sys.argv[1]

 # Get the input filenames from the command line.
 input_filenames = sys.argv[2:]

 # Create a PDFBuilder instance.
 pdf_builder = PDFBuilder(pdf_filename, "Courier", 10, 
       "Composite PDF", "Composite PDF", 
       input_filenames)

 # Build the PDF using the inputs.
 pdf_builder.build_pdf(input_filenames)

 pdf_builder.close()

 sys.exit(0)

And a batch file, run.bat, calls the program with input filename arguments:

@echo off
python PDFBuilder.py %1 file1.csv file1.tdv file2.csv file2.tdv file1-repeats5.csv

Run the batch file like this:

C:> run composite.pdf

which will create a PDF file, composite.pdf, from the input CSV and TDV files given as command-line arguments.

Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Share |

Saturday, November 3, 2012

PDFBuilder can create composite PDFs

By Vasudev Ram

PDFBuilder is a tool to create composite PDFs, i.e. PDFs comprising of data from multiple different input data formats. It is a new component of my xtopdf toolkit for PDF generation.

At present, for input formats, PDFBuilder supports only CSV (Comma Separated Values, which can be exported from / imported to spreadsheets, among other things) and TDV / TSV (Tab Delimited Values / Tab Separated Values), which many UNIX / Linux tools like sed, grep, and awk, can create or process).

But support for more input formats can be added fairly easily, due to the design.

PDFBuilder is included in xtopdf v1.4 (just released on Bitbucket).

To try PDFBuilder:

- Download xtopdf v1.4, then follow the steps in the file README.txt; the steps include installing Python (>= v2.2), if you don't have it already, and Reportlab v1.21. (The steps for installing ReportLab are here.)

Then run this command:

python PDFBuilder.py output.pdf

This will create a composite PDF file, output.pdf, from two CSV files and two TDV files (interleaved). This is hard-coded as of now, but will be changed to take a list of input files from the command-line.

The download includes the 4 input files and the corresponding output PDF file.

Note: The xtopdf links on SourceForge and my site dancingbison.com have not yet been updated for xtopdf v1.4, so don't try to get v1.4 from there, for now.

You can read more about the ReportLab toolkit here.

- Vasudev Ram - Dancing Bison Enterprises

Share |

jugad2 - Vasudev Ram on software innovation

Pages

Wednesday, July 17, 2013

PDFBuilder can now handle unlimited input files

Monday, November 5, 2012

PDFBuilder can now take multiple input files from command line

Saturday, November 3, 2012

PDFBuilder can create composite PDFs

Blog Archive

Labels