jugad2 - Vasudev Ram on software innovation: python-docx

Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF

DOCX to PDF

Building upon my recent post, here:

Extract text from Word .docx files with python-docx,

I came up with the idea of combining that DOCX text extraction functionality of python-docx with my xtopdf toolkit, to create a program that can convert the text in Microsoft Word DOCX files to PDF format.

[ Note: The conversion has some limitations. E.g. fonts, tables, etc. from the input are not preserved in the output. ]

Here is the program, called DOCXtoPDF.py. It will become a part of my xtopdf toolkit.

# DOCXtoPDF.py

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

import sys
import os
import os.path
import string
from textwrap import TextWrapper
from docx import opendocx, getdocumenttext
from PDFWriter import PDFWriter

def docx_to_pdf(infilename, outfilename):

    # Extract the text from the DOCX file object infile and write it to 
    # a PDF file.

    try:
        infil = opendocx(infilename)
    except Exception, e:
        print "Error opening infilename"
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    paragraphs = getdocumenttext(infil)

    pw = PDFWriter(outfilename)
    pw.setFont("Courier", 12)
    pw.setHeader("DOCXtoPDF - convert text in DOCX file to PDF")
    pw.setFooter("Generated by xtopdf and python-docx")
    wrapper = TextWrapper(width=70, drop_whitespace=False)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    for paragraph in new_paragraphs:
        lines = wrapper.wrap(paragraph)
        for line in lines:
            pw.writeLine(line)
        pw.writeLine("")

    pw.savePage()
    pw.close()
    
def usage():

    return "Usage: python DOCXtoPDF.py infile.docx outfile.txt\n"

def main():

    try:
        # Check for correct number of command-line arguments.
        if len(sys.argv) != 3:
            print "Wrong number of arguments"
            print usage()
            sys.exit(1)
        infilename = sys.argv[1]
        outfilename = sys.argv[2]

        # Check for right infilename extension.
        infile_ext = os.path.splitext(infilename)[1]
        if infile_ext.upper() != ".DOCX":
            print "Input filename extension should be .DOCX"
            print usage()
            sys.exit(1)

        # Check for right outfilename extension.
        outfile_ext = os.path.splitext(outfilename)[1]
        if outfile_ext.upper() != ".PDF":
            print "Output filename extension should be .PDF"
            print usage()
            sys.exit(1)

        docx_to_pdf(infilename, outfilename)

    except Exception, e:
        sys.stderr.write("Error: " + repr(e) + "\n")
        sys.exit(1)

if __name__ == '__main__':
    main()

# EOF

To run DOCXtoPDF, give a command of the form:

python DOCXtoPDF.py infilename.docx outfilename.pdf

After this, the text content of the DOCX file will be in the PDF file.

- Enjoy.

Read other posts about xtopdf on this blog.
Read other posts about Python on this blog.

- Vasudev Ram - Dancing Bison Enterprises

Training or consulting inquiry

Share |

Friday, September 27, 2013

Extract text from Word .docx files with python-docx

By Vasudev Ram

python-docx is a Python library that can be used to extract the text content from Microsoft Word files that are in the .docx format.

Here is a program (modified a bit from the python-docx examples) that shows how to do it:

# extract_docx_text.py

import sys
from docx import opendocx, getdocumenttext

def extract_docx_text(infil, outfil):

    # Extract the text from the DOCX file object infile and write it to 
    # the text file object outfil.

    paragraphs = getdocumenttext(infil)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    outfil.write('\n'.join(new_paragraphs))

def usage():

    return "Usage: python extract_docx_text.py infile.docx outfile.txt\n"

def main():

    if len(sys.argv) != 3:
        print usage()
        sys.exit(1)

    try:
        infil = opendocx(sys.argv[1])
        outfil = open(sys.argv[2], 'w')
    except Exception, e:
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    extract_docx_text(infil, outfil)

if __name__ == '__main__':
    main()

# EOF

Save the program as extract_docx_text.py and run it with:

python extract_docx_text.py input_file.docx output_file.txt

That should result in the text of the .docx file being extracted and written to the .txt file.

- Vasudev Ram - Dancing Bison Enterprises

Make a training or consulting inquiry (Python, open source, Linux ...)

Share |

jugad2 - Vasudev Ram on software innovation

Pages

Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF

Friday, September 27, 2013

Extract text from Word .docx files with python-docx

Blog Archive

Labels