0% found this document useful (0 votes)
16 views5 pages

XML for Source Code Structuring

This document presents srcML, which uses XML to add explicit structure and syntactic information to source code files while preserving comments and formatting. SrcML constructs an XML representation of source code as a structured document rather than a compiler-centric view, supporting program comprehension tools and development environments.

Uploaded by

1 dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

XML for Source Code Structuring

This document presents srcML, which uses XML to add explicit structure and syntactic information to source code files while preserving comments and formatting. SrcML constructs an XML representation of source code as a structured document rather than a compiler-centric view, supporting program comprehension tools and development environments.

Uploaded by

1 dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2543984

Source Code Files as Structured Documents

Article · August 2002


Source: CiteSeer

CITATIONS READS
81 120

3 authors, including:

Michael Collard Andrian Marcus


University of Akron George Mason University
67 PUBLICATIONS 2,262 CITATIONS 147 PUBLICATIONS 10,895 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Michael Collard on 05 October 2015.

The user has requested enhancement of the downloaded file.


Source Code Files as Structured Documents

Jonathan I. Maletic, Michael L. Collard, Andrian Marcus


Department of Computer Science
Kent State University
Kent Ohio 44242, USA
[email protected], [email protected] , [email protected]

Abstract programmer centric parts of the source, that is, comments,


spacing, formatting, macro definitions, etc. are retained in
A means to add explicit structure to program source srcML. srcML combines the structure of the syntax
code is presented. XML is used to augment source code (derivation) tree with the generality of the text file.
with syntactic information from the parse tree. More The programmer does not work directly in srcML,
importantly, comments and formatting are preserved and rather they work in a translation (or view) of a srcML
identified for future use by development environments and document. This view can be exactly what the
program comprehension tools. The focus is to construct a programmer originally typed or alternatively the
document representation in XML instead of a more programmer could define a variety of views, most
traditional data representation of the source code. This obvious is a pretty-print version.
type of representation supports a programmer centric While srcML may be a convenient representation for
view of the source rather than a compiler centric view. reformatting code, it is not our ultimate goal. Source code
Our representation is made relevant with respect to other marked up in srcML is already parsed (at least partially)
research on XML representations of parse trees and and the comments are intact. Therefore, doing static
program code. The highlights of the representation are analysis, slicing, deriving call graphs, etc. becomes
presented and the use of queries and transformations drastically simpler. That is, srcML is an excellent
discussed. representation for many types of tools for both
programming (development) and program
comprehension. The primary end product of the software
1. Introduction engineering process is usually the compile-able source
code and its associated documentation. We propose to
While program source code is intrinsically structured, use srcML as the form for this product. It is compile-
the manner in which it is stored has almost no structure at able, with simple pre-processing, while it allows the user
all. That is, source is stored as simple text. While this to view the source code at a more abstract level.
works quite well for writing code and a little less well for We now discuss our motivation behind this work and
reading code, text is, frankly, a poor medium when it survey the related literature on this topic.
comes to explicitly describing structure. A common
solution to this problem, in the field of document 2. Why a structured representation?
engineering, is to add structural information into the text
by inserting special characters or tags into the document. Why do we need a document-oriented representation?
The text can then be more easily searched, parsed, and The parser for a typical programming language (e.g.,
transformed with the aid of these tags. The current C++, C, Java) generates an Abstract Syntax Tree (AST)
standard for marking up documents and information is the and a symbol table. The format and contents of this
Extensible Markup Language (XML), which is being used output are great for the needs of the compiler but greatly
on a wide variety of document and software engineering lacking with respect to the needs of software engineering.
problems. This decidedly compiler centric representation lacks large
In this paper, we describe an XML application, amounts of important information with respect to
srcML1 (SouRce Code Markup Language), which is used comprehension, most obvious are comments. Parsing and
to add structural information to unstructured source code preprocessing removes this “non-essential” information.
text files. srcML adds much of the syntactic information However, to the programmer this information is often
found in an abstract syntax tree derived from parsing. semantically very important.
However, the representation does not remove the Integrated development environments that support the
various software engineering tasks (e.g., maintenance,
1
Pronounced, “Source ML”. reverse engineering) require a more powerful
representation of the source to be more computationally structural information. This leaves us with a much richer
viable. This is the main purpose of the srcML representation to work with than plain text, but with all
representation. the flexibility.
A number of options currently exist for representing
source code information (e.g., AST or ASG) in XML 3. Features of srcML
namely, GXL [5], CppML [8], ATerms [9], and Harmonia
[3]. However, these representations are constructed as XML can be used as a document representation (e.g.,
data exchange languages or for displaying program DocBook, XHTML, etc.) or alternatively as a data
structural information. None of these representations representation (e.g., SOAP, SMIL, and countless domain
directly supports the representation of comments or specific formats). srcML is an XML application for
formatting information. The most widely used of these, representing source code as structured documents and as
GXL [5] is an XML-based exchange format for graph-like such has both document and data representation
structures based on GraX (Graph eXchange format) [4], characteristics.
and RSF (Rigi Standard Format) [10]. Software systems As a document representation, srcML preserves the
are represented as ordered, directed, attributed, and/or information of the original text. That is, srcML preserves
typed graphs. While GXL is designed to be a standard all information present in the original source code (e.g.,
exchange format for data that is derived from software, formatting and spacing); information that is typically not
srcML is designed to represent the actual source code. stored in other representations. Elements occur in the
Although srcML can be used as an standard exchange same document order as they do in the original document
format, the underlying goal of defining and using srcML (as typed by the developer). Document formats often
is to create an intermediate layer of representation encourage the separation of content from view. This
between the source code, the developer, and tools that typically means ignoring the white space of the document,
allows easy transformation to a standard exchange format since that is considered part of the view. In source code,
such as GXL. formatting (e.g., the use of white space) is part of the
The most closely related work to srcML is Badros’ content; not content that the compiler is interested in but
work on JavaML [2], which is an XML application that content inserted explicitly by the programmer who wrote
provides an alternative representation of Java source code. the source code. A srcML document can be used to
JavaML is more natural for tools and permits easy generate the original source code it came from or it can be
specification of numerous software-engineering analyses transformed into another srcML document.
by leveraging the abundance of XML tools and As a data format, srcML includes much of the
techniques. However, JavaML does not preserve the information from an AST of the parsed source. The
original source code document and discards much of the syntactic structure of the source code is marked up to
formatting information. As with srcML it keeps the allow for easy extraction of structural information of the
comments in the text but it associates them to elements of source. srcML encodes data extracted from a partial
the program. Therefore, the location of comments is not derivation of the syntax tree. Many comprehension
preserved. We feel that associating comments with activities do not require a complete parse tree or AST of
constructs should be dictated by coding standards, which the source code. Atkinson [1] argues that generating the
change from organization to organization and entire AST is often times impractical and performing
programmer to programmer. Associating comments is an incremental parsing or “as needed” parsing is a better
important step in the program comprehension process and approach for many analysis tasks. Parsing the source to a
this should be dealt with separately. Additionally, all certain granularity level (e.g., expression) still offers the
formatting information is lost in JavaML and the original developer sufficient information to carry out most
source code document cannot be regenerated from comprehension tasks. If a finer granularity level is
JavaML representations. desired, then only the parts of the source that were not
In the same realm, the Harmonia framework [3] and previously parsed need to be examined and parsed (e.g.,
cppML/JavaML developed at the University of Waterloo the expressions).
[8] are closely related approaches since they encode the The driving principle behind srcML is to provide the
AST itself and actual source code, rather than data user (human or tool) with the ability to view those
extracted (such as the case in GXL). While Harmonia elements and features of the source code that are needed
adds tags to source code as metadata, cppML only uses for their task. A representation of the source code as
tags and records the additional information as attributes structured documents directly supports the following:
on the tags. The differences mentioned above for Badros’ 1. Representation of multiple levels of granularity
work stand for these approaches as well. within the AST;
In short, srcML is an attempt to keep the textual 2. Multiple level of abstraction (or views);
semantics of the source code intact while adding explicit
3. Transformation equality of source to White space in XML includes spaces, tabs, and blank
representation and of representation to source; lines. While many XML applications consider white
4. Query-able and search-able representation; space between elements insignificant and normalize them,
5. Representation of structural information, they can be preserved. White space inside of attributes,
including macros, templates, and compiler however, is normalized to a single space and is not
directives (e.g., #include), etc.; preserved. Thus, we only store meta-information about
6. Preservation of: the code in attributes.
a) Location of constructs; Preserving the white space is what allows reformatting
b) Text formatting information; to the original source and presenting data and source in
c) Comments and their location; readable format. Often times the layout of the source
d) File names and structure. code, as intended by the original developer, conveys a
e) Macros and macro definitions great deal of information (e.g., association or relation by
The feature of srcML that differentiates it from other physical proximity).
related approaches is its ability to preserve semantic The issue of associating comments with structural
information from the source code. elements of the source code is important in analyzing the
Every srcML document has a corresponding source semantic information embedded into the source code [7].
code file. This is represented in the srcML document by The association of comments to the program elements
the element <unit> representing a single compilation they describe can vary from programmer to programmer.
unit. This is typically a file or module but can represent Some programmers like to place comments describing
any piece of code (POC). A POC is any set of contiguous function before its implementation, while others at the
lines of source code. The attributes of the element end of the implementation, or right after the header. This
<unit> store the file name and directory. Include files prompted us to design srcML such that it stores the
(i.e., a .h file in C++) are also stored in their own <unit> comments without associating them with a particular
element; the contents of the include file are not program element.
automatically inserted or applied to the source code files Program comments are stored in a <comment>
in which they are included. This allows further element with all formatting and location preserved. The
processing of the documents from the programmer centric user can define rules on how the comments associate with
view. other elements of the source (e.g., methods, classes, etc),
A practical interest here is the difficulty dealing with or define special types of comments (e.g., PRE and POST
macros, templates, and other preprocessor constructs in conditions). Once these rules are defined, the user can
languages such as C++. srcML does not require complete obtain a view from the srcML document that shows these
parsing of this type of information. These types of relations between comments and their associated source
constructs are simply marked up with specific tags in code elements.
srcML (e.g., <preproc-stmt>) and not run through the Sometimes there is a further structure to the
preprocessor. The source is not completely parsed for comments, as in the case of JavaDoc, and precondition
translation into srcML. We use a partial derivation, comments. The comments stored in srcML do not extract
stopping before we reach a particular level of syntactic the content of this structure directly. However, srcML
abstraction. For example, in our case we do not does provide the comment in a convenient form for
completely parse expressions. This allows for on-the-fly extraction and querying.
generation of srcML. srcML can also represent Being XML applications, srcML documents are easily
syntactically incorrect POCs. The issue of syntactical search-able and query-able with standard XML tools.
correctness is a compiler problem and is not of supreme These queries can generate different views of the source
importance to the representation. code where each view helps the developer solve a
Each statement in the source code, down to the particular task. Other source code browsers allow
expression level, has its own element and is marked definition of views based on structural information of the
accordingly. The relationship of a language item in the source code such as inheritance, visibility, calls, etc.
srcML document to its corresponding item in the While srcML supports all these elements, it adds the
associated source code file is: possibility of combining both structural and semantic
• All language items appear on the same line in the information extracted from the source code and its
<unit> element of the srcML document as they associated documentation in one view. Extension of
would in the source code file. srcML to encode external documentation is currently
• White space between srcML elements is exactly under investigation.
the same as the white space between the language srcML can be easily used to extract and modify
elements in the source code document. information from source code using the DOM, SAX or
XSLT. Selection can be done with XPath using names
that directly relate to the language elements themselves. A set of tools for partial parsing is being developed (a
It is now simple to construct XPath expressions from a C++ to srcML translator). Also, tools to help the user
programmer centric view rather than a compiler centric specify views, queries, and rules of association between
graph view. source code elements (e.g., comments and functions).
Since elements of srcML are stored in the same Since srcML stores any POC, partial programs (or
order/location as the corresponding source code, and the pseudo-code) can be represented. This allows us to
elements are nested in XML as they would be nested in develop an editor that can generate much of the srcML on
the source code, filter and extraction processing is the fly. While not all of srcML may be supported in this
straightforward. The srcML manipulation tools can use manner it will facilitate much of the features now seen in
an event-based interface, such as SAX, to the XML advanced source code editors.
document rather than a DOM interface that requires the
storing in memory of the entire document tree. This is References
very useful for very querying large (sets of) source code
files. The DTD for srcML will be made available on the [1] Atkinson, D. C. and Griswold, W. G., "The design of whole-
web page of the Software Development Laboratory program analysis tools", in Proceedings of 18th International
<SDML>, at Kent State Univ. (www.sdml.cs.kent.edu). Conference on Software Engineering (ICSE'96), Berlin,
Germany, March 25-30 1996, pp. 16-27.
5. Conclusions and future work [2] Badros, G. J., "JavaML: A Markup Language for Java
Source Code", in Proceedings of 9th International World Wide
Although srcML relates to research efforts in the Web Conference (WWW9), Asterdam, The Netherlands, May
standard exchange format community, it proposes a 13-15 2000.
somewhat different approach. We are representing the [3] Boshernitsan, M. and Graham, S. L., "Designing an XML-
source code as structured documents. Experiences from Based Exchange Format for Harmonia", in Proceedings of
the research communities of standard exchange formats, Seventh Working Conference on Reverse Engineering
reverse engineering, and document engineering are (WCRE'00), Brisbane, Australia, November 23-25 2000, pp.
combined in this proposed format. The srcML document 287-289.
representations can be used as an interface between the [4] Ebert, J., Kullbach, B., and Winter, A., "GraX — An
developer and the development environment, as well as Interchange Format for Reengineering Tools", in Proceedings of
between analysis tools. The emphasis in srcML is in Sixth Working Conference on Reverse Engineering (WCRE'96),
combining text with both structural and textual Atlanta, GA, October 6-8 1999, pp. 89 - 100.
information of the source code. [5] Holt, R. C., Winter, A., and Schürr, A., "GXL: Toward a
The syntax of the programming language should take Standard Exchange Format", in Proceedings of 7th Working
two forms: external – easy to understand for the user; and Conference on Reverse Engineering (WCRE '00), Brisbane,
internal – for tool exchange and processing. srcML is our Queensland, Australia, November, 23 - 25 2000, pp. 162-171.
means of internal representation. Through querying, it
[6] Knuth, D., "Literate Programming", The Computer Journal,
can generate views in the external representation. In vol. 27, no. 2, 1984, pp. 97-111.
general, a program-understanding tool should support the
level of granularity of the comprehension task at hand. [7] Maletic, J. I. and Marcus, A., "Supporting Program
Representations in srcML can generate views at different Comprehension Using Semantic and Structural Information", in
Proceedings of 23rd International Conference on Software
granularity levels to support such tools, while these views
Engineering (ICSE 2001), Toronto, Ontario, Canada, May 12-19
also support concepts such as literate programming [6]. 2001, pp. 103-112.
A widely used approach to program understanding is
plan recognition, bottom-up from the source code to a [8] Mammas, E. and Kontogiannis, C., "Towards Portable
more abstract description. Essentially, this is done by Source Code Representations using XML", in Proceedings of
7th Working Conference on Reverse Engineering (WCRE '00),
pattern matching on an internal representation of the code,
Brisbane, Queensland, Australia, November, 23 - 25 2000, pp.
which leads to detecting patterns of higher-level plans or 172-182.
concepts in a lower-level code. Both programmers and
tools use this strategy. We strongly believe that [9] van den Brand, M., Sellink, A., and Verhoef, C., "Current
representations of the source code such as provided by Parsing Techniques in Software Renovation Considered
Harmful", in Proceedings of 6th International Workshop on
srcML directly supports this comprehension strategy. We
Program Comprehension (IWPC'98), Ischia, Italy, June 24-26
are building tools that rely on the srcML representation 1998, pp. 108 - 117.
and future research will address the issue of support for
program understanding more directly. In addition, ways [10] Wong, K., "The Rigi User's Manual - Version 5.4.4.", The
to represent textual and graphical external documentation Rigi Group, Date Accessed: 01/20,
within the srcML representation are being investigated. http://ftp.rigi.csc.uvic.ca/pub/rigi/doc/rigi-5.4.4-manual.pdf,
1998.

View publication stats

You might also like