Some Thoughts on Teaching FP

April 17, 2011

A number of people have asked some very good questions about the details of how we teach certain concepts in our new functional programming class for freshmen at Carnegie Mellon.  Rather than spray responses among the various comments, I’ll summarize a few major points here in hopes of helping others who may wish to teach a similar course.  So this post is not really meant for a broad audience, but rather for the specialist; feel free to skip it if it seems too focused for your interests.  I promise to write some more controversial material of more general interest soon!  Meanwhile, here are a few thoughts presented in no particular order of importance.

Because the class is intended for beginners, we start from scratch with a language-based model of computation.  This means that, with one regrettable misstep on our part, we never talk about extra-linguistic concepts like “run-time stacks” or “compilers.”  The students are taught to think in terms of the language itself, and to reason directly about both the correctness and efficiency of the code they actually write, not the code that it allegedly compiles to or translates to.  One beautiful feature of the language-based approach is that we start with a very familiar model of computation, the evaluation of polynomials over the reals.  It’s very familiar for all students, and I think they find it very satisfying precisely because it has a direct computational flavor.  You plug in for variables and simplify, and out comes the answer.  We can draw on this as our starting point; programs are just generalized polynomials.  In particular, in a functional language variables are variables: the mathematical concept of variables, which is given meaning by substitution, is precisely the programming concept of variable.  It’s not analogous, it’s the same.  So we can draw on their experience and ask them to prove things about programs using methods that build directly on what they already know.  It’s extremely natural, and very beautiful, and leads easily to an integrated view of mathematics and programming.  Moreover, it’s a level playing field.  Students with prior “programming experience” are, if anything, at a disadvantage, because they think they know things that are either wrong or inapplicable.  One consequence is gender equity, because even with American society being what it is, the women have no particular disadvantage with respect to the men when it comes to our style of thinking and programming.  It’s a win-win-win situation.

Moving to a more technical level, the use of structural operational semantics is indispensable for providing a rigorous foundation for understanding program execution, reasoning about program correctness, and for defining a cost model to support reasoning about asymptotic complexity.  There is no substitute for this!  Without a crisp formulation of the semantics of the language, it is impossible to discuss any of these issues in a meaningful and technically precise way.  With it you can readily resolve any questions about “what happens if …”, giving the students a tool that they can use themselves to answer such questions.  Moreover, as program verification becomes more important in industrial practice, as well as academic research, it is essential that students be educated in the tools of semantics.  Structural operational semantics is very easy to teach, and presents no problems for the students.  We just use it, and they get it without any fuss or bother.  It is a natural extension of their experience with high school algebra.  Be not afraid of using these tools to teach programming!

As I’ve explained previously, it is a very good idea to avoid Booleans as much as possible.  And, above all, don’t mention equality!  The equals sign in programming languages is not the equals sign of mathematics.  Propositions are not Booleans, and it only confuses matters to use notations that encourage this misconception.  Related to this, avoid if-then-else entirely, and instead use only case analysis for branching, even when the value to be discriminated is a Boolean.  We consistently write things like

case Int.compare(x,y) of
  LESS => ...
| GREATER => ...
| EQUAL => ...

rather than a nested conditional branch.  It encourages students to think in terms of pattern matching, and prepares the ground for later developments, including a smooth transition to pattern matching over more complex data structures and reasoning inductively when programming recursively.

Teaching parallelism is completely straightforward, because the model of computation inherently avoids unnatural and unnecessary problems of interference, and focuses attention on the critical issue of data dependencies among computations in a program.  Students have no trouble computing the work (sequential time complexity) or span (parallel time complexity) of a program, and have no problems reading off recurrences for the respective time complexities.  Later, when we introduce sequences, the idea of computing in parallel with the entire sequence, rather than item-by-iterm (as encouraged by the dreadful iterators so beloved in the oo world), comes naturally and easily.  The key to this, of course, is that data structures in a functional language are naturally persistent; it is monstrously hard to use ephemeral data structures in a parallel computation, and is not something we could hope to teach freshmen.

A major decision for us is how to teach the expression and enforcement of abstraction in a program.  In a departure from our previous approach, we have decided against using opaque ascription (sealing) as a means of enforcing abstraction.  It has its virtues, but the problem is that it does not mesh well with other language features, in particular with substructures and type classes (views).  For example, consider the signature of a mapping whose domain is an ordered type of keys:

signature MAPPING = sig
  structure Key : ORDERED
  type 'a mapping
  val lookup : Key.t -> 'a mapping -> 'a
  ...
end

Unfortunately, sealing a structure with this signature renders the module useless:

structure IntMapping :> MAPPING = struct
 structure Key = Int
 type 'a mapping = 'a bst
 ...
end

The trouble is that not only is the type ‘a IntMapping.mapping abstract, as intended, but so is IntMapping.Key.t, which is not at all intended!  To get around this we we must create a specialization of the signature MAPPING using one of several means such as

signature INT_MAPPING = MAPPING where type Key.t=int
structure IntMapping :> INT_MAPPING = ...

Of course one need not name the signature, but this illustrates the general problem.  As things get more complicated, you have more and more clauses that specify the types of things (sharing specifications).

The alternative, which has worked very well for us, is to eschew opaque ascription, and instead rely on the datatype mechanism to make types abstract.  So to give an implementation of the abstract type of mappings with keys being integers, we proceed as follows:

structure IntMapping : MAPPING = struct
  structure Key : ORDERED = Int
  datatype 'a bst = Empty | Node of 'a bst * (Key.t * 'a) * 'a bst
  type 'a mapping = 'a bst
  val insert = ...
end

The point is that since the constructors of the type ‘a bst are not exported in the interface, the type ‘a IntMapping.mapping is abstract.  Note as well that the use of transparent ascription on the structure Key ensures that keys really are integers (of type Int.int), and are not abstract, exactly as intended.  This formulation allows us to state simple rules of signature matching (every specification in the signature has a corresponding declaration in the structure), and allows us to enforce abstraction boundaries with a minimum of fuss.  The students have had absolutely no trouble with this at all, and we have had no trouble structuring our code this way.

When using functors (parameterized modules) to combine modules it is, of course, necessary to impose sharing constraints to ensure that only coherent compositions are possible.  (Rather than take the space to explain this here, I will rely on the reader’s experience to understand what is involved here.)  These sorts of sharing specifications are perfectly natural, easily explained, and have presented absolutely no difficulties for the students.  We illustrated their use in our game tree search example, in which the “referee” module is parameterized by the two “player” modules, which must of course cohere on their concept of a “game” (it’s no use pitting a chess player against a checkers player!).  The code looks like this

functor Referee
  (structure Player1 : PLAYER and Player2 : PLAYER
   sharing type Player1.Game.t = Player2.Game.t) : REFEREE = ...

The sharing specification states precisely and concisely the natural coherence constraint governing the two players.  Here again, the dog we feared never barked, and the students found it all quite intuitive and unproblematic.  This allowed them to expend their time on the actual complexities of the problem at hand, such as how to think about alpha-beta pruning in a parallel game-tree search, rather than get all tied up with the bureaucracy of structuring the code itself.

The virtue of teaching bright young students is that they are bright and they are young.  Their brilliance is, of course, a pleasure.  We have to work hard to come up with sufficiently challenging exercises, and many students challenge us with their solutions to our problems.  Their youth means that they come to us with a minimum of misconceptions and misinformation that they’ve picked up on the street, and are open to learning methods that are entirely non-standard (at least for now) with respect to what their friends are learning at other universities.  What makes Carnegie Mellon a special place is precisely that the students are pushed into thinking hard in ways that they might not be.  Personally, I hope that more universities worldwide build on what we have started, and give their students the same benefits that ours are already enjoying.


Modules Matter Most

April 16, 2011

When it comes to controlling the complexity of developing and, more importantly, maintaining a large system, the only game in town is modularity.  And as even the strongest proponents of unityped languages have come to accept, modularity is all about types (static types, of course, there being no other kind).  A decomposition of a system into modules consists of an application of the structural principle of substitution (transitivity of entailment, composition of maps) that is fundamental to the very conception of a type system:

\displaystyle{{\Gamma\vdash M : A \qquad \Gamma,x:A \vdash N:B}\over{\Gamma\vdash [M/x]N:B}}

In pedestrian terms the type A is the “header file” describing M, and N is the new code implementing specification B in terms of an unspecified implementation of the functionality specified by A.  The interaction between M and N is mediated entirely by the type A; access to the source code of M is denied to the client, precisely so that the dependence between the components is weakened in anticipation of future changes to M, or to allow for flexibility in development (M need not even exist for the development of N to proceed, as long as the type A is specified).

To be most useful, it is important that the relationship between M and A be many-to-many, and not many-to-one or one-to-one.  Think of A as the API for a mapping between keys and values.  It is absolutely essential that the language admit that many different modules M be of type A, and it is absolutely essential that a given module M satisfy many distinct types A, without prior arrangement.  The type A is purely descriptive of the extensional behavior of a module of that type, and cannot be regarded as pinning down any particular implementation of that behavior.  Moreover, a given module may well support a variety of views, for example by “forgetting” certain aspects of it in particular contexts.  For example, one may neglect that an implementation of mapping supports deletion in a setting where only extension and application are required.

This is all pretty basic, but what surprises me is how few languages support it cleanly and simply.  One particularly awkward method is to rely on extra-linguistic “tools” that manipulate build parameters to switch choices of M for a given N, quickly resulting in an ad hoc language of its own just to manage the build scripts.  Another technique is to manipulate or restrict inheritance so that some degree of modularity can be simulated, much as one can bang in nails using a pair of pliers.  A common methodology, in fact, attempts to cut down inheritance to provide what we had in ML in the early 1980’s (functors), obviating the intervening decades of maladaptations of bad ideas.

More disappointingly, for me at least, is that even relatively enlightened languages, such as Haskell or F#, fail to support this basic machinery.  In Haskell you have type classes, which are unaccountably popular (perhaps because it’s the first thing many people learn).  There are two fundamental problems with type classes.  The first is that they insist that a type can implement a type class in exactly one way.  For example, according to the philosophy of type classes, the integers can be ordered in precisely one way (the usual ordering), but obviously there are many orderings (say, by divisibility) of interest.  The second is that they confound two separate issues: specifying how a type implements a type class and specifying when such a specification should be used during type inference.  As a consequence, using type classes is, in Greg Morrisett’s term, like steering the Queen Mary: you have to get this hulking mass pointed in the right direction so that the inference mechanism resolves things the way you want it to.  In F# the designers started with the right thing (Caml) and eliminated the very thing that matters the most about ML, it’s module system!  Instead the F# designers added a bunch of object-oriented concepts (for the sake of compatibility with .net and with the mindset of MS developers), and tricked up the language with features that are more readily, and flexibly, provided by the module system.

Why bring this up?  Apart from general grousing, my point is that we had little choice in what language to use in teaching our students.  Modularity matters most, and we must have a language that supports flexible modularity in the form I am describing here.  When we examined our options, which we did very carefully, the only contenders are Standard ML and O’Caml.  We could have gone with either, but were persuaded to use Standard ML, which has worked beautifully for our purposes.  The decisive factor in choosing between the two ML’s was simply that we have a prior code base in Standard ML on which to draw, and there are two implementations of Standard ML that support parallelism (MLton and Poly/ML), albeit neither optimally for our purposes.  Haskell provides better support for parallelism (by undoing its unfortunate commitment to laziness, which results in an unusable cost model for both time and, especially, space), but wasn’t suitable because of its lack of support for modularity.

As I have mentioned, our plan is to re-do the introductory curriculum in Computer Science to modernize it and to place better emphasis on principles rather than current technologies.  One aspect of this is to re-do the standard Data Structures and Algorithms course to eliminate the over-emphasis on ephemeral data structures, and to treat parallel algorithms as the general case that encompasses the increasingly irrelevant case of one processor.  (Yes, this is a technological trend, but it is more importantly a conceptual change that emerges from focusing on language, rather than machine, models of computation.)  What is a data structure?  It’s a signature, or interface, written in the language you’re programming in.  What’s an algorithm?  It’s a structure, or implementation, of that signature.  A signature, such as that for a mapping, can be implemented in various ways, with different cost trade-offs (logarithmic lookup vs. constant lookup, for example).  A given algorithm, such as a balanced tree, can implement many different data structures, such as mappings or sets.  The distinction between a peristent and an ephemeral mapping is a difference of data structure, that is, of signature.  The demands are different, the algorithms are different.  We should be able to support both forms as easily and cleanly as the other, to be able to compare them, and to explain, for example, why the ephemeral case is of limited utility.  It is not too much to ask to be able to write these examples as running code, with a minimum of fuss or bother!

We have for decades struggled with using object-oriented languages, such as Java or C++, to explain these simple ideas, and have consistently failed.  And I can tell those of you who are not plugged into academics at the moment, many of my colleagues world-wide are in the same situation, and are desperate to find a way out.  The awkward methodology, the “design patterns”, the “style guidelines”, all get in the way of teaching the principles.  And even setting that aside, you’re still doing imperative programming on ephemeral data structures.  It just does not work, because it is fundamentally the wrong thing.  Just try to teach, say, binary search tree delete; it’s a horrific mess!  You wind up with absurd “null pointer” nonsense, and a complex mess caused by the methodology, not the problem itself.  Pretty soon you have to resort to “frameworks” and “tools” just to give the students a fighting chance to get anything done at all, distancing them from the essential ideas and giving the impression that programming is painful and ugly, an enormous tragedy.

Our solution is to abandon it entirely, pushing OO techniques to a more advanced level for students who wish to learn them, and concentrating on the basic principles of modularity, parallel and sequential cost analysis, and direct verification of running code at the introductory level.  Dijkstra used to say “beauty is our business”, to which I would add that life is too short, and bright minds too precious, to waste on ugly things.  And if you take this point of view seriously, the best and the brightest are drawn to, rather than repulsed by, the field.  Pace some of my colleagues at a Major Institute of Technology, students absolutely do love to program and do love the beauty of code, provided that the code you ask them to write is, in fact, beautiful.  There is little more dreary than the corporate bureaucracy of OOP, and few things more lovely than the mathematical elegance of FP.

[Update: word-smithing, again]


A Dead Dog

April 12, 2011

In an earlier post I wrote about our approach to teaching students to reason inductively when programming recursively.  This generated some discussion, because teaching induction and recursion is notoriously difficult, to the point that many of my colleagues have all but given up hope of ever getting these ideas across to the majority of their students.  I’ve already explained our methods, and suggested some reasons why we seem to have succeeded where others have not.

We’re nearing the end of semester, and I must say that I am so proud of our students, I just have to brag a bit.  First, some background.  As I’ve mentioned, I am co-teaching a new course in functional programming with Dan Licata this semester as part of a thorough revamping of introductory CS that places a strong emphasis on reasoning about the correctness and efficiency of programs in both a sequential and parallel setting.  Dan and I have worked closely with an amazing team of teaching assistants in developing the new curriculum and course materials.  For the sake of continuity, Dan does all the lecturing, and I sit in the audience, sometimes kibbitzing a bit, but mostly just paying attention to the signals that students are sending and thinking about what we could do better next time.  Despite it being a moderately large class of about 83 students, we have managed to maintain an interactive atmosphere in which students freely ask, and respond, to questions, sometimes working in pairs for a few minutes on their own during the lecture.  This has worked very well for everyone, creating a conducive environment for learning to write beautiful code.

Today’s lecture was about persistent and ephemeral data structures.  One goal was simply to draw the distinction, and show how it is expressed in the interface of an abstract type; it is good to have some terminology available for comparing functional and imperative programming styles.  Another goal was to point out the importance of persistence for parallelism, since a parallel computation inherently requires that a data structure have “multiple futures”, rather than the “single future” provided by the ephemeral case.  (Put another way, the exclusive consideration of ephemeral representations in the usual data structures course taught worldwide looks more and more misguided as time goes on.  Indeed, our planned new approach to teaching data structures reverses the emphasis.  This is one reason why I consider object-oriented languages to be unsuitable for a modern introductory curriculum, conventional wisdom notwithstanding.)  A third goal was to re-acquaint the students with imperative programming so that they could make direct comparison with what they have been learning all semester, and recognize the limitations of imperative methods when it comes to parallelism and distribution.  Simply put, imperative methods are useful, if at all, in the “corner case” of sequential execution of programs acting on ephemeral data structures; in all other cases what you want are functional (transformational) methods.

But enough about context.  What made me feel so proud today was the sophistication displayed by our students during today’s interactive program development.  The exercise was to implement binary search tree delete in the persistent (transformational) style, and in the ephemeral (imperative) style in two different ways.  Dan set up the problem, and asked the students to help him write the code.  What impressed me so deeply was that as I listened to the murmur in the classroom, and to the proposals offered aloud, I could hear students saying things like

Well, inductively, we can assume that the problem has been solved for the left subtree, so to continue we need only ….

Outstanding!  (And remember, these are freshmen!)  Just as I had hoped, for these students thinking inductively comes naturally as the obvious way to think about a piece of recursive code!  They know instinctively how to formulate an inductive argument for the correctness of program, rather than resort to error-prone hand-waving about “going around and around the loop” that is all too typical at this level.  What makes this possible is that functional programs are inherently mathematical, allowing the students to concentrate on the ideas, rather than on the intricacies of coding in less expressive imperative languages.  I think that this gives them the confidence to tackle some quite difficult problems, such as the Barnes-Hut n-body algorithm or Jamboree game-search algorithm, for a one-week homework assignment.  Material that used to be complicated, or deferred to more advanced courses, can be done readily with beginners if you have the right framework in which to express the ideas.

As I walked back to my office, a group of students on the elevator with me groaned about the pain of having to do imperative programming again (most had learned the old ways last semester), but cheered me up by saying that doing so made a good case for why functional methods are better.  What more could one ask?



Persistence of Memory

April 9, 2011

Salvador Dali’s masterpiece, The Persistence of Memory, is one of the most recognizable works of art in the world.  The critic Dawn Ades described it as “a Surrealist meditation on the collapse of our notions of a fixed cosmic order” induced by Einstein’s penetrating analysis of the concepts of time and space in the physical world.  Just as Dali’s Persistence of Memory demarcated the transition from age-old conceptions of time and space in physics, so does the computational concept of persistence of memory mark the transition from sequential time and mutable space to parallel time and immutable space.

A short while ago I described the distinction between parallelism and concurrency in terms of a cost model that assigns a parallel, as well as sequential, time complexity to a program.  Parallelism is all about efficiency, not semantics; the meaning of a program is not affected by whether it is executed on one processor or many.  Functional languages expose parallelism by limiting sequential dependencies among the parts of a computation; imperative languages introduce inessential dependencies that impede parallelism.

Another critical ingredient for parallelism is the concept of a persistent data structure, one for which its associated operations transform, rather than alter, it.  A persistent dictionary, for example, has the characteristic that inserting an element results in a new dictionary with an expanded domain; the old dictionary persists, and is still available for further transformation.  When calculating the sum of 2 and 2, resulting in 4, no one imagines that the 2’s are “used up” in the process!  Nor does one worry whether the sum of 1 and 3 is the “same” 4 or a “different” 4!  The very question is absurd (or, more precisely, trivial).  So why do we worry about whether the result of inserting 2 into dict “uses up” the old dict?  And why do we worry about whether inserting 2 into the empty dictionary twice results in the “same” dictionary or a “different” one?

Yet both academic and practical computing has all but confined itself to ephemeral data structures, which exhibit precisely such behavior.  Operations on an ephemeral data structure “use up” the data structure, making it unavailable for further use without going to some trouble to get it back.  The pathologies resulting from this abound.  Standard compiler texts, for example, devote a chapter to concepts like “block structured symbol tables”, which are, in fact, nothing more than persistent dictionaries done the hard way.  More generally, whenever a data structure has multiple futures, such as when backtracking or exploiting parallelism, ephemeral data structures get in the way.  Indeed,  the bulk of object-oriented programming, with its absurd over-emphasis on the “message passing” metaphor, stress the alteration of objects as the central organizing principle, confounding parallelism and complicating simple algorithms.

A prime virtue of functional languages is that persistence is the default case, but they can as readily support ephemeral data structures as any imperative (including object-oriented) language.  All functional languages include types of mutable cells and mutable arrays, and provide support for conventional, sequential, imperative programming with semicolons and even curly braces!  (Some do this better than others; Haskell is, in my view, the world’s best imperative programming language, and second-best functional language, but that’s a subject for another post.)  But why would you want to? Why deprive yourself of the benefits of persistence, and insist instead on an ephemeral data structure?

This question came up recently in our planning for the Functional Programming class that we are teaching this semester for freshmen at Carnegie Mellon.  All semester we have been using functional programming techniques to build clean, verifiable, modular, parallel programs.  The students routinely prove theorems about their code, structure programs using abstract types, and exploit parallelism to improve the asymptotic performance of their programs.  Recent homework assignments include the implementation of the parallel Barnes-Hut algorithm for the n-body problem in physics, and the parallel Jamboree algorithm for game-tree search in perfect information games.  Persistent data structures are the key to making this possible; just try to code Barnes-Hut in an imperative language, and you will find yourself in a morass worrying about concurrency when you should instead be envisioning a recursive tree decomposition of space, and the computation of forces using formulas from high school physics.

We tried hard to find good motivations for using an ephemeral data structure when you can just as easily (actually, much more easily) use a persistent one.  As we went through them, we realized that all of the standard arguments are questionable or false.  The usual one is some vague notion of “efficiency” in either time or space.  While I concede that one can, in principle, solve a particular, confined problem more efficiently by doing absolutely everything by hand (memory management, scheduling, arithmetic), in the overwhelming majority of cases the demands of evolution of code far outweigh the potential advantages of doing everything by hand.  Modularity matters most when it comes to building and evolving large systems; functional languages, with persistent data structures, support modularity the best.  (I’ll have more to say about this in a future post.)

Most of the arguments about efficiency, though, ignore questions of functionality.  It is senseless to compare the “efficiency” of one data structure that provides different functionality than another.  A persistent data structure does more for you than does an ephemeral one.  It allows you to have multiple futures, including those that evolve in parallel with one another.  It makes no sense to insist that some ephemeral approximation of such a data structure is “more efficient” if it does not provide those capabilities!  And making it do so is, invariably, a bitch!  Conventional ephemeral data structures are not readily parallelizable; it’s often a publishable result to get a decent degree of parallelism using imperative methods.  By contrast, even freshmen (admittedly, Carnegie Mellon freshmen) can implement a parallel game tree search or a tree-based solution to the n-body problem in a one-week homework assignment.

So what’s the deal?  Why should we care about ephemeral data structures at all?  I have no idea.  Mostly, it’s a legacy cost imposed on us by the overzealous emphasis on imperative methods and machine-based, rather than language-based, models of computation.  But this will change.  Starting this fall, introductory data structures and algorithms will be liberated from the limitations of imperative, object-oriented programming, and will instead stress persistent (as well as ephemeral) data structures, and parallel (including as a special case sequential) algorithms.  The future of computing depends on parallelism (for efficiency), distribution (for scale), and verification (for quality).  Only functional languages support all three naturally and conveniently; the old ones just don’t cut it.

Update: Here’s a chart summarizing the situation as I see it:

\displaystyle  \begin{array}{|c|c|c|}  \hline  & \text{Ephemeral} & \text{Persistent} \\  \hline  \text{Sequential} & \textit{imperative} & \textit{benign effects} \\  \hline  \text{Parallel} & \textit{hard to get right} & \textit{functional} \\  \hline  \end{array}

Conventional imperative programming works well for the ephemeral, sequential case; it is notoriously hard to use imperative methods for parallelism.  Benign effects, as exemplified by self-adjusting data structures, can be used to give rise to persistent behavior in the sequential setting, but the use of effects impedes parallelism.


Functions Are Values

April 2, 2011

After the midterm the course staff for the Functional Programming class had a T-shirt made up saying “Functions Are Values”.  What prompted this is a curious blind spot shared by nearly every student in the class about the nature of functions in ML (or, for that matter, in math).  The students performed admirably on the midterm examination, capably writing higher-order functions, proving correctness of programs by structural induction, and even using higher-order functions to express staged computation.  And yet nearly every student got the following simple “gimme” question wrong:

What is the type and value of the expression “fn x => raise Fail”?

I sensed trouble when, during the examination, a student asked me to clarify the question for him.  Needless to say, I was at a loss for words!  Sure enough, all but one student got this simple question wrong, sometimes spectacularly.

Many said “it has no value” and offered a guess as to its type.  Others said “it raises an exception” and therefore has type “exn”.  (For those who don’t know ML, the type exn is the type of values associated with an exception.)  Still others said “the compiler will notice that it always raises an exception, and will therefore reject it”, or words to that effect.  Where the hell did they get that bizarre notion?  Whatever the deficiencies of our teaching may be, we certainly never said anything close to that!  Others got the type right, but still could not explain what is its value.

Given that they are clearly capable of writing higher-order functional programs, how could this happen?  Obviously the fault is not with our students, but with ourselves.  But what did we do wrong?  How did we manage to stump so many students with what we thought was a free 3-pointer?  To be honest, I don’t know.  But I have a few theories that I thought I would air here in hopes of helping others avoid my mistakes, or nip misunderstandings in the bud.

Throughout the course we have been very careful to develop a rigorous language-based model of computation using structural operational semantics.  We assign meaning to the program you write, not to its translation into some mysterious underlying machine that doesn’t really exist anyway (in other words, we avoided telling the conventional lies).  Nowhere did we ever explain anything by reference to “the compiler”, except at one point, which I now realize was a cardinal mistake.  Don’t make it yourself.  Here’s how it happened.  We meant well, but the spirit is weak, and sometimes we stray.  Forgive me Father, for I have sinned…..

Here’s where we went wrong, and I think invited wild speculation that led to the mistakes on the examination.  One cool use of higher-order functions is to stage computation.  A good example is provided by a continuation-based regular expression matcher, which I will not explain in any detail here.  The crucial point is that it has the type

regexp -> string -> (string -> bool) -> bool

which says that it accepts a regular expression, a string, and a continuation, matching an initial segment of the string, if possible, and passing the rest to the continuation, or returning false if not.  As this description suggests we can think of the matcher as a curried form of a three-argument function, and that’s that.  But a more interesting formulation stages the computation so that, given the regexp, it computes a matcher for that regular expression that may be applied repeatedly to many different candidate strings without reprocessing the regular expression.  (The code for this is quite beautiful; it’s available on our course web site if you’re interested.)

All this can be explained quite neatly using operational semantics, showing how the recursion over the regexp is completed, yielding a matching function as result.  It’s all very cool, except for one little thing: the result contains numerous beta redices that can be simplified to obtain a clearer and simpler representation of the matching code.  Since the equations involved are all so self-evident, we stated (and here’s our mistake) that “the compiler can simplify these away to obtain the following code”, which was, of course, much clearer.  What we said was perfectly true, but unfortunately it opened the floodgates of incomprehension.  The trouble is, the students have no idea what a compiler is or how it works, so for them it’s a “magic wand” that somehow manages to get their ML code executed on an x86.  And we invited them to speculate on what this mystical compiler does, and thereby (I conjecture) invited them to think that (a) what the compiler does is paramount for understanding anything, and (b) the compiler does whatever they wish to imagine it does.  Somehow the students read our simple three-pointer as a “trick question” that was supposed to involve some mumbo-jumbo about the compiler, and that’s what we got, alas.

So, my first bit of advice is, don’t mention the compiler as an explanation for anything!  They’ll have plenty of chances later in their careers to learn about compilers, and to understand how semantics informs compilation, what is valid equational reasoning, and so forth.  But for freshmen, stick with the model.  Otherwise you’re courting disaster.  (I’ve already mentioned other situations where mentioning “the compiler” only muddies the waters and confuses students, the teaching of recursion being a prime example.  I intend to discuss others in future posts.)

Now that I’ve taken the blame for my mistakes, I feel less guilty about shifting some of it elsewhere.  I have two thoughts about why students resist the idea of functions being values of no fundamentally different character than any others.  One source, no doubt, is that many of them have “learned” programming on the street, and have developed all sorts of misconceptions that are being applied here.  Most popular languages make it hard to handle functions as values: you have to name them, you have to go to special trouble to pass them as arguments or return them as results, they are called “methods” and have a strange semantics involving state, and on and on.  From this students acquire the sense that functions are “special”, and cannot be thought of like other things.

Another source of misunderstanding is that elementary mathematics (up through freshman year, at least) stresses the pointful style, always speaking of f(x), rather than f itself, and carrying this forward through calculus, writing d f(x) / dx for the derivative, confusing the function with its values, and so forth.  Here again the message is clear: functions are “special” and are not to be treated on the same footing as numbers.  It’s a pity, because I think that the point-free style is clearer and more natural, if non-standard.  The differential is an operator acting on functions that produces a linear approximation to the function at each point in its domain (if the function in fact has such).  Be that as it may, the consequence of the pointful style is that students develop early on a “fear of functions” that is hard to break.


The Holy Trinity

March 27, 2011

The Christian doctrine of trinitarianism states that there is one God that is manifest in three persons, the Father, the Son, and the Holy Spirit, who together form the Holy Trinity.   The doctrine of computational trinitarianism holds that computation manifests itself in three forms: proofs of propositions, programs of a type, and mappings between structures.  These three aspects give rise to three sects of worship: Logic, which gives primacy to proofs and propositions; Languages, which gives primacy to programs and types; Categories, which gives primacy to mappings and structures.  The central dogma of computational trinitarianism holds that Logic, Languages, and Categories are but three manifestations of one divine notion of computation.  There is no preferred route to enlightenment: each aspect provides insights that comprise the experience of computation in our lives.

Computational trinitarianism entails that any concept arising in one aspect should have meaning from the perspective of the other two.  If you arrive at an insight that has importance for logic, languages, and categories, then you may feel sure that you have elucidated an essential concept of computation—you have made an enduring scientific discovery.  Advances in our understanding of computation may arise from insights gained in many ways (any data is useful and relevant), but their essential truth does not depend on their popularity.

Logic tells us what propositions exist (what sorts of thoughts we wish to express) and what constitutes a proof (how we can communicate our thoughts to others).  Languages (in the sense of programming) tells us what types exist (what computational phenomena we wish to express) and what constitutes a program (how we can give rise to that phenomenon).  Categories tell us what structures exist (what mathematical models we have to work with) and what constitutes a mapping between them (how they relate to one another).  In this sense all three have ontological force; they codify what is, not how to describe what is already given to us.  In this sense they are foundational; if we suppose that they are merely descriptive, we would be left with the question of where these previously given concepts arise, leading us back again to foundations.  It is the foundations that I wish to describe here, because I believe it will help to clarify some common misunderstandings about the notions of proposition, type, and structure.  Of particular interest here is that a “type system” is not, under this conception, an arbitrary collection of conditions imposed on a previously given notion of program (whether written with horizontal lines, or not).  It is, rather, a way to say what the programs are in the first place, and what they mean as proofs and as mappings.

Here I will outline the basic correspondences between logic, languages, and categories by examining their structural properties (and, for now, nothing more).

The fundamental notion in logic is that of entailment, written P_1,\dots,P_n\vdash P, expressing derivability of P from P_1,\dots, P_n.  This means that P is derivable from the rules of logic, given the P_i as axioms.  In contrast to admissibility (which I will not discuss further here) this form of entailment does not express implication!  In particular, an entailment is never vacuously true.  Entailment enjoys at least two crucial structural properties, making it a pre-order:

\displaystyle{\strut\over{P\vdash P}}

\displaystyle{{P\vdash Q\quad Q\vdash R}\over{P\vdash R}}.

In addition we often have the following additional structural properties:

\displaystyle{{P_1,\dots,P_n\vdash Q}\over{P_1,\dots,P_n,P_{n+1}\vdash Q}}

\displaystyle{{P_1,\dots,P_i,P_{i+1},\dots,P_n\vdash Q}\over{P_1,\dots,P_{i+1},P_{i},\dots,P_n\vdash Q}}

\displaystyle{{P_1,\dots,P_i,P_i,\dots,P_n\vdash Q}\over{P_1,\dots,P_i,\dots,P_n\vdash Q}}.

These state that “extra” axioms do not affect deduction; the “order” of axioms does not matter; “duplication” of axioms does not matter.  (These may seem inevitable, but in substructural logics any or all of these may be denied.)

In languages we have the fundamental concept of a typing judgement, written x_1{:}A_1,\dots,x_n{:} A_n\vdash M{:}A, stating that M is an expression of type $A$ involving variables x_i of type $A_i$.  A typing judgement must satisfy the following basic structural properties:

\displaystyle{\strut\over{x:A\vdash x:A}}

\displaystyle{{y:B\vdash N:C \quad x:A\vdash M:B}\over{x:A\vdash [M/y]N:C}}

We may think of the variables as names for “libraries”, in which case the first states that we may use any library we wish, and the second states closure under “linking” (as in the Unix tool ld or its relatives), with [M/x]N being the result of linking x in N to the library M.  Typically we expect analogues of the “extra”, “reordering”, and “duplication” axioms to hold as well, though this ain’t necessarily so.  I will leave their formulation as an exercise for the reader.

In categories we have the fundamental concept of a mapping f:X\longrightarrow Y between structures X and Y.  The most elementary structures, perhaps, are sets, and mappings are functions, but it is more common to consider, say, that X and Y are topological spaces, and $f$ is a continuous function between them.  Mappings satisfy analogous structural properties:

\displaystyle{\strut\over{\textit{id}_X : X \longrightarrow X}}

\displaystyle{{f:X\longrightarrow Y \quad g : Y\longrightarrow Z}\over{g\circ f:X\longrightarrow Z}}

These express, respectively, the existence of the identity map, and the closure of maps under composition.  They correspond to reflexivity and transitivity of entailment, and to the library and linking rule of languages.  As with types, one may expect additional closure conditions corresponding to the “extra”, “reordering”, and “duplication” axioms by giving suitable meaning to multiple assumptions.  I will not go into this here, but numerous standard sources treat these conditions in detail.

What I find captivating about computational trinitarianism is that it is beautiful!  Imagine a world in which logic, programming, and mathematics are unified, in which every proof corresponds to a program, every program to a mapping, every mapping to a proof!  Imagine a world in which the code is the math, in which there is no separation between the reasoning and the execution, no difference between the language of mathematics and the language of computing.  Trinitarianism is the central organizing principle of a theory of computation that integrates, unifies, and enriches the language of logic, programming, and mathematics.  It provides a framework for discovery, as well as analysis, of computational phenomena.  An innovation in one aspect must have implications for the other; a good idea is a good idea, in whatever form it may arise.  If an idea does not make good sense logically, categorially, and typically (sorry for the neologism), then it cannot be a manifestation of the divine.


The Dog That Didn’t Bark

March 21, 2011

The crucial clue that allowed Holmes to solve the Silver Blaze mystery was the dog that didn’t bark: the apparent murderer and horse thief was not a stranger, but the trainer himself, the only one who could have gotten into the stables unnoticed.  As I’ve mentioned, this semester Dan Licata and I are co-teaching a new course in functional programming for first-year students at Carnegie Mellon.  Apart from parallelism, wbich I’ve already discussed, we are placing a strong emphasis on verification and proof as tools for the practicing programmer, chief among these being the infamous hellhounds of Computer Science, induction and recursion.  And I am happy to say that, as in the Holmes story, these are the dogs that didn’t bark.  The results from homework and an in-class exam make clear that the vast majority of the class of about 80 students have mastered these crucial techniques with a minimum of fuss and bother.  In this post I’d like to explain why I think we’ve succeeded where, apparently, so many others have failed.

One of the pleasures of teaching at Carnegie Mellon is that we the students are bright (among the brightest in the country according to the usual metrics), open-minded (they’re game for whatever I wish to throw at them), and hard-working (Carnegie’s motto is “My heart is in the work.”).  I am sure that this is a big part of the story.  Most probably our students could master almost anything, no matter how poorly we explain it!  But it can’t be just that either.  My colleagues at other top schools, both research universities and teaching colleges, public and private, American and international, have often spoken of their frustration and despair at teaching programming using recursion, let alone proof by induction, to beginning (or even advanced) students.  Some even go so far as to say that it can’t be done, that while the brightest students “get it”, the less bright students never will, so we should just give up even trying.  I’m not convinced.  I can’t prove that I’m right, but I do have quite a few years experience at pulling this off, so it is at least conceivable that the problem lies not with the students, but with the teaching methods.  So let me offer mine as an example, and perhaps others will be able to use them as well.

Let’s start with the punch line, just to pique your interest (or ire): the answer is type theory.  The great idea of constructive type theory is that there is no distinction between programs and proofs: the code embodies the reasoning the justifies it, and the reasoning is just a form of code.  This is the propositions as types principle (sometimes grandly called the “Curry-Howard isomorphism”, the small problem being that neither Curry nor Howard invented it, nor is it an isomorphism).  Theorems are types (specifications), and proofs are programs.  Programs/proofs are executed by evaluating them by applying simplification rules to reduce them to canonical form.  It’s just generalized high school algebra, in which you plug in values for variables and simplify using equational reasoning, but scaled up to a rich collection of types and programs acting on them.  In other words type theory is a language-based model of computation, which supports reasoning directly about the code you write, not its translation or interpretation in some other terms.

The significance of a language-based model for teaching cannot be overestimated.  One of the classic pitfalls of teaching recursion is to try to explain it by referring to some mysterious entity called a “stack” that is absolutely invisible in the program you write.  There’s a long song and dance about “frames” and “procedure calls” and “pushing” and “popping” things on this “stack” of which the professor speaks, and no one understands a damn thing.  It all sounds very complicated, and impressive in a way, but if there’s one thing people learn, it’s to stay the hell away from it!  And in doing so, students are robbed of one of the most powerful and useful programming tools available to them, the key to writing clean, reliable code (and, as I’ve previously argued, for parallelism as well).  The alternative is to use Plotkin’s structural operational semantics to define the behavior of all language features, including recursion.  To pull it off, you have to have a semantics for the language you’re using (something that’s in short supply for commercial languages), and you have to be facile in using it yourself.

What makes structural operational semantics so useful is that the execution model is directly defined by an inductive definition of a transition relation, M\mapsto M', on programs.  (I don’t have the space to explain it fully here, but you may consult Practical Foundations for Programming Languages for a full account.)  The crucial idea is to use inference rules to give a structural definition of evaluation for programs themselves, not a translation into some underlying abstract machine code.  For example, one rule for evaluation of addition expressions is

\displaystyle{M\mapsto M'\over M+N\mapsto M'+N},

which states that one step of evaluation of the sum M+N consists of a step of evaluation of M to obtain M', with the result being M'+N.  Another rule states that

\displaystyle{N\mapsto N'\over m+N\mapsto m+N'},

where $m$ is a number, and a third states that \displaystyle{m+n\mapsto p}, where $p$ is the sum of $m$ and $n$.  With these (and similar) rules in place, one may give an execution trace in the form

M_1\mapsto M_2 \mapsto \cdots \mapsto M_k,

which makes explicit the steps of evaluation.  Each step is justified by a “proof tree” consisting of a composition of rules.  Importantly, the size of the expressions M_i correlates directly with the space required by an implementation, allowing for a clean and clear explanation of space issues such as tail recursion.

This is a good start for explaining the semantics of a recursive function, but the real beauty of type theory emerges when teaching how to reason about a recursively defined function.  The lesson of type theory is that induction and recursion are the same thing: an inductive proof is a recursive function, and a recursive function is an inductive proof.  This means that there is really only one concept here, and not two, and that the two go hand-in-hand.  It’s not a matter of “eat your spinach, it’s good for you”, but rather of a conceptual unification of proving and programming.  For our students thinking inductively and programming recursively go together naturally, the dynamic duo of functional programming.

The starting point is, once again, an inductive definition.  For specificity and ease of exposition, let me give an inductive definition of the natural numbers:

\displaystyle{\strut\over\texttt{Zero}\,\textsf{nat}}

\displaystyle{m\,\textsf{nat}\over \texttt{Succ}(m)\,\textsf{nat}}

The definition uses the exact same rule format that we used to define the semantics.  Inductive definitions given by a set of rules become second nature.  Moreover, they transcribe almost literally into ML:

datatype nat = Zero | Succ of nat

So here we have the primordial example of an inductively defined type.  Many data structures, such as various forms of trees, fit naturally into the same format, a fact that we use repeatedly throughout the semester.

Defining a function on an inductive type consists of giving a series of clauses, one for each rule defining the elements of that type:

fun plus x y = case x of  Zero => y | Succ x’ => Succ (plus x’ y)

Whenever we have an inductive type, we use the same format each time for functions defined over it.  Crucially, we allow ourselves “recursive calls” corresponding exactly to the premises of the rule governing a particular clause.  There is no need for any fuss about “pushing” things, nor any discussion of “sizes” of things, just a pattern of programming that is guided directly by the definition of the type on which we are working.

And now comes the best part.  If we want to prove something about a recursively defined function, we write a proof that has exactly the same form, one case for each rule in the inductive definition.  And, just as we get to make a “recursive call” for each premise of each rule, so we also get to make a “recursive call” to the proof for each premise, that is, an appeal to the inductive hypothesis.  The proof is, of course, a program that constructs a proof for every element of an inductive type by composing the inductive steps around a base case.

Importantly, the students “get it”.  On the midterm we ask them to prove (in an in-class setting) a small theorem about a piece of code, or to write a piece of code and prove a small theorem about it, and they are able to do this quite reliably.  In other words, it works!

Perhaps the most important thing, though, is to convey the sense of inevitability and naturality about induction and recursion.  Don’t talk about proof as a separate topic!  Don’t operationalize recursion beyond the bounds of the language model.  Simply do the natural thing, and it will all seem natural and obvious … and fun!

Moreover, it scales.  Our high-water mark prior to the midterm is to develop and debug a continuation-based regular expression matcher that has a subtle bug in it.  We state carefully the specification of the matcher, and use an inductive proof attempt to reveal the error.  The proof does not go through!  And the failure makes clear that there is a bug that can be fixed in a number of ways, repairing the proof along with the code.  I will detail this another time, but the impatient reader may wish to consult my JFP paper entitled “Proof-directed debugging” for an earlier version of what we taught this semester.

Let me close with two remarks.

Matthias Felleisen at Northeastern has been advocating essentially this methodology for decades; his work has strongly influenced mine.  His results are nothing short of spectacular, and with a much broader range of students than I ordinarily face here at Carnegie Mellon.  So, no excuses: read his How to Design Programs, and follow it!

Finally, this article explains why Haskell is not suitable for my purposes: the principle of induction is never valid for a Haskell program!  The problem is that Haskell is a lazy language, rather than a language with laziness.  It forces on you the existence of “undefined” values of every type, invalidating the bedrock principle of proof by induction.  Put in other terms suggested to me by John Launchbury, Haskell has a paucity of types, lacking even the bog standard type of natural numbers.  Imagine!

(And please note the analogy with dynamic languages, as opposed to languages with dynamic types. It’s the same point, albeit in another guise, namely the paucity of types.)


Dynamic Languages are Static Languages

March 19, 2011

While reviewing some of the comments on my post about parallelism and concurrency, I noticed that the great fallacy about dynamic and static languages continues to hold people in its thrall.  So, in the same “everything you know is wrong” spirit, let me try to set this straight: a dynamic language is a straightjacketed static language that affords less rather than more expressiveness.   If you’re one of the lucky ones who already understands this, congratulations, you probably went to Carnegie Mellon!  For those who don’t, or think that I’m wrong, well let’s have at it.  I’m not going to be very technical in this post; the full technical details are available in my book, Practical Foundations for Programming Languages, which is available online.

So-called dynamic languages (“so-called” because I’m going to argue that they don’t exist as a separate class of languages) are perennially popular.  From what I can tell, it’s largely a matter of marketing.  Hey, we all want to be dynamic, don’t we?  A dynamic personality is one that makes things happen, takes risks, has fun, breaks the rules!  And who wants to be static?  It’s synonymous with boring, after all, and no one wants to be a bore (even if they are, or especially if they are).  So, hey, dynamic languages are cool, right?  Or, at any rate, cool people like dynamic languages.  Static languages are for drips and losers.

That’s one argument, and, to be honest, it’s quite a successful bit of marketing.  Like most marketing tactics, it’s designed to confuse, rather than enlighten, and to create an impression, rather than inform.

Another part of the appeal of dynamic languages appears to be that they have acquired an aura of subversion.  Dynamic languages fight against the tyranny of static languages, hooray for us!  We’re the heroic vanguard defending freedom from the tyranny of typing!  We’re real programmers, we don’t need no stinking type system!

Now I’m as susceptible to the seductions of subversion as the next guy (witness this very post), so I can appreciate wanting to strike one for the little guy, overturning the evil establishment.  But I’ve got news for you: when it comes to “dynamic typing”, it’s a futile enterprise.  Not because the establishment is so powerful.  Not because evil always wins.  But because the distinction makes no sense.  Dynamic typing is but a special case of static typing, one that limits, rather than liberates, one that shuts down opportunities, rather than opening up new vistas.  Need I say it?  Something can hardly be opposed to that of which it is but a trivial special case.  So, give it up, and get with the program!  There are ill-defined languages, and there are well-defined languages.  Well-defined languages are statically typed, and languages with rich static type systems subsume dynamic languages as a corner case of narrow, but significant, interest.

Wittgenstein said that all philosophical problems are problems of grammar.  And so it is here as well.  The root of the problem lies in the confusion between a type and a class.  We all recognize that it is often very useful to have multiple classes of values of the same type.  The prototypical example is provided by the complex numbers.  There is one type of complex numbers that represent points in two-dimensional space.  In school we encounter two classes of complex numbers, the rectangular and the polar.  That is, there are two ways to present a complex number using two different systems of coordinates.  They are, of course, interchangeable, because they represent values of the same type.  But the form of presentation differs, and some forms are more convenient than others (rectangular for addition, polar for multiplication, for example).  We could, of course, choose a “coin of the realm”, and represent all complex numbers in one way, and consider coordinates as just a matter of how a number is given.  But it can also be convenient to consider that the type of complex numbers consists of two classes, each of which consists of two real numbers, but interpreted differently according to the coordinate system we are using.

Crucially, the distinction between the two classes of complex number is dynamic in that a given computation may result in a number of either class, according to convenience or convention.  A program may test whether a complex number is in polar or rectangular form, and we can form data structures such as sets of complex numbers in which individual elements can be of either form.  But this does not conflict with the basic fact that there is but one static type of complex numbers!  In type theoretic terms what I am saying is that the type complex is defined to be the sum of two copies of the product of the reals with itself.  One copy represents the class of rectangular representations, the other represents the class of polar representations.  Being a sum type, we can “dispatch” (that is, case analyze) on the class of the value of the type complex, and decide what to do at run-time.  This is no big deal.  In fact, it’s quite simple and natural in languages such as ML or Haskell that support sum types.  (Languages that don’t are hobbled, and this is part of the confusion.)

What does this have to do with the concept of a dynamic language?  The characteristics of a dynamic language are the same, values are classified by into a variety of forms that can be distinguished at run-time.  A collection of values can be of a variety of classes, and we can sort out at run-time which is which and how to handle each form of value.  Since every value in a dynamic language is classified in this manner, what we are doing is agglomerating all of the values of the language into a single, gigantic (perhaps even extensible) type.  To borrow an apt description from Dana Scott, the so-called untyped (that is “dynamically typed”) languages are, in fact, unityped.  Rather than have a variety of types from which to choose, there is but one!

And this is precisely what is wrong with dynamically typed languages: rather than affording the freedom to ignore types, they instead impose the bondage of restricting attention to a single type!  Every single value has to be a value of that type, you have no choice!  Even if in a particular situation we are absolutely certain that a particular value is, say, an integer, we have no choice but to regard it as a value of the “one true type” that is classified, not typed, as an integer.  Conceptually, this is just rubbish, but it has serious, tangible penalties.  For one, you are depriving yourself of the ability to state and enforce the invariant that the value at a particular program point must be an integer.  For another, you are imposing a serious bit of run-time overhead to represent the class itself (a tag of some sort) and to check and remove and apply the class tag on the value each time it is used.  (See PFPL for full details of what is involved.)

Now I am fully aware that “the compiler can optimize this away”, at least in some cases, but to achieve this requires one of two things (apart from unreachable levels of ingenuity that can easily, and more profitably, be expressed by the programmer in the first place).  Either you give up on modular development, and rely on whole program analysis (including all libraries, shared code, the works), or you introduce a static type system precisely for the purpose of recording inter-modular dependencies.  The whole-program approach does not scale, and flies in the face of the very advantage that dynamic languages supposedly have, namely dynamic evolution and development of components.  The static type system approach works beautifully, and makes my point nicely.

I am also fully aware that many statically typed languages, particularly the ones widely in commercial use, do not have a sufficiently expressive type system to make it feasible to work dynamically (that is, with multiple classes of values) within a statically typed framework.  This is an argument against bad languages, not an argument against type systems.  The goal of (static, there is no other kind) type systems is precisely to provide the expressive power to capture all computational phenonema, including dynamic dispatch and dynamically extensible classification, within the rich framework of a static type discipline.  More “researchy” languages such as ML or Haskell, have no trouble handling dynamic modes of programming.

Why should you care?  Because, I argue, that a fully expressive language is one that supports the delicate interplay between static and dynamic techniques.  Languages that force only one perspective on you, namely the dynamic languages, hobble you; languages that admit both modes of reasoning enable you and liberate you from the tyranny of a single type.  Let a thousand flowers bloom!


Parallelism Is Not Concurrency

March 17, 2011

In an earlier post I mentioned that one goal of the new introductory curriculum at Carnegie Mellon is to teach parallelism as the general case of computing, rather than an esoteric, specialized subject for advanced students.  Many people are incredulous when I tell them this, because it immediately conjures in their mind the myriad complexities of concurrency: locks, monitors, deadlock, dining philosophers, ….  And if you look at the recent research literature, you will see countless papers of the form (a) parallelism is important, so (b) let’s talk about concurrency.  The situation has gotten so bad that I think that most computer scientists cannot distinguish the two concepts.  (Or maybe they willfully confuse the two so as to justify funding for work on concurrency by appealing to the importance of parallelism.)

Given this state of affairs, I cannot explain what we are doing at Carnegie Mellon (to anyone who doesn’t already know) without first rebooting you.  Trust me, I’m just going to hit Ctrl-Alt-Delete, and then click Restart.  OK, we’re back.

The first thing to understand is parallelism has nothing to do with concurrency.  Concurrency is concerned with nondeterministic composition of programs (or their components).  Parallelism is concerned with asymptotic efficiency of programs with deterministic behavior.  Concurrency is all about managing the unmanageable: events arrive for reasons beyond our control, and we must respond to them.  A user clicks a mouse, the window manager must respond, even though the display is demanding attention.  Such situations are inherently nondeterministic, but we also employ pro forma nondeterminism in a deterministic setting by pretending that components signal events in an arbitrary order, and that we must respond to them as they arise.  Nondeterministic composition is a powerful program structuring idea.  Parallelism, on the other hand, is all about dependencies among the subcomputations of a deterministic computation.  The result is not in doubt, but there are many means of achieving it, some more efficient than others.  We wish to exploit those opportunities to our advantage.

Now I can hear you object, but isn’t concurrency required to implement parallelism?  Well, yes, it is, but concurrency is also required to implement sequentiality too!  The timing signal on your processor chip is essentially a synchronization mechanism with which to coordinate the otherwise independent activity of the components of the processor.  (Yes, I know that there are asynchronous processors, but this only makes my point stronger, you still need to control the concurrency.)  The point is that concurrency is not relevant to parallelism, even if the engineers who build our parallel computing platforms must deal with concurrency.  Another way to say the same thing is that parallelism is a useful abstraction, and abstractions should never be confused with their implementations.  (Numbers are not certain sets, but they can be implemented in terms of sets.  And so on.)

All well and good, parallelism is not concurrency, but what is it?  What is an appropriate model for parallel computation?  As in the sequential case, there are two answers, a machine-based approach and a language-based approach.  I am going to criticize the machine-based formulation, and argue for the language-based formulation on two grounds: (a) it is inherently more elegant and natural, and (b) it can be taught directly to students without any fuss or bother.  Much of what I have to say is derived from the seminal and beautiful work of Guy Blelloch, which I have absorbed over years of collaboration.  To the extent I’ve gotten it right, give Guy the credit; to the extent that I’ve gotten it wrong, assign me the blame.

The canonical machine-based model is the PRAM, or parallel RAM, which generalizes the classical RAM model to allow p>0 processors that share a common memory.  There are many variations on the PRAM, according to what sorts of interactions are allowed and how conflicts are resolved among the processors.  This is not important for present purposes, so I will not attempt to categorize the variations here.  The key idea is that the number, p, of processors is fixed in advance—results are stated in terms of how efficiently an algorithm executes on a p-processor PRAM as a function of the size, n, of the input.  This is well and good as a cost model for parallel execution, but it totally sucks as a programming model!  To write code for a PRAM, you fix the number of processors, and write the code for that number of processors (often the various cases are “analogous”, but in reality you must fix p and write code for that fixed p).  Moreover, you must (officially, anyway) write low-level imperative RAM code, which introduces spurious dependencies among sub-computations by explicitly scheduling the steps of the computation.  This is horrendous!  Of course typically one writes pidgin C, and sketches how it is compiled, but the fact is you’re still writing sequential, imperative code to implement a parallel algorithm.

As I’ve argued previously, what is needed is a language-based model of computation in which we assign costs to the steps of the program we actually write, not the one it (allegedly) compiles into.  Moreover, in the parallel setting we wish to think in terms of dependencies among computations, rather than the exact order in which they are to be executed.  This allows us to factor out the properties of the target platform, such as the number, p, of processing units available, and instead write the program in an intrinsically parallel manner, and let the compiler and run-time system (that is, the semantics of the language) sort out how to schedule it onto a parallel fabric.  Here’s a useful analogy.  Just as abstract languages allow us to think in terms of data structures such as trees or lists as forms of value and not bother about how to “schedule” the data structure into a sequence of words in memory, so a parallel language should allow us to think in terms of the dependencies among the phases of a large computation and not bother about how to schedule the workload onto processors.  Storage management is to abstract values as scheduling is to deterministic parallelism.

The paradigmatic example is provided by Quicksort, a naturally parallel algorithm.  Many people think that the “big idea” of Quicksort is the in-place swapping technique that Hoare invented in the original formulation.  I argue that this is a detail; the “big idea” is that it is parallelizable.  To see this, we must work at a reasonable level of abstraction: a sort function permutes a non-empty sequence of (comparable) elements to obtain another sequence in sorted order.

fun qs xs = if |xs|=1 then xs else let val (xsl, xsg) = split xs in concat (qs xsl, qs xsg)

That is, we split the sequence into two parts (by some means) such that everything in the left part is less than everything in the right part, then sort the parts independently (in parallel!), and merge the results together.  The exact complexity of Quicksort depends on the representation of sequences, and the complexity of the operations on them.  For now, let us just gloss over this to focus on the overall structure of the computation.  If we can ensure that the sequence is split roughly in half (by any of a number of means, including randomization), then the recursion is of logarithmic depth.  At each stage we do linear work, but the recursive calls are independent of one another.  The algorithm has a natural fork-join structure in which we (a) split the sequence into two parts, (b) in parallel sort the two parts, and (c) join the results together.  (The splitting itself can be done in parallel as well, but we’ll not bother about that here.)

Analysis of Quicksort splits into two parts, the work and the depth (or span).  The work is the overall number of steps required to perform the algorithm.  On a sequence of length n, we perform O(n log n) steps to sort the sequence, as might have been expected.  This is the sequential complexity, the amount of time it would take to run on a single processor.  To account for parallelism, there is a second measure, called the depth, that measures the critical path length of the computation, the length of the longest chain of dependencies among subcomputation.  No matter how much parallelism we may have available, we can never run faster than the depth; it represents the idealized parallel complexity of the algorithm, the amount of time it would take given any number of processors.  Depending on the choice of data structures, the depth of Quicksort is O(log^2 n), or thereabouts, which makes it quite parallelizable.  (Note that, as in our earlier discussion, a step of computation is a transition step in the operational semantics of the language we are using to express the algorithm.)

This is very nice and elegant, but now comes the really beautiful part, the concept of a provable implementation of the language.  (The word “provable” is only needed here because of the shabby state of the field; having a proof that your implementation works, and how well it works, remains an oddity to be emphasized, rather than an expectation to be fulfilled by any self-respecting computer scientist.)  The idea is that the language-based model can be implemented on a PRAM (with stated interconnect properties that I will suppress here) with a provable bound on the cost that takes account of the overhead of scheduling onto a p-processor fabric.  Thus, p need not be specified in advance, but is rather determined by the run-time system and compiler, which are provided by the proof of the theorem.

Theorem A computation with work w and depth d can be implemented in a p-processor PRAM in time O(max(w/p, d)).

(Sometimes the second factor is d log p, depending on the assumptions we are making about the target PRAM.)

This is called Brent’s Principle, and it states a fairly obvious fact.  First, we can never execute a depth d program in fewer than d steps.  The depth is the critical path length; this is the amount of time we must wait to execute the algorithm, regardless of any available parallelism.  Second, to the extent that parallelism arises from the dependency (rather, indepedency) structure of the computation, we can at best do p units of work at time, keeping all processors warm, until we are done.  And that’s it!

The effectiveness of the language-based model of parallelism lies entirely in its ability to expose the dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself.  And the key to this is functional programming, which manifests itself here in the transformational approach to computation: sorting is conceived of as a mathematical function that transforms a given sequence into another sequence.  It does not destroy the given sequence any more than adding two numbers destroys those numbers!  Since Quicksort is a mathematical function, we need not worry that execution of qs xsl interferes with (depends on) qs xsg; we can readily run them in parallel without fear of untoward consequences.  The payoff is that there are many fewer dependencies among the subcomputations, and hence many more opportunities for parallelism that can be exploited, in accord with Brent’s Principle, when scheduling the work onto a parallel fabric.

The upshot of all this is that functional programming is of paramount importance for parallelism.  That’s (one reason) why we are teaching functional programming to freshmen at Carnegie Mellon.  And it’s the main reason why we are able to.  There’s nothing to it, just program like you do math, and the parallelism will take care of itself!

Update: Simon Marlowe made similar points to mine some time ago.


Languages and Machines

March 16, 2011

Models of computation may be divided into two categories, the λ-calculus, and all the others.  Well, maybe Post’s production systems and Herbrand-Goedel equations belong in the former category, but certainly all the well-known, and widely accepted, models such as Turing machines or RAM’s, sit quite apart from the λ-calculus.  Guy Blelloch has characterized the division as between languages and machines.  The machine-based models are all based on the idea of a finite control augmented by unbounded memory.  It’s up to the programmer to manage the memory to implement anything more structured than a bit, and there is no notion, other than via an interpreter, for changing the program during execution; the machine models are all decidedly non-von Neumann because they lack the idea of a stored program.  The sole language-based model, by contrast, does not separate program from data, and offers a foundation on which realistic languages may be built, directly and without encoding.  This would seem like a decisive advantage for the λ-calculus: whereas the machine models live in the early chapters of our textbooks, they are of essentially no practical importance, the λ-calculus is directly useful for programming and for mechanization of logical systems.  And yet it is the machine models that dominate, especially in the world of algorithms and complexity, and the λ-calculus remains an esoteric oddity studied only by a relative minority of Computer Scientists.  How can that be?

When confronted with this question, my typical colleague (at least on the theory side) dismisses the question as totally uninteresting because “all models of computation are polynomial-time equivalent”, so who cares?  Well, if your work is all done modulo a polynomial factor in complexity, and you only care about computations over natural numbers (or other finitary objects), then I guess it doesn’t matter.  You tell some story once and for all of how the objects of interest are represented on some model, and leave it at that; the details just don’t matter.

But obviously the models do matter, as evidenced by the inescapable fact that the very same colleague would never under any circumstance consider anything other than a classical machine model as the underpinning for his work!  Surely, if they are all equivalent, then it can’t matter, so we can all switch to the λ-calculus, right?  Well, true though that may be, I can assure you that your argument is going nowhere.  Apparently some models are more equivalent than others after all!  Why would that be?

One reason is that a lot of work is not up to a polynomial factor (one might even suspect, most of it isn’t.)  So for those purposes the model may matter.  On the other hand, no one actually uses the “official” models in their work.  No one writes Turing machine code to describe an algorithm, nor even RAM code (Knuth notwithstanding).  Rather, they write what is charmingly called pidgin Algol, or pidgin C, or similar imperative programming notation.  That is, they use a programming language, not a machine!  As well they should.  But then why the emphasis on machine models?  And what does that pidgin they’re writing mean?

The answer is that the language is defined by a translation (that is, a compiler) that specifies that such-and-such pidgin Algol stands for such-and-such RAM code or Turing machine code.  The description is informal, to be sure, but precise enough that you get the idea.  Crucially, you reason not about the code you wrote, but the code the compiler writes, because officially the target code is the “real” code, the pidgin being a convenience.  For this to work, you have to hold the compiler in your head: it must be clear what is meant by everything you actually write by imagining its translation onto, say, a RAM.  This is not too bad, provided that the pidgin is simple, so simple that you can compile it in your head.  This works ok, but is increasingly problematic, because it limits the range of algorithms we can conveniently express and analyze by restricting the language in which we express them.

So why the insistence on such an awkward story?  One reason is, evidently, tradition.  If it was good enough for my advisor, it’s good enough for me.  Or, if you want to get published, you’d better write in a style that is familiar to the people making the decisions.

A more substantial reason, though, is that machine models are the basis for establishing the cost of a computation.  A “step” of computation is defined to be a machine step, and we compare algorithms by estimating the number of steps that they take to solve the same problem (up to a multiplicative factor, in most cases).  As long as we can “hand compile” from pidgin Algol into machine code, we can assign a cost to programs written in a higher-than-assembly level language, and we can make rational comparisons between algorithms.  This methodology has served us well, but it is beginning to break down (parallelism is one reason, but there are many others; I’ll leave discussion of this for another occasion).

There is an alternative, which is to provide a cost semantics for the language we write, and conduct our analyses on this basis directly, without appeal to a compiler or reference to an underlying machine.  In short we adopt a linguistic model of computation, rather than a machine model, and life gets better!  There is a wider range of options for expressing algorithms, and we simplify the story of how algorithms are to be analyzed.

To do this requires that we give a cost semantics for the language we’re actually using.  And this is where the λ-calculus comes into play, for it is the paradigmatic example of how to specify a language-based model of computation.  Briefly, the crucial notion in the λ-calculus is the concept of a reduction, written M→M’, expressing one step of computation of program M resulting in program M’.  Execution is defined directly on the programs we write, and the model provides a well-defined notion of computation step that we can count to obtain the time complexity of the program.  Decades of experience shows that this approach scales to realistic languages.

What’s not to like?  Personally, I have no idea.  I have been mystified by the suspicion with which the λ-calculus seems to be regarded in algorithmic circles.  Just about any machine model is regarded as a perfectly good foundation, yet the behavior of the field itself shows that machine models are of limited utility for the work that is actually done!

The only reason I can think of, apart from powerful social effects, is that there is lingering, yet totally unfounded, doubt about whether the concept of cost that arises from the λ-calculus is suitable for algorithm analysis.  (One researcher recently told me “you can hide an elephant inside a β-reduction!”, yet he could not explain to me why I’m wrong.)  One way to justify this claim is to define, once and for all, a compiler from λ-calculus to RAM code that makes clear that the abstract steps of the λ-calculus can be implemented in constant time on a RAM (in the sense of not varying with the size of the input, only in the size of the static program).  Blelloch and Greiner have carried out exactly this analysis in their seminal work in the early 90’s; see Guy’s web page for details.

In a later post I will take these ideas a bit further, and show (by explaining Blelloch’s work) that the λ-calculus serves not only as a good model for sequential algorithms, but also for parallel algorithms!  And that we can readily extend the model to languages with important abstractions such as higher-order functions or exceptions without compromising their utility for expressing and analyzing algorithms.


Design a site like this with WordPress.com
Get started