Elegant Coding: Programming

Showing posts with label Programming. Show all posts

23 March 2015

What is Software Correctness?

Software correctness which is really software quality is not one thing. It is comprised of a number of different and sometimes conflicting attributes. As always Gerald J. Sussman provides interesting insights and in this case it is in his "We really don’t know how to compute" talk. He makes the point that "correctness" may not be the most important attribute of software, the example he give is if Google’s page craps out on you, you will probably try again and it will probably work. Actually this example might be more about reliability but it got me thinking as I have always thought that "software correctness" meant the combination of both accuracy and reliability and that was this the "correctness" that I always thought was the most important feature of software. But this correctness is really two software quality attributes and the Google search software exemplifies these two aspects of "software correctness", one is availability and the other is search accuracy or the fulfillment of the users expectations. One could argue that availability is not a software attribute but it is affected by the quality of the software and it is an attribute of a software system even if some of that might be driven by hardware. So these two attributes for Google search do not have to be perfect. The search quality needs to be at least better than the competition and the availability has to be high enough so that people won’t be discouraged from using it and for these two attributes they are both very high. These parameters of quality vary in other domains, a machine administering a dose of radiation to a cancer patient would need to have stringent requirements as to the correctness of the dose, and in fact there have been deaths due to exactly this type of problem. Another is software guiding a spacecraft or a Mars rover which needs to be extremely reliable and correct and when it hasn’t been, well there are a few space bricks floating around the solar system due to software errors. In the real world, especially now, software is everywhere and it now lives in a variety of contexts. So which quality attributes are important will vary based on its domain and its context.

So what makes software good? Google first conquered search with Map-Reduce and other processes running on custom commodity hardware that allowed for flexible fault tolerant distributed massive data crunching. Amazon has commoditized its server infrastructure to sell spare capacity that can be easily used by anyone. Apple created easy to use attractive interfaces for the iPhone and iPad that consumers loved with similar usability now being offered on Android. NASA has successfully landed on the frigid world of Titan, touched the edge of the heliosphere, and has run rovers on Mars. While there are obviously hardware components to these systems, it was the various types of software quality that played a pivotal role in these and the many other great software successes.

It is easy to look back at these successes. With the exception of NASA, I doubt there was a deliberate effort on which quality aspects to focus on. Most were probably chosen intuitively based on business needs. Software quality is essential to any organization that creates software. It is an established subdomain or perhaps a related separate sibling field of software development. It has consortiums, professional organizations and many books written on the subject. So I delved into some of these and you find mention of Agile and TDD. You also start to see things like CMMI, ISO 9000, Six Sigma. Some of the resources seem to be written in some alien language, perhaps called ISO900ese, that seems related to software development but leaves me feeling very disconnected from it. I think that the problem might be that these attempts to formalize software quality as a discipline are suffering from the same problem that plagues the attempt to formalize software engineering as a discipline. It conflates the development process with the final product and the software quality formalization might be further exacerbated by the influence of the domain of manufacturing and its jargon. I think these established methodologies have value, for example makes sense, but it seems like in general with these standards there is a lot to weed through and the good parts seem to be hard to pick out. So I thought it might be fruitful to look at the problem from another angle that might provide a better way to reason about the quality of a system not only after and during its development but also before development begins.

A list of "quality parameters" for software is a requirement for software engineering to be a real engineering discipline. These parameters could then be viewed qualitatively and quantitatively and perhaps even be measured. In engineering disciplines various parameters are tradeoffs, you can make something sturdier but you will probably raise the cost and or the weight, for example plastic bumpers on cars, metal would be more durable but would be heavier and more costly. If one were to think of some parameters in software as tradeoffs you can see that improvements on some parameters may cause others to go down. In my post "The Software Reuse Complexity Curve" I posit that reuse and complexity are interrelated and that the level of reuse can affect understandability and maintainability. Another example is performance, increasing performance of certain modules or functions might lead to an increase in complexity which without proper documentation could lead to harder understandability which might cause a higher long term maintenance cost.

So to come back to the question and to rephrase it: What are the attributes that constitute software quality? I am not the only person to ask this question as I have seen other blog posts that address this. There are quite a few people who have asked this question and have compiled good lists. One post by Peteris Krumins lists out some code quality attributes that Charles E. Leiserson, in his first lecture of MIT's Introduction to Algorithms, mentions as being more important than performance. Additionally Bryan Soliman’s post "Software Engineering Quality Aspects" provides a nice list with a number of references. I thought I’d create my own list derived from those sources and my own experience. I am categorizing mine in three groups, Requirements based (including nonfunctional requirements), structural, and items that fall into both categories.

REQUIREMENTS

Performance
Correctness (Requirements Conformance)
User Expectations
Functionality
Security
Reliability
Scalability
Availability
Financial Cost
Auditability

STRUCTURAL

Modularity
Component Replaceability
Recoverability
Good Descriptive Consistent Naming
Maintainability/ Changeability
Programmer's time
Extensibility/Modifiability
Reusability
Testability
Industry standard best practices (idiomatic code)
Effective library use, leveraging open source (minimal reinvention)
Proper Abstractions and Clear code flow
Good and easy to follow architecture/ Conceptual Integrity
Has no security vulnerabilities
Short low complexity routines

BOTH

Robustness/Recoverability
Usability
Adaptability
Efficiency
Interoperability/Integrate-ability
Standards
Durability
Complexity
Installability

This is by no means intended as a definitive list. The idea here is to put forth the idea of enumerating and classifying these attributes as a way to reason about, plan and remediate the quality of software projects, basically create a methodical way to address software quality throughout the software development lifecycle. In the case of planning software quality attributes can be potentially rated with respect to priority. This could also allow conflicting attributes to be more effectively evaluated.

The attribute of availability is one that is fairly easy to quantify and often gets addressed usually in some type of Service Level Aggreement (SLA). Although availability cannot be predicted, the underlying system that "provides" it can be tested. Netflix has developed chaos monkey to in part to verify (test) this quality attribute of their production system. Tests and QA testers are a way to verify the requirements correctness and the correctness of system level components and services contracts of software. Static analysis tools like PMD, Checkstyle, and Sonarqube can help with code complexity, code conventions and even naming.

Complexity is an elusive idea in this context it also crosses many of the above ideas. Code can be overly complex, as can UIs, Usability (think command line), Installability, etc. Another cross concept is understandability which can affect code, UIs, etc. which is in part related to complexity but can manifest itself in the use of bad or unconventional nomenclature as well. There may be other cross concerns and it seems that the spectrum of quality attributes might better reasoned about by separating out these concepts and developing general ways to reason about them.

I should note that I include financial cost as one of the attributes. While it might not be a real quality attribute, higher quality may necessitate higher cost, but not necessarily. If one were to develop two systems that had the same level of quality across all of the chosen attributes, but one system had twice the cost, the cheaper one would be a better value, hence more desirable. So cost is not a quality attribute per se, but it is always an important factor in software development.

In summary my vision is a comprehensive methodology where the quality attributes of a software system are defined and prioritized and hopefully measured. They are planned for and tracked throughout the life cycle and perhaps adjusted as new needs arise or are discovered. The existing tools would be used in a deliberate manner and perhaps as thinking about the quality attributes becomes more formalized new tools will be developed to cover areas that are not currently covered. It will no longer be about code coverage but quality coverage.

References and Further Reading

Anthony Ferrara, Beyond Clean Code, http://blog.ircmaxell.com/2013/11/beyond-clean-code.html

David Chappell, The Three Aspects of Software Quality: Functional, Structural, and Process, http://www.davidchappell.com/writing/white_papers/The_Three_Aspects_of_Software_Quality_v1.0-Chappell.pdf

Rikard Edgren, Henrik Emilsson and Martin Jansson , Software Quality Characteristics,

http://thetesteye.com/posters/TheTestEye_SoftwareQualityCharacteristics.pdf

Bryan Soliman, Software Engineering QualityAspects, http://bryansoliman.wordpress.com/2011/04/14/40/

Ben Moseley and Peter Marks , Out of the Tar Pit, http://shaffner.us/cs/papers/tarpit.pdf

Brian Foote and Joseph Yoder, Big Ball of Mud, http://laputan.org/mud/

Wikipedia , Software quality, http://en.wikipedia.org/wiki/Software_quality

Franco Martinig, Software Quality Attributes, http://blog.martinig.ch/quotes/software-quality-attributes/

MSDN, Chapter 16: Quality Attributes, http://msdn.microsoft.com/en-us/library/ee658094.aspx

Advoss, Software Quality Attributes, http://www.advoss.com/software-quality-attributes.html

Eric Jacobson, CRUSSPIC STMPL Reborn, http://www.testthisblog.com/2011/08/crusspic-stmpl-reborn.html

11 March 2013

Programming and Order Theory

Covariance, Contravariance and Order Theory

In this post I make the observation that covariance and contravariance in programming are what are known as Order Duals. I am not the first person to make this observation, however, these ideas often tend to be buried in academic research papers like "Adding Axioms to Cardelli-Wegner Subtyping" by Anthony J H Simons, don’t get me wrong I love these types of papers, they give me hope and inspiration that software engineering will someday become a first class engineering citizen. Unfortunately, these types of papers tend to be too theoretical and thus not very accessible for the average developer. This is unfortunate as the idea of covariance and contravariance as order duals both puts these concepts into the mathematical context of order theory and possibly gives programmers some native context for order theory. So hopefully this post will make these ideas more programmer friendly.

I previously wrote a post about lattice theory, which is part of the more general order theory, where I talked about some basic order theory ideas such as duality. Order theory occurs quite a lot in software and programming and this is part of a series of posts to talk about those occurrences.

Covariance and contravariance receive a fair amount of attention, as they should, in software blogs and some of these posts include some interesting observations. One, perhaps slightly off topic observation, is "Liskov Substitution Principle is Contravariance" which is an interesting observation and interesting post if you overlook the disdainful tone towards OO. Another more relevant post which is a nice post about "Covariance and Contravaiance in Scala" relates these ideas to category theory which is relevant especially since apparently you can think of "Category Theory as Coherently Constructive Lattice Theory", warning heavy going in that paper.

Defining Covariance and Contravariance

To me one of the most striking and perhaps apropos examples of order theory in software is that of covariance and contravariance which Eric Lippert defines on his blog as:

The first thing to understand is that for any two types T and U, exactly one of the following statements is true:

T is bigger than U.

T is smaller than U.

T is equal to U.

T is not related to U.

For example, consider a type hierarchy consisting of Animal, Mammal, Reptile, Giraffe, Tiger, Snake and Turtle, with the obvious relationships. (Mammal is a subclass of Animal, etc.) Mammal is a bigger type than Giraffe and smaller than Animal, and obviously equal to Mammal. But Mammal is neither bigger than, smaller than, nor equal to Reptile, it’s just different.

He has an eleven part series on covariance and contravariance, his posts cover some C# implantation details but the ideas are generally applicable and looking at one language’s details can help with comparing and contrasting to other languages.

Wikipedia includes the following definition, the animal example is pretty popular:

Covariant: converting from wider (Animals) to narrower (Cats).

Contravariant: converting from narrower (Triangles) to wider (Shapes).

Invariant: Not able to convert.

Including this is slightly redundant but this definition captures the conversion aspect and defines the relationships explicitly.

Covariance and Contravariance as Order Duals

The above are definitions that have order theory written all over them. In fact that is pretty much a text book definition of an order Relation in that it is reflexive, transitive, and antisymmetric. It is reflexive since Animal = Animal, transitive since Animal ≤ Mammal ≤ Cat implies Animal ≤ Cat, and antisymmetric since Animal ≤ Mammal implies not Animal ≥ Mammal and in the animal example there are cases of both comparability and incomparability as you would find in a partial order.

As you can see from the above definitions both sets of terms, wider/narrower or bigger/smaller, which are the same, define an order dual for comparison. To write it more formally we will call the set C classes of various types in an OO hierarchy. So covariant would be represented by less than or equals ≤ and contravariant would be represented by greater than or equals ≥ and a set of classes with these order relations can be written with mathematical notation as (C, ≤) = (C, ≥)^d .

Types as Sets of Fields

It was in researching this post that I came across the paper "Adding Axioms to Cardelli-Wegner Subtyping". These kinds of discoveries are one of the reasons I write these posts. In that paper they quote another paper "On understanding types, data abstraction and polymorphism" by Luca Cardelli and Peter Wegner:

a type A is included in, or is a subtype of another type B when all the values of type A are also values of B, that is, exactly when A, considered as a set of values, is a subset of B

The ideas about types and subtypes covered in these papers extend beyond which fields a class or object has, however, I thought it would be interesting and beneficial to limit the discussion to that case. One reason is that if you take the fields of an object or class then all subtype collections of fields will be found in the powerset and all subtypes will be a subset relation and these can be drawn as my favorite lattice, yes I have favorite lattice, the powerset lattice. Also in this case covariance and contravariance are now defined as the subset and superset operations on the powerset lattice.

Types as Sets of Fields in the Real World

Now I always feel that a real example helps quite a bit so I have created a set of example classes, Scala traits actually, which illustrate the above ideas using a quasi-real-world example. Please note that these code examples are designed for the purposes of illustrating these ideas and may contain design issues that one would not implement in the real world, but they should be close enough to bridge the conceptual gap, if you will. Also this first example is should be applicable to dynamically typed languages that might use duck typing or structural typing.

The following Scala traits define a possible domain object hierarchy that could be used to persist data to a database and render it back to a web page among other possible uses:

trait BaseDomain {

override def equals(that: Any) : Boolean

override def hashCode : Int

}

trait PersonInfo extends BaseDomain {

var firstName : String

var lastName : String

}

trait Address extends BaseDomain {

var street : String

var street2 : String

var city : String

var state : String

var country : String

var zipCode : String

}

trait PhoneNumber extends BaseDomain {

var phoneNumber : String

var extension : String

}

trait Person extends BaseDomain {

var personInfo : PersonInfo

var address : Address

var phoneNumber : PhoneNumber

}

These traits yield the following field powerset lattice:

Now suppose we would want to define a type for each of the above lattice points, which we probably would not do but there may be cases to do similar types of things in the real world. Let’s define the following Scala traits that wrap the above domain object hierarchy elements:

trait PersonInfoTrait extends BaseDomain {

var personInfo : PersonInfo

}

trait AddressTrait {

var address : Address

}

trait PhoneNumberTrait extends BaseDomain {

var phoneNumber : PhoneNumber

}

trait PersonInfoAddressTrait extends AddressTrait with PersonInfoTrait {

}

trait AddressPhoneNumberTrait extends AddressTrait with PhoneNumberTrait {

}

trait PersonInfoPhoneNumberTrait extends PhoneNumberTrait with PersonInfoTrait {

}

trait PersonTrait extends PersonInfoAddressTrait with AddressPhoneNumberTrait with PersonInfoPhoneNumberTrait {

}

Since Scala traits support multiple inheritance we can define the above type hierarchy which can be drawn as the powerset lattice that it is:

Again we can see covariant and contravariant types defined on this lattice and each relation is actually the subset superset relation on fields.

I feel the basic observations and the above examples on order duals and covariance and contravariance make the ideas pretty strait forward for field set context. In writing this I delved into a number of papers, adding some of them to my "to read list", on Types in programming and Type Theory and I feel there are probably some deeper insights and implications of all this.

07 August 2012

Why You Can’t Have a Real Software Engineering Discipline

As we all know the fields of computer science and software engineering are in their infancies. Many a blog post has been written lamenting the fact that software engineering is not a real engineering discipline, and while I have not written that exact post, I have written about the subject and have deconstructed what others have said about it. A number of these points do apply to why software projects routinely fail, yet another topic that has received considerable attention. Now admittedly all engineering disciplines regardless of their maturity and formalization have project failures and creating a real software engineering discipline will not eliminate this problem but one would hope that it would abate it.

I find it odd that at a time when so many people are involved in IT there seems to be little discernible progress in creating a real software engineering disciple. There are probably millions of people working on creating software. Also there are many researchers trying to solve this problem from many different angles. Practitioners have been attempting to solve it with ideas like Agile and Software Craftsmanship, etc. Additionally there is a long list of failures surrounding the formalization software engineering.

My biggest complaint is the fact that there really are no good formal standards in general software engineering principles and methodologies lack a formal foundation and even the more vague principles can be easily thwarted and misused, and often, as I have previously complained, it often ends up being based on pure opinion which is usually won through positional authority, perseverance or by those who are just more politically savvy.

The intent of this post is not to attempt to solve or even offer solutions to the problem of creating a real software engineering discipline but to look at what I think are some barriers to creating this discipline and in some of these cases offer thoughts on overcoming those barriers.

Low Entry Requirements

If we are to think software engineering as an engineering discipline, I challenge you to find another engineering discipline that routinely has practitioners with no formal training in the field. I very much doubt that someone who did not hold a degree in engineering and who built a shed in his back yard is now working as a structural engineer on a construction project, the mere idea seems ludicrous. Yet I have worked with non technical degree holders who became software developers, one who started by building web pages and now calls himself an "Internet Application Architect", and is probably one of the biggest cargo cult coders I’ve ever had the misfortune to work with. This egalitarian aspect isn’t all bad as it does let good people into the field as well. Another non technical degree holder I once worked with became an accomplished developer and went on to get an advanced degree in CS and is now a CS professor. Although it’s probably the case that for every good non technical degree holder who joins the field it is likely that dozens of mediocre and bad practitioners will also join. It seems that in other engineering disciplines there are much more stringent educational requirements to ensure proper training. To be clear here I am not equating having a degree with being competent because I have met many incompetent people with degrees, but still you need something. I confess I am at a loss for a solution for this one.

The Lack of Differentiation between Science and Engineering

Chemistry is a scientific discipline, chemical engineering is an engineering discipline, and you can make this comparison between other engineering disciplines and the sciences that they employ e.g. electrical, mechanical, and structural map to various areas of physics. Engineering and scientific disciplines are different types of disciplines often taught in separate schools. Although each engineering discipline has some scientific overlap and it would probably be possible to move from the appropriate scientific field to the corresponding engineering field in general you probably would not do so without returning to school. In recent years software engineering curricula have been added to the rosters of many colleges and while this is potentially a step forward, discounting for the moment that software engineering is still not really real engineering, a software engineering degree and a computer science degree in many cases gets you the same job!

So in other fields there would be a differentiation between a degree that would put you on a track to do scientific research and one that would put you on a track to do engineering work. As I understand it, if you are fresh out of school with a BS in CS that qualifies you to be a tester at Microsoft¹, developers hired right out of school need at least a Masters degree in CS. This is an extreme case of how this really breaks down in our industry. In most cases engineering practitioners are the software engineering, MIS, CS or even non technical degrees and the few scientific careers generally go to the advanced degree holders. I admit this probably fairly normal a BS in chemistry or biology is also more likely to get you a job in IT than a job as a research scientist. Still compared to more mature engineering fields there seems to be a lack of real differentiation between the degrees that yield a career as a software engineer versus one as a computer scientist.

Bad Management

The entry requirements for managers in software make the low entry requirements for software engineering practitioners look downright rigorous. I have previously criticized the fact that many organizations find the cheapest people, especially for software management and give them an obligatory certification. Now I know that engineering management regardless how formal or established the engineering discipline is an area that has problems and failures, but again I very much doubt that you would find a twenty something business or liberal arts major with a freshly minted PMP suffix managing construction or civil works projects. Yet in my experience this is pretty normal in IT especially in the government sector. Of course bad software management is executed by older managers as well.

A post by Larry White titled "Engineering Management Is Dying" delves into some these issues. It is definitely the case that methodologies like Agile have changed how software projects work and the traditional corporate management approach to software really doesn’t work. One point he makes is that it is not uncommon to see a two hundred person project at Google headed by an engineer. To me this implies that someone is in some way managing that project. I think all of this is indicative of the need for software engineering mangers to also be practitioners in engineering or at least very well versed in how software development works. The situation he talks about at Google is not the case everywhere, in that case the company is its own client, but there are many cases where software is being built for clients and this does create a need for management that deals with the client, perhaps this should be a role that is separated from management and called a client liaison, which is what often happens. Also I am not sure how things work at Google but I have also seen internal development in organizations where the IT department provides services to an internal client, I have also seen this go horribly awry with contentious unproductive relationships between departments. In short I feel that just as we need a real software engineering discipline we also need a real incarnation of software engineering management.

The Disconnect between Academia and Industry

Most practitioners do not keep up with, or read academic research, actually many practitioners don’t read at all, but that’s another issue, and most academicians don’t work in the field so they lack the hands on knowledge. Unfortunately this rift can take on a somewhat disdainful tone as is the case in my exploration of one practitioner’s attempt to define "Real Engineering". Fortunately some people, I’d like to count myself among them, take a more constructive approach to building bridges across this rift. Daniel Lemire has an interesting critique the quality of software produced in academia.

I think the solution here is for both sides to become more engaged in problems faced on each side. I am optimistic about this one as I feel that these walls are breaking down as the field is growing up and many of the emerging technologies are forcing developers to be more cognizant of and engaged in research topics. I also have encountered much more research that has ties to the practical concerns of day to day software development, some I have mentioned with more to come.

The Lack of an Effective CS Math Curriculum

If you read my blog you know it to be a mix of math and software practices. I would describe myself as a software practitioner and a math enthusiast. My math journey has taken me on some interesting math excursions into areas of math that seem to get little or no mention in CS curricula and I feel that this is a major problem in really applying math to the field of software engineering. Another problem is in the way that it is taught, my experience was that it was not only taught badly but in a way that made it seem irrelevant to many of the programming courses. Specifically I would shift the CS math curriculum to include less differential and integral calculus and include more logic, combinatorics, abstract algebra, graph theory, order theory, category theory and probably some topology among others.

Over the last few years I have been progressively learning more math which has been in part motivated by necessity for understanding research papers. This approach has changed the way I see things, I now see the patterns of math in software. They are there and I believe that they can be exploited to create a real software engineering discipline.

For the math learning problem I do have some ideas many of which are expressed in my blog. Some of my posts like refactoring if statements with Demorgan’s Laws or the String Monoid, the Object Graph, etc. are about making these mathematical ideas more relevant and accessible to programmers. Other posts like my series on naming and my post on generic programming and entropy and complexity are about my own ideas and ideas in the research literature that I feel may help to create a real software engineering discipline. These are my attempts at solutions to these problems, so expect more of both of these and more thoughts on the CS math curriculum.

The Software Architect Debacle

This is one that really bothers me. I feel the software architect role in general has a very negative effect on creating quality software and it dissuades the industry from developing a real engineering discipline. The Software Architect role tends to be broadly defined, you have enterprise architects, software architects, etc., some architects tend to be involved with hardware and networking, some work on configuration management, some do software design, some do all of these things and more. I feel this is a problem as each of these is a separate engineering area with different types of problems. By not breaking these down it often leads to a lack of focus on specific problems, also I have seen cases where the architect focuses on their preference and ignores other issues. The architect role is often divorced from the code. I have met many architects that were responsible for software but had never even looked at the code. To me this is a huge failing that architects that are responsible for building software often lack the interest, time, or even aptitude to know the quality or underlying of structure of the software that they are delivering.

To really do justice to these ideas I might need a separate post, my solution would be to break down the architect role into specific engineering roles, which could include:

Software Process Engineering - this would be involved with software team and resources planning, some aspects of configuration management such as defining policies, requirements analysis and general project management aspects of software construction including the definition and refinement of the SDLC. This role is tightly coupled with software engineering management.
Software Structural Engineering - this would include requirements comprehension, code infrastructure planning, prototyping work, hands on code structural work including custom frameworks and reusable components, third party library products, and general code quality including reviews and static analysis.
Software Quality Engineering - This would include the QA roles, software testing and testing tools, requirements validation and refinement, usability and reliability and general software quality issues.
Software Infrastructure Engineering - This would include hardware, networking, database, app server and general service infrastructure, non functional requirements fulfillment and possibly the implementation of configuration management, continuous integration, etc.

This is a rough "sketch" of these possible roles and each of these areas has some overlap implying that all of the people in these roles would work closely together, also some combination of, or all of these roles might be performed by a single individual depending on the organization’s size and structure. What this approach does is clearly defines roles and responsibilities as opposed to some nebulous software architect role.

The Continuous Turnover of the Work Force

A young CEO once commented that he thinks young programmers are superior. In general companies prefer younger workers as they often don’t have family commitments so they can work more hours and you can pay them lower salaries, DC area government contractors rely on this to keep their profit margins up.

As an older developer I often find this frustrating. I confess that I have not moved up the ladder and still mostly work as a developer or what might be called a hands on architect, yes I know but that’s the current term, actually my preferred title would be: Software Structural Engineer. Working as a contractor I have on several occasions found myself on projects that were dominated by younger developers and unfortunately on more than one occasion I watched as teams would make many mistakes due to a lack of experience, in some cases I was able to help in others I was ignored by the younger developers. This is not to say that I do not still make mistakes, I’ve just been around long enough to have made a lot of them already.

If you buy into the software craftsmanship thing, I admit to being skeptical of this idea, you might be inclined to draw a parallel to craftsman of the past when older established masters took on apprentices and passed on their wisdom to the next generation. Given the access to information these days such process is somewhat antiquated and perhaps impractical. Nevertheless I feel, and the industry supports me here, that older developers who keep their skills up to date have value. I know several younger developers who have sought me out to learn from me so I think I can safely say I still have some value.

Many older developers have let their skills get rusty and have not kept up on current technologies I have seen this many times. Also many companies seem to lack the vision to allow for real hands on technical career growth. I can’t help but feel that this inclination to recycle developers in each new generation is hurting or at least slowing our ability to develop a real software engineering discipline, this potential loss of continuity seems like it leads to some lost wisdom.

The Social Networking Drain and Hollywoodification of Silicon Valley

As I write this Hollywood gears up to deliver a reality TV show that takes place in silicon valley and Mark Zuckerberg now gets the paparazzi treatment. Another article that I previously mentioned was about how CS enrollment was up because the movie The Social Network had enticed many aspiring Zuckerberg wannabes. At present there does seem to be a gold rush mentality and it is noticeable and even discussed on sites like Hacker News.

As someone who has never worked in the valley I am very much an outsider and may not have a good perspective on these things, but it seems that there are a lot of startup companies focusing on crap. Now I get it. We live in a shallow consumer society with a rapacious appetite for crap. But are not these supposed to be some of the smartest people and I am not the only one to wonder why so many smart people who should have better taste and higher standards are settling for working on serving up ads and other vapid crap instead of doing more meaningful work like working on real problems that would advance human society or even just work on advancing software engineering.

¹I believe this was the case in the late 90’s and may have changed, this information was conveyed to me by a former Microsoft employee.

17 July 2012

A Framework for Software System Naming

On the Importance of Naming in Software Systems part IV

In my previous posts on naming I have laid the foundation for a framework for naming in software systems. This post is mostly a rehashing of those previous ideas and is intended to both codify and consolidate the definitions of terms upon which I have chosen and created to build a naming framework. It also serves as a single location to concisely reference these ideas. To gain more insight into the background and development of these ideas see my previous three posts: "A Survey of the Conventional Wisdom on Software Naming", "More Thoughts on Formal Approaches to Naming in Software", and "Software System Morphology". The definitions are as follows:

Structural

Name – The Signifier that consists of a string of characters that refers to a System Referent.
Name Unit Separator – is a mechanism to separate name units within a Name, this can be an explicit character such as "-" or "_". It can also be in implicit mechanism such as the use of Camel Case within the name.
Name Unit – Is the smallest lexical unit of a Name usually separated by a Separator (Name Unit Separator) and is used to describe the lexical structure of the Name.
Named Scope Context – this is the location of the Name within the system, e.g. where the name usage occurs.
System Referent – This is the actual instance of a thing that is named and is used in the system. See System Referent Type for specific types of System Referents.
System Referent Type - This is the specific type or class, as in classification, of what a System Referent is in a system some examples include:
- Classes
- Variables (local, instance, static)
- Methods
- Method Parameters
- Packages
- Database Tables, Columns, Triggers, Stored Procedures, etc.
- HTML Files, CSS Files, Javascript Files, Config Files, all files
- Directories
- Urls
- Documents
- XML Elements and Attributes
- CSS Classes

Semantic

System Morpheme – This is a conceptual unit of meaning in a system that is used in a name, it is distinguished from a Name Unit in that a morpheme can consist of one or more Name Units.
System Natural Language Set – This is the set of Natural Languages used to create the names. It is often English but it need not be and a system could include components or be constructed in a way that might include hybridization between multiple natural languages.
Problem Domain - This is the Semantic Domain that describes problem area for which the system solution is targeted. This would be a set of concepts and System Morphemes that apply to the problem domain. For example a financial system would be built using the relevant financial terms and language concepts.
Solution Domain – This is the Semantic Domain that describes concepts that are used to implement the system including System Morphemes that apply to the solution domain. This might include things like terms that describe Design Patterns, data structures, etc.

As you can see from above these Ideas can be separated into two categories: Structural and Semantic. However the two categories are not completely independent of each other so there is some overlap and interdependence. The structural ideas include: Name Unit, Named Scope Context, System Referent, and a System Referent Type. Each Name has these properties as attributes. A name has a lexical structure made up of Name Units, and refers to a System Referent of some System Referent Type and has one or more locations given by its Named Scope Context(s).

The semantics of the names which has the "unit" described by the System Morpheme idea has semantic properties which will fall into either the Problem Domain or the Solution Domain. The expression of these concepts is dependent on the System Natural Language Set, in other words the natural languages that are used to construct meaningful names for concepts in both the Problem Domains and the Solution Domains.

This is the current state of my thinking on this. These ideas will be refined and might change over time, the objective is to develop something that is useful and that both supports and can be validated by a more analytic approach.

01 April 2012

The Coding Buzz

In the episode of Seinfeld titled "The Frogger" Jerry and George discover a long lost Frogger machine that still has the G.L.C. George Louis Costanza high score of 860,000 of which George reminisces: "I remember that night. The perfect combination of Mountain Dew and mozzarella...just the right amount of grease on the joy stick..."

This essentially sums up what I call the coding buzz, for me it's the perfect combination of a good problem, private time to work on it, and just the right combination of caffeine and music. The coding buzz is what I live for, those narcotic moments of creative bliss when all other distractions and even personal problems melt away into the background and your mind and fingers become one with your editor begetting a symphony of code creation. It's a wonderful thing.

To really achieve the coding buzz one needs to have a level proficiency generally denoted with the quantity 10,000 hours, meaning that you need at least 10,000 hours of practice. It turns out that the coding buzz is what is called "flow". This is from an article on New Scientist:

The first is an intense and focused absorption that makes you lose all sense of time. The second is what is known as autotelicity, the sense that the activity you are engaged in is rewarding for its own sake. The third is finding the "sweet spot", a feeling that your skills are perfectly matched to the task at hand, leaving you neither frustrated nor bored. And finally, flow is characterised by automaticity, the sense that "the piano is playing itself", for example.
...
Flow typically accompanies these actions. It involves a Zen-like feeling of intense concentration, with time seeming to stop as you focus completely on the activity in hand. The experience crops up repeatedly when experts describe what it feels like to be at the top of their game, and with years of practice it becomes second nature to enter that state.
...
Exactly what happens in the brain during flow has been of particular interest, but it has been tricky to measure. Csikszentmihalyi took an early stab at it, using electroencephalography (EEG) to measure the brain waves of expert chess players during a game. He found that the most skilled players showed less activity in the prefrontal cortex, which is typically associated with higher cognitive processes such as working memory and verbalisation. That may seem counter-intuitive, but silencing self-critical thoughts might allow more automatic processes to take hold, which would in turn produce that effortless feeling of flow.

This is why after over two decades I still mostly like to work as a developer, admittedly not all of my career has consisted of those moments, but the times I found myself immersed in those types of projects both personal and professional has well been worth it. I frequently feel that I should have grown up at some point and tried to pursue the management track or a more formal architect position and I have contemporary colleagues that are VPs and CTOs, etc. However, every time I think about such a career path I realize how much I would dislike it and I would then have to let go of being hands on, I know there are probably cases of those types of positions where that is not the case, but you would have to find the right environment.

I have recently found myself in a pair programming environment and while there are some aspects of it that are ok, but I often find myself sneaking away to do the creative experimental work on my own. As an introvert I sometimes find it hard to pair. One problem is that there is a substantial skill differential between my teammates and me in that I am considerably more senior so that's a problem, thus I am not sure if my issues with pairing are truly accurate. Still when I need to build some complex functionality or refactor code that requires you to hold dozens of classes in your head, tasks that require intense concentration and that both require you to and reward you for being in the zone aka achieve flow, I just have trouble seeing doing this effectively with the added pressure of someone looking over your shoulder. I have spoken with developers who like pairing and perhaps it's an introvert/extrovert preference.

When I originally came up with the idea for this post I just wanted to talk about the euphoric bliss of the coding buzz but now I find myself in a situation that seems to be harshing that coding buzz, to put it in the parlance of our times. Coincidently I am writing this at time when the "group" approach seems to be coming under fire, most notably in an article in the New York Times entitled "The Rise of the New Groupthink" by Susan Cain which includes the following:

The reasons brainstorming fails are instructive for other forms of group work, too. People in groups tend to sit back and let others do the work; they instinctively mimic others' opinions and lose sight of their own; and, often succumb to peer pressure. The Emory University neuroscientist Gregory Berns found that when we take a stance different from the group's, we activate the amygdala, a small organ in the brain associated with the fear of rejection. Professor Berns calls this "the pain of independence."

She also includes the following excellent quote from Steve Wozniak:

Most inventors and engineers I've met are like me ... they live in their heads. They're almost like artists. In fact, the very best of them are artists. And artists work best alone .... I'm going to give you some advice that might be hard to take. That advice is: Work alone... Not on a committee. Not on a team.

She also has an interesting TED talk. As an introvert I feel that both Woz's and her sentiments resonate with me, and I really know what it's like to live in my head, I can't imagine doing good creative work any other way.

As I said the groupthink backlash and pair programming skepticism seems to be something of a hot topic at the moment, see this compendium post on InfoQ which includes some these and other links. Obviously communication is essential in a multi-person software project but is pair programming over doing it? It just feels like a coding buzz kill to me.

The following two talks are some additional interesting and relevant neuroscience related resources:

Authors@Google: Sandra Aamodt & Sam Wang

Your Brain at Work by David Rock.

22 October 2011

A Confederacy of Cargo Cult Coders

I hate to say it but I feel that one of the biggest problems with Software Engineering is the prevalence of programmers who are Cargo Cult Coders. Now I know this may seem a bit extreme but it is my feeling that many software projects suffer in part due to the fact that most developers are mediocre at best and there are quite a few developers some who are more senior who are "faking it" . So before you call me a cynic or worse, think about this, it’s pretty well known that really good developers are really rare and good developers are rare. So let’s assume for sake of argument that developer ability falls on the Normal Distribution, I am not asserting that it does, that means the vast majority of developers fall into the average category it also runs along the lines of the 80/20 rule in that most developers range from average to bad.

Now I have seen these types of developers a fair amount and this is one possible down side of frameworks such as Spring and Hibernate. If you read my blog you know that I am a big fan of frameworks and the concept in general, but they can be abused as can any other technique or methodology. The plus side of technologies like Spring and Hibernate is that good developers can quickly create large complex well engineered systems, of course mediocre and bad programmers can, by blindly implementing framework patterns, create large complex monstrosities. I have seen cases where through the use of cutting and pasting of example code, bad programmers can create working systems with a limited understanding of the underlying technologies, I often use googled code snippets myself, but at least I understand the fundamentals and if I don’t fully understand what I am doing at the time I try to go back and learn more about how the underlying technology works. Essentially these Cargo Cult Programmers are mostly just configuring boilerplate code and code snippets by trial and error within the confines of a framework.

On one of my previous contracts I had the misfortune to work with perhaps one of the worst developers I have ever met, he was truly a consummate cargo cult coder, he was musician turned programmer¹ with about eight years of professional experience. He had zero understanding of Computer Science fundamentals and seemingly no understanding of good Software Engineering principles and yet during his career he had accumulated enough basic knowledge of Web Design, Javascript, Java, Spring, Hibernate, SQL and various other technologies and components to cobble together enterprise applications. Ironically his ability to create these applications was actually somewhat impressive, as long as you didn’t look at the code. I think the scariest thing about this developer is that he had garnered a false sense of the level of his abilities which he would project, which in turn would cause others to falsely believe that he was a highly competent developer.

I use this particular developer as an example, but this is a trend that I am seeing, developers who now claim themselves as "senior level" because they have gained proficiency as cargo cult coders, but when you see their non boilerplate code it will often demonstrate an egregious lack of knowledge of basic concepts like cohesion, coupling, inheritance, basic OOP design, threading, the underlying workings of the framework technologies themselves, etc. I think the biggest danger is the complacency and perhaps self delusion or naivety which can even manifest itself as hubris, that these developers acquire from this limited perspective, in fact it is ultimately self limiting behavior because all new technologies and languages are then viewed in the same narrow context, usually as another resume bullet, practitioners of what I call RDD (Resume Driven Development).

I believe that you can boil down what makes a good developer to two relatively simple things, one is ability and the other is desire. Ability is a complex web of inherent skill, intellect including a good memory, experience etc. and desire is the hunger for knowledge and aspiration to want to improve one’s skill and the passion to find better ways to do things. In some ways the two go hand in hand but not always, the "consummate cargo cult coder" from above had a high degree of passion, but he was stubborn and did not work well with others and was unable to take advantage of the opportunity to benefit from the knowledge of others, I actually found his situation to be somewhat tragic. The other side is people who are talented who lack desire, ironically I have been accused of this one, it’s usually not due to my lack of desire, I just sometimes find what I am working on to be boring which in turn can cause me to lag in terms of productivity, fortunately this is fairly aberrant behavior for me.

Sadly I have recently been on some pretty dysfunctional teams and I feel the biggest tragedy is that on those teams there have been developers in whom I saw that they could be more than they were on the project, but they mainly lacked guidance, which I could not provide due to team’s dynamics. This type of situation is really a double tragedy, it is tragic for the developer in that a better opportunity for the developer is missed, and it is a missed opportunity for project, the team, and the management of the team as they could have a had a better developer who was producing better quality code. I guess what I am saying is that there should be a way to try to avoid wasting developer potential, I think this is some of what Agile Process tries to achieve. "Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done." I think this should be taken further to try maximize each developer’s skill and inspire their inherent passion and to kindle increased passion to maximize the efficiency of the team.

If you are lucky the cargo cult coders on your team are just plodding along and hopefully not messing up your codebase too much and are the ones who just need that extra guidance or encouragement. Unfortunately they can do far worse damage, there are many out there who have been getting by for many years and have formed overrated opinions of themselves, these are the worst, not only can they disrupt the software structure they can often disrupt the team dynamics and create more substantial problems, in my Driven Development post I mentioned developers who I was in conflict with creating a CDD (Cognitive Dissonance Development) environment, they were cargo cult coders who had taken very defensive and recalcitrant positions, you couldn’t reason with them because if you tried to talk about good design principles or standard practices you were speaking a language they did not understand, they only saw software development as learning some new technology that they could then claim to be experts in. They will pollute your system with redundant inconsistent shoddy code that will degrade its quality, performance and maintainability. Worst yet, they will be excessively defensive in the face of criticism, which makes code reviews hard if not impossible, and since they don’t read or understand good engineering principles they often argue against them. Also they will often disrupt the team dynamics and poison it with their defensive and sometimes arrogant behavior this can result in the team becoming highly dysfunctional which can create a toxic working environment that is hostile to good developers and is the opposite of the optimal environment in that the team will not function as whole that is greater than the sum of its parts it will function as whole that is less than the sum of its parts.

¹ I have worked with plenty of non-technical/non-CS converts and there are some who are good and go the extra mile to learn the basics and beyond.

18 October 2011

O(log(n))

I find O(log(n)) to be a very interesting algorithm complexity, it’s also highly desirable and if you achieve it you are probably doing well, or you are just using a couple of common time tested algorithms and data structures such as Binary Search or [Self] Balanced Binary Trees like Red Black or AVL trees. The order of Big O Complexities is:

O(1)	constant
O(log log n)	Double Logarithmic
O(log n)	Logarithmic
O(n^m) 0 < m < 1	fractional power
O(n)	Linear
O(n log n) = O(log n!)	Loglinear
O(n²)	Quadratic
O(n³)	Cubic
O(n^m) m > 1	(general) polynomial (m is a fixed, non-negative integer; e.g. n⁴, n⁵)
O(mⁿ)	exponential (m >= 2; ; e.g. 2ⁿ, 3ⁿ)
O(n!)	Factorial

When I was growing up and hand held calculators were quickly evolving much like smart phones are now, although perhaps not as rapidly, my father would bring these calculators home and I would play with them, this was my first exposure to logarithms when I got to high school I learned more about them including some of the identities. Actually I find the Logarithmic Identities to be quite interesting and have found them important in understanding logarithms which are useful in understanding other areas of science and math. I even used to use log base 10 to calculate the number of digits of numbers for print formatting way back before we had better ways to do number formatting^†, ⌈log₁₀(n)⌉, of any base 10 number is its number of digits, where ⌈x⌉ is the ceiling function. Also many common measurements like the Richter Scale, pH and Decibel among other things are logarithmic scale. Also we have previously encountered logs in relation to information entropy . Some useful logarithmic identities are:

y = log_b(x) if and only if x = b^y
log_b(1) = 0
log_b(b) = 1
-log_b(x) = log_b(1/x)
log_b(x*y) = log_b(x) + log_b(y)
log_b(x/y) = log_b(x) - log_b(y)
log_b(xⁿ) = n log_b(x)
log_b(x) = log_b(c) * log_c(x)
log_b(x) = log_c(x) / log_c(b)

The first one demonstrates the relation between the functions of logarithm and exponentiation, as you can see the value y, which is log base b of the value x, is the exponent to raise b to get the value x, so a logarithm is just a way to get the exponent. Log base 10 of 100 is 2, log₁₀(100) = 2, since 10 to second power is 100, 10² = 100. Log base b of y, x = log_b(y) is inverse function of raising b to the y power, x=b^y, so:

b^log_b(x) = x

Another interesting observation that one can draw from the logarithmic identities is the answer to the question: What is the difference between O(log₂(x)) and O(log₁₀(x))? It’s a little trick question that I like. The answer is they are the same. Using the second to last identity (#8):

O(log₂(x)) = O(log₂(10) * log₁₀(x))

Since log₂(10) is a constant:

O(log₂(x)) = O(log₁₀(x))

Pretty cool!

This would also work with the second to last identity (#9) as well and remember that Big O notation is about [asymptotic] growth, so equals in the case of Big O notation is not the same as equals e.g. log₂(x) ≠ log₁₀(x). Also for small quantities like the examples in this article, the base does make a difference.

Simlarily exponetiation has its identities as well:

b⁰ = 1
b¹ = b
b^-n = 1/bⁿ
b^mbⁿ= b^m+n
b^m/bⁿ= b^m-n
(b^m)ⁿ = b^mn
(b/b)ⁿ = bⁿ/bⁿ
(b/a)^-n = (a/b)ⁿ

The binary tree structure (a rooted acyclic graph) visually illustrates O(log(n)) and the inverse function relationship between logarithms and exponentiation. The diagram above shows a complete^‡ and therefore balanced binary tree structure. As you can see the number of items in each row grows exponentially as powers of two, shown in red, also the total number of elements in the tree as each row is added grows in the same exponential fashion, denoted in square braces. So in a balanced tree you will fill all of the n rows up to the 2ⁿ – 1 item, and when that number is exceeded (greater than or equal to 2ⁿ) a new (n + 1) row will be added. Now the growth is not necessarily exponential, you can add items at any rate, but the structure in which items are stored are broken down along “exponential row boundaries”. So to search 2⁴ – 1 (15) items for the number 7 value we would traverse 8-4-6-7, shown in green, which is 4 nodes, this is the maximum searches for this tree, a search can also be 1,2, or 3 depending on the depth of the item. Since 4 the exact exponent of the total size of the graph, which is [2⁴-1], and therefore the log: [log₂(2⁴-1) = 3.9…], almost 4, our max traversal, O(log(2ⁿ-1)) = O(log(2ⁿ)). Since we know that these two functions are inverses this illustrates O(log(n)) in visual terms.

The Binary Search algorithm has similar log base 2 characteristics, if you remember the algorithm it takes list which has to be ordered, and it repeatedly halves the length of the remaining items and checks the element at each new position, if it is equal you are done, if it is not then depending on whether the search value is greater or less than the midpoint value you then half the bottom or top of the list respectively and repeat the process. This is better illustrated by the following Java code:

public static int binarySearch(int value, int[] array) {

int min = 0;

int max = array.length - 1;

while (min <= max) {

int mid = min + (max - min) / 2;

if (value < array[mid])

max = mid - 1;

else if (value > array[mid])

min = mid + 1;

else

return mid;

}

return -1;

}

The following recursive code example is both more elegant and gives a more intuitive feel for the Divide and Conquer nature of the algorithm. Also it is this Divide and Conquer behavior that breaks the list apart logarithmically.

public static int binarySearch(int value, int[] array) {

returnbinarySearch(value, array, 0, array.length – 1);

}

public static intbinarySearch(int value, int[] array, int min, int max) {

if (max < min)

return -1;

int mid = min + (max - min) / 2;

if (array[mid] > value)

return binarySearch(value, array, min, mid - 1);

else if (key > array[mid])

return binarySearch(value, array, mid + 1, max);

return mid;

}

Let's view a search visually:

In our search for the value 7 in our 16 item list we first go to 8, which is too large, then to 4 which is too small and then to 6, and then 7, at each point in our search the size of what remains to be searched is half of the size of the previous search space.

In our visual example needed to search 4 positions (8, 4, 6, 7) and log₂(16) = 4 which is O(log(n)).

An interesting side note is that many binary search algorithms are broken due to possible overflow problems with integers, as pointed out in a blog post by Joshua Bloch, I account for the issue, you can read more about it here.

^† Due to computing a logarithm, this may not be optimally efficient.

^‡ Complete in this context means that all positions up to and including 2ⁿ-1 positions are filled, i.e. have a node. Also it should be noted that Self Balancing algorithms may not yield such a well balanced tree and it might have more rows for 15 items, but it is still O(log(n)).

19 August 2011

The Object Graph

In the ORM world, e.g. Hibernate, the term Object Graph is used quite a bit and it is a powerful notion, however, I suspect that Object Graph often gets used without much thought to some of the underlying concepts. Generally the term refers to an Object, usually a Domain Object which often contains other Domain Objects and/or Collections of other Domain Objects via composition. With an ORM like Hibernate very complex Object Graphs can be read, altered and persisted with very little work, it is a effective concept that can make building applications both easy and elegant especially if you have the proficiency and forethought to design your mappings properly.

In order to talk more intelligently about some of the underlying concepts regarding the Object Graph it is helpful to get a few fundamentals of Graph Theory out of the way, if you know them feel free to skip ahead. Actually this is a good opportunity to put them in a context that may be more intuitive for many developers. Graph Theory is an abstraction, a framework if you will, built on top of Set Theory, however, we don’t need to get bogged down in a Set based definition here, I do encourage you to pursue this on your own as it is extremely seminal and becoming increasingly essential. We just need to get four fairly simple graph concepts out of the way which can be explained easily and visually, they are: The difference between a simple graph and a multi graph, the difference between a cyclic graph and an acyclic graph, the definition of a directed graph, and the definition of a tree. Let’s start with the graphs:

As you can see from the above diagram we have several different types of graphs, now it should be noted that these attributes can overlap, the Undirected Graph contains cycles, so it is cyclic as does the Directed Graph, the Multi-Graph¹, by definition contains cycles so it is cyclic this one is also undirected. The Acyclic Graph is undirected. So these attributes can combine, except for an acyclic multi-graph which by definition can’t exist, I think.

In order to limit complexity for our discussion we are going to limit our object graph to the case of a directed acyclic graph, which is also known as a DAG. More specifically we will be considering a specific case of this known as a directed rooted tree, where edges have a direction away from the root. The following example shows a "rooted" DAG aka a tree:

As you can see the above tree is also a domain object graph it is also it is a directed graph where the direction indicates containment via composition, which can be thought of as a "has-a" relation. Now in some cases using ORM’s like hibernate one might include a mapping from Address back to Person so the ORM knows how to manage the relation and this is a cycle but it is an artifact created by the ORM since all accesses to this object graph by the application are from the root (Person node). Now that is not to say that there might not be other structural reasons for such a cycle, perhaps an application needs to hold a list of addresses, but needs to get some of the Person attributes, then in this use case the Address node becomes the "relative" root from which the application access occurs so the object graph would then have two roots and need a cycle to maintain that duality.

So in many cases in regards to the ORM entity model the object graph can be thought of as a tree, a rooted directed acyclic graph (DAG). There is one important divergence from traditional graph theory which is the presence of collections in graphs, these form a special node that can have combinatorial enumerative properties, in the case of the various types of collections also it can have certain mapping properties in the case of a Map object. An example of a modified person object that might hold a collection of addresses is:

There are a number of ways to expand on these ideas further but I think I will leave those for future posts.

¹ This is the graph that Euler came up with for famous the Seven Bridges of Königsberg problem. The bridges are shown in the top image.

17 May 2011

The Combinatorial and Other Math of the Java Collections API

The Java Collections API both involves some explicit Math and presents some nice artifacts which parallel other Math concepts.

Combinatorics is probably one of the more fundamental Maths for Computer Science and is also prevalent throughout other Maths. In one example of Combinatorial counting problems, there is a collection of items, say a deck of cards, from which we want to determine the number of selections, say hands. The formula to compute the possible Combinations is written as:

This can also be stated as “n choose k.” My mnemonic for remembering this, and it is worth remembering, is to look at the left side above note that the n is on top and the k is on the bottom, which visually maps the n! to the numerator and k! to the denominator, and then rotate the left part by 90 degrees counter clockwise so that it forms (n - k)! and put it in the denominator. Also every number calculated by this equation is a Binomial Coefficient found in Pascal’s Triangle, which in my opinion is one of the most amazing mathematical structures.

Our example for possible five card hands from a deck of cards computes as follows:

Combinations give you the number of elements with no repetitions and where order does not matter. The possible cases and Mathematical structures for the criteria of order and repetition are as follows:

	Order Does not Matter	Order Matters
No Repetitions	Combination/Set	Permutation
Repetitions	Multiset/Bag	Enumeration/List/String

The Combinatorial equations are as follows:

	Order Does not Matter	Order Matters
No Repetitions
Repetitions

The interesting thing about lists is that if you make lists of varying lengths of the following characters [0,1,2,3,4,5,6,7,8,9] , you have the natural numbers in base ten, so a list of length 3 gives you 10³ or 1000 possibilities, which are: 000-999. Also lists have the highest cardinality of the four:

|List| > |Multiset| > |Permutation| > |Combinitation|

Where |object| means the cardinality or size of the collection, which is the number of elements in the set, list, etc.

The corresponding Java classes/interfaces are:

	Order Does not Matter	Order Matters
No Repetitions	java.util.Set	java.util.SortedSet
Repetitions	org.apache.commons.collections.Bag	java.util.List/java.lang.String

Sadly the Java Collections API is incomplete in this regard. Also this example is somewhat abstract but it does demonstrate how to view these objects from a Combinatorial perspective, which in turn can help with choosing the appropriate data structure, also it gives insight into possible ways that they may be populated and perhaps even tested.

This is an introductory example to Combinatorics, for more perspective there is an expansion on this known as the Twelvefold way.

The next example is more concrete, in fact it comes directly from the java docs for Comparator and Comparable:

For the mathematically inclined, the relation that defines the imposed ordering that a given comparator c imposes on a given set of objects S is:
{(x, y) such that c.compare(x, y) <= 0}.
 
The quotient for this total order is:
{(x, y) such that c.compare(x, y) == 0}.
 
It follows immediately from the contract for compare that the quotient is an equivalence relation on S, and that the imposed ordering is a total order on S. When we say that the ordering imposed by c on S is consistent with equals, we mean that the quotient for the ordering is the equivalence relation defined by the objects' equals(Object) method(s):
{(x, y) such that x.equals(y)}. 

I have to admit when I first read that a few years back, I was like, WTF does that mean! Obviously whoever wrote that was thinking deeply and mathematically about the underlying principals here which is clearly needed if you are designing fundamental low level framework aspects of base classes and collections.

Comparators and Comparable are about ordering, so what is being defined above is a Total Order which means that all elements can be compared to every other element, compare this to a Partial Order where elements may or may not be comparable to each other, both of these fall under the more general Order Theory. The Quotient aka Equivalence Class concept is a bit more complicated and beyond what I am trying to talk about here.

I believe the above Comparator Javadoc quote was written by Joshua Bloch, the following is from his book Effective Java:

The equals method implements an equivalence relation. It is:
Reflexive: For any non-null reference value x, x.equals(x) must return true.

Symmetric: For any non-null reference values x and y, x.equals(y) must return true if and only if y.equals(x) returns true.

Transitive: For any non-null reference values x, y, z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) must return true.

Consistent: For any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.

For any non-null reference value x, x.equals(null) must return false.

The first three are the Mathematical definition of an Equivalence Relation (which relates back to the Equivalence Class) can be found in “Abstract Mathematics” or “Transition to Advanced Mathematics” type books which describe Functions and Relations. CS Majors usually are exposed to these concepts as part of the intro into discrete math.

What I find interesting about these concepts is how the comparator/comparable and equals relation and their interrelationship for the individual object map back to control the collections behavior of ordering and repetitions.