Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label algorithm. Show all posts
Showing posts with label algorithm. Show all posts

Tuesday, December 13, 2022

On Algorithm Design

Some background: FAERIE DUST™, Obstinate Idiocy, Obstinate Idiocy, Expanded, and even Permutations, Combinations and Frustrations. I want to set up algorithm design as the diametric opposite of Obstinate Stupidity. To do that, let's look at Obstinate Stupidity.

The theme? 

We did something wrong, and we don't want to fix it.

I emphasize this because it takes many forms. Another common variant is "We can't afford to continue the way we are, but we can't afford the time to fix it, either." Sometimes, "Management wants this fixed, but we don't have any budget." You know how it is.

The current go-round is someone who has an algorithm of intensely stupid (and largely irrelevant) complexity. See My algorithm performs badly, do I need asyncio?

The situation is touchy. They have pre-reasoned an answer -- asyncio -- and they're looking for (a) confirmation that they're obviously right and (b) help rewriting something needlessly complex to somehow be faster even when it's compute-bound. Specifically, they want Faerie Dust.

Frivolous Complexity

How do I know it has needless, frivolous complexity?

Here are two symptoms.

  1. The problem has a lot of context. In thise case, there's a hierarchy. The hierarchy may seem irrelevant, but it has this mind-numbingly complex back-story, that they can't seem to ignore or abstract out of the essential problem. There's a (large) number of details that don't really explain what the hierarchy means or why it has to be preserved. but somehow make it essential.
  2. The problem can only be described by repeating the legacy algorithm. 

Let's dwell on this second symptom for a moment. We have two competing issues:

  • The legacy algorithm is too slow. AND,
  • There's no other way to describe the problem.

This should make it clear they are looking at asyncio as a kind of Faerie Dust that will magically make the bad algorithm good. Without fixing the bad algorithm.

I want to emphasize the existence of details which can neither be explained nor removed. The hierarchy must be there simply because it must be there. Bizarre complications to walk the hierarchy are, therefore, essential even if no one can explain them.

Algorithm Design

To actually improve the processing they need a new algorithm.

I can't emphasize this enough: they need a new algorithm. (This often means a new data structure.)

"Tuning" in any form is nothing more than nibbling around the edges to make a bad algorithm run 15% faster.

Rewriting may replace \(\textbf{O}(2^n)\) with \(\textbf{O}(n \log n)\). This would be dramatically better. From seconds to milliseconds. You know, 1,000% faster.

There's a disciplined approach to this. Here are the steps.

  1. Write the post-condition for the processing as a whole.
  2. Write code that achieves the post-condition. (This may involve decomposing the big problem into sub-problems, each of which is approached by the same two-step process.)

The intensely painful part of this is creating the post-condition.

I suggested they "write an assert statement that must be true when the algorithm has completed, and computed the right answer."

Hahahah.

What an idiot I was.

They didn't know how to write an assert statement. And at this point, they stopped. Brick Wall. Dead in the water. Cannot proceed. Done. Failed.

The assert statement has become the end-of-the-line. They can't (or won't) do that. And they won't ask about it.

Me: "Do you have a question?"

Them: "I have to think before I can even begin to ask a question."

Me: "How about think less and ask more. Do you have trouble writing the word assert? What's stopping you?"

Them: [silence]

Okay.

Post-Conditions

The post-condition is true when you're done. Let's look at my favorite, M must be the maximum of A and B.

\[M \geq A \textbf{ and } M \geq B\]

This becomes an assert statement through (what seems to me, but boy was I wrong) the following kind of translation.

assert M >= A and M >= B, f"Algorithm Failed {M=} {A=} {B=}"

Again, I acknowledge I was wrong to think creating an assert statement from a post condition was in any way clear. It's absolutely bewilderingly impossible.

It's also important to note that the above condition is incomplete. The value \(M = A+B\) will also satisfy the condition. We need to test our test cases to be sure they really do what we want.

We really need to be more complete on what the domain of values for \(M\) is.

\[M = A \textbf{ or } M = B \textbf{ and } M \geq A \textbf{ and } M \geq B\]

We could rewrite this slightly to be

\[M \in \{A, B \} \textbf{ and } M \geq A \textbf{ and } M \geq B\]

This version directly suggests a potential set comprehension to compute the result:

M = {m for m in {A, B} if m >= A and m >= B}.pop()

This is the advantage of writing post-conditions. They often map to code.

You can even try it as pseudo-SQL if that helps you get past the assert statement.

SELECT M FROM (TABLE INT(X); A; B) WHERE M >= A AND M >= B

I made up a TABLE INT(X); A; B to describe a two-row table with candidate solutions. I'm sure SQL folks have other sort of "interim table" constructs they like.

The point is to write down the final condition. 

I'll repeat that because the folks I was trying to work with refused to understand the assert statement.

Write down the final condition.

The Current Problem's Post-Condition

The problem at hand seems to involve a result set, \(R\), pulled from nodes of some hierarchy, \(H\), \(R \subseteq H\). Each element of the hierarchy, \(h \in H\) has a set of strings, \(s(h)\). It appears that a target string, \(t\), must be a member of \(t \in s(r), r \in R\). I think.

Note that the hierarchy is nothing more than a collection of identified collections of strings. The parent-childness doesn't seem to matter for the search algorithm. Within the result set, there's some importance to the tier of the hierarchy, \(t(h)\), and a node from tier 1 means all others are ignored or something. Can't be sure. (The endless backstory on the hierarchy was little more than a review of the algorithm to query it.)

If any of this is true, it would be a fairly straightforward map() or filter() what could be parallelized with dask or concurrent.futures.

But we can't know if this really is the post-condition until someone in a position to know writes the post-condition.

Things To Do

The post-condition defines the results of test cases. The assert statement becomes part of the pytest test cases. In a kind of direct copy-and-paste process to shift from design aid to test result condition.

Currently, the algorithm they have seems to have no test cases. They can't write a condition to describe correct answers, which suggests they actually don't know what'a correct.

If they wrote test cases, they might be able to visualize an assert statement that confirms the test worked. Might. It appears to be asking a lot to write test cases for the legacy algorithm.

Indeed, if they wrote a conditional expression that described the results of any working example, they'd have taken giant steps toward the necessary assert statement. But that's asking a lot, it appears.

And Then What?

Once you have a target condition, you can then design code to satisfy some (or all) of the target condition. Dijkstra's A Discipline of Programming has a thorough description of the "weakest precondition" operator. It works like this:

  1. Imagine a statement that might satisfy some or all of your post-condition.
  2. Substitute the effect of the statement into the post-condition. 
  3. What's left is the weakest pre-condition for that statement to work. It's often the post-condition for a statement must precede the statement you wrote.

You write the program from the desired post-condition moving forward until you get a weakest pre-condition of True. Back to front. From goal to initialization.

Post-condition gives you statements. Statements have pre-conditions. You iterate, writing conditions, statements, and more conditions.

(You can also spot useless code because the pre-condition matches the post-condition.)

For the silly "maximum" problem?

Try M := A as a statement. This only works if A >= B. That's the pre-condition that is derived from substituting M = A into the post-condition.

Try M := B as a statement. This only works if B >= A. That's the pre-condition that is derived from substituting M = B into the post-condition.

These two pre-conditions describe an if-elif statement. 

Note that this feels weirdly arbitrary and exploratory. It's a kind of empiricism where we try statements and see if they're helpful. There don't need to be any constraints. The post-condition is all that's required to explore the space of statements that might work, or at least might help.

Of course, we're not stupid. And we're lazy. We don't search the infinite space of statements. We can often imagine the statements without a lot of complex work. The formal weakest pre-condition process is necessary to confirm our intuition. Or to assert that something is free of astonishing side-effects.

It all depends on one thing: a clear, formal statement of the post-condition.

Since I made the mistake of describing the post-condition as a line of code, we've hit some kind of brick wall related to "I won't write code." Or "I don't want to be seen writing code." or "I don't want you to critique my code." 

Dunno.

Tuesday, December 6, 2022

My algorithm performs badly, do I need asyncio?

Real Question (somewhat abbreviated): "My algorithm performs badly, do I need asyncio?"

Short answer: No.

Long answer: Sigh. No. Do you need a slap upside the head?

Here's how it plays out:

Q: "We figured that if we 'parallelize' it, then we can apply multiple cores, and it will run 4x as fast."

Me: "What kind of I/O are you doing?"

Q: "None, really. It's compute-intensive."

Me: "Async is for I/O. A function can be computing while other functions are waiting for I/O to complete."

Q: "Right. We can have lots of them, so they each get a core."

Me: "Listen, please. A function can be computing. That's "A". Singular. One. Take a step back from the asyncio package. What are you trying to do?"

Q: "Make things faster."

Me: "Take a breath. Make what faster?"

Q: "A slow algorithm."

Me: 

Q: "Do you want to know what we're trying do?"

Me: 

Q: "First, we query the database to get categories. Then we query the database to get details for the categories. Then we query the database to organize the categories into a hierarchy. Except for certain categories which are special. So we have if-statements to handle the special cases."

Me: "That's I/O intensive."

Q: "That's not the part that's slow."

Me: 

Q: "Context is important. I feel the need to describe all of the background."

Me: "That's trivia. It's as important as your mother's maiden name. What's the problem?"

Q: "The problem is we don't know how to use asyncio to use multiple cores."

Me: "Do you know how to divide by zero?"

Q: "No. It's absurd."

Me: "We already talked about asyncio for compute-intensive processing. Same level of absurd as dividing by zero. What are you trying to do?"

Q: "We have some for loops that compute a result slowly. We want to parallelize them."

Me: "Every for statement that computes a collection is a generator expression. Every generator expression can be made into a list, set, or dictionary comprehension. Start there."

Q: "But what if the for statement has a super-complex body with lots of conditions?"

Me: "Then you might have to take a step back and redesign the algorithm. What does it do?"

Q: <code> "See all these for statements and if-statements?"

Me: "What does it do? What's the final condition?"

Q: "A set of valid answers."

Me: "Define valid."

Q: "What do you mean? 'Define valid?' It's a set that's valid!"

Me: "Write a condition that defines whether or not a result set is valid. Don't hand-wave, write the condition."

Q: "That's impossible. The algorithm is too complex."

Me: "How do you test this process? How do you create test data? How do you know an answer it produces is correct?"

Q:

Me: "That's the fundamental problem. You need to have a well-defined post-condition. Logic. An assert statement that defines all correct answers. From that you can work backwards into an algorithm. You may not need parallelism; you may simply have a wrong data structure somewhere in <code>."

Q: "Can you point out the wrong data structure?"

Me: 

Q: "What? Why won't you? You read the code, you can point out the problems."

Me: 

Q: "Do I have to do all the work?"

Me:

Tuesday, February 11, 2020

Interesting Data Restructuring Problem

This seemed like an interesting problem. I hope this isn't someone's take-home homework or an interview question. It seemed organic enough when I found out about it.

Given a document like this...

doc = {
    "key": "the key",
    "tag1": ["list", "of", "values"],
    "tag2": ["another", "list", "here"],
    "tag3": ["lorem", "ipsum", "dolor"],
}


We want a document like this...

doc = {
    "key": "the key",
    "values": [
        {"tag1": "list", "tag2": "another", "tag3": "lorem"},
        {"tag1": "of", "tag2": "list", "tag3": "ipsum"},
        {"tag1": "values", "tag2": "here", "tag3": "dolor"},
    ]
}


In effect, rotating the structure from Dict[str, List[Any]] to List[Dict[str, Any]].
Bonus, we need to limiting the rotation to those keys with a value of List[Any], ignoring keys with atomic values (int, str, etc.).

Step 1. Key Partitioning

We need to distinguish the keys to be rotated from the other keys in the dict.
We start with Dict[str, Union[List[Any], Any]]. We need to distinguish the two subtypes in the union.

from itertools import filterfalse
list_of_values = lambda x: isinstance(doc[x], list)
lov_keys = list(filter(list_of_values, doc.keys()))
non_lov_keys = list(filterfalse(list_of_values, doc.keys()))

This gets two disjoint subsets of keys: those which have a list and all the others. The others, presumably, are strings or integers or something irrelevant.

List lengths

There's no requirement for the lists to be the same lengths. We have three choices here:
  • insist on uniformity,
  • truncate the long ones,
  • pad the short ones.

We'll opt for uniformity in this example. Truncating is what zip() normally does. Padding is what itertools.zip_longest() does.

lengths = (len(doc[k]) for k in lov_keys)
sample = next(lengths)
assert all(l == sample for l in lengths), "Inconsistent lengths"

Some folks don't like using assert for this. This can be a more elaborate if-raise ValueError() if that's necessary.

Use zip() to merge data values

We have several List[Any] instances in the document. The intermediate goal is a List[Tuple[Any, ...]] structure where the items from each tuple are chosen from the source lists. This gets us a sequence of tuples that have parallel selections of items from each of the source lists.

The zip(list, list) function produces pairs from each of the two lists. In our case, we have n lists in the original document. A zip(*lists) will produce a sequence of items selected from each list.

Here's what it looks like:

list(zip(*(doc[k] for k in lov_keys)))

We can also use zip(key-list, value-list) to make a list of key-value pairs from a tuple of the keys and a tuple of values. zip(Tuple[Any, ...], Typle[Any, ...]]) gives us a List[Tuple[Any, Any]] structure. These objects can be turned into dictionaries with the dict() function.

It looks like this:

list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))

Assemble the parts

The final document, then, is built from untouched keys and touched keys.

d1 = {
    k: doc[k] for k in non_lov_keys
}
d2 = {
    "values": list(dict(zip(lov_keys, row)) for row in zip(*(doc[k] for k in lov_keys)))
}
d1.update(d2)

It might be slightly easier to "somehow" build this as s single dictionary, but the two subsets of keys make it seem more sensible to build the resulting document in two parts.

The code I was asked to comment on was quite complex. It built a large number of intermediate structures rather than building a List[Dict] using a list comprehension.

What's important about this problem is the complexity of the list comprehension. In particular, the keys are used twice in the comprehension. One use extracts the source lists from the original document. The second use attaches the key to each value from the original list.

It almost seems like the Python 3.8 "Walrus" operator might be a handy way to shrink this code down from about 14 lines. I'm not sure it's helpful to make this any shorter. Indeed, I'm not 100% sure this compact form is really optimal. The fact that I had to expand things as part of an explanation suggests that separate lines of code are as important as separate subsections of this blog post.

Tuesday, May 16, 2017

Needless Complexity -- Creating Havoc Leads to Mistakes

I received the worst code example ever. The. Worst.

Here's the email.
I have hurriedly created a blog post titled [omitted] at the url below
[also omitted]
...
Unfortunately, I am neither an algorithm expert or a Python expert. However, I am willing to jump in. 
Please review the Python code snippets. I know that they work because I ran them using Python 2.7.6. It was the environment available on my work PC. Speed to get something to the group so that it does not disband outweighs spending time on environments. The goal is not to be Pythonic but to have anyone that has written any code follow the logic.
Also, please review the logic. Somehow, I managed to get the wrong answer. The entire blog post is a build to provide a solution to CLRS exercise 2.3-7. My analysis gave me the answer of O( {n [log n]}**2 ) and the CLRS answer is O(n [log n] ). Where did I screw up my logic?

The referenced blog post is shocking. The "neither an algorithm expert or a Python expert" is an understatement. The "willing to jump in" is perhaps a bad thing. I sent several comments. They were all ignored. I asked for changes a second time. That was also ignored. Eventually, changes were made reluctantly and only after a distressing amount of back-and-forth.

Havoc was created through a process of casually misstating just about everything that can possibly be  misstated. It transcended mere "wrong" and enters that space where the whole thing can't even be falsified. It was beyond simply wrong.

My point here is (partially) to heap ridicule on the author. More importantly, want to isolate a number of issues to show how simple things can become needlessly complex and create havoc.

The Problem

The definition of the problem 2.3-7 seems so straightforward.
"Describe a $\Theta(n \log n)$-time algorithm that, given a set $S$ of $n$ integers and another integer $x$, determines whether or not there exist two elements in $S$ whose sum is exactly $x$."
This brings us to problem #1. The blog post is unable to actually repeat the problem. Here's the quote:
x + y = x0
where
x ∈ N and y ∈ N for some finite set of integer values N
x0: the integer to which x and y sum
x != y
It's not at all clear what's going on here. What's x0? What's N? Why is x!=y even introduced?

This is how havoc starts. The requirements have been restated in a way that makes them more confusing. The original terminology was dropped in favor of random new terminology. There's no reason for restating the problem. The consequence of the bad restatement is to introduce needless features and create confusion.

The Python Nonsense

A quote:
Concepts are demonstrated via code snippets. They code snippets were executed using Python 2.7.6. They were written in such a way that anyone with basic coding skills could read the code. In other words, the goal was not to be Pythonic.
Python 2.7.6 has been obsolete since May of 2014. At the very least, use a current release.

The goal of using Python without being Pythonic seems to be -- well -- schizophrenic. Or it's intentional troll-bait. Hard to say. 

Another paragraph says this.
The Python community will be annoyed because I am using Python 2 and not 3. Their annoyance is appropriate. Unfortunately, I only have Windows machines and can't afford to screw them up at this point in time.
What? That makes no sense at all. It's trivial to install Python 3.6 side-by-side with Python 2. Everyone should. Right now. I'll wait. See https://conda.io/docs/. Start here: https://conda.io/miniconda.html.

If you're going to insist on using the quirky and slow Python 2, you absolutely must use this in all of your code:

from __future__ import print_function, division, absolute_import, unicode_literals

Python 2 code without this is wrong. If you're still using Python 2, add this to all your code, right now. Please. You'll have to fix stuff that breaks; but we'll all thank you for it. pylint --py3k will help you locate and fix this.

The code with a -2/10 pylint score

I'm trying to reproduce this faithfully. It's hard, because the original blog post has issues with layout.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
for SomeIntegerA in SomeIntegerList:
 for SomeIntegerB in SomeIntegerList:
  if SomeIntegerA == SomeIntegerB: continue
        SumOfIntegers = SomeIntegerA + SomeIntegerB
        print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
  if DesiredSumOfIntegers == SumOfIntegers:
      print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"

(The original really could not be copied and pasted to create code that could even be parsed. I may have accidentally fixed that. I hope not.)

Almost every line of code has a problem. It gets worse, of course.

There's output in the original blog post that provides a hint as to what's supposed to be happening here.

Addition is Commutative

Yes. There is an entire paragraph plus a spreadsheet which proves that addition is commutative. An. Entire. Paragraph. Plus. A. Spreadsheet.

Meanwhile, factorial, multiplication, and division aren't mentioned. Why do we need a spreadsheet to show that addition is commutative, yet, all other operators are ignored? No clue. Moving on.

Permutations

A quote:
Now, let's talk about the number of computations involved in using nested for loops to examine all the possible addition permutations. Here I am using the term permutation as it is strictly defined in mathematics.
First. The algorithm uses all combinations. $\textbf{O}(n^2)$.

Second. "as it is strictly defined in mathematics" should go without saying. If you feel the need to say this, it calls the entire blog post into question.

It's like "honestly." Anyone who has to establish their honesty with "can I be honest with you?" is still lying.

If we're being strict here, are we not being strict elsewhere? If we're not being strict, why not?

The algorithm enumerates all combinations of n things taken 2 at a time without replacement. For reasons that aren't clear. The original problem statement permits replacement. The restatement of the problem doesn't permit replacement.

The n things taken r or 2 at a time problem

There's a table with values for $\frac{n!}{(n-r)!}$

No hint is given as to what this table is or why it's here.  I think it's supposed to be because of this:

$\frac{n!}{r!(n-r)!} \text{ with } r=2 \equiv \frac{n!}{2(n-2)!} \equiv \frac{n\times(n-1)}{2}$

It's hard to say why commutativity of addition gets a paragraph, but this gets no explanation at all. To me, it shows a disregard for the reader: the reader doesn't understand addition, but they totally get factorial.

Another Perspective

A quote
Another perspective is to note that the nested for loops result in O(n^2). Clearly, the above approach is not scalable.
That's not "another perspective." That's. The. Point. The entire point of the exercise is that the brute force algorithm isn't optimal.

The Worst Code Snippet Ever

This is truly and deeply shocking.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
i = 0
for SomeIntegerA in SomeIntegerList:
    i = i + 1
    j = 0
 for SomeIntegerB in SomeIntegerList:
        j = j + 1
        if j > i:
            print "i = ", i, ", j = ", j
            SumOfIntegers = SomeIntegerA + SomeIntegerB
            print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
      if DesiredSumOfIntegers == SumOfIntegers:
          print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"


This is what drove me over the edge. This is unconscionably evil programming. It transcends mere "non-Pythonic" and reaches a realm of hellish havoc that can barely be understood as rational. Seriously. This is evil incarnate.

This is the most baffling complex version of a half-matrix iteration that I think I've ever seen. I can only guess that this is written by someone uncomfortable with thinking. They copied and pasted a block of assembler code changing the syntax to Python. I can't discern any way to arrive at this code.

The Big-O Problem

This quote:
Even though the number of computations is cut in half
The rules for Big-O are in the cited CLRS book.  $\textbf{O}(\frac{n^2}{2}) = \textbf{O}(n^2)$.

The "cut in half" doesn't count when describing the overall worst-case complexity. It needs to be emphasized that "cut in half" doesn't matter. Over and over again.

This code doesn't solve the problem. It doesn't advance toward solving the problem. And it's unreadable. Maybe it's a counter-example? An elaborate "don't do this"?

The idea of for i in range(len(S)): for j in range(i): ... seems to be an inescapable approach to processing the upper half of a matrix, and it seems to be obviously $\textbf{O}(n^2)$.

The Binary Search

This quote is perhaps the only thing in the entire blog post that's not utterly wrong.
we can compute the integer value that we need to find. We can than do a search over an ordered list for the integer that we need to find.
Finally. Something sensible. Followed by more really bad code.

The code starts with this

def binarySearch(alist, item):

instead of this

from bisect import bisect

Why does anyone try to write code when Python already provides it?

There's more code, but it's just badly formatted and has a net pylint score that's below zero. We've seen enough.

There's some further analysis that doesn't make any sense at all:
Since the integers that sum must be distinct, the diagnol on the matrix have values of N/A
And this:
Secondly, we should remove the integer that we are on from the binary search
This is a consequence of the initial confusion that decided that $x \neq y$ was somehow part of the problem. When it wasn't. These two sentences indicate a level of profound confusion about the essential requirements. Which leads to havoc.

Added Complication

The whole story is pretty badly confused. Then this arrives.
Complicate Problem by Having Integer List Not Sorted
It's not clear what this is or why it's here. But there it is. 

It leads eventually to this, which also happens to be true. 
The total computation complexity is O(2 * n [log n] ) = O(n [log n] )
That's not bad. However. The email that asked for help claimed O( {n [log n]}**2 ). I have no idea what the email is talking about. Nor could I find out what any of this meant. 

The Kicker

The kicker is some code that solves the problem in $\textbf{O}(n)$ time. Without using a set, which is interesting.

This was not part of the CLRS exercise 2.3-7. I suppose it's just there to point out something about something. Maybe it's a "other people are smarter than CLRS"? Or maybe it's a "just google for the right answer without too much thinking"? Hard to say.

A sentence or two of introduction might be all that's required to see why the other result is there.

Lessons Learned

Some people like to add complexity to the problem. The $x \neq y$ business is fabricated from thin air. It adds to the code complexity, but is clearly not part of the problem space.

This creates havoc. Simple havoc.

Some people appear to act like they're asking for help. But they're not. They may only want affirmation. A nice pat on the head. "Yes, you've written a blog post." Actual criticism isn't expected or desired. This is easy to detect by the volume and vehemence of the replies.

Given a list of numbers, S, and a target, x, determine of two values exist in the set that sum to x.

>>> S = [1,2,3,4,5,6]
>>> x=11
>>> [(n, x-n) for n in S if (x-n) in S]
[(5, 6), (6, 5)]
>>> bool([(n, x-n) for n in S if (x-n) in S])
True

This follows directly from the analysis. It doesn't add anything new or different. It just uses Python code rather than indented assembler.

This first example is $\textbf{O}(n^2)$ because the in operator is applied to a list. We can, however, use bisect() instead of the in operator.

>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]
[(5, 6), (6, 5)]
>>> x=13
>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]
[]

This achieves the goal -- following the parts of the analysis that aren't riddled with errors -- without so much nonsensical code.

This does require some explanation for what bisect(S, i) does. It's important to note that the bisect() function returns the position at which we should insert a new value to maintain order. It doesn't return the location of a found item. Indeed, if the item isn't found, it will still return a position into which a new item should be inserted.

If we want this to be $\textbf{O}(n)$, we can use this:

>>> S = [1,2,3,4,5,6]
>>> S_set = set(S)
>>> x=11
>>> bool([(n, x-n) for n in S_set if (x-n) in S_set])
True

This replaces the linear list with a set, S_set. The (x-n) in S_set operation is $\textbf{O}(1)$, leading to the overall operation being $\textbf{O}(n)$.

If you want to shave a little time, you can use any() instead of bool([]). If you're not returning the pairs, you can reduce it to any(x-n in S_set for n in S_set). Try it with timeit to see what the impact is. It's surprisingly small.

Tuesday, March 29, 2016

The Data Structures and Algorithms Problem

Here's a snippet of an email
In big data / data science, the curse of dimensionality keeps showing up over and over. A good place to start is the wiki article “curse of dimensionality.” The issue seems to be that a lot of these big data / data science people have not taken the time to study fundamental data structures.
There was more about Foundations of Multidimensional and Metric Data Structures by Hanan Samet being too detailed. And Stack Overflow being too high-level.  And more hand-wringing after that, too.

The email was pleading for some book or series of blog posts that would somehow educate data science folks on more fundamental issues of data structures and algorithms. Perhaps getting them to drop some dimensions when doing k-NN problems or perhaps exploit some other data structure that didn't involve 100's of columns.

I think.

I'm guessing because -- like a lot of hand-waving emails -- it didn't involve code. And yes, I'm very bigoted about the distinction between code and hand-waving.

If there is a lack of awareness of appropriate data structures, the real place to start is The Algorithm Design Manual by Steven Skiena.

I harbor my doubts that this is the real problem, however. I think that the broad spectrum of computing applications leads to a lot of specialization. I don't think that it's really prudent to try and think of generalists who can handle deep data science issues as well as algorithm design and performance issues. No one expects them to write JavaScript and tinker with CSS so that the web site which presents the results looks good.

I actually think the real problem is that some folks expect too much from their data scientists.

In fantasy land the rock stars are full stack developers who can span the entire spectrum from OS to CSS. In the real world, developers have different strengths and interests. In some cases, "full stack" means mediocre skills in a lot of areas.

Here's a more useful response: Bridging the Gap Between Data Science and DevOps. I don't think the problem is "big data / data science people have not taken the time to study fundamental data structures". I think the problem is that big data is a cooperative venture. It takes a team to solve a problem.

Thursday, February 5, 2015

Idempotence, Hysteresis and Determinism

Three terms that seem to cause confusion: Idempotence, Hysteresis and Deterministic. The subject came up during my webcast on the Five Kinds of Python Functions. We can use all three terms to describe a function. Two of them are relevant to common design questions in software. The third is a higher-order architectural consideration, and not really part of writing a function definition in Python.

Idempotent -- narrowly defined --  means that f(f(x)) = f(x). In computer science, the meaning is stretched so that we can distinguish functions like random number generators from other functions. A random number generator (Python's random.random(), for example) is not idempotent. It returns a different result each time it's called. Many other functions are idempotent because they always return the same result given the same arguments. The functions in Python's os module are not idempotent. The results change based on external events.

Hysteresis is memory of previous events. A random number generator may have some hidden hysteresis so that it can return the next value in it's random sequence. Note that os.random() is explicitly based on /dev/random; it involves hysteresis where entropy is accumulated. A function that has an internal memoization cache has hysteresis; it will compute subsequent results more quickly after having memorized previous results.

Most functions are simply idempotent and don't generally involve any hysteresis. The result value is a  fixed mapping from the argument values.

A memoization cache preserves idempotence while adding hysteresis. The functools.lru_cache decorator, for example, adds hysteresis to an otherwise idempotent function. The result value still reflects a fixed mapping.

A random number generator cannot have idempotence; it will leverage hysteresis. We should think of a random number generator as an iterator through a sequence of numbers. Given a seed, the sequence is entirely fixed. Between any two numbers, it's very difficult to predict one from the other without also knowing the hidden seed value.

Unrelated Concept

We use idempotence and hysteresis to describe programming language features. These features are entirely deterministic.  There's no room for arbitrary, undefined, unpredictable behavior. Any non-determinism needs to be very tightly boxed in by otherwise deterministic programming constructs.

When writing a Python function, we assume that the language is deterministic. Without this assumption, it's hard to conceive of what the language would be like. What if statements were executed out of the order we wrote them?

External events -- interrupts and the like -- are non-deterministic. They can happen at any time, with no relationship to what's going on inside the software world. Inside the software world, however, we expect that everything is deterministic. This means that a server must always cope with non-deterministic request ordering. Once request processing starts, however, we generally rely on essential non-deterministic software to process the results perfectly consistently.

An important example of bounded non-determinism is in Dijksta's hypothetical programming language described in A Discipline of Programming. Here there is explicit non-determinism among the "elif" and "eldo" clauses. The selection among true alternatives was specifically non-deterministic. Indeed, an evil demon would always strive to select the worst possible choice. There was no "first one that's true" kind of silliness that would allow certain kind of logic errors to survive.

A multiprocessing application leverages the OS to keep all processes separate. Each process can then operate deterministically. Any non-determistic OS events are concealed from the process by the normal OS libraries that generally queue up events into buffers.

A multithreaded application, however, has to use some kind of explicit synchronization to handle the inherent non-determinism of thread scheduling. Thread libraries make no guarantees about the exact sequence of operations between threads; the execution is non-deterministic between threads.

For real fun, read about the non-deterministic memory write orders. The Data Race article from Intel is illuminating. Google "non-deterministic memory write order" for interesting reading on how processors have gotten -- perhaps -- too complex to be completely trustworthy.

This is different, also, from "arbitrary." A word that describes how the C language deals with constructs like a[i]= i++;. There are two unrelated state changes that happen in this statement. The order of those two things is best described as "arbitrary." It's deterministic. But it's not well defined by the language. Depending on the compiler and optimization settings, it will be entirely fixed. A function that uses these constructs could be idempotent. However, the outcome can vary from compiler to compiler and optimization setting to optimization setting. This kind of thing is devoutly to be avoided. It's presence is a less-than-ideal language design choice; writing software around a bad language feature is simply inexcusable.

Thursday, August 21, 2014

Permutations, Combinations and Frustrations

The issue of permutations and combinations is sometimes funny.

Not funny weird. But, funny "haha."

I received an email with 100's of words and 10 attachments. (10. Really.) The subject was how best to enumerate 6! permutations of something or other. With a goal of comparing some optimization algorithm with a brute force solution. (I don't know why. I didn't ask.)

Apparently, the programmer was not aware that permutation creation is a pretty standard algorithm with a standard solution. Most "real" programming languages have libraries which already solve this in a tidy, efficient, and well-documented way.

For example

https://docs.python.org/3/library/itertools.html#itertools.permutations

I suspect that this is true for every language in common use.

In Python, this doesn't even really involve programming. It's a first-class expression you enter at the Python >>> prompt.

>>> import itertools
>>> list(itertools.permutations("ABC"))
[('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), ('C', 'B', 'A')]

What's really important about this question was the obstinate inability of the programmer to realize that their problem had a tidy, well understood solution. And has had a good solution for decades. Instead they did a lot of programming and sent 100's of words and 10 attachments (10. Really.)

The best I could do was provide this link:

Steven Skiena, The Algorithm Design Manual

It appears that too few programmers are aware of how much already exists. They plunge ahead creating a godawful mess when a few minutes of reading would have provided a very nice answer.

Eventually, they sent me this:

http://en.wikipedia.org/wiki/Heap's_algorithm

As a grudging acknowledgement that they had wasted hours failing to reinvent the wheel.

Saturday, August 9, 2014

Some Basic Statistics

I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.


def s0( samples ):
    return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
    return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
    return sum( x**2 for x in samples )

Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
    return s1(samples)/s0(samples)

def stdev( samples ):
    N= s0(samples)
    return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)

It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.


def z( x, μ_x, σ_x ):
    return (x-μ_x)/σ_x


Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
    μ_1, σ_1 = mean(sample1), stdev(sample1)
    μ_2, σ_2 = mean(sample2), stdev(sample2)
    z_1 = (z(x, μ_1, σ_1) for x in sample1)
    z_2 = (z(x, μ_2, σ_2) for x in sample2)
    r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
    return r

I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.


def linest( x_list, y_list ):
    r_xy= corr( x_list, y_list )
    μ_x, σ_x= mean(x_list), stdev(x_list)
    μ_y, σ_y= mean(y_list), stdev(y_list)
    beta= r_xy * σ_y/σ_x
    alpha= μ_y - beta*μ_x
    return alpha, beta


This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

Thursday, July 17, 2014

New Focus: Data Scientist

Read this: http://www.forbes.com/sites/emc/2014/06/26/the-hottest-jobs-in-it-training-tomorrows-data-scientists/

Interesting subject areas: Statistics, Machine Learning, Algorithms.

I've had questions about data science from folks who (somehow) felt that calculus and differential equations were important parts of data science. I couldn't figure out how they decided that diffeq's were important. Their weird focus on calculus didn't seem to involve using any data. Odd: wanting to be a data scientist, but being unable to collect actual data.

Folks involved in data science seem to think otherwise. Calculus appears to be a side-issue at best.

I can see that statistics are clearly important for data science. Correlation and regression-based models appear to be really useful. I think, perhaps, that these are the lynch-pins of much data science. Use a sample to develop a model, confirm it over successive samples, then apply it to the population as a whole.

Algorithms become important because doing dumb statistical processing on large data sets can often prove to be intractable. Computing the median of a very large set of data can be essentially impossible if the only algorithm you know is to sort the data and find the middle-most item.

Machine learning and pattern detection may be relevant for deducing a model that offers some predictive power. Personally, I've never worked with this. I've only worked with actuaries and other quants who have a model they want to confirm (or deny or improve.)

Thursday, July 3, 2014

Project Euler

This is (was?) an epic web site:

http://projecteuler.net/about

Currently, they're struggling with a security problem.

http://forum.projecteuler.net/viewtopic.php?f=5&t=3591

Years ago, I found the site and quickly reached Level 2 by solving a flood of easy problems.

Recently, a recruiter strongly suggested reviewing problems on Project Euler as preparation for a job interview.

It was fun! I restarted my quest for being a higher-level solver.

Then they took the solution checking (and score-keeping) features off-line.

So now I have to content myself with cleaning up my previous solutions to make them neat and readable and improve the performance in some areas.

I -- of course -- cannot share the answers. But, I can (and will) share some advice on how to organize your thinking as you tackle these kinds of algorithmically difficult problems.

My personal preference is to rewrite the entire thing in Django. It would probably take a month or two. Then migrate the data. That way I could use RST markup for the problems and the MathJax add-on that docutils uses to format math. But. That's just me.

I should probably take a weekend and brainstorm the functionality that I can recall and build a prototype. But I'm having too much fun solving the problems instead of solving the problem of presenting the problems.

Tuesday, August 28, 2012

Password Encryption -- Short Answer: Don't.

First, read this.    Why passwords have never been weaker—and crackers have never been stronger.

There are numerous important lessons in this article.

One of the small lessons is that changing your password every sixty or ninety days is farcical.  The rainbow table algorithms can crack a badly-done password in minutes.  Every 60 days, the cracker has to spend a few minutes breaking your new password.  Why bother changing it?  It only annoys the haxorz; they'll be using your account within a few minutes.  However.  That practice is now so ingrained that it's difficult to dislodge from the heads of security consultants.

The big lesson, however, is profound.

Work Experience

Recently, I got a request from a developer on how to encrypt a password.  We have a Python back-end and the developer was asking which crypto package to download and how to install it.

"Crypto?" I asked.  "Why do we need crypto?"

"To encrypt passwords," they replied.

I spat coffee on my monitor.  I felt like hitting Caps Lock in the chat window so I could respond like this: "NEVER ENCRYPT A PASSWORD, YOU DOLT."

I didn't, but I felt like it.

Much Confusion

The conversation took hours.  Chat can be slow that way.  Also, I can be slow because I need to understand what's going on before I reply.  I'm a slow thinker.  But the developer also needed to try stuff and provide concrete code examples, which takes time.

At the time, I knew that passwords must be hashed with salt.  I hadn't read the Ars Technica article cited above, so I didn't know why computationally intensive hash algorithms are best for this.

We had to discuss hash algorithms.

We had to discuss algorithms for generating unique salt.

We had to discuss random number generators and how to use an entropy source for a seed.

We had to discuss http://www.ietf.org/rfc/rfc2617.txt in some depth, since the algorithms in section 3.2.2. show some best practices in creating hash summaries of usernames, passwords, and realms.

All of this was, of course, side topics before we got to the heart of the matter.

What's Been Going On

After several hours, my "why" questions started revealing things.  The specific user story, for example, was slow to surface.

Why?

Partly because I didn't demand it early enough.

But also, many technology folks will conceive of a "solution" and pursue that technical concept no matter how difficult or bizarre.  In some cases, the concept doesn't really solve the problem.

I call this the "Rat Holes of Lost Time" phenomena: we chase some concept through numerous little rat-holes before we realize there's a lot of activity but no tangible progress.  There's a perceptual narrowing that occurs when we focus on the technology.  Often, we're not actually solving the problem.
IT people leap past the problem into the solution as naturally as they breathe. It's a hard habit to break.
It turned out that they were creating some additional RESTful web services.  They knew that the RESTful requests needed proper authentication.  But, they were vague on the details of how to secure the new RESTful services.

So they were chasing down their concept: encrypt a password and provide this encrypted password with each request.  They were half right, here.  A secure "token" is required.  But an encrypted password is a terrible token.

Use The Framework, Luke

What's most disturbing about this is the developer's blind spot.

For some reason, the existence of other web services didn't enter into this developer's head.  Why didn't they read the code for the services created on earlier sprints?

We're using Django.  We already have a RESTful web services framework with a complete (and high quality) security implementation.

Nothing more is required.  Use the RESTful authentication already part of Django.

In most cases, HTTPS is used to encrypt at the socket layer.  This means that Basic Authentication is all that's required.  This is a huge simplification, since all the RESTful frameworks already offer this.

The Django Rest Framework has a nice authentication module.

When using Piston, it's easy to work with their Authentication handler.

It's possible to make RESTful requests with Digest Authentication, if SSL is not being used.  For example, Akoha handles this.  It's easy to extend a framework to add Digest in addition to Basic authentication.

For other customers, I created an authentication handler between Piston and ForgeRock OpenAM so that OpenAM tokens were used with each RESTful request.  (This requires some care to create a solution that is testable.)

Bottom Lines

Don't encrypt passwords.  Ever.

Don't write your own hash and salt algorithm.  Use a framework that offers this to you.

Read the Ars Technica article before doing anything password-related.

Tuesday, August 21, 2012

Performance "Tuning": running in 1/100th the time

For the 757 Python Meetup group, someone proposed looking at some Python code they had which was slow.  The code implemented a variation on the Elo chess rating system.  It applied the ratings to other sports, and used the points scored as well as basic win/lose/tie to work out a ranking for sports teams.  Very clever stuff.

But.

It was described as horribly slow. Since it was only 400 lines of code, it was a great subject for review in a Python meetup.  I would be able to show some tweaks and performance tips.

Hypothetical Step 1:  Profile

My first thought was to run the profiler and see what popped up as a possible root cause of slowness.

Generally, there are three kinds of performance problems.

  • Wrong data structure.  Or wrong algorithm.  (They're simply opposite sides of the same coin.)  This generally leads to dramatic, earth-shattering improvements.
  • Piss-Poor Programming Practices.  This kind of fine-tuning yields minuscule improvements.  In some cases, no measurable change at all.
  • Bugs.  Every time I've been asked to improve "working" code, I find that it has bugs.  Every time.  These can exacerbate performance problems.  Interestingly, most of these are readily apparent during an initial survey:  i.e., while simply reading the code to see how it works.  Trying to create unit tests for the purposes of refactoring often reveals additional bugs.
I didn't know what I was going to see in this 400-line program.

Hypothetically, profiling will help show what kind of problems we have.

Prior to profiling, of course, we need to run the darn thing.  Thankfully, they provided some typical input files.  For example, 1979 High School Football season results.  159 lines of patiently gathered teams and scores.

It Didn't Run

When we tried to run it with the profiler, we found that we had a bug.  Tracking down the bug, revealed the essential performance problem, also.

The core problem was a failure to make use of Python data structures.  

This manifested itself in a number of really bad design errors.  Plus, it presented as the actual, show-stopping, serious bug.

In order to work out team rankings, the program kept a list of teams.

I'll repeat that for folks who are skimming.
In order to work out team rankings, the program kept a list of teams.
Not a dict mapping from team name to team details.  But a simple list.  Locating a team in the list meant iterating through the list, looking for the matching team.  Really.

for t in team_list:
    if t['name'] == target_name:
        process( t )

This kind of thing was repeated in more than one place.

And one of those repeats had the bug in it.

What we have here is "wrong data structure".  Replacing a list with a dict will have earth-shattering impact on performance.

The Bug

The bug, BTW, was a lookup loop which had the additional requirement of adding missing teams.  It tried to use the for-else structure.  This was the intended code (not the actual code).

for t in team_list:
    if t['name'] == target_name:
        return
else:
    t['name']= init_new_team(target_name)
    
This is a first-class bit of Python.  An else clause on a for statement is a real feature of the language.  However, it's obscure enough that it's easy to get wrong.

However, it's also unique to Python, and the kind of thing that can lead to confusion.  I discourage it's use.  

Test-Driven Reverse Engineering

We're doing TDRE on a little piece of recreational programming.  This means that we need to fabricate unit tests for code that is purported to work.  Some folks like to call this "exploratory testing" (a phrase which is contradictory, and consequently meaningless.)  We're "exploring".  We're not "testing".

Once the core bug is fixed, we can run sample data through the application to get "big picture" results.  We can extract bits from the sample data and exercise various functions in isolation to determine what they actually do now, bugs included.

Since this is so simple--and we're going to review it during a 2-hour meet up--and there's nothing billable going on--we can get by with a few really simple unit tests.  We'll run the darn thing once to create some expected output.  We save the output and use that single output log as the expected results from each refactoring step.

More complex applications require more unit tests.  For a billable TDRE effort last year, I had to create 122 unit tests to successfully refactor a program of over 6,000 lines of code.  It took several weeks.

Real Step 1: Fix The Data Structure

Before profiling (but after running to capture some output) we have to fix the essential data structure.  Repeated scanning of a list is never going to perform well.
The whole point of the study of "Data Structures" is to prevent (or optimize) search.
In this case, we can prevent linear search of a list by using a dictionary.  That, clearly, is job one.

It was a pervasive rewrite.   Essentially, everything in this little program included a goofy loop to lookup a team in the list.  The Elo algorithm itself, which is already O(n2), is dragged down by using the linear search for the opposing team four more times, making it O(n3).

Cyclomatic Complexity

One of the big "issues" is the use of if statements throughout the scoring algorithm.  An if statement creates Cyclomatic Complexity and can lead to performance problems.  Generally, if statements should be avoided.

This algorithm applies some normalization factors to reconcile scoring with win/loss numbers in different sports.  Basketball, specifically, involves generally high scores.  Since there are 2-point and 3-point scoring opportunities, a factor is used to normalize the points into "goals".  Football, similarly, has numerous scoring opportunities with values of 1, 2, 3 and 6 points; the scores here are also normalized.

This normalization was done with an if statement that was evaluated deep inside the Elo algorithm.  Repeatedly. Evaluated.

The two functions that handled the normalizations, plus the normalization factors, are ideal candidates for OO design.  There's a clear hierarchy of classes here.  A superclass handles most sports, and two polymorphic subclasses handle football and basketball normalization.

The if statement is now "pushed up" to the very beginning of the program where an instance of the sports normalization object is created.  This object's methods are then used by the Elo algorithm to normalize scores.

Icing on the Cake

Once we've fixed the bug and replaced a list with a dict, everything else is merely icing.
Some other OO changes.
  1. The "Team" information should not be a flat, anonymous dictionary.  It should be a proper class definition with proper attributes.  There aren't many methods, so it's easy to create. 
  2. The "Game" information is read by csv.DictReader.  However, it should not remain a simple, anonymous dict.  As with a Team, a simple class can be created to handle Game.
  3. The overall structure of the application needs to be broken into two sections.  The command-line interface parses options, opens files, and generally gets everything set up.  The actual ranking algorithm should be a function that is given an open file-like object plus the Sport object for normalization.  This allows the ranking algorithm to be reused in other contexts than the command-line (i.e. a web service).
A more subtle OO design point is the question of "mutability".  A Team in this application is little more than a name.   There are also numerous "stateful" values that are part of the Elo algorithm.   A Game, similarly, is an immutable pair of teams and scores.  However, it has some mutable values that are part of the Elo algorithm.  

Really, we have immutable Team and GameHistory objects, plus a few values that are used as part of the Elo calculation.  I'm a big fan of disentangling these mutable and immutable objects from each other.  

I suspect that the Elo algorithm doesn't really need to update the "state" of an object.  I suspect that it actually creates (and discards) a number of immutable candidate ranking objects.  The iteration that leads to convergence might be a matter of object creation rather than update.  I didn't make this change, since it required real work and we were out of time.

Bottom Line

The more times I do TDRE to improve performance, the more I realize that it's all about bugs and data structures.

This recreational application took 45-60 seconds to process one year's record of games for a given league.  It now takes less than 0.2 seconds to do the same volume of work.  Two test cases involving a complete run of 159 records runs in 0.411 seconds.  That's 1/100th the time simply from switching data structures.

The idea of "tweaking" a working program to improve performance is generally misleading.  It might happen, but the impact is minuscule at best.  

Here's the checklist for getting 100:1 improvements.
  • Remove searches.  
  • Remove deeply-nested if statements.  
Generally, reduce Cyclomatic Complexity.

Tuesday, April 5, 2011

Performance Discussions and Software Design

Read this first: "There is something I find interesting about online discussions around performance issues..." It's about Stack Overflow, specifically. Apparently, someone didn't get their question answered and decided it was better to gripe than to rewrite the question.

Let's look at their response in pieces.

"people try to gang up". Since there's almost no social networking capability, this is a bit much to attribute to people responding to a poorly-worded question. But, if you've worked all day on a bad solution to a poorly-conceived problem, it can feel like being ganged up on. When reality leaks in, it can feel unpleasant.

Hint 1. There are no gangs. It's possible that the question really is poorly written.

"cookie-cutter, patronizing, zero-information responses". I'm guessing these are comments suggesting the approach is bad and asking for clarification. I run afoul of this often because I feel compelled to post comments asking for clarification. Some folks just don't like to clarify. More than once I've been told that their question was very clear. Since I'm asking for clarification, it seems odd to insist the question is perfect. Worse, of course, is asking for help on Stack Overflow, but refusing to clarify the help required.

Hint 2. Clarify. Please. Don't insist that the question is perfect.

"they assume, without any basis, that the person has not (a) benchmarked the code," When the question has no bench mark data, this isn't an assumption. It's a response to the lack of benchmark data.

Hint 3. Provide the facts. Don't complain when folks ask for facts.

"(b) is obviously running an inferior algorithm". Again, this isn't an assumption. It's the response to an incomplete question where the algorithm isn't provided. Also, it's a common response to questions where the algorithm really is inferior.

Hint 4. Consider that -- even after spending days banging your head against the wall -- your question might be poorly-written and require both benchmark data and an algorithm.

"advice about premature optimization... is a well assimilated folklore by now and I dont see how repeating that adds value". Without measurements, profiler results and benchmark data, this is our only possible response. After the profiler results are posted, this advice really is useless. Before profiler results are posted, this advice often turns out to be essential.

Hint 5. Whatever you might know is not well-assimilated folklore on Stack Overflow as a whole. We don't know you, sadly. We don't know how much you know. To avoid useless advice, provide evidence -- in the question -- that the advice has already been followed.

"better served if the discussion shifted to ... Pointed out possible bottlenecks ahead of time," Wouldn't that be nice? What's a "possible bottleneck"? It's a badly-design algorithm. So, the responses to performance questions has to be focused on algorithm choice right away. That means details on the code being used, and profiling information.

Hint 6. There is no hint 6. This would simply repeat hints 3 and 5.

"regardless of the fact whether the code construct is actually a bottleneck in the application or not, it is always good to know what the more efficient alternatives are... there is something called intellectual curiosity."

Reducing a question to a hand-waving hypothetical doesn't improve the question. It doesn't rationalize a poor question. The question still needs to be clarified.

Hint 7. If the question raises a lot of comments and useless advice, please rewrite the question.

Tuesday, December 28, 2010

Amazing Speedup

A library had unit tests that ran for almost 600 seconds. Two small changes dropped the run time to 26 seconds.

I was amazed.

Step 1. I turned on the cProfile. I added two methods to the slowest unit test module.

def profile():
    import cProfile
    cProfile.run( 'main()', 'the_slow_module.prof' )
    report()

def report():
    import pstats
    p = pstats.Stats( 'the_slow_module.prof' )
    p.sort_stats('time').print_callees(24)
Now I can add profiling or simply review the report. Looking at the "callees" provided some hints as to why a particular method was so slow.

Step 2. I replaced ElementTree with cElementTree (duh.) Everyone should know this. I didn't realize how much this mattered. The trick is to note how much time was spent doing XML parsing. In the case of this unit test suite, it was a LOT of time. In the case of the overall application that uses this library, that won't be true.

Step 3. The slowest method was assembling a list. It did a lot of list.append(), and list.__len__(). It looked approximately like the following.


def something( self ):
result= []
for index, value in some_source:
    while len(result)+1 != index:
        result.append( None )
    result.append( SomeClass( value ) )
return result

This is easily replaced by a generator. The API changes, so every use of this method function may need to be modified to use the generator instead of the list object.


def something_iter( self ):
 counter= 0
 for index, value in some_source:
     while counter+1 != index:
         yield None
         counter += 1
     yield SomeClass( value )
     counter += 1

The generator was significantly faster than list assembly.

Two minor code changes and a significant speed-up.