Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label functional python programming. Show all posts
Showing posts with label functional python programming. Show all posts

Tuesday, November 29, 2022

Functional Programming and Finite State Automata (FSA)

When I talk about functional programming in Python, folks like to look for place where functional programming isn't appropriate. They latch onto finite-state automata (FSA) because "state" of an automata doesn't seem to fit with stateless objects used in functional programming.

This is a false dichotomy. 

It's emphatically false in Python, where we don't have a purely functional language.

(In a purely functional language, monads can help make FSA's behave properly and avoid optimization. The use of a recursion to consume an iterable and make state transitions is sometimes hard to visualize. We don't have these constraints.)

Let's look at a trivial kind of FSA: the parity computation. We want to know how many 1-bits are in a given value. Step 1 is to expand an integer into bits.

def bits(n: int) -> Iterable[int]:
    if n < 0:
        raise ValueError(f"{n} must be >= 0")
    while n > 0:
        n, bit = divmod(n, 2)
        yield bit

This will transform a number into a sequence of bits. (They're in order from LSB to MSB, which is the reverse order of the bin() function.)

>>> list(bits(42))
[0, 1, 0, 1, 0, 1]

Given a sequence of bits, is there an odd number or an even number? This is the parity question. The parity FSA is often depicted like this:

When the parity is in the even state, a 1-bit transitions to the odd state. When the parity is in the odd, a 1-bit transitions to the even state.

Clearly, this demands the State design pattern, right?

An OO Implementation

Here's a detailed OO implementation using the State design pattern.

 
class Parity:
    def signal(self, bit: int) -> "Parity":
        ...


class EvenParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return OddParity()
        else:
            return self


class OddParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return EvenParity()
        else:
            return self


class ParityCheck:
    def __init__(self):
        self.parity = EvenParity()

    def check(self, message: Iterable[int]) -> None:
        for bit in message:
            self.parity = self.parity.signal(bit)

    @property
    def even_parity(self) -> bool:
        return isinstance(self.parity, EvenParity)

Each of the Parity subclasses implements one of the states of the FSA. The lonely signal() method implements state-specific behavior. In this case, it's a transition to another state. In more complex examples it may involve side-effects like updating a mutable data structure to log progress.

This mapping from state to diagram to class is pretty pleasant. Folks really like to implement each state as a distinct class. It somehow feels really solid.

It's import to note the loneliness of the lonely signal() method. It's all by itself in that big, empty class.

Hint. This could be a function.

It's also important to note that this kind of design is subject to odd, unpleasant design tweaks. Ideally, the transition is *only* done by the lonely signal() method. Nothing stops the unscrupulous programmer from putting state transitions in other methods. Sigh.

We'll look at more complex kinds of state transitions later. In the UML state chart diagrams sates may also have entry actions and exit actions, a bit more complex behavior than we we're showing in this example.

A Functional Implementation

What's the alternative? Instead of modeling state as an object with methods for behavior, we can model state as a function. The state is a function that transitions to the next state.

def even(bit: int) -> ParityF:
    if bit % 2 == 1:
        return odd
    else:
        return even


def odd(bit: int) -> ParityF:
    if bit % 2 == 1:
        return even
    else:
        return odd


def parity_check(message: Iterable[int], init: ParityF = None) -> ParityF:
    parity = init or even
    for bit in message:
        parity = parity(bit)
    return parity


def even_parity(p: ParityF) -> bool:
    return p is even

Each state is modeled by a function.

The parity_check() function examines each bit, and applies the current state function (either even() or odd()) to compute the next state, and save this as the vakue of the parity variable.

What's the ParityF type? This:

from typing import Protocol


class ParityF(Protocol):
    def __call__(self, bit: int) -> "ParityF":
        ...

This uses a Protocol to define a type with a recursive cycle in it. It would be more fun to use something like ParityF = Callable[[int], "ParityF"], but that's not (yet) supported.

Some Extensions

What if we need each state to have more attributes?

Python functions have attributes. Like this: even.some_value = 2; odd.some_value = 1. We can add all the attributes we require.

What about other functions that happen on entry to a state or exit from a state? This is trickier. My preference is to use a class as a namespace that contains a number of related functions.

class Even:
    @staticmethod
    def __call__(bit: int) -> ParityF:
        if bit % 2 == 1:
            odd.enter()
            return odd
        else:
            return even
    @staticmethod
    def enter() -> None:
        print("even")

even = Even()

This seems to work out well, and keeps each state-specific material in a single namespace. It uses static methods to follow the same design principle as the previous example -- these are pure functions, collected into the class only to provide a namespace so we can use odd.enter() or even.enter().

TL;DR

The State design pattern isn't required to implement a FSA.

Tuesday, November 15, 2022

Generators as Stacks of Operations

See https://towardsdatascience.com/building-generator-pipelines-in-python-8931535792ff 

I'm delighted by this article. 

I was shown only the first, horrible, example. I think the idea was to push back on the idea of complex generators. I fumed.

Then I read the entire article.

Now I'm fuming at someone who posted the first example -- apparently having failed to read the rest of the post.

This idea of building a stack of iterators is very, very good.

The example (using simple operations) can be misleading. A follow-on example doing something like file parsing might be helpful. But, if you go too far, you wind up writing an entire book about Functional Programming in Python.

Tuesday, October 25, 2022

Some Functional Programming in Python material

This is bonus content for the forthcoming Functional Python Programming 3rd edition book. It didn't make it into the book because -- well -- it was just too much of the wrong kind of detail.

See this "Tough TCO" document for some thoughts on Tail-Call Optimization that can be particularly difficult. This isn't terribly original, but I think it's helpful for folks working through more complex problems from a functional perspective.

"Why a PDF?" I've been working with with LaTeX, and the switching to other ways of editing and presenting code seemed like too much work.

Tuesday, July 6, 2021

A Python Roadmap

An interesting tweet. The  roadmap has three sections. I'm not sure this is actually complete, or even grouped correctly. It is a very good list of topics.

Foundations

I want start by quibbling about variables being first. I'm not sold on this.

I think that operators, expressions, and the built-in immutable types are foundational. int, float, str, and tuple are hugely important as core concepts in computing and Python.

I also think that "loops" is a sketchy notion and I kind of wish we wouldn't describe for and while statements as "loops". I think we should call them iterations. They implement two kinds of logical quantifiers "for all" and "there exists." I think we should talk about the final result of a for statement: all of the values in a range are processed. Similarly a for-if-break construct establishes a "for exists" that defines the first value in a range for which a condition is met. And yes, range objects will be central.

I think that a huge amount of programming can be covered with these topics. I'm not sure "basic" is the right term; foundations might be a better idea. 

The use of variables to manage state is part of this. But. Variables, assignment, and state change are a bit more advanced and maybe shouldn't be first.

I also think function definitions are foundational. Mathematics has been defining functions based on other functions. It's a way of providing a mental short-hand for complex concepts. I don't need to know all the details of how to compute a square root to make use of square root as a concept.

The wide varieties of assignment statements, including assignment to decompose collections aren't mentioned in the original post. This may be an important omission, causing me to quibble on "complete."

I agree that files and elements of File IO are part of this foundation. If we limit ourselves to reading and writing files, then they're essentially immutable structures. I think we can safely avoid update-in-place files because this is an application topic more than a language topic. Python offers the minimal level of support via seek and tell, but little more. And most modern application relies on a database for updatable files.

Data Structures

Moving from basic to intermediate. I prefer the term "data structures" which are built on the language foundations. I think that the mutable built-in data structures come next in the roadmap. My preference is to omit terms like Object-Oriented or Functional, and focus on list, dict, and set, and how the iteration works. This means comprehensions and generators are part of this essential data structure section.

No, comprehensions aren't and shouldn't be called "advanced." They're very much a core concept. Thinking about statements to implement a map/filter/reduce over a collection is the essence of a great deal of programming. We don't always learn it that way, but it needs to be presented in that framework even to beginners. A pile of for and if statements and a bunch of variables is a programmer's first step toward a simpler comprehension. In both cases, they're doing a mapping and it needs to be described as mapping one collection to another collection.

This is where the standard library collections module is introduced.  Yes it's part of the library. I think it's too central to be ignored. I think dataclasses belong here, too.

Talking about the mutable data structures means revisiting the for statement and using it on a variety of iterables. The way Python's concepts apply to a variety of data types is an important feature of the language. (In the olden days, they used to talk about "orthogonality" of data and processing; we don't need to dwell on it, but I think it helps to acknowledge it.)

Functional Programming

It appears to me that the functional programming topics can come next. The idea of functional composition via higher-order functions and decorators builds on the existing foundation. This is where map() and filter() belong. Because of the way sorted(), max(), and min() work on collections with a key= function, these are part of the functional programming roadmap. The inconsistency between map() and functions like max() is an important thing to note.

I also think itertools belongs here. We can make the case that it's in the standard library, but then, so is io. I think itertools and functools are as central to practical Python as the math module and collections.

I think typing.NamedTuple and dataclasses belong here, also. A frozen dataclass is stateless, and can be helpful when creating list comprehensions to perform a mapping from one collection to another collection.

Object-Oriented Programming

I think OO programming and related concepts build on the previous material. Class definitions and state management aren't simple, even though they're essential parts of Python.

To an extent, OO programming can be decomposed into two layers. While I hate to overuse "foundation", there seem to be two parts:

OO Foundations -- inheritance, composition, and different kinds of delegation. This tends to expose a number of common design patterns like Strategy, Decorator, and Facade.

OO Features -- this includes metaprogramming, decorators, ABC's, mixins, and the like. These topics are all designed to avoid copy-and-paste in sophisticated edge cases that cross class boundaries.

Concurrency

I'm not sure why concurrency and parallelism are separate topics in the original list. I've had folks try to split this hair a number of ways. The idea is to find a place where async lives that's "concurrency lite" or something.

The concepts here become blurry because threads and processes are OS features, not language features. The async/await language features, however, are clearly part of Python. It becomes particularly awful when working on something practical where asyncio doesn't provide the feature you need. Specifically, blocking file system I/O isn't part of asyncio and requires an explicit appeal to the underlying thread pool for the blocking operation. 

To an extent, async/await needs to be on the roadmap. It's tricky, though, to cover this without also digressing into threads as a way to deal with blocking operations.

Test, Integration, and Deployment

This is where tools show up. This is where pip, unittest, pytest, tox/nox, coverage, etc. live. Are these part of the language? Or are the part of the broader ecosystem?

I submit they're explicitly not part of the language. The roadmap ends just before this topic. The idea is that we should have a Python roadmap that uses the language and the standard library.

Once we've talked about the language (and some of the library) we can move on to pip and packaging. I don't think pip is and "intermediate" topic. I find that premature introduction of pip is a sign of trying to create useful interesting examples. Examples that don't use pip wind up being kind of boring. Everyone wants to play with pygame and pillow and other kinds of projects, but, those aren't foundational to the language. They're interesting and appealing and -- frankly -- a lot of fun.

tl;dr

I'm not a fan of the roadmap. I like some of it. I don't like some of it.

I am a fan of presenting the idea for discussion.

Tuesday, November 3, 2020

"Python doesn’t do tail recursion" -- wait, what?

Yes, that's what the email said.

I was -- well -- shocked. Too shocked to be polite. Content Warning: much snark follows.

BLUF: Tail Recursion is not Tail Recursion Optimization.

Eventually, it became clear they were worried about tail recursion optimization. Maybe I'm too focused on words, but I think words matter. The initial claim was so clearly wrong, I had to challenge it. It took three *more* emails to get to the optimization point.

Hence my overload of snark. Three emails to discover they didn't see the word "optimization."

Tail Recursion

Here's an example. (I wish questions included example code instead of weird assertions that are clearly false.)

def f(x: int) -> int:
    if x == 0: return 1
    return x*f(x-1)

This function demonstrates tail recursion. The "tail" of the function involves a recursive reference to the function. An optimizing compiler can optimize the recursion to make it not recursive.

If Python doesn't do tail recursion, this function would not work.

Since it works, Python does tail recursion.

Python limits recursion so that you crash cleanly before you're out of memory. 

This is important. Older languages would run out of memory for stack frames. Instead of reporting a recursion problem, they'd just crash. Out-of-memory. Sorry. No drop to `pdb`. Nothing. This is bad. 

In the old Mac OS (≤9) the stack and heap memory grew from opposite ends of the available RAM. If they collided, it was a stack overflow, and the running app was crashed.

Here's how the stack limitation plays out in practice:

Python 3.9.0 | packaged by conda-forge | (default, Oct 10 2020, 20:36:05) 
[Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f(x: int) -> int:
...     if x == 0: return 1
...     return x*f(x-1)
... 
>>> f(999)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in f
  File "<stdin>", line 3, in f
  File "<stdin>", line 3, in f
  [Previous line repeated 995 more times]
  File "<stdin>", line 2, in f
RecursionError: maximum recursion depth exceeded in comparison
>>> f(997)
403597244616453902342926527652907402110903352461393891430307973735196631901068423726375883385358710213700663198466197719394411318126551961682447808198924968051643330792790545975658652366984953410102994729193397927862707860663312211428139657339199210288839829245965893084586772188847949801354437616450752245066665598898009417557796737695167521343249479413631414534202184726421479392615730781173164526393982880263279118925406206180689438683308644696334133955184235540077242451165903811018277198321800315958279899941381566151490917379981054549852483223292752438981198080270888254399197574536460570473430371595872403169486757166154294941258045311241382930836862005052391967478429035362983199050663230586866257612402804942403442331663944341683350732204123565349869446216232111598995678724462182568501131746383857706790400107507266739002631612931124112227909672935742104968533278074796000335855930432060517447195226436187301231195091058916141500005034486568847599649004940677693185232218378659444854645703908824934015144550035704605317977378620311855095356769488892217130200011250491151641531090120083765159221969755314437880209281708574493693840125338722070514029362985801732618715060934298236579096167095859504053310608725711198457200226544350445941157734863407428532431126485686678788466148681975019174010453297639004006819520704463840773528691224455265229985489764356909675383800245969276679872407757924211918488179598530382266647547907226165479802976547896939656888813256826539067915695278878516257396920983511389029563101112325372395464739783143361362879872578550147571168136083391354242735142803988735616917749898060073075542403509536490539404444972668319521415425667918323473675966566332390993259591959049424070380861864682206986463729281557338747466546627859206287571996491606797979064142819469589200812679026561288124087136359830959867034513441434850212864818601504529520195828528045600869420646442863720485414968365312690523835026508545772659712105161137693595262919371358840019473383802028344531181679417716563013501242477291139042422814166369601152223293596957527530934652046662174154235850073391729650007182794396630407081318880947107940245036774649857429379220776637356890211596540009349092255988047909417594778375705723841918167663026277009033939654785671715045122185315730249393616044737902170116980736000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>>> 

This shows how a large number of operations (around 1,000) exceeds an internal limit. The exception is not out-of-memory, it's too much recursion.

A slightly smaller number (997 in this example) worked.  999 didn't work because it exceeded the threshold.

Manual Tail Recursion Optimization

New word: Optimization. New concept.

If the tail recursion were optimized into a loop, we'd see code that behaves as if we wrote this:

>>> from math import prod
>>> def f1(x: int) -> int:
...     return prod(range(1, x+1))
    

This unwinds the tail recursion into a loop. A generator, range(1, x+1), creates a sequence of values, which are reduced into a product. This doesn't involve recursion or stack frames. Indeed, because it's a generator, it involves very little memory.

And it works for  numbers well over 1,000. Evidence (if any were needed) that tail recursion optimization is *not* being done in Python.

We'll repeat this for those who feel the need to send me crazy-sounding emails.

Automatic Tail Recursion Optimization

There is no automatic transformation from tail recursion to loop in Python.

I'll repeat that for the folks who send me emails claiming tail recursion isn't supported. (Everyone else probably has a pretty good grip on this.)

There is no automatic transformation from tail recursion to loop. 

Tail recursion works. It has a limit.

Manually optimized tail recursion (to create a loop) also works. It doesn't have the same limit.

Stated yet another way: unlike some languages, Python does not optimize for you. You must do the rewrite yourself.

While I would have thought these ideas (tail recursion and tail recursion optimization) were different. I was wrong. Hopefully, this blog post will help folks read *all* the words.  

I'm also pretty sure it's covered in here: https://www.packtpub.com/product/functional-python-programming-second-edition/9781788627061.

Tuesday, January 7, 2020

Patreon Book Idea


See "Additional, Related Content". It's one of the posts here: https://www.patreon.com/slott

I think there's space for a Building Skills in Functional Python title next to the Building Skills in OO Design

Tuesday, December 17, 2019

Plannng a Linked-in Learning Course (and using the := walrus operator)

I've recorded two courses for LinkedIn Learning https://www.linkedin.com/learning/me
Let me emphasize that their production values take a lot of work. While I think I'm a pretty good live presenter, a few days in the recording booth with a producer, reveals all my weaknesses. so. um. you know?

I'm starting down the road to at least one more, maybe another one or two after that. 

Which leads to code. Of course. And the code uses the assignment expression ("walrus") operator.

Here's what's going on. I've got a directory full of CSV files with the slide-by-slide scripts. Each file has a bunch of tabs, and the relevant tables have a fixed heading that the production folks use. 

target_headings = ['Part', 'Voice', 'Visual Description', 'Storyboard / Description']

The "Voice" column in these tables is the script. Each row is a slide or other visual. The overall management of the resources with all of these spreadsheets doesn't seem ideal to me. However, it's the way skilled professionals prefer to manage these multi-media assets.

The question is: "which sections are too long?"

Generally, we speak at a consistent rate. During rehearsals, I can use my stopwatch to get timing for a particular script. This gives me a seconds/word or words/second rate metric. Given an average rate, and a script, I can predict a likely duration given the text of the script.

The data is in spreadsheets -- generally the root cause of many complications. There's no word-count in Numbers. So. Time to apply Python. (I'm sure someone has a bunch of Excel macros that can do word-counts. Good for you. I don't own a copy of Excel.)

Here's how this shakes out. There are three parts to the analysis, modeling, and applying the model. The first is a functional flattener to turn all of the files and tabs and tables into a single stream of useful rows.

The Data Gathering

The essential data gathering has to flatten the relatively complex file/sheet/table structure into something we can extract features from. A sequence of the final text of the scripts is what we want. Each script can be a mapping from the slide label to the voice content. It's this content -- the script text -- where we'll find the interesting features.

Here's how this starts.

from pathlib import Path
from fractions import Fraction
import csv
import re
from typing import Tuple, Dict, Iterator, List

sheet_table_pattern = re.compile(f"^(\w+): (.+)$")
target_headings = ['Part', 'Voice', 'Visual Description', 'Storyboard / Description']


def script_iter(source: Path) -> Iterator[Tuple[str, Dict[str, str]]]:
    for script_path in sorted(source.glob("*.csv")):
        # print(script_path)
        with script_path.open() as script_file:
            reader = csv.reader(script_file)
            row_iter = iter(reader)
            for row in row_iter:
                if len(row) == 1 and (match := sheet_table_pattern.match(row[0])):
                    if match and match.group(2).startswith('Table '):
                        headings = next(row_iter)
                        if headings == target_headings:
                            section = match.group(1)
                            text = {}
                            # print(f"Analyzing {section}")
                            for sub_row in row_iter:
                                if len(sub_row) == 0:
                                    break
                                dict_sub_row = dict(zip(headings, sub_row))
                                text[dict_sub_row['Part']] = dict_sub_row['Voice']
                            yield section, text


The outermost for statement locates all .csv files. All the rows within a file will belong to a number of sheets and tables within each sheet. The separator is a line with a Sheet: Table string, described by the sheet_table_pattern. The second for statement picks all the rows from a given sheet, looking for the separators.

There are a bunch of irrelevant tables. Hence the tall stack of if-statements. The useful parts of the script all have names that start with 'Table '. Weird, but true. The match.group(2).startswith('Table ') check feels like some casual ad-hoc test and should probably be made more visible and configurable.

Once we've found a table with the right headings, we can iterate over the following rows until we get to a blank line at end-of-table. We accumulate a dictionary, named text, which has the 'Part' and 'Voice' column values as a handy Dict[str, str] mapping.

Note that we're sharing an iterator, the row_iter variable, among two for statements. This is a very handy trick when doing this kind of partitioning. The outermost use of the iterator is rejecting irrelevant rows. The inner use of the iterator is assembling composite objects from a subset of rows, effectively partitioning the raw data.

This *can* be decomposed into separate functions. Further refactoring is left as an exercise for the reader.

The Benchmark Data

The result of benchmarking is a Fraction object with my unique reading pace. And yes, a Fraction makes more sense than a float value. We're working in int space, and introducing float seems wrong.

Here's the benchmarking to create a model.
def rate() -> Fraction:
    Benchmarks = [
        {'time': 3*60 + 29, 'words': 568},  # 01_01
        {'time': 5*60 + 32, 'words': 732},  # 01_04
        {'time': 5*60 + 54, 'words': 985},  # 02_04
        {'time': 4*60 + 58, 'words': 663},  # 02_05
        {'time': 8*60 + 48, 'words': 1192},  # 03_02 (draft)
    ]
    time_bm = sum(b['time'] for b in Benchmarks)
    words_bm = sum(b['words'] for b in Benchmarks)
    time_per_word = Fraction(time_bm/words_bm)
    return time_per_word

For some sample sections, I read through the material in my best NPR professional broadcasting voice. The sums of words and times give us a time-per-word Fraction object. The resulting value is near 31 seconds for 75 words.

I really like using Fraction instead of float for this kind of thing. The data doesn't support even one decimal place of supposed accuracy.

Note that I didn't factor in any slide count. I assumed this is a linear model from words to time. If I was a real scientist I might have tried a bunch of models.

Applying the Model

The model is linear. It's a scaling factor applied to a specific feature, the number of words. Here's one version of the code. I'm not sure I like it.

def main() -> None:
    time_per_word = rate()
    source = Path.cwd()
    print(f"script, slides, words, time")
    for script, body in script_iter(source):
        word_count = sum(len(text.split()) for text in body.values())
        slide_count = sum(1 for text in body.values() if len(text) > 0)
        m, s = divmod(int(word_count*time_per_word), 60)
        print(f"{script}, {slide_count}, {word_count}, {m}:{s:02d}")

There are three mappings going on here. This makes it a little tricky to create a simple function to map from raw data to something the model can use, then applying the model.

The 'word_count' is a mapping from raw data to one feature. The 'slide_count' is another mapping from raw data to a secondary feature. The 'm' and 's' values represent another mapping from the word_count to the estimated time.

We can hack this around to find another use for the assignment operator.  But the following seems insane:

divmod(int(word_count:=sum(len(text.split()) for text in body.values())*time_per_word), 60)

Let's not consider this assignment expression example as particularly helpful. The above turns two simple statements into a mess.

Alternative Implementation

The relationships among the mappings can be built a pure functional programming, but seems flirt with needless complexity. We can have a pair of functions to map the body.values() to some named tuple with feature values. We can use a third function to apply the model.

Something like this is an alternative that's slightly more functional.

class Features(NamedTuple):
    body: Dict[str, str]
    @property
    def word_count(self) -> int:
        return sum(len(text.split()) for text in self.body.values())
    @property
    def slide_count(self) -> int:
        return sum(1 for text in self.body.values() if len(text) > 0)
    def duration(self, time_per_word: Fraction) -> int:
        return int(self.word_count*time_per_word)

def main_2() -> None:
    time_per_word = rate()
    source = Path.cwd()
    print(f"script, slides, words, time")
    for script, body in script_iter(source):
        details = Features(body)
        m, s = divmod(details.duration(time_per_word), 60)
        print(f"{script}, {details.slide_count}, {details.word_count}, {m}:{s:02d}")

I'm not sure this is dramatically "better". It isolates some aspects of feature collection and model application. It also harbors a secret inefficiency. The two feature values should be cached to avoid recomputing them.

I'll leave the refactoring for the interested reader.

The durations over > 5:00 (300 seconds) need some rework. That's the actual useful output: the list of scripts with excessive time becomes the queue of content that needs rework.

Thursday, June 20, 2019

HumbleBundle -- Functional Python Programming -- Through July 1


See this https://www.humblebundle.com

This is amazing to me. 
Humble Bundle sells games, ebooks, software, and other digital content. Our mission is to support charity while providing awesome content to customers at great prices. We launched in 2010 with a single two-week Humble Indie Bundle, but we have humbly grown into a store full of games and bundles, a subscription service, a game publisher, and more.
Currently, Packt is offering some of my books.

Want a *ton* of technical books and donate to charity?

Click Now. Thank me later.

Tuesday, April 24, 2018

Functional Python Programming 2e -- Type Hints!

You might want to look into this: Functional Python Programming - Second Edition.

Let's talk about the type hints, shall we?

Most of the examples have had type hints added. This means running everything through mypy. And it also means running everything through doctest, as well.

More important than the technical steps, there's a change in viewpoint that comes with type hints.

If you follow a variety of Pythonistas on Twitter, you can see some debates on the merits of type-hinting. Some key points:

  • It's hard.
  • It's so hard, only do it if you absolutely need it.
  • It's too verbose
  • It's hard, but it can help.
  • It's really helpful.
  • It represents a "gap" in the language and without run-time type checking, the whole thing is worthless.
The last point a weird view. I work in a shop that's heavily Pythonic. But. You still hear nonsense. Python a very popular language and it's popularity is growing. The popularity of Python isn't like the popularity of a movie where you're not planning on making a living off of it (I know someone who makes their living off the popularity of movies.) The popularity of Python is like the popularity of automobiles or air travel or electricity.

I hear the "a real language would have prevented that with type-checking." And I respond, "Then why do you unit test?" And they don't really have much of an answer. Python has the same workflow as statically type-checked languages, so the "prevention" thing seems to be nonsense.

Moving on.

"It's hard." Anything new is hard. The complaint is vague, so it's *hard* to respond. (Heh.)

Anything like "only do it if you absolutely need it" bothers me because it seems like a passive-aggressive barricade around things. Also. It's vague.

Verbosity

Verbosity in type hints is a real problem. When creating complex objects from built-in types, we often forget to give names to the intermediate object classes.

Consider Dict[Tuple[Tuple[int, int], Tuple[int, int]], float]

It's long. It describes a structure like this {((12, 13), (14, 15)): 2.8284271247461903, ...} 

Writing something like the following d_map() function without hints is easy. Adding hints seems hard.

def d_map(points):
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

The declaration became L.. O... N... G... because we ignored the intermediate types.

def d_map(points: List[Tuple[Tuple[int, int], Tuple[int, int]]]) -> Dict[Tuple[Tuple[int, int], Tuple[int, int]], float]:
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

These hints, however, doesn't really describe what's happening. The hints elide important details. The hints don't reflect the underlying semantics of the data structure.

One of Python's strengths is the rich collection of first-class data structures with built-in syntax. We can abbreviate some complex concepts into succinct, expressive code.

However.

We shouldn't lose sight of what the succinct code represents. And in this case, it represents some rather complex concepts.
<rant>
Let me sit in my lawn chair and shake my fist in helpless fury at you kids. When I was your age, we sent half a semester of undergraduate work trying to get linked lists, and simple hash mapping to work. Months of work. Later on, as a professional -- years of actual experience -- it took forever to build a binary tree-based collections.Counter definition to gather simple numbers from a flat file. Nowadays, you just slap a Counter down into your code like it's a nothing. It's not a nothing. It's serious, sophisticated software engineering. It's more than Dict[Any, int]. </rant>

What can we do?

When in doubt, Expose the Intermediate Types.

Point = Tuple[int, int]
Leg = Tuple[Point, Point]
Distances = Dict[Leg, float]
def d_map(points: Iterable[Leg]) -> Distances:
    return {(p1, p2): hypot(p1[0]-p2[0], p1[1]-p2[1]) for p1, p2 in points}

This exposes the details. In some cases, it causes us to rethink using a two-tuple to represent a point. The p1[0] syntax starts to chafe a little. Perhaps this should have been

class Point(NamedTuple):
    x: int
    y: int

That leads to tiny (almost-but-not-quite trivial) simplifications. Instead of building simple tuples for each point, we can now build named Point tuples and use p1.x and p1.y to make the code more civilized.

One consequence of this is actually avoiding (), [], and {} to build tuples and lists. Yes. This is heresy. I seriously recommend using tuple(), list(), dict(), and set() because we can replace them with equivalent types. And yes, I text my mother with the same fingers that wrote that.

"But," you object, "It's objectively LONGER! You didn't save me anything! You're a fraud!"

My first response is, "Correct." It is objectively longer. And "Correct," I didn't really "save" you anything; I'm not sure what you're saving. Lines of code do have a cost, but I think clarity has value. And finally, "Correct," I've often been wrong, and I may be wrong here, too.

I like this because the type definitions are reusable, I think this can add clarity throughout the application.

When this kind of declaration is part of a reusable module, the goodness spreads like smiles and hugs throughout the application. Before long, other functions have been tweaked and everyone is sending each other little teddy-bear hug gifts with rainbow cupcakes.

(Please don't exchange mylar balloons. They're evil. Also, see this.)

tl;dr

When your type hints seem ungainly and large, consider Exposing the Intermediate Types. Break down a big structural type hint into the constituent pieces.

If you had to create a class definition for EVERY variation on list, dict, set, and tuple, what would your new class be named?

If you had to describe the underlying meaning of a class -- separate from it's structure -- what name would you give it?

Picking names is one of the two hardest problems in computing. It isn't easy. (The other hardest problem? Cache invalidation and off-by-one errors.)

Tuesday, October 17, 2017

Why I like Functional Composition

After spending years developing a level of mastery over Object Oriented Design Patterns, I'm having a lot of fun understanding Functional Design Patterns.

The OO Design Patterns are helpful because they're concrete expressions of the S. O. L. I. D. design principles. Much of the "Gang of Four" book demonstrates the Interface Segregation, Dependency Injection, and Liskov Substitution Principles nicely. They point the way for implementing the Open/Closed and the Single Responsibility Principles.

For Functional Programming, there are some equivalent ideas, with distinct implementation techniques. The basic I, D, L, and S principles apply, but have a different look in a functional programming context. The Open/Closed principle takes on a radically different look, because it turns into an exercise in Functional Composition.

I'm building an Arduino device that collects GPS data. (The context for this device is the subject of many posts coming in the future.)

GPS devices generally follow the NMEA 0183 protocol, and transmit their data as sentences with various kinds of formats. In particular, the GPRMC and GPVTG sentences contain speed over ground   (SOG) data.

I've been collecting data in my apartment. And it's odd-looking. I've also collected data on my boat, and it doesn't seem to look quite so odd. Here's the analysis I used to make a more concrete conclusion.

def sog_study(source_path = Path("gps_data_gsa.csv")):
    with source_path.open() as source_file:
        rdr = csv.DictReader(source_file)
        sog_seq = list(map(float, filter(None, (row['SOG'] for row in rdr))))
        print("max {}\tMean {}\tStdev {}".format(
            max(sog_seq), statistics.mean(sog_seq), statistics.stdev(sog_seq)))


This is a small example of functional composition to build a sequence of SOG reports for analysis.

This code opens a CSV file with data extracted from the Arduino. There was some reformatting and normalizing done in a separate process: this resulted in a file in a format suitable for the processing shown above.

The compositional part of this is the list(map(float, filter(None, generator))) processing.

The (row['SOG'] for row in rdr) generator can iterate over all values from the SOG column. The filter(None, generator) will drop all None objects from the results, assuring that irrelevant sentences are ignored.

Given an iterable that can produce SOG values, the map(float, iterable) will convert the input strings into useful numbers. The surrounding list() creates a concrete list object to support summary statistics computations.

I'm really delighted with this kind of short, focused functional programming.

"But wait," you say. "How is that anything like the SOLID OO design?"

Remember to drop the OO notions. This is functional composition, not object composition.

ISP: The built-in functions all have well-segregated interfaces. Each one does a small, isolated job.

LSP: The concept of an iterable supports the Liskov Substitution Principle: it's easy to insert additional or different processing as long as we define functions that accept iterables as an argument and yield their values or return an iterable result.

For example.

def sog_gen(csv_reader):
    for row in csv_reader:
        yield row['SOG']

We've expanded the generator expression, (row['SOG'] for row in rdr), into a function. We can now use sog_gen(rdr) instead of the generator expression. The interfaces are the same, and the two expressions enjoy Liskov Substitution.

To be really precise, annotation with type hints can clarify this.  Something like sog_gen(rdr: Iterable[Dict[str, str]]) -> Iterable[str] would clarify this.

DIP: If we want to break this down into separate assignment statements, we can see how a different function can easily be injected into the processing pipeline. We could define a higher-order function that accepted functions like sog_gen, float, statistics.mean, etc., and then created the composite expression.

OCP: Each of the component functions is closed to modification but open to extension. We might want to do something like this: map_float = lambda source: map(float, source). The map_float() function extends map() to include a float operation. We might even want to write something like this.  map_float = lambda xform, source: map(xform, map(float, source)). This would look more like map(), with a float operation provided automatically.

SRP: Each of the built-in functions does one thing. The overall composition builds a complex operation from simple pieces.

The composite operation has two features which are likely to change: the column name and the transformation function. Perhaps we might rename the column from 'SOG' to 'sog'; perhaps we might use decimal() instead of float(). There are a number of less-likely changes. There might be a more complex filter rule, or perhaps a more complex transformation before computing the statistical summary.  These changes would lead to a different composition of the similar underlying pieces.

Tuesday, July 11, 2017

Extracting Data Subsets and Design By Composition

The request was murky. It evolved over time to this:
Create a function file_record_selection(train.csv, 2, 100, train_2_100.csv)
First parameter: input file name (train.csv)
Second parameter: first record to include (2)
Third parameter: last record to include (100)
Fourth parameter: output file name (train_2_100.csv)
Fundamentally, this is a bad way to think about things. I want to cover some superficial problems first, though.

First superficial dig. It evolved to this. In fairness to people without a technical background, getting to tight, implementable requirements are is difficult. Sadly the first hand-waving garbage was from a DBA. It evolved to this. The early drafts made no sense.

Second superficial whining. The specification -- as written -- is extraordinarily shabby. This seems to be written by someone who's never read a function definition in the Python documentation before. Something I know is not the case. How can someone who is marginally able to code also unable to write a description of a function? In this case, the "marginally able to code" may be a hint that some folks struggle with abstraction: the world is a lot of unique details; patterns don't emerge from related details.

Third. Starting from record 2, seems to show that they don't get the idea that indexes start with zero. They've seen Python. They've written code. They've posted code to the web for comments. And they are still baffled by the start value of indices.

Let's move on to the more interesting topic, functional composition. 

Functional Composition

The actual data file is a .GZ archive. So there's a tiny problem with looking at .CSV extracts from the gzip. Specifically, we're exploding a file all over the hard drive for no real benefit. It's often faster to read the zipped file: it may involve fewer physical I/O operations. The .GZ is small; the computation overhead to decompress may be less than the time waiting for I/O.

To get to functional composition we have to start by decomposing the problem. Then we can build the solution from the pieces. To do this, we'll borrow the interface segregation (ISP) design principle from OO programming.

Here's an application of ISP: Avoid Persistence. It's easier to add persistence than to remove it. This leads peeling off three further tiers of file processing: Physical Format, Logical Layout, and Essential Entities.

We shouldn't write a .CSV file unless it's somehow required. For example, if there are multiple clients for a subset. In this case, the problem domain is exploratory data analysis (EDA) and saving .CSV subsets is unlikely to be helpful. The principle still applies: don't start with persistence in mind. What are the Essential Entities?

This leads away from trying to work with filenames, also. It's better to work with files. And we shouldn't work with file names as strings, we should use pathlib.Path. All consequences of peeling off layers from the interfaces.

Replacing names with files means the overall function is really this. A composition. 

file_record_selection = (lambda source, start, stop, target: 
    file_write(target, file_read_selection(source, start, stop))
)

We applied the ISP again, to avoid opening a named .CSV file. We can work with an open file-like objects, instead of a file names. This doesn't change the overall form of the functions, but it changes the types. Here are the two functions that are part of the composition:

from typing import *
import typing
Record = Any
def file_write(target: typing.TextIO, records: Iterable[Record]):
    pass
def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    pass

We've left the record type unspecified, mostly because we don't know what it just yet. The definition of Record reflects the Essential Entities, and we'll defer that decision until later. CSV readers can produce either dictionaries or lists, so it's not a complex decision; but we can defer it.

The .GZ processing defines the physical format. The content which was zipped was a .CSV file, which defines the logical layout.

Separating physical format, logical layout, and essential entity, gets us code like the following:

with gzip.open('file.gz') as source:
    reader = csv.DictReader(source)  # Iterator[Record]
    for line in file_read_selection(reader, start, stop):
        print(line)

We've opened the .GZ for reading. Wrapped a CSV parser around that. Wrapped our selection filter around that. We didn't write the CSV output because -- actually -- that's not required. The core requirement was to examine the input.

We can, if we want, provide two variations of the file_write() function and use a composition like the file_record_selection() function with the write-to-a-file and print-to-the-console variants. Pragmatically, the print-to-the-console is all we really need.

In the above example, the Record type can be formalized as  List[Text].  If we want to use csv.DictReader instead, then the Record type becomes Dict[Text, Text].

Further Decomposition

There's a further level of decomposition: the essential design pattern is Pagination. In Python parlance, it's a slice operation. We could use itertools to replace the entirety of file_read_selection() with itertools.takewhile() and itertools.dropwhile(). The problem with these methods is they don't short-circuit: they read the entire file.

In this instance, it's helpful to have something like this for paginating an iterable with a start and stop value.

for n, r in enumerate(reader):
    if n < start: continue
    if n = stop: break
    yield r

This covers the bases with a short-circuit design that saves a little bit of time when looking at the first few records of a file. It's not great for looking at the last few records, however. Currently, the "tail" use case doesn't seem to be relevant. If it was, we might want to create an index of the line offsets to allow arbitrary access. Or use a simple buffer of the required size.

If we were really ambitious, we'd use the Slice class definition to make it easy to specify start, stop, and step values. This would allow us to pick every 8th item from the file without too much trouble.

The Slice class doesn't, however support selection of a randomized subset. What we really want is a paginator like this:

def paginator(iterable, start: int, stop: int, selection: Callable[[int], bool]):
    for n, r in enumerate(iterable):
        if n < start: continue
        if n == stop: break
        if selection(n): yield r

file_read_selection = lambda source, start, stop: paginator(source, start, stop, lambda n: True)

file_read_slice = lambda source, start, stop, step: paginator(source, start, stop, lambda n: n%step == 0)

The required file_read_selection() is built from smaller pieces. This function, in turn, is used to build file_record_selection() via functional composition. We can use this for randomized selection, also.

Here are functions with type hints instead of lambdas.

def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: True)

def file_read_slice(source: csv.DictReader, start: int, stop: int, step: int)  -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: n%step == 0)

Specifying type for a generic iterable and the matching result iterable seems to require a type variable like this:

T = TypeVar('T')
def paginator(iterable: Iterable[T], ...) -> Iterable[T]:

This type hint suggests we can make wide reuse of this function. That's a pleasant side-effect of functional composition. Reuse can stem from stripping away the various interface details to decompose the problem to essential elements.

TL;DR

What's essential here is Design By Composition. And decomposition to make that possible.

We got there by stepping away from file names to file objects. We segregated Physical Format and Logical Layout, also. Each application of the Interface Segregation Principle leads to further decomposition. We unbundled the pagination from the file I/O. We have a number of smaller functions. The original feature is built from a composition of functions.

Each function can be comfortably tested as a separate unit. Each function can be reused.

Changing the features is a matter of changing the combination of functions. This can mean adding new functions and creating new combinations. 

Tuesday, May 9, 2017

Fizz Buzz Overthought, Project Euler #1, and Unit Tests

This. http://www.tomdalling.com/blog/software-design/fizzbuzz-in-too-much-detail/

And many other thoughts on overthinking fizz buzz. I'm going to overthink it, also. Why not?

This is a problem where the obvious unit test may not cover the cases properly. See https://projecteuler.net/problem=1 for a unit test case with a subtle misdirection built in. I love this problem statement deeply because the happy path is incomplete.

Part of my overthinking is overthinking this as a "classification" exercise, where we have simple classifiers that we can apply as functions of an input value. A higher-level function (i.e., map or a generator expression) applies all of these functions to the input. Some match. Some don't.

It shakes out like this.

The core classifier is a function that requires flexible parameter binding.

query = lambda n, t: lambda x: t if x%n == 0 else None

This is a two-step function definition. The outer function binds in two parameters, n, and t. The result of this binding is the inner function. If an argument value is a multiplier of n, return the text, t. We can think of the result of the outer lambda as a partial function, also, with some parameters defined, but not all.

We have two concrete results of this query() function:

fizz = query(3, 'fizz') 
buzz = query(5, 'buzz')

We've bound in a number and some text. Here's how the resulting functions work.


>>> fizz(3) 
'fizz' 
>>> fizz(2)

The "trick" in the fizz buzz problem space is recognizing that we're working with the power set of these two rules. There are actually four separate conditions. This is remarkably easy to get wrong even though the sample code may pass a unit test like the Project Euler #1 sample data.

Here's the power set that contains the complete set of all subsets of the rules.

rule_groups = set(powerset([fizz, buzz]))

"Um," you say, "Is that necessary? And powerset() isn't built-in."

Try to add a third or fourth rule and the $\textbf{O}(2^n)$ growth in complexity of checking all combinations of the rules will become readily apparent. For two rules, $4 = \lvert\mathcal{P}(\{q(3), q(5)\})\rvert$. For a general set of rules, $r$, it's $2^{\lvert r \rvert} = \lvert \mathcal{P}(r)\rvert$. Four rules? sixteen outcomes. It sure seems like the power set is absolutely necessary; it describes the domain of possible outcomes. How could it not be necessary?

Also. There's a nice definition of powerset in the itertools recipes section of the standard library. It's *almost* built-in. 

The domain of possible responses form a power set. However, it's also clear that we aren't obligated to actually enumerate that set for each value we're testing. We do need to be aware that the complexity of the classification output is $\textbf{O}(2^n)$ where $n$ is the number of rules.

The processing to build each set of classifications, however, is $\textbf{O}(n)$. Here's how it looks.

for n in range(20):
    m = set(filter(None, (r(n) for r in [fizz, buzz])))
    print(m if m else n)

This locates the set of all matches, m. We apply the rules, fizz() and buzz(), to the given value. The result of applying the rules is filtered to remove falsy values. The resulting set, m, has the values from all rules which matched. This will be one of the sets from the power set of applying the rules to each value. The match, $m$, is an element of $\mathcal{P} (\{q(3), q(5)\})$.

I'm delighted that Python has some support for creating partial functions in a variety of ways. When things are complex, we can use the def statement. We can use functools partial(). When things are simple, we can even use lambdas.

Tuesday, May 2, 2017

Functional Python, Literate Programming & Trello Board Analysis

The general advice to people using Kanban/Agile Project boards of various kinds is this:

Stop Starting -- Start Finishing


etc.

There's a lot of this advice. Some of it is helpful.

Many tools have various dashboards and metrics computations.

However.

The basic velocity calculations -- starts v. finishes -- is pretty straight-forward. The rules to classify a Trello action as "start" or "finish" are actually nice examples of simple functions or lambdas. Which also means that the basic pipeline required to gather the data can be written as a lazy, functional process.

Which leads to writing a Literate Programming version of a small program that gathers data from a Trello board.

https://github.com/slott56/Trello-Action-Counts

It's a kind of deep-dive into some aspects of Python functional-style programming. It's also a dive into Literate Programming via a longish example. And it has a fair number of type hints. It's not perfectly clean from MyPy-'s analysis. So there's some more to do on that front.

Monday, January 9, 2017

The Depths of Degradation or How to Reduce

Let's talk real-world functional programming. Disclosure: I'm a fan of functional programming in Python. (This: https://www.packtpub.com/application-development/functional-python-programming)

The usual culprits for functional programming are map(), filter(), generator functions, and the various comprehensions. This is very pleasant and can lead to succinct, expressive code.

The reduce operation, however, is sometimes slippery.  The obvious reductions are sum() and prod().  Some slightly less obvious reductions are these three:

sum0 = lambda s: sum(1 for _ in s)
sum1 = lambda s: sum(s)
sum2 = lambda s: sum(n**2 for n in s)

The first is essentially len(s), but stated more formally. It shows how we can add in filter or transformations. If we're working with a collections.Counter object, we can rewrite these three to work with the values() of a counter. This allows us to have a statistics library that works with a sequence of simple items or a Counter of binned items.

(I've left it as an exercise for the reader to create the summaries of Counters.)

The Health Check Question

The context is an RESTful application's /health end-point. When a client does a GET to /health, we want to provide status of the components on which the app depends as well as a summary.

The details are created like this:

components = (component() for component in COMPONENT_LIST)
init_components = [thing.init_app(app) for thing in components]
details = [component.health() for component in init_components]

We have a list of class definitions for each component. We can create instances of each class. We can initialize these by providing the RESTful app. Finally, we can create a list of the various health end-point status codes.

There's a class definition for other RESTful API's. The health check does a transitive GET to a /health end-point. These are all more-or-less identical.

There are also class definitions for the database and the cache and other non-RESTful components. It's all very pretty and very functional.

Note that the three statements aren't adjacent. They're scattered around to fit better with the way Flask works. The component list is in one place. The initialization happens before the first request. The details are computed as requested.

Also. We don't really use a simple list for the details. It's actually a mapping from which we will derive a vector. I've left that detail out because it's a relatively simple complication.

Representation of Health

We represent health with a simple enumeration of values:

from enum import Enum
class Status(Enum):
    OK = "OK"
    DEGRADED = "DEGRADED"
    DOWN = "DOWN"

This provides the essential definition of health for our purposes. We don't drag around details of the degradation; that's something that we have to determine by looking at our consoles and logs and stuff.  Degradation is (a) rare, and (b) nuanced. Some degradations are mere annoyances: one of the servers is being restarted. Other degradations are hints that something else might be going on that needs investigation: database primary server is down and we're running on a secondary.

Summarizing Health

A subset of the details vector, then, looks like this: [Status.OK, Status.OK, Status.DEGRADED].

How can we summarize this?

First, we need some rules.  Like these:

class Status(Enum):
    OK = "OK"
    DEGRADED = "DEGRADED"
    DOWN = "DOWN"

    def depth(self, other):
        if self == self.OK:
            return {self.OK: self.OK,
                    self.DEGRADED: self.DEGRADED,
                    self.DOWN: self.DEGRADED}[other]
        elif self == self.DEGRADED:
            return {self.OK: self.DEGRADED,
                    self.DEGRADED: self.DEGRADED,
                    self.DOWN: self.DEGRADED}[other]
        elif self == self.DOWN:
            return {self.OK: self.DEGRADED,
                    self.DEGRADED: self.DEGRADED,
                    self.DOWN: self.DOWN}[other]


The depth() method implements a comparison operator that defines the relationships. This can be visualized as a table.

depthOKDEGRADEDDOWN
OKOKDEGRADEDDOWN
DEGRADEDDEGRADEDDEGRADEDDEGRADED
DOWNDOWNDEGRADED DOWN


This allows us to define a function that uses reduce to summarize the vector of status values.

from functools import reduce
def summary(sequence): 
    return reduce(lambda a, b: a.depth(b), sequence)

The reduce() function applies a binary operator between items in a vector. We've used lambda a, b: a.depth(b) to turn the the depth() method into a binary operator so it can be used with reduce.

The summary() function is a "depth-reduction" of a vector of status objects. It's defined independently of the actual status objects. The relationships among the status levels are embedded in the class definition where they belong. The actual details of status are pleasantly opaque.

And.

We have an example of map-reduce outside the sphere of big data.

The Integer Alternative

The health rules as shown above are kind of complex. Could they be simplified? The answer is no.

Here's an alternative -- which does not do what we want.

class Status2(IntEnum):
    OK = 1
    DEGRADED = 2
    DOWN = 3
    
summary2 = lambda sequence: max(sequence)

This works in some cases, but it doesn't work in others. Another alternative is to change the order to be OK=1, DOWN=2, DEGRADED=3. This doesn't work, either. I'll leave it as an exercise to write out some of the various combinations of values and see how these differ.

JSON Representation

The final detail is JSONification of the status vector and the summary.

json.dumps({"status": summary(vector).name, "details": [s.name for s in vector]})

This converts the various Status objects to text items that fit the Swagger specification for our /health end-points. The .name attribute reference is required to get the string labels from the enum. An alternative is to customize the JSON encoder to recognize the Enum objects and extract their names.

Conclusion

Map-Reduce is easy. It surfaces in a number of places. The idea helps encapsulate summarization rules.

Tuesday, October 18, 2016

Optimizing complex generator expressions [Updated]

See this: https://twitter.com/jakevdp/status/786920174595158018

The core expression is similar to this

y = (f(x) for x in L if f(x) is not None)

There are a lot of variations on the filter. The point is that the function appears twice in the above expression.

We have a number of alternatives.

  • y = filter(None, f(x) for x in L)
  • y = filter(None, map(f, L))
  • y = (x for x in map(f, L) if x)
  • y = (x for x in (f(y) for y in L) if x is not None)
  • y = (val for x in L for val in (f(x),) if val is not None)


My preference is two steps, even though I don't really have a good reason for this.

y1 = (f(x) for x in L)
y2 = (f for f in y1 if f)


The thread leads to this path: https://twitter.com/TomAugspurger/status/786922167522828289  and the idea of "Let Bindings." We could extend the language slightly to bind a variable within the confines of the generator expression.

Like this:

y = (f(x) as val for x in L if val is not None)

The as clause binds the f(x) to val so that it can be used in the if clause.

Summary: Interesting.

Tuesday, August 23, 2016

On Generator Functions, Yield and Return

Here's the question, lightly edited to remove the garbage. (Sometimes I'm charitable and call it "rambling". Today, I'm not feeling charitable about the garbage writing style filled with strange assumptions instead of questions.)

someone asked if you could have both a yield and a return in the same ... function/iterator. There was debate and the senior people said, let's actually write code. They wrote code and proved that couldn't have both a yield and a return in the same ... function/iterator. .... 
The meeting moved on w/out anyone asking the why question. Why doesn't it make sense to have both a yield and a return. ...

The impact of the yield statement can be confusing. Writing code to mess around with it was somehow unhelpful. And the shocking "proved that couldn't have both a yield and a return in the same ... function" is a serious problem.

(Or a seriously incorrect summary of the conversation; a very real possibility considering the garbage-encrusted email. Or a sign that Python 3 isn't widely-enough used and the emil omitted this essential fact. And yes, I'm being overly sensitive to the garbage. But there's a better way to come to grips with reality and it involves asking questions and parsing details instead of repeating assumptions and writing garbage.)

An example


>>> def silly(n, stop=None):
 for i in range(n):
  if i == stop: return
  yield i

  
>>> list(silly(5))
[0, 1, 2, 3, 4]
>>> list(silly(5, stop=3))
[0, 1, 2]

This works in both Python 3.5.1 and 2.7.10.

Some discussion

A definition with no yield is a conventional function: the parameters from some domain are mapped to a return value in some range. Each mapping is a single evaluation of the function with concrete argument values.

A definition with a yield statement becomes an iterable generator of (potentially) multiple values. The return statement changes its behavior slightly. It no longer defines the one (and only) return value. In a generator function (one that has a yield) the return statement can be thought of as if it raised the StopIteration exception as a way to exit from the generator.

As can be seen in the example above, both statements are in one function. They both work to provide expected semantics.

The code which gets an error is this:

>>> def silly(n, stop=3):
...     for i in range(n):
...         if i == step: return "boom!"
...         yield i


The "why?" question is should -- perhaps -- be obvious at this point.  The return raises an exception; it doesn't provide a value.

The topic, however, remains troubling. The phrase "have both a yield and a return" is bothersome because it fails to recognize that the yield statement has a special role. The yield statement transforms the semantics of the function to make it into a different object with similar syntax.

It's not a matter of having them "both". It's matter of having a return in a generator. This is an entirely separate and trivial-to-answer question.

A Long Useless Rant

The email seems to contain an implicit assumption. It's the notion that programming language semantics are subtle and slippery things. And even "senior people" can't get it right. Because all programming languages (other then the email sender's personal favorite) are inherently confusing. The confusion cannot be avoided.

There are times when programming language semantics are confusing.  For example, the ++ operator in C is confusing. Nothing can be done about that. The original definition was tied to the PDP-11 machine instructions. Since then... Well.... Aspects of the generated code are formally undefined.  Many languages have one or more places where the semantics are "undefined" or only defined by example.

This is not one of those times.

Here's the real problem I have with the garbage aspect of the email.

If you bring personal baggage to the conversation -- i.e., assumptions based on a comparison between some other language and Python -- confusion will erupt all over the place. Languages are different. Concepts don't map from language to language very well. Yes, there are simple abstract principles which have different concrete realizations in different languages. But among the various concrete realizations, there may not be a simple mapping.

It's essential to discard all knowledge of all previous favorite programming languages when learning a new language.

I'll repeat that for the author of the email.

Don't Go To The Well With A Full Bucket.

You won't get anything.

In this specific case, the notion of "function" in Python is expanded to include two superficially similar things. The syntax is nearly identical. But the behaviors are remarkably different. It's essential to grasp the idea that the two things are different, and can't be casually lumped together as "function/iterator".

The crux of the email appears to be a failure to get the Python language rules in a profound way. 

Tuesday, March 8, 2016

The Composite Builder Pattern, an Example of Declarative Programming [Update]

I'm calling this the Composite Builder pattern. This may have other names, but I haven't seen them. It could simply be lack of research into prior art. I suspect this isn't very new. But I thought it was cool way to do some declarative Python programming.

Here's the concept.

class TheCompositeThing(Builder):
    attribute1 = SomeItem("arg0")
    attribute2 = AnotherItem("arg1")
    more_attributes = MoreItems("more args")

The idea is that when we create an instance of TheCompositeThing, we get a complex object, built from various data sources.  We want to use this in the following kind of context:

with some_config_path.open() as config:
    the_thing = TheCompositeThing().substitute(config)
print(json.dumps(the_thing))

We want to open some configuration file -- something that's unique to an environment -- and populate the complex object in one smooth motion. Once we have the complex object, it can then be used in some way, perhaps serialized as a JSON or YAML document.

Each Item has a get() method that accepts the configuration as input. These do some computation to return a useful result. In some cases, the computation is kind of degenerate case:

class LiteralItem(Item):
    def __init__(self, value):
        self.value = value
    def get(self, config):
        return self.value

This shows how we jam a literal value into the output. Other values might involve elaborate computations, or lookups in the configuration, or a combination of the two.

Why Use a Declarative Style?

This declarative style can be handy when each of the Items in TheCompositeThing involves rather complex, but completely independent computations. There's no dependency here, so the substitute() method can fill in the attributes in any order. Or -- perhaps -- not fill the attributes until they're actually requested. This pattern allows eager or lazy calculation of the attributes.

This pattern applies to building complex AWS Cloud Formation Templates as an example. We often need to make a global tweak to a large number of templates so that we can rebuild a server farm. There's little or no dependency among the Item values being filled in. There's no strange "ripple effect" of a change in one place also showing up in another place because of an obscure dependency between items.

We can extend this to have a kind of pipeline with each stage created in a declarative style. In this more complex situation, we'll have several tiers of Items that fill in the composite object. The first-stage Items depend on one source. The second stage Items depend on the first-stage Items.

class Stage1(Builder):
    item_1 = Stage_1_Item("arg")
    item_2 = Stage_1_More("another")

class Stage2(Builder):
    item_a = Stage_2_Item("some_arg")
    item_b = Stage_2_Another(355, 113)

We can then create a Stage1 object from external configuration or inputs. We can create the derived Stage2 object from the Stage1 object.

And yes. This seems like useless metaprogramming.  We could -- more simply -- do something like this::

class Stage2:
    def __init__(self, stage_1, config):
        self.item_a = Stage_2_Item("some_arg", stage_1, config)
        self.item_b = Stage_2_Another(355, 113, stage_1, config)

We've eagerly computed the attributes during __init__() processing.

Or perhaps this::

class Stage2:
    def __init__(self, stage_1, config):
        self.stage_1= stage_1
        self.config= config
    @property
    def item_a(self):
        return Stage_2_Item("some_arg", self.stage_1, self.config)
    @property
    def item_b(self):
        return Stage_2_Another(355, 113, self.stage_1, self.config)

Here we've been lazy and only computed attribute values as they are requested.

Benefits

We've looked at three ways to build composite objects:
  1. As independent attributes with an flexible but terse implementation.
  2. As attributes during __init__() using sequential code that doesn't assure independence.
  3. As properties using wordy code. 
What's the value proposition? Why is this declarative technique interesting?

I find that the the Declarative Builder pattern is handy because it gives me the following benefits.
  • The attributes must be built independently. We can -- without a second thought -- rearrange the attributes and not worry about one calculation interfering with another attribute. 
  • The attributes can be built eagerly or lazily. Details don't matter. We don't expose the implementation details via __init__ or @property.
  • The class definition becomes a configuration item. A support technician without deep Python knowledge can edit the definition of TheCompositeThing successfully.
I think this kind of lazy, declarative programming is useful for some applications. It's ideal in those cases where we need to isolate a number of computations from each other to allow the software to evolve without breaking.

It may be a stretch, but I think this shows the Depedency Inversion Principle. To an extent, we've moved all of the dependencies to the visible list of attributes within these classes. The items classes do not depend on each other; they depend on configuration or perhaps previous stage composite objects. Since there are no methods involved in the class defintion, we can change the class freely. Each subclass of Builder is more like a configuration item than it is like code. In Python, particularly, we can change the class freely without the agony of a rebuild.

A Build Implementation

We're reluctant to provide a concrete implementation for the above examples because it could go anywhere. It could be done eagerly or lazily. One choice for a lazy implementation is to use a substitute() method. Another choice is to use the __init__() method.

We might do something like this:

def substitute(self, config):
    class_dict= self.__class__.__dict__
    for name in class_dict:
        if name.startswith('__') and name.endswith('__'): continue
        setattr(self, name, class_dict[name].get(config))

This allows us to lazily build the composite object by stepping through the dictionary defined at the class level and filling in values for each item. This could be done via __getattr__() also.