Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label python 3. Show all posts
Showing posts with label python 3. Show all posts

Tuesday, July 28, 2020

Modern Python Cookbook 2nd ed -- Advance Copies -- DM me

This is your "why wait" invitation.

Advanced copies will be available.  

IF.

And this is a big "if".

You have to write a blurb. 

I'll be putting you in contact with Packt marketing folks who will get you your advanced copy so you can write blurbs and reviews and -- well -- actually use the content.

It's all updated to Python 3.8. Type hints almost everywhere. F-strings and the walrus operator. Bunches of devops and data science examples. Plus a few personal examples involving sailboat navigation and management.

See me at LinkedIn https://www.linkedin.com/in/steven-lott-029835/ and I'll hook you up with Packt marketing folks.

See https://www.amazon.com/Modern-Python-Cookbook-Updated-programmer/dp/180020745X for the official Amazon Book Link. This is for ordinary "no obligation to write a review" orders.

DM me directly slott56 at gmail to be put into the marketing spreadsheet.

Tuesday, June 30, 2020

Over-Solving or Solving Problems You Don't Have

Sometimes we call them "Belt and Braces" solutions. As a former suspenders person who switched to belts, the idea of wearing both is a little like over-engineering. In the unlikely event of catastrophic failure of one system, your pants can still remain properly hoist. There's a weird, but defensible reason for that. Most over-engineering lacks a coherent reason. 

Sometimes we call them "Bells and Whistles." The solution has both bells and whistles for signaling. This is usually used in a derogatory sense of useless noisemakers, there for show only. Again, there's a really low-value and dumb, but defensible reason for this. 

While colorful, none of this is helpful for describing over-engineered software. Over-engineered software is often over-engineered for incoherent and indefensible reasons.

Over-engineering generally means trying to solve a problem that no user actually has. This leads to throwing around irrelevant features.

Concrete Example

I lived on a boat. I spent a fair amount of time fretting over navigation. 

There are two big questions: 
  1. How far apart are two points, really. 
  2. What's the real bearing from one point to another.
These are -- in some cases -- easy to answer.

If you have a printed, paper chart at the right scale, you can use dividers to compute a distance. It's actually a very easy task. Similarly, you can read the bearing off the chart directly. There's a trick to comparing a course to a nearby compass rose, but it's easy to learn and very accurate.

Of course, we don't want to painstakingly copy our notes from a paper chart to a spreadsheet to add them up to get total distance. And then fold in speed to get time and fuel consumption. These summary computations are a pain.

What you want is to do all of this with a computer.
  1. Plot the points using a piece of software like OpenCPN (https://opencpn.org).
  2. Extract the GPX file.
  3. Compute distances, bearings, and durations to create a route.
"So?" you ask.

So. When I did this, I researched the math and got a grip on the haversine formula for doing the spherical geometry computation of distances between points on a sphere.



For airplanes and powered freighters crossing oceans, this is perfect.

For a small sailboat going from Annapolis, Maryland, to the Bahamas, this level of complexity is craziness. While accurate, it doesn't really solve the problem I have. 

I don't actually need that much accuracy. 

I need this much accuracy.


And no more. This is the essential hypotenuse distance using an R-factor to convert the difference between latitudes and the distance between longitudes into pretty-close distances. For nautical miles, R is 60×180÷π. 

This is simpler and it solves the problem I actually have. Up to about 232 miles, the answer is within 1 mile of correct. The error grows quickly. Double the distance and the error seems to jump to 8 miles. A 464 mile sailing journey (at 6 knots) takes 3 days. Wind, weather, tides and currents will introduce more error than the simplifying assumptions.

What's important is this can be put into a spreadsheet without pain. I don't need to write sophisticated Python apps to apply haversine to sequences of way-points. I can do a simpler hypotenuse computation on waypoints converted to radians.

Is there a lesson learned?

I think there is.

There's the haversine a super-general solution. It handles great-circle routes elegantly. 

But it doesn't solve my actual problem. And that makes it over-engineering.

My problem is what we call rhumb-line sailing. Over short-enough distances the world may as well be flat. Over slightly longer distances, errors in the ship's compass and speedometer make a hyper-accurate great circle route moot. 

(Even with several fancy GPS-based navigation computers, a prudent mariner has paper backups. The list of waypoints, estimated times and directions are essential when the boat's GPS reciever fails.)

I don't really need the sophistication (and the potential for bugs) with haversine. It doesn't solve a problem I actually have.

Tuesday, November 14, 2017

CI/CD DevOps and Python

See https://www.slideshare.net/ITRevolution/does-sfo-2016-topo-pal-devops-at-capital-one for the 16 gates that separate a good idea from secure, productive use of software. While a lot of DevOps folks like the idea, when it comes to implementing it for Python apps, they get confused.

The confusion seems to stem from Python's lack of a proper "build" step in the CI/CD pipeline. I've had the "everything involves a build" argument and the "well setup.py is analogous to a build" arguments. I have to acquiesce to these positions as part of making progress. In this case, reasoning by analogy can be misleading.

I want to focus on the two gates that are part of the code itself, separate from the rest of the pipeline.

  • Static Analysis 
  • >80% Code Coverage (which implies Unit Tests)

Unit Testing

My preference is to run the unit test suite first and get that out of the way. If the unit test suite fails, or fails to cover 80% of the code, any other considerations are moot.

I like Git triggers based on Pull Requests (PR's) and Merge to Master for checking these two conditions. I like the idea that a PR can't be discussed until unit tests pass. They can also be part of whatever other pipeline is going on, but I like them to be done early and often.

(I worked on a sprint team where the PR unit test wasn't trusted by one of the devs: he'd carefully check out the branch and rerun the unit tests. His comments were good, so the extra effort paid off. I guess.)

After flirting with a lot of frameworks, I'm happiest with py.test. I like the py.test-coverage plug-in and the py.test-BDD plug-in.

Yes. We have acceptance tests for our features written in Gherkin. And we have pytest fixtures that are used by pytest-bdd to process the scenarios in the Gherkin feature files. It actually works out nicely because we have a cucumber.json file that makes everyone happy that we've run an acceptance test suite along with our unit test suite.

What's important is the coverage report is painless and automatic.

And it's compatible with the Ruby-based cucumber tool without involving any actual ruby.

For integration testing, we use Behave. This is a bit more cumbersome than pytest-bdd, but it's appropriate for the bigger-picture testing where we have a docker cluster and have to see a number of "Then" steps to confirm operations spread across a suite of microservices.

The goofy question that often leads to endless confusion is the relationship between unit testing and "build." The setup.py setup definition includes a `tests_require` parameter. This *should be* all that's needed to do `python setup.py test`, which *should be* all that's involved in testing.

Is it a "build"? No. But. You can tell the DevOps folks it's a build if it makes them happy.

Static Analysis

There are several kinds of static analysis. Folks who work in Java are used to having Sonar analysis performed. This is above and beyond the static analysis already performed by the compiler. It seems excessive to me, but folks deploying Java seem to like it.

For Python, there are two important static analysis tools. And this is another source of profound confusion for DevOps folks new to Python.

I like to extract the last line of the pylint output and use that numeric score as the "bottom line" on static analysis part 1. While the default setting is 9.5, that can be a challenge, and we prefer 9.0 as well as some local pylintrc modifications to modify some checkers (e.g., set line length to 120.)

For mypy, it's a little bit more complex. We're still fumbling around here.

Ideally, the type hints are all clean and mypy has nothing to say. We can, of course, fix any errors by claiming everything uses Any and returns Any and every assignment statement sets an Any value. But that's so wrong.

There are (still) modules which require typeshed stub definitions. Ideally, we'd provide these. This would be better than using Any as a hack-around. While good, it's a lot of work.

For now, I think it's sensible to have two "pass" rules for mypy: clean or typeshed error. If mypy is silent, that's perfect. If mypy can't find stubs in typeshed, we can let this go for now and log an issue from the CI/CD pipeline to note the presence of technical debt.

In the best of all worlds, we'd fork the package, fix the type hints, and put in a PR. That's a lot better than using typeshed to work around the lack of hints. 

And, of course, there's the "build" question. For mypy to work, the dependencies (or their typeshed stubs) must be present. We wind up doing a `python setup.py install` to build out the requirements. Is this a "build"? Maybe. You can tell the DevOps folks it's a build if it makes them happy.

If you want idempotent server (or container) builds, you'll need to be sure that you pin specific versions. It can help to break this into two parts:
  • a requirements.txt with specific versions 
  • a generic version-free high-level list in setup.py
The reason for this separation is to make it easy to do a `pip install` or `conda create` from the detailed requirements. Once that's out of the way, the `python setup.py` will run very quickly. If you're working with Docker containers, the `pip install` (or `conda create`) can be part of the Dockerfile, and then tests or static analysis can be run separately, after the initial wave of installations.

Tuesday, November 7, 2017

Python Type Hinting -- generally easy until you find your design flaws

Adding type hints is easy and fun. Seriously. It's not a lot of work.

Until.

Until you find a piece of code that does more than what you sort-of thought it kind-of did.

def null_aware_func(x):
    if x is None:
         return x
    return 2.2*x**1.05

This is a stab at a none-aware computation.

Let's add type hints, shall we?

def null_aware_func(x: float) -> float:
    if x is None:
        return None
    return 2.2*x**1.05

This won't fool mypy. Sigh. It passes unit tests, but it's flagged as a problem.

We have a variety of ways of define this function. And that means we need to think carefully about our None-aware design.

Is this really an @overload?

from typing import overload
@overload
def null_aware_func(x: None) -> None:
    ...
def null_aware_func(x: float) -> float:
    if x is None:
        return None
    return 2.2*x**1.05

And yes, the ... is legit Python syntax. (It's a rarely used token that forms the body of the function.)

Or is this a more advanced type?

from typing import Optional
OptFloat = Optional[float]

def null_aware_func(x: OptFloat) -> OptFloat:
    if x is None:
        return None
    return 2.2*x**1.05

I'd argue that OptFloat is a more sensible definition. However, if this is the only function that's none-aware, perhaps it's an overload.

The deeper question is one of underlying meaning. Why are we doing this? What does it mean?

And. Bonus. Will this be working in a SQLAlchemy environment, where they have their own wrappers for database objects, meaning that `is None` doesn't work and `== None` is required?

What's important is that adding type hints forced us to think about what we were doing. Unlike Java we did this without stopping progress for an extended period of "wrestling with the compiler". We can use Any temporarily because the unit tests all pass. Then, we can pay down the technical debt by fixing the type declaration.

Total. Victory.

Tuesday, October 10, 2017

Python Exercises

https://www.ynonperek.com/2017/09/21/python-exercises/amp/

This seems very cool. These look like some pretty cool problems. It includes debugging and unit testing, so there's a lot of core skills covered by these exercises.

Tuesday, July 18, 2017

Yet Another Python Problem List

This was a cool thing to see in my Twitter feed:

Dan Bader (@dbader_org)
"Why Python Is Not My Favorite Language" zenhack.net/2016/12/25/why…

More Problems with Python. Here's the short list.

1. Encapsulation (Inheritance, really.)
2. With Statement
3. Decorators
4. Duck Typing (and Documentation)
5. Types

I like these kinds of posts because they surface problems that are way, way out at the fringes of Python. What's important to me is that most of the language is fine, but the syntaxes for a few things are sometimes irksome. Also important to me is that it's almost never the deeper semantics; it seems to be entirely a matter of syntax.

The really big problem is people who take the presence of a list like this as a reason dismiss Python in its entirety because they found a few blog posts identifying specific enhancements. That "Python must be bad because people are proposing improvements" is madding. And dismayingly common.

Even in a Python-heavy workplace, there are Java and Node.js people who have opinions shaped by little lists like these. The "semantic whitespace" argument coming from JavaScript people is ludicrous, but there they are: JavaScript has a murky relationship with semi-colons and they're complaining about whitespace. Minifying isn't a virtue. It's a hack. Really.

My point in general is not to say this list is wrong. It's to say that these points are minor. In many cases, I don't disagree that these can be seen as problems. But I don't think they're toweringly important.

1. The body of first point seems to be more about inheritance and accidentally overiding something that shouldn't have been overridden. Java (and C++) folks like to use private for this. Python lets you read the source. I vote for reading the source.

2. Yep. There are other ways to do this. Clever approach. I still prefer with statements.

3. I'm not sold on the syntax change being super helpful.

4. People write bad documentation about their duck types. Good point. People need to be more clear.

5. Agree. A lot of projects need to implement type hints to make it more useful.

Tuesday, July 11, 2017

Extracting Data Subsets and Design By Composition

The request was murky. It evolved over time to this:
Create a function file_record_selection(train.csv, 2, 100, train_2_100.csv)
First parameter: input file name (train.csv)
Second parameter: first record to include (2)
Third parameter: last record to include (100)
Fourth parameter: output file name (train_2_100.csv)
Fundamentally, this is a bad way to think about things. I want to cover some superficial problems first, though.

First superficial dig. It evolved to this. In fairness to people without a technical background, getting to tight, implementable requirements are is difficult. Sadly the first hand-waving garbage was from a DBA. It evolved to this. The early drafts made no sense.

Second superficial whining. The specification -- as written -- is extraordinarily shabby. This seems to be written by someone who's never read a function definition in the Python documentation before. Something I know is not the case. How can someone who is marginally able to code also unable to write a description of a function? In this case, the "marginally able to code" may be a hint that some folks struggle with abstraction: the world is a lot of unique details; patterns don't emerge from related details.

Third. Starting from record 2, seems to show that they don't get the idea that indexes start with zero. They've seen Python. They've written code. They've posted code to the web for comments. And they are still baffled by the start value of indices.

Let's move on to the more interesting topic, functional composition. 

Functional Composition

The actual data file is a .GZ archive. So there's a tiny problem with looking at .CSV extracts from the gzip. Specifically, we're exploding a file all over the hard drive for no real benefit. It's often faster to read the zipped file: it may involve fewer physical I/O operations. The .GZ is small; the computation overhead to decompress may be less than the time waiting for I/O.

To get to functional composition we have to start by decomposing the problem. Then we can build the solution from the pieces. To do this, we'll borrow the interface segregation (ISP) design principle from OO programming.

Here's an application of ISP: Avoid Persistence. It's easier to add persistence than to remove it. This leads peeling off three further tiers of file processing: Physical Format, Logical Layout, and Essential Entities.

We shouldn't write a .CSV file unless it's somehow required. For example, if there are multiple clients for a subset. In this case, the problem domain is exploratory data analysis (EDA) and saving .CSV subsets is unlikely to be helpful. The principle still applies: don't start with persistence in mind. What are the Essential Entities?

This leads away from trying to work with filenames, also. It's better to work with files. And we shouldn't work with file names as strings, we should use pathlib.Path. All consequences of peeling off layers from the interfaces.

Replacing names with files means the overall function is really this. A composition. 

file_record_selection = (lambda source, start, stop, target: 
    file_write(target, file_read_selection(source, start, stop))
)

We applied the ISP again, to avoid opening a named .CSV file. We can work with an open file-like objects, instead of a file names. This doesn't change the overall form of the functions, but it changes the types. Here are the two functions that are part of the composition:

from typing import *
import typing
Record = Any
def file_write(target: typing.TextIO, records: Iterable[Record]):
    pass
def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    pass

We've left the record type unspecified, mostly because we don't know what it just yet. The definition of Record reflects the Essential Entities, and we'll defer that decision until later. CSV readers can produce either dictionaries or lists, so it's not a complex decision; but we can defer it.

The .GZ processing defines the physical format. The content which was zipped was a .CSV file, which defines the logical layout.

Separating physical format, logical layout, and essential entity, gets us code like the following:

with gzip.open('file.gz') as source:
    reader = csv.DictReader(source)  # Iterator[Record]
    for line in file_read_selection(reader, start, stop):
        print(line)

We've opened the .GZ for reading. Wrapped a CSV parser around that. Wrapped our selection filter around that. We didn't write the CSV output because -- actually -- that's not required. The core requirement was to examine the input.

We can, if we want, provide two variations of the file_write() function and use a composition like the file_record_selection() function with the write-to-a-file and print-to-the-console variants. Pragmatically, the print-to-the-console is all we really need.

In the above example, the Record type can be formalized as  List[Text].  If we want to use csv.DictReader instead, then the Record type becomes Dict[Text, Text].

Further Decomposition

There's a further level of decomposition: the essential design pattern is Pagination. In Python parlance, it's a slice operation. We could use itertools to replace the entirety of file_read_selection() with itertools.takewhile() and itertools.dropwhile(). The problem with these methods is they don't short-circuit: they read the entire file.

In this instance, it's helpful to have something like this for paginating an iterable with a start and stop value.

for n, r in enumerate(reader):
    if n < start: continue
    if n = stop: break
    yield r

This covers the bases with a short-circuit design that saves a little bit of time when looking at the first few records of a file. It's not great for looking at the last few records, however. Currently, the "tail" use case doesn't seem to be relevant. If it was, we might want to create an index of the line offsets to allow arbitrary access. Or use a simple buffer of the required size.

If we were really ambitious, we'd use the Slice class definition to make it easy to specify start, stop, and step values. This would allow us to pick every 8th item from the file without too much trouble.

The Slice class doesn't, however support selection of a randomized subset. What we really want is a paginator like this:

def paginator(iterable, start: int, stop: int, selection: Callable[[int], bool]):
    for n, r in enumerate(iterable):
        if n < start: continue
        if n == stop: break
        if selection(n): yield r

file_read_selection = lambda source, start, stop: paginator(source, start, stop, lambda n: True)

file_read_slice = lambda source, start, stop, step: paginator(source, start, stop, lambda n: n%step == 0)

The required file_read_selection() is built from smaller pieces. This function, in turn, is used to build file_record_selection() via functional composition. We can use this for randomized selection, also.

Here are functions with type hints instead of lambdas.

def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: True)

def file_read_slice(source: csv.DictReader, start: int, stop: int, step: int)  -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: n%step == 0)

Specifying type for a generic iterable and the matching result iterable seems to require a type variable like this:

T = TypeVar('T')
def paginator(iterable: Iterable[T], ...) -> Iterable[T]:

This type hint suggests we can make wide reuse of this function. That's a pleasant side-effect of functional composition. Reuse can stem from stripping away the various interface details to decompose the problem to essential elements.

TL;DR

What's essential here is Design By Composition. And decomposition to make that possible.

We got there by stepping away from file names to file objects. We segregated Physical Format and Logical Layout, also. Each application of the Interface Segregation Principle leads to further decomposition. We unbundled the pagination from the file I/O. We have a number of smaller functions. The original feature is built from a composition of functions.

Each function can be comfortably tested as a separate unit. Each function can be reused.

Changing the features is a matter of changing the combination of functions. This can mean adding new functions and creating new combinations. 

Tuesday, July 4, 2017

Python and Performance

Real Question:

One of the standard problems that keeps coming up over and over is the parsing of url's. A sub-problem is the parsing of domain and sub-domains and getting a count.

For example


It would be nice to parse the received file and get counts like

.com had 15,323 count
.google.com had 62 count
.theatlantic.com had 33 count

The first code snippet would be in Python and the other code snippet would be in C/C++ to optimize for performance.

---------

Yes. They did not even try to look in the standard library for urllib.parse. The general problem has already been solved; it can be exploited in a single line of code.

The line can be long-ish, so it can help to use a lambda to make it a little easier to read. The code is below.

The C/C++ point about "optimize for performance" bothers me to no end. Python isn't very slow. Optimization isn't required.

I made 16,000 URL's. These were not utterly random strings, they were random URL's using a pool of 100 distinct names. This provides some lumpiness to the data. Not real lumpiness where there's a long tail of 1-time-only names. But enough to exercise collections.Counter and urllib.parse.urlparse().

Here's what I found. Time to parse 16,000 URLs and pluck out the last two levels of the name?

CPU times: user 154 ms, sys: 2.18 ms, total: 156 ms
Wall time: 157 ms

32,000?

CPU times: user 295 ms, sys: 6.87 ms, total: 302 ms
Wall time: 318 ms

At that pace, why use C?

I suppose one could demand more speed just to demand more speed.

Here's some code that can be further optimized.

top = lambda netloc: '.'.join(netloc.split('.')[-2:])
random_counts = Counter(top(urllib.parse.urlparse(x).netloc) for x in random_urls_32k)

The slow part of this is the top() function. Using rsplit('.', maxsplit=2) might be better than split('.'). A smarter approach might be find all the "." and slice the substring from the next-to-last one. Something like this, netloc[findall('.', netloc)[-2]:], assuming a findall() function that returns the locations of all '.' in a string.

Of course, if there is a problem, using a numpy structure might speed things up. Or use dask to farm the work out to multiple threads.

Wednesday, April 19, 2017

AWS "Serverless" Architecture Update -- Python 3.6 News

This: https://aws.amazon.com/about-aws/whats-new/2017/04/aws-lambda-supports-python-3-6/

You can now use Python 3.6 Lambdas. This changes things dramatically. We can now write faster, simpler, less quirky code using the latest-and-greatest Python.

If you want to configure a server in the cloud, consider this: https://wiki.ubuntu.com/Python. Use Ubuntu as the base image. Faster. Cleaner. Less Quirky.

Tuesday, March 28, 2017

Linked-in Learning: Migrating from Python 2.7 to Python 3

Migrating from Python 2.7 to Python 3
with: Steven Lott

...is now available on LinkedIn Learning:
https://www.linkedin.com/learning/migrating-from-python-2-7-to-python-3

and on Lynda.com:
https://www.lynda.com/Python-tutorials/Migrating-from-Python-2-7-Python-3/560887-2.html

Course Description:
Are you still using Python 2.7? If you've been meaning to make the jump to Python 3, but aren't entirely sure what's different in the latest version of this popular language—or how to migrate your code—then this course is for you. Instructor Steven Lott illuminates the differences between the two Python versions, going over elements such as changes to built-in Python functions and the Python standard library. He also walks through a number of ways to convert your Python 2.7 applications to Python 3, showing how to use packages like six, future, and 2to3. Along the way, Steven shares his own experiences with this transition, and offers helpful suggestions for enhancing the overall quality and performance of your code.

Topics Include:
Reviewing the differences between the two versions
Reviewing the syntax changes introduced with Python 3
Understanding the changes to built-in functions
Reviewing the most important changes to the Python standard library
Understanding which tools are required to migrate from Python 2.7 to Python 3
Using six to handle class definitions
Using six with standard library changes
Using future
Making syntax changes and class changes with futurize
Using 2to3 or modernize

Duration:
2h 40m

Tuesday, November 22, 2016

The Modern Python Cookbook

See https://www.packtpub.com/application-development/modern-python-cookbook

This is a large (!) collection of recipes, focused on Python 3, exclusively.

It's much easier to write about the version of Python I actually use each day, and leave the old, quirky, slow Python 2 behind. This book doesn't have any "this will be different in Python 2" warnings. Those days seem to have passed. Finally.

The clock is counting down. https://pythonclock.org


Tuesday, March 8, 2016

The Composite Builder Pattern, an Example of Declarative Programming [Update]

I'm calling this the Composite Builder pattern. This may have other names, but I haven't seen them. It could simply be lack of research into prior art. I suspect this isn't very new. But I thought it was cool way to do some declarative Python programming.

Here's the concept.

class TheCompositeThing(Builder):
    attribute1 = SomeItem("arg0")
    attribute2 = AnotherItem("arg1")
    more_attributes = MoreItems("more args")

The idea is that when we create an instance of TheCompositeThing, we get a complex object, built from various data sources.  We want to use this in the following kind of context:

with some_config_path.open() as config:
    the_thing = TheCompositeThing().substitute(config)
print(json.dumps(the_thing))

We want to open some configuration file -- something that's unique to an environment -- and populate the complex object in one smooth motion. Once we have the complex object, it can then be used in some way, perhaps serialized as a JSON or YAML document.

Each Item has a get() method that accepts the configuration as input. These do some computation to return a useful result. In some cases, the computation is kind of degenerate case:

class LiteralItem(Item):
    def __init__(self, value):
        self.value = value
    def get(self, config):
        return self.value

This shows how we jam a literal value into the output. Other values might involve elaborate computations, or lookups in the configuration, or a combination of the two.

Why Use a Declarative Style?

This declarative style can be handy when each of the Items in TheCompositeThing involves rather complex, but completely independent computations. There's no dependency here, so the substitute() method can fill in the attributes in any order. Or -- perhaps -- not fill the attributes until they're actually requested. This pattern allows eager or lazy calculation of the attributes.

This pattern applies to building complex AWS Cloud Formation Templates as an example. We often need to make a global tweak to a large number of templates so that we can rebuild a server farm. There's little or no dependency among the Item values being filled in. There's no strange "ripple effect" of a change in one place also showing up in another place because of an obscure dependency between items.

We can extend this to have a kind of pipeline with each stage created in a declarative style. In this more complex situation, we'll have several tiers of Items that fill in the composite object. The first-stage Items depend on one source. The second stage Items depend on the first-stage Items.

class Stage1(Builder):
    item_1 = Stage_1_Item("arg")
    item_2 = Stage_1_More("another")

class Stage2(Builder):
    item_a = Stage_2_Item("some_arg")
    item_b = Stage_2_Another(355, 113)

We can then create a Stage1 object from external configuration or inputs. We can create the derived Stage2 object from the Stage1 object.

And yes. This seems like useless metaprogramming.  We could -- more simply -- do something like this::

class Stage2:
    def __init__(self, stage_1, config):
        self.item_a = Stage_2_Item("some_arg", stage_1, config)
        self.item_b = Stage_2_Another(355, 113, stage_1, config)

We've eagerly computed the attributes during __init__() processing.

Or perhaps this::

class Stage2:
    def __init__(self, stage_1, config):
        self.stage_1= stage_1
        self.config= config
    @property
    def item_a(self):
        return Stage_2_Item("some_arg", self.stage_1, self.config)
    @property
    def item_b(self):
        return Stage_2_Another(355, 113, self.stage_1, self.config)

Here we've been lazy and only computed attribute values as they are requested.

Benefits

We've looked at three ways to build composite objects:
  1. As independent attributes with an flexible but terse implementation.
  2. As attributes during __init__() using sequential code that doesn't assure independence.
  3. As properties using wordy code. 
What's the value proposition? Why is this declarative technique interesting?

I find that the the Declarative Builder pattern is handy because it gives me the following benefits.
  • The attributes must be built independently. We can -- without a second thought -- rearrange the attributes and not worry about one calculation interfering with another attribute. 
  • The attributes can be built eagerly or lazily. Details don't matter. We don't expose the implementation details via __init__ or @property.
  • The class definition becomes a configuration item. A support technician without deep Python knowledge can edit the definition of TheCompositeThing successfully.
I think this kind of lazy, declarative programming is useful for some applications. It's ideal in those cases where we need to isolate a number of computations from each other to allow the software to evolve without breaking.

It may be a stretch, but I think this shows the Depedency Inversion Principle. To an extent, we've moved all of the dependencies to the visible list of attributes within these classes. The items classes do not depend on each other; they depend on configuration or perhaps previous stage composite objects. Since there are no methods involved in the class defintion, we can change the class freely. Each subclass of Builder is more like a configuration item than it is like code. In Python, particularly, we can change the class freely without the agony of a rebuild.

A Build Implementation

We're reluctant to provide a concrete implementation for the above examples because it could go anywhere. It could be done eagerly or lazily. One choice for a lazy implementation is to use a substitute() method. Another choice is to use the __init__() method.

We might do something like this:

def substitute(self, config):
    class_dict= self.__class__.__dict__
    for name in class_dict:
        if name.startswith('__') and name.endswith('__'): continue
        setattr(self, name, class_dict[name].get(config))

This allows us to lazily build the composite object by stepping through the dictionary defined at the class level and filling in values for each item. This could be done via __getattr__() also.

Tuesday, March 1, 2016

Dexy and word-processing toolchains

See http://www.dexy.it

Wow. That seems cool.

I write. A lot.

I've tried a lot of tool chains. A lot. And I mean non-trivial "try". Whole books.  Hundreds of pages.

LEO + my own HTML Templates. A lot of fun at first.  An outliner that generates RST is a very, very handy thing for technical writing.

An XML editor (I forget which one. Maybe http://www.xmlmind.com/xmleditor/?) with the DocBook XML and XSLT tool-chain. This produced HTML from the XML. I think it could also produce LaTeX. Again, the outlining and structuring were kind of handy. What was particularly cool was the diverse semantic markup tags available in DocBook. Getting the tag containment right was a large pain even with a handy GUI editor.

Plain Text using RST "the hard way;" i.e., without LEO. This isn't too bad, it turns out. The outlining features of LEO -- while fun -- aren't essential. A simple RST toolchain is easy to concoct. I used SCONS to rebuild HTML and LaTeX from the RST.

LaTeX. Once I had the base LaTeX from RST, I could then edit that to produce an even richer document using lots of LaTeX add-on packages. I use MacTex and it is truly great. The downside of LaTex -- for me -- was no trivial way to go back to HTML from the LaTeX.  There are some complex back-and-forth toolchains, but it's easier to just produce a PDF. PDF looks great, but wasn't really my goal.

RST with Sphinx. Wow. This is elegant. I often produce chapter drafts here, and then copy and paste the HTML into the word processors preferred by the publishing industry.

[They insist on applying their goofy markup templates to the text. It's a subset of meaningful semantic markup used by Sphinx, but somehow their toolchain must start with .DOCX files, and nothing else will do.]

Dexy was cool right up until I read this: "Dexy is a Python package (Python 2.6-2.7 only)".

Ouch. Web.py Utilities include DBUtils which won't install under Python 3.5. So that put the kibosh on Dexy. Sadface.

Tuesday, January 19, 2016

SQL Hegemony -- the "Pivot Table" problem

As far as I can tell, the Pivot Table Problem™ only exists for people who have actively put on blinders so that they can only see data one way.

This leads to the following.

The context appears to be millions of rows of data. Hundreds of columns.  It appears that someone we'll call DesKtop tried to load a spreadsheet to "pivot" the data. And the spreadsheet -- of course -- breaks because it's too much data.

DesKtop then calls the DBA.

DesKtop: "We need to load a table and use the database to create a spread-sheet like pivot table."

DBA: "Okay. Cool. It's complex SQL, though. Check this out..." They look at the Oracle 11g PIVOT and UNPIVOT.

DesKtop: "Oh my. That's really complex. Okay, I guess that's the only choice, right?"

DBA: "Right, it is our only possible choice."

Wrong.

[I heard about this from DBA who sent me a "humble brag" about something that can only be done with hyper-complex SQL query. DBA had found a Python tutorial on Pandas that mentioned pivoting. The humble brag point was this: the Python stuff was just as hyper-complex as the SQL. Apparently, DBA conflated the entire tutorial with the one line of code that was the pivot example.]

If you restructure the data into (row key, column key, cell value) triples, you don't have a Pivot Table Problem™ any more. You have a SELECT reduction() GROUP BY row vs. SELECT reduction GROUP BY column kind of query. There's no "pivot". Maybe it's a conceptual pivot but there's no hyper-complex SQL.

It requires a non-trivial loader to transform data that's in row order and explode it into triples. This isn't the kind of thing a program like Oracle's SQL*Loader or other bulk loader does particularly well. In Python (without using Pandas) we can expand the data into triples like this:

for row in reader:
    for column in column_names:
        new_row = row['key'], column, row[column]
        insert...

The idea here is that we're using something like a csv DictReader. We have a list of column names we'd like to pivot. In many row-oriented data sets, there are columns we might like to ignore. For example, the row key column itself shouldn't be exploded into a (row key, column key, row key) triple.

This restructuring idea applies in full force to doing Python-based reduction of the data. Forget loading a database in the first place.

by_column = defaultdict(list)
for row in reader:
    for column in column_names:
        by_column[column].append(row[column])

We've summarized each column's data into a list of values. This is the "GROUP BY" part of the SQL. Now we can do reductions on the values in each column-based list.

from statistics import mean
for column in by_column:
    print(column, sum(by_column[column]), mean(by_column[column]))

We've done sum and mean reductions on the values in each column. We can -- of course -- layer in mapping and filtering if that's required.

This works well for millions of individual cells of data. We can comfortably hold several hundred million individual values in memory in a 32Gb desktop computer. You may notice the fan kicks on when this is running.

If this turns out to require too much storage, then the reductions can be computed item-by-item rather than simply accumulating a list of values. This a hair more complex, but not in an interesting way.

sum_by_column = defaultdict(int)
count_by_column = defaultdict(int)

for row in reader:
    for column in column_names:
        sum_by_column[column] += row[column]
        count_by_column[column] += 1

Minima and Maxima are a trifle trickier. We don't want to initialize them to None and have an if current_min is None statement executed millions of times. We have to create an iterator and process the first row specially, using it to initialize all of the values. The remaining rows can then be processed free of any initialization question.

row_iter= iter(reader)
first = next(row_iter)
for column in column_names:
    min_by_column[column]= row[column]
    max_by_column[column]= row[column]
for row in row_iter:
    for column in column_names:
        min_by_column[column] = min( min_by_column[column], row[column])
        max_by_column[column] = max(max_by_column[column], row[column])

I like to call this the Head-Tail design pattern.

The DBA and DesKtop appear to be married to SQL. Even when it appears to be an ineffective solution to their problem.
     


Tuesday, November 24, 2015

Navigation: Latitude, Longitude, Haversine, and all that

For a few years, I was a tech nomad. See Team Red Cruising for some stories of life on a sailboat. Warning: it's pretty dull.

As a tech nomad, I lived and died (literally) by my ability to navigate. Modern GPS devices make the dying part relatively unlikely. So, let's not oversell the danger aspect of this.

The prudent mariner plans a long voyage with a great deal of respect for the many things which can go wrong. One aspect of this is to create a "Float Plan". Read more about it here: http://floatplancentral.cgaux.org.

The idea is to create a summary of the voyage, provide that summary to trusted shore crew, and then check in periodically so that the shore crew can confirm that you're making progress safely. Failure to check in is an indicator of a problem, and action needs to be taken. We use a SPOT Messenger to check in at noon (and sometimes at waypoints.)

Creating a float plan involved an extract of the waypoints from our navigation software (GPS NavX). I would enrich the list of waypoints with estimated travel time between the points.  Folding in a departure time would lead to a schedule that could be tracked. I also include some navigation hints in the form of a bearing between waypoints so we know which way to steer to find the next point.

The travel time is the distance (in  nautical miles) coupled with an assumption about speed (5 knots.) It's a really simple thing. But the core haversine calculation is not a first-class part of any spreadsheet app. Because of the degrees-to-radians conversions required, and the common practice of annotating degrees with a lot of internal punctuation (38°54ʹ57″ 077°13ʹ36″), it becomes right awkward to simply implement this as a spreadsheet.

Some clever software has a good planning mode. The chartplotter on the boat can do a respectable job of estimating time between waypoints. But. It's not connected to a computer or the internet. So we can't upload that information in the form of a float plan. The idea of copying the data from the chart plotter to a spreadsheet is fraught with errors.

Navtools

Enter navtools. This is a library that I use to transform a route into a .csv with distances and bearings that I can use to create a useful float plan. I can add an estimated arrival time calculation so that a change to departure time creates the entire check-in schedule.

This isn't a sophisticated GUI app. It's just enough software to transform a GPS NavX extract file into a more useful form. The GUI was a spreadsheet (i.e., Numbers.) From this we created a PDF with the details.

Practically, we don't have good connectivity on the boat.  So we would create a number of alternative plans ("leave tomorrow", "leave the day after", "leave next Monday", etc.) we would go ashore, find a coffee shop, and email the various plans to ourselves. They could sit in our inbox, waiting for weather and tide to be favorable.

Then, when the weather and tides were finally aligned, we could forward the relevant details to our trusted shore crew. This was a quick spurt of cell phone connectivity to forward an email. It worked out well. When the scheduled departure time arrived, we'd coax Mr. Lehman to life, raise the anchor and away.

Literate Programming

This is an exercise in literate programming. The code that's executed and the HTML documentation are both derived from source ReStructured Text (RST) documents. The documentation for the navigation module includes the math along with the code that implements the math.

I have to say that I'm enthralled with the intimate connection between requirements, design, and implementation that literate programming embodies.

I'm excited to (finally) publish the thing to GitHub. See https://github.com/slott56/navtools.  I'm looking at some other projects that require the navtools module. What I wind up doing is copying and pasting the navigation calculation module into other projects. I had something like three separate copies on my laptop. It was time to fold all of the features together, delete the clones, and focus on one authoritative copy going forward.

I still have to remove some crufty old code. One step at a time. First, get all the tests to pass. Then expunge the old code. Then make progress on the other projects that leverage the navtools.navigation module.

Tuesday, November 10, 2015

Formatting Strings and the str.format() family of functions -- Python 3.4 Notes

I have to be clear that I am obsessed with the str.format() family of functions. I've happily left the string % operator behind. I recently re-discovered the vars() function.

My current go-to technique for providing debugging information is this:

print( "note: local={local!r}, this={this!r}, that={that!r}".format_map(vars)) )

I find this to be handy and expressive. It can be replaced with logging.debug() without a second thought. I can readily expand what's being dumped because all locals are provided by vars().

I also like this as a quick and dirty starting point for a class:

def __repr__(self):
    return "{__class__.__name__}(**{state!r})".format(__class__=self.__class__, state=vars(self))

This captures the name and state. But. There are nicer things we can do. One of the easiest is to use a helper function to reformat the current state in keyword parameter syntax, like this:

def args(obj):
    return ", ".join( "{k}={v!r}".format(k=k,v=v) for k,v in vars(obj).items())

This allows us to dump an object's state in a slightly nicer format. We can replace vars(self) with args(self) in our __repr__ method. We've dumped the state of an object with very little class-specific code. We can focus on the problem domain without having to wrestle with Python considerations.

Format Specifications

The use of !r for formatting is important. I've (frequently) messed up and used things like :s where data might be None. I've discovered that -- starting in Python 3.4 -- the :s format is unhappy with None objects. Here's the exhaustive enumeration of cases. 

>>> "{0} {1}".format("s",None)
's None'
>>> "{0:s} {1:s}".format("s",None)
Traceback (most recent call last):
  File "", line 1, in 
    "{0:s} {1:s}".format("s",None)
TypeError: non-empty format string passed to object.__format__
>>> "{0!s} {1!s}".format("s",None)
's None'
>>> "{0!r} {1!r}".format("s",None)
"'s' None"

Many things are implicitly converted to strings. This happens in a lot of places. Python is riddled with str() function evaluations. But they aren't everywhere. Python 3.3 had one that was removed for Python 3.4 and up.

Bottom Line: be careful where you use :s formatting.  It may do less than you think it should do.

Wednesday, October 14, 2015

Chapters to Edit: What do I do instead?

I'm starting to get chapters back from the technical reviewers. This is an important part of the writing process: correcting my mistakes and clarifying things that confused the reviewers.

Packt has had a uniformly excellent cadre of technical reviewers. At this point, I've worked with something like a dozen people on four different books. It's been great (for me) to get detailed, specific feedback point by point.

Instead of working on my reviewed chapters, however, I'm browsing. It's Python Week.


I'll get to the chapters Thursday, I think.

Tuesday, October 13, 2015

Wait, there's more Python goodness from Packt

This just in...

Here's a link to the actual Python Week page, with all the deals there for the week: https://www.packtpub.com/packt/offers/pythonweek

They also have a week of free Python books too, which change daily: https://www.packtpub.com/packt/offers/free-learning/

Feel free to ruthlessly export their largess and build your personal technical library.