Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Tuesday, December 20, 2022

Christmas Book Offers

Apple Books

Pivot to Python

A Guide for professionals and skilled beginners

https://books.apple.com/us/book/pivot-to-python/id1586977675 

I've recently updated this to fix some cosmetic problems with title pages, the table of contents and stuff like that. The content hasn't changed. Yet. It's still an introduction to Python for folks who already know how to program, they want to pivot to programming in Python. Quickly.

But wait, there's more. 

Unlearning SQL

When your only tool is a hammer, every problem looks like a nail

https://books.apple.com/us/book/unlearning-sql/id6443164060

Many folks know some Python, but struggle with the architectural balance between writing bulk processing in SQL or writing it in Python. For too many developers, SQL is effectively the only tool they can use. With a variety of tools, it becomes easier to solve a wider variety of problems effectively.

Google Play

Also available on Google Play. Here's Unlearning SQL:

https://play.google.com/store/books/details?id=23WAEAAAQBAJ

I've made a clone of Pivot to Python, also.

https://play.google.com/store/books/details/Steven_F_Lott_Unlearning_SQL?id=23WAEAAAQBAJ&hl=en_US&gl=US

Both books are (intentionally) short to help experts make rapid progress.

Tuesday, November 29, 2022

Functional Programming and Finite State Automata (FSA)

When I talk about functional programming in Python, folks like to look for place where functional programming isn't appropriate. They latch onto finite-state automata (FSA) because "state" of an automata doesn't seem to fit with stateless objects used in functional programming.

This is a false dichotomy. 

It's emphatically false in Python, where we don't have a purely functional language.

(In a purely functional language, monads can help make FSA's behave properly and avoid optimization. The use of a recursion to consume an iterable and make state transitions is sometimes hard to visualize. We don't have these constraints.)

Let's look at a trivial kind of FSA: the parity computation. We want to know how many 1-bits are in a given value. Step 1 is to expand an integer into bits.

def bits(n: int) -> Iterable[int]:
    if n < 0:
        raise ValueError(f"{n} must be >= 0")
    while n > 0:
        n, bit = divmod(n, 2)
        yield bit

This will transform a number into a sequence of bits. (They're in order from LSB to MSB, which is the reverse order of the bin() function.)

>>> list(bits(42))
[0, 1, 0, 1, 0, 1]

Given a sequence of bits, is there an odd number or an even number? This is the parity question. The parity FSA is often depicted like this:

When the parity is in the even state, a 1-bit transitions to the odd state. When the parity is in the odd, a 1-bit transitions to the even state.

Clearly, this demands the State design pattern, right?

An OO Implementation

Here's a detailed OO implementation using the State design pattern.

 
class Parity:
    def signal(self, bit: int) -> "Parity":
        ...


class EvenParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return OddParity()
        else:
            return self


class OddParity(Parity):
    def signal(self, bit: int) -> Parity:
        if bit % 2 == 1:
            return EvenParity()
        else:
            return self


class ParityCheck:
    def __init__(self):
        self.parity = EvenParity()

    def check(self, message: Iterable[int]) -> None:
        for bit in message:
            self.parity = self.parity.signal(bit)

    @property
    def even_parity(self) -> bool:
        return isinstance(self.parity, EvenParity)

Each of the Parity subclasses implements one of the states of the FSA. The lonely signal() method implements state-specific behavior. In this case, it's a transition to another state. In more complex examples it may involve side-effects like updating a mutable data structure to log progress.

This mapping from state to diagram to class is pretty pleasant. Folks really like to implement each state as a distinct class. It somehow feels really solid.

It's import to note the loneliness of the lonely signal() method. It's all by itself in that big, empty class.

Hint. This could be a function.

It's also important to note that this kind of design is subject to odd, unpleasant design tweaks. Ideally, the transition is *only* done by the lonely signal() method. Nothing stops the unscrupulous programmer from putting state transitions in other methods. Sigh.

We'll look at more complex kinds of state transitions later. In the UML state chart diagrams sates may also have entry actions and exit actions, a bit more complex behavior than we we're showing in this example.

A Functional Implementation

What's the alternative? Instead of modeling state as an object with methods for behavior, we can model state as a function. The state is a function that transitions to the next state.

def even(bit: int) -> ParityF:
    if bit % 2 == 1:
        return odd
    else:
        return even


def odd(bit: int) -> ParityF:
    if bit % 2 == 1:
        return even
    else:
        return odd


def parity_check(message: Iterable[int], init: ParityF = None) -> ParityF:
    parity = init or even
    for bit in message:
        parity = parity(bit)
    return parity


def even_parity(p: ParityF) -> bool:
    return p is even

Each state is modeled by a function.

The parity_check() function examines each bit, and applies the current state function (either even() or odd()) to compute the next state, and save this as the vakue of the parity variable.

What's the ParityF type? This:

from typing import Protocol


class ParityF(Protocol):
    def __call__(self, bit: int) -> "ParityF":
        ...

This uses a Protocol to define a type with a recursive cycle in it. It would be more fun to use something like ParityF = Callable[[int], "ParityF"], but that's not (yet) supported.

Some Extensions

What if we need each state to have more attributes?

Python functions have attributes. Like this: even.some_value = 2; odd.some_value = 1. We can add all the attributes we require.

What about other functions that happen on entry to a state or exit from a state? This is trickier. My preference is to use a class as a namespace that contains a number of related functions.

class Even:
    @staticmethod
    def __call__(bit: int) -> ParityF:
        if bit % 2 == 1:
            odd.enter()
            return odd
        else:
            return even
    @staticmethod
    def enter() -> None:
        print("even")

even = Even()

This seems to work out well, and keeps each state-specific material in a single namespace. It uses static methods to follow the same design principle as the previous example -- these are pure functions, collected into the class only to provide a namespace so we can use odd.enter() or even.enter().

TL;DR

The State design pattern isn't required to implement a FSA.

Tuesday, July 26, 2022

Bashing the Bash -- The shell is awful and what you can do about it

A presentation I did recently.

https://github.com/slott56/bashing-the-bash

Folks were polite and didn't have too many questions. I guess they fundamentally agreed: the shell is awful, we can use it for a few things.

Safe Shell Scripts Stay Simple: Set the environment, Start the application.

The Seven S's of shell scripting.

Many many thanks to Code & Supply for hosting me.

Tuesday, March 1, 2022

Static Site Blues

I have a very large, static site with 10+ years of stuff about my boat. Most of it is pretty boring. http://www.itmaybeahack.com/TeamRedCruising/

I started with iWeb. It was very -- well -- 2000-ish look and feel. Too many pastels and lines and borders.

In 2012, I switched to Sandvox. I lived on a boat back then. I don't have reliable internet. Using blogger.com, for example, required a sincere commitment to bandwidth. I moved ashore in 2014 and returned to the boat in 2020.

Sandvox's creator seems to be out-of-business.

What's next?

Give up on these fancy editors and switch to a static site generator. Write markdown. Run the tool. Upload when in a coffee shop with Wi-Fi. 

What site generator?

See https://www.fullstackpython.com/static-site-generator.html for some suggestions.

There are three parts to this effort.

  1. Extract the goodness from iWeb and Sandvox. I knew this would be real work. iWeb's site has too much javascript to be easy-to-parse. I have to navigate the underlying XML database. Sandvox is much easier to deal with: their published site is clean, static HTML with useful classes and ids in their tags.
  2. Reformat the source material into Markdown. I've grudgingly grown to accept Markdown, even through RST is clearly superior. Some tools work with RST and I may pandoc the entire thing over to RST from Markdown. For now, though, the content seems to be captured.
  3. Fixup internal links and cross references. This is a godawful problem. Media links -- in particular -- seem to be a nightmare. Since iWeb resolves things via Javascript, the HTML is opaque.  Fortunately, the database's internal cross-references aren't horrible. Maybe this was exacerbated a poor choice of generators. 
  4. Convert to HTML for a local server. Validate.
  5. Convert to HTML for the target server. Upload to a staging server and validate again. This requires a coffee shop. Not doing this with my phone's data plan.

Steps 1 and 2 aren't too bad. I've extracted serviceable markdown from the iWeb database and the published Sandvox site. The material parallels the Site/Blog/Page structure of the originals. The markdown seems to be mostly error-free. (Some images have the caption in the wrong place, ![caption](link) isn't as memorable as I'd like.) 

Step 3, the internal links and cross-references, has been a difficult problem, it turns out. I can, mostly, associate media with postings. I can also find all the cross-references among postings and fix those up. The question that arises is how to reference media from a blog post?

Mynt

I started with mynt. And had to bail. It's clever and very simple. Too simple for blog posts that have a lot of associated media assets.

The issue is what to write in the markdown to refer to the images that go with a specific blog post. I resorted to a master _Media directory. Which means each posting has ![caption][../../../../_Media/image.png) in it.  This is semi-manageable. But exasperating in bulk. 

What scrambled my brain is the way a mynt posting becomes a directory, with an index.html. Clearly, the media could be adjacent to the index.html. But. I can't figure out how to get mynt's generator to put the media into each post's published directory. It seems like each post should not be a markdown file. 

Also, I can trivially change the base URL when generating, but I can't change the domain. When I publish, I want to swap domains *only*, leaving the base URL alone. I tried. It's too much fooling around.

Pelican

Next up. Pelican. We'll see if I can get my media and blog posts neatly organized. This http://chdoig.github.io/create-pelican-blog.html seems encouraging. I think I should have started here first. Lektor is another possibility.

Since my legacy sites have RSS feeds, it may be sensible to turn Pelican loose on the RSS and (perhaps) skip steps 1, 2, and 3, entirely.


Tuesday, July 6, 2021

A Python Roadmap

An interesting tweet. The  roadmap has three sections. I'm not sure this is actually complete, or even grouped correctly. It is a very good list of topics.

Foundations

I want start by quibbling about variables being first. I'm not sold on this.

I think that operators, expressions, and the built-in immutable types are foundational. int, float, str, and tuple are hugely important as core concepts in computing and Python.

I also think that "loops" is a sketchy notion and I kind of wish we wouldn't describe for and while statements as "loops". I think we should call them iterations. They implement two kinds of logical quantifiers "for all" and "there exists." I think we should talk about the final result of a for statement: all of the values in a range are processed. Similarly a for-if-break construct establishes a "for exists" that defines the first value in a range for which a condition is met. And yes, range objects will be central.

I think that a huge amount of programming can be covered with these topics. I'm not sure "basic" is the right term; foundations might be a better idea. 

The use of variables to manage state is part of this. But. Variables, assignment, and state change are a bit more advanced and maybe shouldn't be first.

I also think function definitions are foundational. Mathematics has been defining functions based on other functions. It's a way of providing a mental short-hand for complex concepts. I don't need to know all the details of how to compute a square root to make use of square root as a concept.

The wide varieties of assignment statements, including assignment to decompose collections aren't mentioned in the original post. This may be an important omission, causing me to quibble on "complete."

I agree that files and elements of File IO are part of this foundation. If we limit ourselves to reading and writing files, then they're essentially immutable structures. I think we can safely avoid update-in-place files because this is an application topic more than a language topic. Python offers the minimal level of support via seek and tell, but little more. And most modern application relies on a database for updatable files.

Data Structures

Moving from basic to intermediate. I prefer the term "data structures" which are built on the language foundations. I think that the mutable built-in data structures come next in the roadmap. My preference is to omit terms like Object-Oriented or Functional, and focus on list, dict, and set, and how the iteration works. This means comprehensions and generators are part of this essential data structure section.

No, comprehensions aren't and shouldn't be called "advanced." They're very much a core concept. Thinking about statements to implement a map/filter/reduce over a collection is the essence of a great deal of programming. We don't always learn it that way, but it needs to be presented in that framework even to beginners. A pile of for and if statements and a bunch of variables is a programmer's first step toward a simpler comprehension. In both cases, they're doing a mapping and it needs to be described as mapping one collection to another collection.

This is where the standard library collections module is introduced.  Yes it's part of the library. I think it's too central to be ignored. I think dataclasses belong here, too.

Talking about the mutable data structures means revisiting the for statement and using it on a variety of iterables. The way Python's concepts apply to a variety of data types is an important feature of the language. (In the olden days, they used to talk about "orthogonality" of data and processing; we don't need to dwell on it, but I think it helps to acknowledge it.)

Functional Programming

It appears to me that the functional programming topics can come next. The idea of functional composition via higher-order functions and decorators builds on the existing foundation. This is where map() and filter() belong. Because of the way sorted(), max(), and min() work on collections with a key= function, these are part of the functional programming roadmap. The inconsistency between map() and functions like max() is an important thing to note.

I also think itertools belongs here. We can make the case that it's in the standard library, but then, so is io. I think itertools and functools are as central to practical Python as the math module and collections.

I think typing.NamedTuple and dataclasses belong here, also. A frozen dataclass is stateless, and can be helpful when creating list comprehensions to perform a mapping from one collection to another collection.

Object-Oriented Programming

I think OO programming and related concepts build on the previous material. Class definitions and state management aren't simple, even though they're essential parts of Python.

To an extent, OO programming can be decomposed into two layers. While I hate to overuse "foundation", there seem to be two parts:

OO Foundations -- inheritance, composition, and different kinds of delegation. This tends to expose a number of common design patterns like Strategy, Decorator, and Facade.

OO Features -- this includes metaprogramming, decorators, ABC's, mixins, and the like. These topics are all designed to avoid copy-and-paste in sophisticated edge cases that cross class boundaries.

Concurrency

I'm not sure why concurrency and parallelism are separate topics in the original list. I've had folks try to split this hair a number of ways. The idea is to find a place where async lives that's "concurrency lite" or something.

The concepts here become blurry because threads and processes are OS features, not language features. The async/await language features, however, are clearly part of Python. It becomes particularly awful when working on something practical where asyncio doesn't provide the feature you need. Specifically, blocking file system I/O isn't part of asyncio and requires an explicit appeal to the underlying thread pool for the blocking operation. 

To an extent, async/await needs to be on the roadmap. It's tricky, though, to cover this without also digressing into threads as a way to deal with blocking operations.

Test, Integration, and Deployment

This is where tools show up. This is where pip, unittest, pytest, tox/nox, coverage, etc. live. Are these part of the language? Or are the part of the broader ecosystem?

I submit they're explicitly not part of the language. The roadmap ends just before this topic. The idea is that we should have a Python roadmap that uses the language and the standard library.

Once we've talked about the language (and some of the library) we can move on to pip and packaging. I don't think pip is and "intermediate" topic. I find that premature introduction of pip is a sign of trying to create useful interesting examples. Examples that don't use pip wind up being kind of boring. Everyone wants to play with pygame and pillow and other kinds of projects, but, those aren't foundational to the language. They're interesting and appealing and -- frankly -- a lot of fun.

tl;dr

I'm not a fan of the roadmap. I like some of it. I don't like some of it.

I am a fan of presenting the idea for discussion.

Tuesday, July 28, 2020

Modern Python Cookbook 2nd ed -- Advance Copies -- DM me

This is your "why wait" invitation.

Advanced copies will be available.  

IF.

And this is a big "if".

You have to write a blurb. 

I'll be putting you in contact with Packt marketing folks who will get you your advanced copy so you can write blurbs and reviews and -- well -- actually use the content.

It's all updated to Python 3.8. Type hints almost everywhere. F-strings and the walrus operator. Bunches of devops and data science examples. Plus a few personal examples involving sailboat navigation and management.

See me at LinkedIn https://www.linkedin.com/in/steven-lott-029835/ and I'll hook you up with Packt marketing folks.

See https://www.amazon.com/Modern-Python-Cookbook-Updated-programmer/dp/180020745X for the official Amazon Book Link. This is for ordinary "no obligation to write a review" orders.

DM me directly slott56 at gmail to be put into the marketing spreadsheet.

Tuesday, June 30, 2020

Over-Solving or Solving Problems You Don't Have

Sometimes we call them "Belt and Braces" solutions. As a former suspenders person who switched to belts, the idea of wearing both is a little like over-engineering. In the unlikely event of catastrophic failure of one system, your pants can still remain properly hoist. There's a weird, but defensible reason for that. Most over-engineering lacks a coherent reason. 

Sometimes we call them "Bells and Whistles." The solution has both bells and whistles for signaling. This is usually used in a derogatory sense of useless noisemakers, there for show only. Again, there's a really low-value and dumb, but defensible reason for this. 

While colorful, none of this is helpful for describing over-engineered software. Over-engineered software is often over-engineered for incoherent and indefensible reasons.

Over-engineering generally means trying to solve a problem that no user actually has. This leads to throwing around irrelevant features.

Concrete Example

I lived on a boat. I spent a fair amount of time fretting over navigation. 

There are two big questions: 
  1. How far apart are two points, really. 
  2. What's the real bearing from one point to another.
These are -- in some cases -- easy to answer.

If you have a printed, paper chart at the right scale, you can use dividers to compute a distance. It's actually a very easy task. Similarly, you can read the bearing off the chart directly. There's a trick to comparing a course to a nearby compass rose, but it's easy to learn and very accurate.

Of course, we don't want to painstakingly copy our notes from a paper chart to a spreadsheet to add them up to get total distance. And then fold in speed to get time and fuel consumption. These summary computations are a pain.

What you want is to do all of this with a computer.
  1. Plot the points using a piece of software like OpenCPN (https://opencpn.org).
  2. Extract the GPX file.
  3. Compute distances, bearings, and durations to create a route.
"So?" you ask.

So. When I did this, I researched the math and got a grip on the haversine formula for doing the spherical geometry computation of distances between points on a sphere.



For airplanes and powered freighters crossing oceans, this is perfect.

For a small sailboat going from Annapolis, Maryland, to the Bahamas, this level of complexity is craziness. While accurate, it doesn't really solve the problem I have. 

I don't actually need that much accuracy. 

I need this much accuracy.


And no more. This is the essential hypotenuse distance using an R-factor to convert the difference between latitudes and the distance between longitudes into pretty-close distances. For nautical miles, R is 60×180÷ฯ€. 

This is simpler and it solves the problem I actually have. Up to about 232 miles, the answer is within 1 mile of correct. The error grows quickly. Double the distance and the error seems to jump to 8 miles. A 464 mile sailing journey (at 6 knots) takes 3 days. Wind, weather, tides and currents will introduce more error than the simplifying assumptions.

What's important is this can be put into a spreadsheet without pain. I don't need to write sophisticated Python apps to apply haversine to sequences of way-points. I can do a simpler hypotenuse computation on waypoints converted to radians.

Is there a lesson learned?

I think there is.

There's the haversine a super-general solution. It handles great-circle routes elegantly. 

But it doesn't solve my actual problem. And that makes it over-engineering.

My problem is what we call rhumb-line sailing. Over short-enough distances the world may as well be flat. Over slightly longer distances, errors in the ship's compass and speedometer make a hyper-accurate great circle route moot. 

(Even with several fancy GPS-based navigation computers, a prudent mariner has paper backups. The list of waypoints, estimated times and directions are essential when the boat's GPS reciever fails.)

I don't really need the sophistication (and the potential for bugs) with haversine. It doesn't solve a problem I actually have.

Tuesday, June 2, 2020

Overcoming Incuriosity -- Sailing Over The Horizon

I'm in regular contact with a few folks who seem remarkably incurious.

Seem.

Perhaps they're curious about something other than software. I don't know.

But I do know they're remarkably incurious about software. And are trying to write Python applications.

I know some people don't sail out of sight of their home port. I've sailed over a few horizons. It's not courage. It's curiosity. And patience. And preparation.

I find this frustrating. I refuse to write their code for them.

But any advice I give them devolves to "Do you have an example?" With the implicit "Which I can copy and paste?"

Even the few who claim they don't want examples, suffer from a paralyzing level of incuriosity. They can't seem to make search work because they never read beyond the first few results on their first attempt. A lot of people seem to be able to make search work; and the incurious folks seem uniquely paralyzed by search.

And it's an attribute I don't understand.

Specific example.

They read through the multiprocessing module until they got to examples with apply_async() and appear to have stopped reading.  They've asked for code reviews on two separate module. Both based on apply_async().

One module was so hopelessly broken it was difficult to make the case that it could never be made to work. There's a way the results of apply_async() have to be consumed, and the code not only did not reflect this, it seemed like they had decided specifically never to consider an alternative. (Spoiler alert, it requires an explicit wait().)

The results were sometimes consumed -- by luck -- and the rest of the time, the app was quirky. It wasn't quirky. It was deplorably wrong. And "reread the apply_async()" advice fell on deaf ears. They couldn't have failed to read the page in the standard library documentation, no, it had to be Python or Windows or me or something.

The other module was a trivial map() application. But. Since apply_async() has an incumbency, there was an amazingly elaborate implementation that amounted to rebuilding apply() or map() with globals and callbacks. This was wrapped by queue processing of Byzantine complexity. The whole mess appeared to stem from an unwillingness to read the documentation past the first example.

What to do?

My current suggestion is to exhaustively enumerate each of the methods for putting work into the processing pool. Write an example of each and every one.

In effect: "Learn the methods by building throw-away code."

I anticipate a series of objections. "Why write throw-away code?" and this one: "That's not realistic, what do you do?"

What do I do?

I write throw-away code.

But that's no substitute for a lack of curiosity.

Tuesday, May 26, 2020

Modern Python Cookbook 2nd ed -- big milestone

Whew.

Chapter rewrites finished.

Technical reviews in process.

Things are going pretty well. Look for Packt to publish this in the next few months. Details will be posted.

Now. For LinkedIn Learning course recordings.

Tuesday, April 21, 2020

Why Python is not the programming language of the future -- a response

See https://towardsdatascience.com/why-python-is-not-the-programming-language-of-the-future-30ddc5339b66.

This is an interesting article with some important points. And. It has some points that I disagree with.

  • Speed. This is a narrow perspective. numpy and pandas are fast, dask is fast. A great many Python ecosystem packages are fast. This complaint seems to be unsupported by evidence.
  • Dynamic Scoping Rules. This actually isn't the problem. The problem is something about not being able to change containing scopes. First, I'm not sure changing nesting scopes is of any value at all. Second, the complaint ignores the global and nonlocal statements. The vague "leads to a lot of confusion" seems unsupported by any evidence. 
  • Lambdas. The distinction between expressions and statements isn't really a distinction in Python in general, only in  the bodies of lambdas. I'm not sure what the real problem is, since a lambda with statements seems like a syntactic nightmare better solved with an ordinary, named function.
  • Whitespace. Sigh. I've worked with many people who get the whitespace right but the {}'s wrong in C++. The code looks great but doesn't work. Python gets it right. The code looks great and works.
  • Mobile App Platform. See https://beeware.org
  • Runtime Errors. "coding error manifests itself at runtime" seems to be the problem. I'm not sure what this means, because lots of programming languages have run-time problems. Here's the quote: "This leads to poor performance, time consumption, and the need for a lot of tests. Like, a lot of tests." Performance? See above. Use numpy. Or Cuda. Time consumption? Not sure what this means. A lot of tests? Yes. Software requires tests. I'm not sure that a compiled language like Rust, Go, or Julia require fewer tests. Indeed, I think the testing is essentially equivalent.
I'm interested in ways Python could be better. 

Tuesday, April 7, 2020

Why Isn't COBOL Dead? Or Why Didn't It Evolve?

Here's part of the question:
Why didn't COBOL evolve more successfully?
FORTRAN, OTOH, has survived precisely because it--and more importantly, related tools, esp compilers--has evolved to solve/overcome many (certainly not all!) of the sorts of pain-points you describe, while retaining the significant performance edge that (IMHO, ICBW) prevents challengers (e.g., Python) from dislodging it for tasks like (e.g.) running dynamical models (esp weather forecasting).
In short, why is FORTRAN still OK? Why is COBOL not still OK?

Actually, I'd venture to say the stories of these languages are essentially identical. They're both used because they have significant legacy implementations.

There's a distinction, that I think might be relevant to the "revulsion factor."

Folks don't find Fortran quite so revolting because it's sequestered into libraries where we don't really have to look at it. It's often wrapped into SciPy. The GCC compiler system handles it and we're happy.

COBOL, however, isn't sequestered into libraries with tidy Python wrappers and Conda installers. COBOL is the engine of enterprise applications.

Also. COBOL is used by organizations that suffer from high amounts of technical inertia, which makes the language a kind of bellwether for the rest of the organization. The organization changes slowly (or not at all) and the language changes at an even more tectonic pace.

This is a consequence of very large organizations with regulatory advantages. Governments, for example, regulate themselves into permanence. Other highly-regulated industries like banks and insurance companies can move slowly and tolerate the stickiness of COBOL.

Also.

For a FORTRAN library function that does something useful, it's not utterly mysterious. There's often a crisp mathematical definition, and a way to test the implementation. There are no quirks.

For a COBOL program that does something required by law, there can still be absolutely opaque mysteries and combinations of features without acceptable unit test cases. This isn't for lack of trying. It's the nature of "application" vs. "subroutine."

The special case and exceptions have to live somewhere. They live in the application.

For FORTRAN, the exceptions are in the Python wrapper using numpy using FORTRAN.

For COBOL, the exceptions are in the COBOL  Somewhere.

The COBOL Problem


It's a tweet, so I know there's no room for depth here.

As it is, it's absolutely correct. Allow me to add to it.

First. Replacing COBOL with something shiny and new is more-or-less impossible. Replacing COBOL is a two-step job.

1. Replace the COBOL with something that's nearly identical but written in a new language. Python. Java. Scala. Whatevs. Language doesn't matter. What matters is the hugeness of this leap.

2. Once the COBOL is gone and the mainframe powered off, then you can rebuild things yet again to create RESTful API's and put many shiny things around it.

Second. Replacing COBOL is essential. Software is a form of knowledge capture. If the language (and tools) have become opaque, then the job of knowledge capture has failed. Languages drift. The audience is in a constant state of flux. New translations are required.

Let's talk about the "Nearly Identical But In A New Language."

Nearly Identical

COBOL code has two large issues in general
  • Data. The file layouts are very hard to work with. I know a lot about this. 
  • Processing. The code has crap implementations of common data structures. I know. I wrote some. There's more, we'll get to it.
We have -- for the most part -- two kinds of COBOL code in common use.
  • Batch processing. Once upon a time, we called it "Programming in the Large." The Z/OS Job Control Language (JCL) was a kind of shell script or AWS Step Function state transition map among applications. This isn't easy to deal with because the overall data flow is not a simple Directed Acyclic Graph (DAG.) It has cycles and state changes.
  • Interactive (once called "on-line") processing. We called it OLTP: On-Line Transaction Processing. There are two common frameworks, CICS and IMS, and both are complicated.
Okay. Big Breath. What do we *DO*?

Here's the free consulting part.

You have to run the new and old side-by-side until you're sick of the errors and poor performance of the old machine.

You have to migrate incrementally, one app at a time.

It's hellishly expensive to positively determine what the COBOL really did. You can't easily do a "clean-room" conversion by writing intermediate specifications. You must read the COBOL and rewrite it into Python (or Java or Scala or whatever.)

You cannot unit test your way to success here, because you never really knew what the COBOL does/did. All you can do is extract example records and use those to build Gherkin-language acceptance tests using a template like this. GIVEN a source document WHEN the app runs THEN the output document matches the example. 

In effect, you're going to do TDD on the COBOL, replacing COBOL with Python essentially 1-for-1 until you have a test suite that passes.

Don't do this alphabetically, BTW. 

The processing graph for COBOL will include three essential design patterns for programs. "Edit" programs validate and possibly merge input files. "Update" programs will apply changes to master files or databases. "Report" programs will produce useful reports and feeds for reporting systems that involve yet more data derivation and merging.

  1. Find the updates. Convert them first. They will involve the most knowledge capture, A/K/A "Business Logic."  There will be a lot of special cases and exceptions. You will find latent bugs that have always been there.
  2. Convert the programs that produce files for the updates, working forward in the graph.
  3. The "reporting" is generally a proper DAG, and should be easier to deal with than the updates and edits. You never know, but the reporting apps are filled with redundancy. Tons of reporting programs are minor variations on each other, often built as copy-pasta from some original text and then patched haphazardly. Most of them can be replaced with a tool to emit CSV files as an interim step.
Each converted application requires two new steps injected into the COBOL batch jobs.
  • Before an update runs, the files are pushed to some place where they can be downloaded.
  • The app runs as it always had. For now.
  • After the update, the results are pushed, also.
This changes merely slow things down with file transfers. It provides fodder for parallel testing.

Then. 

Two changes are made so the job now looks like this.
  • Before an update runs, the files are pushed to some place where they can be downloaded. (No change here.)
  • Kill time polling the file location, waiting for the file to be created externally. (The old app is still around. We could run it if we wanted to.) 
  • After the update, download the results from the external location.
This file-copy-and-parallel-run dance can, of course, be optimized if you take whole streams of edit-update processing and convert them as a whole.

Yes, But, The COBOL Is Complicated

No. It's not.

It's a lot of code working around language limitations. There aren't many design patterns, and they're easy to find.
  1. Read, Validate, Write. The validation is quirky, but generally pretty easy to understand. In the long run, the whole thing is a JSONSchema document. But for now, there may be some data cleansing or transformation steps buried in here.
  2. Merged Reading. Execute the Transaction. Write. The transaction execution updates are super important. These are the state changes in object classes. They're often entangled among bad representations of data. 
  3. Cached Data. A common performance tweak is to read reference data ("Lookups") into an array. This was often hellishly complex because... well... COBOL. It was a Python dict, for the love of God, there's nothing to it. Now. Then. Well. It was tricky.
  4. Accumulators. Running totals and counts were essential for audit purposes. The updates could be hidden anywhere. Anywhere. Not part of the overall purpose, but necessary anyway.
  5. Parameter Processing. This can be quirky. Some applications had a standard dataset with parameters like the as-of-date for the processing. Some applications prompted an operator. Some had other quirky ways of handling the parameters.
The bulk of the code isn't very complex. It's quirky. But not complicated.

The absolute worst applications were summary reports with a hierarchy. We called these "control break" reports. I don't know why. Each level of the hierarchy had its own accumulators. The data had to be properly sorted. It was complicated. 

Do Not Convert these. Find any data cleansing or transformation and simply pour the data into a CSV file and let the users put it into a spreadsheet.

Right now. We have to keep the lights on. COBOL apps have to be kept operational to manage unemployment benefits through the pandemic.

But once we're out of this. We need to get rid of the COBOL.

And we need to recognize that all code expires and we need to plan for expiration. 

Tuesday, January 7, 2020

Patreon Book Idea


See "Additional, Related Content". It's one of the posts here: https://www.patreon.com/slott

I think there's space for a Building Skills in Functional Python title next to the Building Skills in OO Design

Tuesday, December 31, 2019

Christmas Ornament

See https://github.com/slott56/cpx-xmas-ornament

You'll need a Circuit Playground Express https://www.adafruit.com/product/3333

Install the code. Enjoy the noise and blinky lights.

The MML translation isn't as complete as you might like. The upper/lower case for the various commands isn't handled quite as cleanly as it could be. AFAIK, case shouldn't matter, but I omitted any lower() functions, making the  MML parser case sensitive. It only mattered for one of the four songs, and it was easier to edit the song.

The processing leaves a great deal of "clickiness" in the start_tone() processing. I think I know how to address it.

There are barely 96 or so different tones available in MML compositions. It might be possible to generate the wave shapes in advance to have a smoother music experience.

One could image having an off-line translator to transform the MML text into a sequence of bytes with note number and duration. This would slightly compress the song, but would speed up processing by eliminating the overhead of parsing.

Additionally, having 96 wave tables could speed up tone production. The tiny bit of time to recompute the sine wave at a given frequency would be eliminated. But. Memory is limited.

Wednesday, December 4, 2019

Creating Palindromes -- if possible -- from a string of letters.

This can be an interesting exercise. I think it is something that can help people learn to code well. I found this in the LinkedIn Python community:  https://www.linkedin.com/groups/25827/

The Palindrome Problem:
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde' (or 'deed')
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu' or several similar variants.



Let's not get too carried away. I like *some* of this problem.

I don't like the idea of Union[str, int] as a return type from this function. Yes, it's valid Python, but it seems like a code smell. Since the intent is to build lists, a None would be more sensible than a number; we'd have Optional[str] which seems better overall.

The solution that was posted was interesting. It did way too much work, but it was acceptable-looking Python. (It started with a big block comment with "#" on each line instead of a docstring, so... there were minor style problems, but otherwise, it was not bad.)

Here's what popped into my head, to act as a concrete response to the request for comments.

"""
Make a function that makes a palindrome out of the letters in a string and
returns -1 if this is not possible.
Convert a list of strings with the function.
Some test cases:

>>> palify('eedd')
'edde'
>>> palify('wgerar')
>>> palify('uiuiqii')
'uiiqiiu'
"""
from typing import Optional, Set


def palify(source: str) -> Optional[str]:
    """Core palindromic conversion."""
    singletons: Set[str] = set()
    pairs = list()
    for c in source:
        if c in singletons:
            pairs.append(c)
            singletons.remove(c)
        else:
            singletons.add(c)

    if pairs and len(singletons) <= 1:
        # presuming a single letter can't be palindromic.
        return ''.join(pairs+list(singletons)+pairs[::-1])
    return None

if __name__ == "__main__":
    s =  ['eedd', 'wgerar', 'uiuiqii']
    p = list(map(palify, s))
    print(f"from {s=}, we get {p=}")


The core problem statement is interesting. And the ancillary requirement is almost as interesting as the problem.

The simple-seeming "Make a palindrome out of the letters of the string" has two parts. First, there's the question of "can it even become a palindrome"? Which implies validating the source data against some set of rules. After that, we have to emit one of the many possible palindromes from the source material.

The original post had a complicated survey of the data. This was followed by an elegant way of creating a palindrome from the survey data. Since we're looking for a bunch of pairs and a singleton, I elided the more complex survey and opted to collect pairs and singletons into two separate collections.

When we've consumed the input, we will have partitioned the characters into their two pools and we can decide if the pools have the right sizes to proceed. The emission of the palindrome is a lazy assembly of the resulting data, first as a list, and then transformed to a single string.

The ancillary requirement is interesting in its own right. When a bundle of letters can't form a palindrome, that seems like a ValueError exception to me. Doing bulk transformations in the presence of ValueErrors seems wrong-ish. I already griefed about the -1 response above: it seems very bad. A None is less bad than -1. An Exception, however, seems like a more right thing to do.

Code Review Response

I think my response to the original code should be follow-up questions on why a defaultdict(int) was used to survey the data in the first place. A Counter() is a better idea, and requires less code.

The survey involved trying to locate singletons -- a laudable goal. There may have been a better approach to looking for the presence of a singleton letter in the Counter values.

More fundamentally, there are few states for each letter. There are two stark algorithmic choices: a structure keyed by letter or collections of letters. I've shown the collections, and hinted at the collection. The student response used a collection.

I think this problem serves as a good discussion for algorithmic alternatives. The core problem of detecting the possibility of palindromicity for a bunch of letters is cool. There are two choices. The handling of the exceptional case (-1, None or ValueError) is another bundle of choices.

Tuesday, December 3, 2019

Functional programming design pattern: Nested Iterators == Flattening

Here's a functional programming design pattern I uncovered. This may not be news to you, but it was a surprise to me. It cropped up when looking at something that needs parallelization to reduced the elapsed run time.

Consider this data collection process.

for h in some_high_level_collection(arg1):
    for l in h.some_low_level_collection(arg2):
        if some_filter(l):
            logger.info("Processing %s %s", h, l)
            some_function(h, l)

This is pretty common in devops world. You might be looking at all repositories in all github organizations. You might be looking at all keys in all AWS S3 buckets under a specific account. You might be looking at all tables owned by all schemas in a database.

It's helpful -- for the moment -- to stay away from taller tree structures like the file system. Traversing the file system involves recursion, and the pattern is slightly different there. We'll get to it, but what made this clear to me was a "simpler" walk through a two-layer hierarchy. 

The nested for-statements aren't really ideal. We can't apply any itertools techniques here. We can't trivially change this to a multiprocessing.map()

In fact, the more we look at this, the worse it is.

Here's something that's a little easier to work with:

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            if some_filter(l):
                logger.info("Processing %s %s", h, l)
                yield h, l

itertools.starmap(some_function, h_l_iter(arg1, arg2))

The data gathering has expanded to a few more lines of code. It gained a lot of flexibility. Once we have something that can be used with starmap, it can also be used with other itertools functions to do additional processing steps without breaking the loops into horrible pieces.

I think the pattern here is a kind of "Flattened Map" transformation. The initial design, with nested loops wrapping a process wasn't a good plan. A better plan is to think of the nested loops as a way to flatten the two tiers of the hierarchy into a single iterator. Then a mapping can be applied to process each item from that flat iterator.

Extracting the Filter

We can now tease apart the nested loops to expose the filter. In the version above, the body of the h_l_iter() function binds log-writing with the yield. If we take those two apart, we gain the flexibility of being able to change the filter (or the logging) without an awfully complex rewrite.

T = TypeVar('T')
def logging_iter(source: Iterable[T]) -> Iterator[T]:
    for item in source:
        logger.info("Processing %s", item)
        yield item

def h_l_iter(arg1, arg2):
    for h in some_high_level_collection(arg1):
        for l in h.some_low_level_collection(arg2):
            yield h, l

raw_data = h_l_iter(arg1, arg2)
filtered_subset = logging_iter(filter(some_filter, raw_data))
itertools.starmap(some_function, filtered_subset)

Yes, this is still longer, but all of the details are now exposed in a way that lets me change filters without further breakage.

Now, I can introduce various forms of multiprocessing to improve concurrency.

This transformed a hard-wired set of nest loops, if, and function evaluation into a "Flattener" that can be combined with off-the shelf filtering and mapping functions.

I've snuck in a kind of "tee" operation that writes an iterable sequence to a log. This can be injected at any point in the processing.

Logging the entire "item" value isn't really a great idea. Another mapping is required to create sensible log messages from each item. I've left that out to keep this exposition more focused.

I'm sure others have seen this pattern, but it was eye-opening to me.

Full Flattening

The h_l_iter() function is actually a generator expression. A function isn't needed.

h_l_iter = (
    (h, l) 
    for h in some_high_level_collection(arg1) 
        for l in h.some_low_level_collection(arg2)
)

This simplification doesn't add much value, but it seems to be general truth. In Python, it's a small change in syntax and therefore, an easy optimization to make.

What About The File System?

When we're working with some a more deeply-nested structure, like the File System, we'll make a small change. We'll replace the h_l_iter() function with a recursive_walk() function.

def recursive_walk(path: Path) -> Iterator[Path]:
    for item in path.glob():
        if item.is_file():
            yield item
        elif item.is_dir():
            yield from recursive_walk(item)

This function has, effectively the same signature as h_l_iter(). It walks a complex structure yielding a flat sequence of items. The other functions used for filtering, logging, and processing don't change, allowing us to build new features from various combinations of these functions.

tl;dr

The too-long version of this is:

Replace for item in iter: process(item) with map(process, iter).

This pattern works for simple, flat items, nested structures, and even recursively-defined trees. It introduces flexibility with no real cost.

The other pattern in play is:

Any for item in iter: for sub-item in item:  processing is "flattening" a hierarchy into a sequence. Replace it with (sub-item for item in iter for sub-item in item).

These felt like blinding revelations to me.

Tuesday, October 22, 2019

State Change and NoSQL Databases

Let's take another look at F. L. Stevens spreadsheet with agencies and agents. It's -- of course -- an unholy mess. Why? It's difficult to handle state change and deduplication.

Let's look at state changes.

The author needs URL's and names and a list of genres the agent is interested in. This is more-or-less static data. It changes rarely. What changes more often is an agent being closed or open to queries.

Another state change is the query itself. Once the email has been sent, the agent (and their agency) should not be bothered again for at least sixty days. After an explicit rejection, there's little point in making any contact with the agent; they're effectively out of the market for a given manuscript.

There are some other stateful rules, we don't need all the details to see the potential complexities here.

A spreadsheet presents a particularly odious non-solution to the problem of state and state change. There's a good and a bad. Mostly bad.
  • On the good side, you can edit a single cell, changing the state. You can define a drop-down list of states, or radio buttons with alternative states. 
  • The be bad side, you're often limited to editing a single cell when you want to change the state. You want to have dates filled in automatically on state change. You want history of state changes. Excel hackers try to write macros to automate filling in the date. History, however... History is a problem.
We can try to spread history across the row. This rapidly becomes horrifying -- the rows are uneven in length, breaking a First Normal Form rule for spreadsheets.

We can try to spread history down the rows of a column. Wow this is bad. We can try to use the hierarchy features to make history a bunch of folded-up details underneath a heading row. This is microscopically better, but still difficult to manage with all the unfolding and folding required to change state after a rejection.

We can blow up a single cell to have non-atomic data -- all of the history with events and dates in a long, ";" delimited list.

There's no good way to represent this in a spreadsheet.

What to do?

The relational database people love the master-detail relationship. Agency has Agent. Agent has History. The history is a bunch of rows in the history table, with a foreign key relationship with the agent.

The rigidity of the SQL schema is a barrier here. We're dealing with some sloppy data handling practices in the legacy spreadsheet. We don't want to have to tweak the SQL each time we find some new subtlety that's poorly represented in the spreadsheet data.

We're also handling a number of data sources, each with a unique schema. We need a way to unify these flexibly, so we can fold in additional data sources, once the broken spreadsheet is behind us.

(There are a yet more problems with the relational model in general, those are material for a separate blog post. For now, the rigidity and complexity are a big enough pair of problems.)

SQL is Out. What Else?

A document store is pretty nice for this.  The rest of this section is an indictment of SQL. Feel free to skip it. It's widely known, and well supported elsewhere.

We have an Agency as the primary document., Within an Agency, there are a number of individual Agents. Within each agent is a series of Events. Some Agents aren't even interested in the genre F. L. Stevens writes, so they're closed. Some Agents are temporarily closed. The rest are open.

The author can get a list of open agents, following a number of rules, including waiting after the last contact, and avoiding working with multiple agents within a single agency. After sending query letters, the event history gets an entry, and those agents are in another state, query pending.

One common complaint I hear about a document store is the "cost" of updating a large-ish document. The implicit assumption seems to be that an update operation can't locate the relevant sub-document, and can't make incremental changes. Having worked with both SQL and NoSQL, this "cost of document update" seems to be unmeasurably small.

Another cluster command question hovers around locking and concurrency. Most of them nonsensical because they come from the world of fragmented data in a SQL database. When the relevant object (i.e. Agency) is spread over a lot of rows of several tables, locking is essential. When the relevant object is a single document, locks aren't as important. If two people are updating the same document at the same time, that's a document design issue, or a control issue in the application.

Finally, there are questions about "update anomalies." This is a sensible question. In the relational world, we often have shared "lookup" data. A single change to a lookup row will have a ripple effect to all rows using the lookup row's foreign key.

Think of changing zip code 12345 from Schenectady, NY to Scotia, NY. Everyone sharing the foreign key reference via the zip code has been moved with a single update. Except, of course, nothing is visible until a query reconstructs the desired document from the fragmented pieces.

We've traded a rare sweeping updated across many documents for a sweeping, complex join operating to build the relevant document from the normalized pieces. Queries are expensive, complex, and often wrong. They're so painful, we use ORM's to mask the queries and give us the documents we wanted all along.

What's It Look Like?

This:

@dataclass
class Agency:
    """A collection of individual agents."""
    name : str
    url : Optional[str] = field(default=None)
    agents : Dict[str, 'Agent'] = field(init=False, default_factory=dict)

@dataclass
class Agent:
    """An Agent with a sequence of events: actions and state changes."""
    name : str
    url : str
    email : str
    fiction_genres : List[str]
    query_details : str = field(default_factory=str)
    events : List['Event'] = field(init=False, default_factory=list)

@dataclass
class Event:
    """An action or state change.
    status = 'open', 'closed', 'query sent', 'query outcome', 'closed until', etc.

    Depending on the status, there may be additional details.
    For 'query sent', there's 'date'.
    For 'query outcome', there's 'outcome' and an optional 'date'.
    for 'closed until', there's 'reason' and an optional 'date'.
    """
    status : str
    date : Optional[datetime.date] = field(default=None)
    outcome : Optional[str] = field(default=None)
    reason : Optional[str] = field(default=None)

    def __repr__(self):
        return f"{self.status} {self.date} {self.outcome} {self.reason}"


We have three classes here. Agency is the parent document. Each Agency contains one or more Agent instances. Each Agent contains one or more Events.

When we fetch an agent's data, we fetch the entire agency, since the "business" rules preclude querying more than one agent in an agency. The queries involve a nuanced state change: a rejection by one agent, opens another in the same agency.  Rather than do some additional SQL queries to locate the parent and other children of the parent, just read the whole thing at once.

In later posts, we'll look at deduplication and some other processing. But this seems to be all the schema we'll ever need.  The type hints provided mypy some evidence of what we intend to do with these documents.

Tuesday, October 15, 2019

Apple's Numbers and the All-in-One CSV export

Author F. L. Stevens has a hellishly complex (and irregular) spreadsheet with agents, agencies, and query status. (This is how fiction gets marketed: querying agents.) The spreadsheet has become unmanageably complex, with multiple pages. Each page has multiple tables. Buried in this are three "interesting" tables with agent query information.

Can we talk about drama? There is the dark night of the soul for anyone interested in regular, normalized data.

We have some fundamental choices for working with this mess:
  1. Export each relevant table to separate files. Lots of manual pointy-clicky and opportunities for making mistakes.
  2. Export the whole thing to separate files. Less pointy-clicky.
  3. Export the whole thing to one file. About the same pointy-clicky and error vulnerability as #2. But. Simpler still because there's one file to take care of. Something a fiction author should be able to handle.
The all-in-one CSV export is (initially) exasperating. Each table is wrapped in a prefix and suffix.
The prefix is a line with "Sheet: Table" Yes. There's a ": " (colon space) separator. The suffix is a simple blank line, essentially indistinguishable from a blank line within a table.

If the table was originally in strict first normal form (1NF) each row would have the same number of commas. If cells are merged, however, the number of commas can be fewer. This makes it potentially difficult to distinguish blank rows in a table from blank lines between tables.

It's generally easiest to ignore the blank lines entirely. We can distinguish table headers because they're a single cell with a sheet: table format. We are left hoping there aren't any tables that have values that have this format.

We have two ways to walk through the values:
  • Preserving the Sheet, Table, Row hierarchy. We can think of this as the for s in sheet: for t in table: for r in rows structure. The sheet iterator is Iterator[Tuple[str, Table_Iterator]]. The Table_Iterator is similar: Iterator[Tuple[str, Row_Iterator]]. The Row_Iterator, is the most granular Iterator[Dict[str, Any]].
  • Flattening this into a sequence of "(Sheet name, Table Name, Row)" triples. Since a sheet and table have no other attributes beyond a name, this seems advantageous to me.
The hierarchical form requires a number of generator functions for Sheet-from-CSV, Table-from-CSV, and Row-from-CSV. Each of these works with a single underlying iterator over the source file and a fairly complex hand-off of state. If we only use the sheet iterator, the tables and rows are skipped. If we use the table within a sheet, the first table name comes from the header that started a sheet; the table names come from distinct headers until the sheet name changes. 

The table-within-sheet iteration is very tricky. The first table is a simple yield of information gathered by the sheet iterator. Any subsequent tables, however, may be based one one of two conditions: either no rows have been consumed, in which case the table iterator consumes (and ignores) rows; or, all the rows of the table have been consumed and the current row is another "sheet: table" header. 

The code sample below involves a fair amount of repetition. It's not appealing to refactor this because it's ungainly in its complexity, and doesn't create any tangible value. (I haven't even tried to get the type hints right.)

class SheetTable:
    def __init__(self, source_path: Path) -> None:
        self.path: Path = source_path
        self.csv_source = None
        self.rdr = None
        self.header = None
        self.row = None

    def __enter__(self) -> None:
        self.csv_source = self.path.open()
        self.rdr = csv.reader(self.csv_source)
        self.header = None
        self.row = next(self.rdr)
        return self

    def __exit__(self, *args) -> None:
        self.csv_source.close()

    def _sheet_header(self) -> bool:
        return len(self.row) == 1 and ': ' in self.row[0]

    def sheet_iter(self):
        while True:
            while not (self._sheet_header()):
                try:
                    self.row = next(self.rdr)
                except StopIteration:
                    return
            self.sheet, _, self.table = self.row[0].partition(": ")
            self.header = next(self.rdr)
            self.row = next(self.rdr)
            yield self.sheet, self.table_iter()

    def table_iter(self):
        yield self.table, self.row_iter()
        while not (self._sheet_header()):
            try:
                self.row = next(self.rdr)
            except StopIteration:
                return
        next_sheet, _, next_table = self.row[0].partition(": ")
        while next_sheet == self.sheet:
            self.table = next_table
            self.header = next(self.rdr)
            self.row = next(self.rdr)
            yield self.table, self.row_iter()
            while not (self._sheet_header()):
                try:
                    self.row = next(self.rdr)
                except StopIteration:
                    return
            next_sheet, _, next_table = self.row[0].partition(": ")

    def row_iter(self):
        while not self._sheet_header():
            yield dict(zip(self.header, self.row))
            try:
                self.row = next(self.rdr)
            except StopIteration:
                return

Clearly, this is craziness.

Flattening is much nicer.

def sheet_table_iter(source_path: Path) -> Iterator[Tuple[str, str, Dict[str, Any]]]:
    with source_path.open() as csv_source:
        rdr = csv.reader(csv_source)
        header = None
        for row in rdr:
            if len(row) == 0:
                continue
            elif len(row) == 1 and ": " in row[0]:
                sheet, table = row[0].split(": ", maxsplit=1)
                header = next(rdr)
                continue
            else:
                # Inject headers to create dict from row
                yield sheet, table, dict(zip(header, row))


This provides a relatively simple way to find the relevant tables and sheets. We can use something as simple as the following to locate the relevant data.

    for sheet, table, row in sheet_table_iter(source_path):
        if sheet == 'AgentQuery' and table == 'agent_query':
            agent = agent_query_row(database, row)
        elif sheet == 'AAR-2019-03' and table == 'Table 1':
            agent = aar_2019_row(database, row)


This lets us write pleasant functions that handle exactly one row from the source table. We'll have one of these for each target table. In the above example, we've only shown two, you get the idea. Each new source table, with its unique headers can be accommodated.

Tuesday, October 8, 2019

Spreadsheet Regrets

I can't emphasize this enough.

Some people, when confronted with a problem, think
“I know, I'll use a spreadsheet.”   Now they have two problems.

(This was originally about regular expressions. And AWK. See http://regex.info/blog/2006-09-15/247)

Fiction writer F. L. Stevens got a list of literary agents from AAR Online. This became a spreadsheet driving queries for representation. After a bunch of rejections, another query against AAR Online provided a second list of agents.

Apple's Numbers product will readily translate the AAR Online HTML table into a usable spreadsheet table. But after initial success the spreadsheet as tool of choice collapses into a pile of rubble. The spreadsheet data model is hopelessly ineffective for the problem domain.

What is the problem domain?

There are two user stories:
  1. Author needs to deduplicate agents and agencies. It's considered poor form to badger agents with repeated queries for the same title. It's also bad form to query two agents at the same agency. You have to get rejected by one before contacting the other. 
  2. Author needs to track activities at the Agent and Agency level to optimize querying. This mostly involves sending queries and tracking rejections. Ideally, an agent acceptance should lead to notification to other agents that the manuscript is being withdrawn. This is so rare as to not require much automation.
Agents come and go. Periodically, an agent will be closed to queries for some period of time, and then reopen. Their interests vary with the whims of the marketplace they're trying to serve. Traditional fiction publishing is quite complex; agents are the gatekeepers.

To an extent, we can decompose the processing like this. 

1. Sourcing. There are several sources: AAR Online and Agent Query are two big sources. These sites have usable query engines and the HTML can be scraped to get a list of currently active agents with a uniform representation. This is elegant Python and Beautiful Soup. 

2. Deduplication. Agency and Agent deduplication is central. Query results may involve state changes to an agent (open to queries, interested in new genres.) Query results may involve simple duplicates, which have to be discarded to avoid repeated queries. It's a huge pain when attempted with a spreadsheet. The simplistic string equality test for name matching is defeated by whitespace variations, for example. This is elegant Python, however. 

3. Agent web site checks. These have to be done manually. Agency web pages are often art projects, larded up with javascript that produces elegant rolling animations of books, authors, agents, background art, and text. These sites aren't really set up to help authors. It's impossible to automate a check to confirm the source query results. This has to be done manually: F. L. is required to click and update status. 

4. State Changes. Queries and Rejections are the important state changes. Open and Closed to queries is also part of the state that needs to be tracked. Additionally, there's a multiple agent per agency check that makes this more complex. The state changes are painful to track in a simple spreadsheet-like data structure: a rejection by one agent can free up another agent at the same agency. This multi-row state change is simply horrible to deal with.

Bonus confusion! Time-to-Live rules: a query over 60 days old is more-or-less a de facto rejection. This means that periodic scans of the data are required to close a query to one agent in an agency, freeing up subsequent agents in the same agency.

Manuscript Wish Lists (MSWLs) are a source for agents actively searching for manuscripts. This is more-or-less a Twitter query. Using the various aggregating web sites seems slightly easier than using Twitter directly. However, additional Twitter lookups are required to locate agent details, so this is interesting web-scraping.

Of course F. L. Stevens has a legacy spreadsheet with at least four "similar" (but not really identical) tabs filled with agencies, agents, and query status.

I don't have an implementation to share -- yet. I'm working on it slowly.

I think it will be an interesting tutorial in cleaning up semi-structured data.