Sunday, December 9, 2012

Better Python APIs

Oz Katz has a good entry over at his blog, Eventual Consistency, about making better APIs for your Python libraries.

Wednesday, December 5, 2012

PP #1: The always-true condition

Python Pitfalls are quirks of the Python programming language that may trip up those new to the language or to programming. First in yet another series.

My experience with Python is that my code usually does what I expect it to—in fact, more so than with any other programming language I've used. But that doesn't mean it's impossible to write code that doesn't do what you expect! Far from it. I see code like this on Stack Overflow pretty much every day:

while True:
    command = raw_input("command> ").lower().strip()
    if command == "stop" or "exit":
         break
    elif command == "help":
        print "quit, exit: exit the program"
        print "help:       display this help message" 
    elif command:
        print "Unrecognized command."

Can you guess the question? It's some variation of "No matter what command I enter, it always exits."

And the reason is always the same: a Boolean expression using or that is always truthy. (See PF #1: Truth and Consequences.) In this case, it's this line:

    if command == "quit" or "exit":

The programmer who wrote that line intended it to match either "quit" or "exit." Instead, it matches any command. The reason for this becomes obvious if we look at how Python evaluates the condition.

As described in PF #2: And and/or or, the value of a Boolean or operation is truthy if either operand is truthy. The operands of or here are:
  • command == "quit"
  • "exit"
The latter operand, being a string with nonzero length, is truthy in Python, and therefore the result of the or is always truthy. Therefore, this test matches everything and the suite under the if is always executed!

This error is particularly pernicious because it reads exactly how you would phrase it in English, and expresses the programmer's intent clearly. It's just that the meaning of or is more specific in Python than in English.

One way to write this test so that it works as expected is to explicitly tell Python to test command against both alternatives:

    if command == "quit" or command == "exit":

Of course, there's a shorter and clearer way to write this in Python:

    if command in ("quit", "exit"):

If you will be testing for a lot of alternatives and performance is important, use a set defined outside the loop. Testing for membership in an existing set is a lot faster than testing for membership in a tuple that's created new each time the test is executed.

exit_synonyms = set(["quit", "halt", "stop", 
                     "end", "whoa", "exit"])

while True:
    command = raw_input("command> ").lower().strip()
    if command in exit_synonyms:
        break
    # and so on as above

Tuesday, December 4, 2012

PF #3: Non-short-circuiting Boolean operations

Third entry in the Python Foundations series, thoroughly exploring the basic building blocks of Python programs.

Wow, it's been months, hasn't it? Our last Python Foundations entry talked about how the Boolean operators and and or work in Python, including their short-circuting behavior. But what if you don't want short-circuiting? What if you want both operands of and or or to be fully evaluated even when the first operand's value is sufficient to determine the expression's truth value?

Here, we can take advantage of the fact that Booleans are a special kind of integer in Python (see IDKTAP #2: Booleans as fancy integers), and perform math on them.

If you create a table of the results of adding the various combinations of 0 and 1, which are the integer values of False and True in Python:

+  | 0  1
----------
0  | 0  1
1  | 1  2

It bears a striking resemblance to the results of the Boolean or operator: the result is truthy (see PF #1: Truth and Consequences) if either of the operands is truthy.

Similarly, the multiplication operator is a dead ringer for Boolean and:

*  | 0  1
----------
0  | 0  0
1  | 0  1

Zero times anything is zero, just as False and anything is False.

If you have studied digital electronics, you may recognize this as Boolean arithmetic: arithmetic using only the values 0 and 1, where and is in fact equivalent to multiplication and or is equivalent to addition. The result of addition is "clamped" to 1 (that is, the highest result an addition can be is 1), and the only valid operations are addition and multiplication. It may seem like a toy arithmetic, but it's the foundation of everything a computer does.

We can use this equivalence of logical and arithmetic operations to implement non-short-circuiting ("long-circuiting"?) and and or in Python. To support truthy and falsy values, which may not actually be integers, we convert the operands to proper Booleans first using the bool() constructor.

bool(x) + bool(y) + bool(z)    # x or y or z
bool(x) * bool(y) * bool(z)    # x and y and z

Here, x, y, and z are always fully evaluated. If one of them is, say, a function call, the function is always called. In the equivalent expression using Boolean operators, remember, the expression will short-circuit as soon as it can, and may not evaluate all arguments. We explored how to exploit this behavior for fun and profit in PF #2.

Conveniently, multiplication has precedence over (is performed before) addition, just as and has precedence over or, so an expression like the following works exactly as you would expect:

bool(a) * bool(b) + bool(x) * bool(y)    # a and b or x and y

In fact, this is an easy way to remember which Boolean operator has precedence: because and is equivalent to multiplication, it has precedence over or, which is equivalent to addition. This is a rule I had trouble remembering when I started programming.

So far, what we've discussed isn't really specific to Python; you can do the same non-short-circuiting Boolean math in virtually any language with only minor syntax changes. We can, however, use Python's any() and all() functions to help make our non-short-circuiting Boolean operators a little more readable. Both functions operate on a sequence.
  • any() - Returns True if any element is truthy (equivalent to Boolean or)
  • all() - Returns True if all elemens are truthy (equivalent to Boolean and)
Technically, any()and all() do in fact short-circuit: they will stop inspecting elements in the sequence as soon as they see a value that definitively determines the result. That is, all() returns False when it sees the first falsy element, because a single such element is enough to know that not all elements are truthy. And any() returns True when it sees the first truthy element, because that's enough to know that there is at least one such.

However, the short-circuiting behavior is apparent only with generators, which "lazily" calculates a sequence one element at a time. (There will surely be a Python Foundations entry on generators in the future!) If you pass a sequence (i.e., a list or a tuple) to any() or all(), all the elements of the sequence will be fully evaluated when the sequence is constructed, before any() or all() is even involved. The fact that any() and all() short-circuit is then merely a performance optimization. Useful, in other words, but not exploitable for side effects as with and and or.

And so we can use any() and all() as non-short-circuiting or and and, respectively. x, y, and z below may be expressions of arbitrary complexity, and will be evaluated fully.

any([x, y])      # x or y
all([x, y])      # x and y
any([x, y, z])   # x or y or z (any number of arguments)
all((x, y, z))   # you may use a tuple instead of a list


As for the last example, I personally prefer using the square brackets and passing a list since it stands out more and makes it clearer what is happening, but constructing a tuple is faster, a fact which may be useful in some circumstances.

any() and all() can be nested to implement compound non-short-circuiting Boolean operations:

any([all([a, b]), all([x, y]])    # a and b or x and y

However, this gets difficult to read pretty quickly (unless, perhaps, you have experience with Lisp, where all operations are prefix).. Fortunately, any() and all() always return True or False (not the deciding element from the sequence, which may be merely truthy or falsy), so you can easily combine their results with + and *.

all([a, b]) + all([x, y])         # a and b or x and y, as above

We're almost finished here; this has become rather more in-depth than I intended! But while were talking about any() and all(), I should mention that it is something of a shame that neither returns the value that made the decision (i.e., the first truthy value for any() or the first falsy value for all()). This keeps them from being used to actually find the first truthy or falsy value in a sequence, which is especially a pity when you're using them with a generator because that value, having been consumed within any()/all(), is forever lost.

You can write your own versions of these functions that do return the deciding value, but because they are implemented in Python rather than in C, they will be slower than the built-ins. Still, here are such implementations.

def all(iterable):
    for item in iterable: 
        if not item: return item
    return True

def any(iterable):
    for item in iterable: 
        if item: return item
    return False