Sunday, December 9, 2012

Better Python APIs

Oz Katz has a good entry over at his blog, Eventual Consistency, about making better APIs for your Python libraries.

Wednesday, December 5, 2012

PP #1: The always-true condition

Python Pitfalls are quirks of the Python programming language that may trip up those new to the language or to programming. First in yet another series.

My experience with Python is that my code usually does what I expect it to—in fact, more so than with any other programming language I've used. But that doesn't mean it's impossible to write code that doesn't do what you expect! Far from it. I see code like this on Stack Overflow pretty much every day:

while True:
    command = raw_input("command> ").lower().strip()
    if command == "stop" or "exit":
         break
    elif command == "help":
        print "quit, exit: exit the program"
        print "help:       display this help message" 
    elif command:
        print "Unrecognized command."

Can you guess the question? It's some variation of "No matter what command I enter, it always exits."

And the reason is always the same: a Boolean expression using or that is always truthy. (See PF #1: Truth and Consequences.) In this case, it's this line:

    if command == "quit" or "exit":

The programmer who wrote that line intended it to match either "quit" or "exit." Instead, it matches any command. The reason for this becomes obvious if we look at how Python evaluates the condition.

As described in PF #2: And and/or or, the value of a Boolean or operation is truthy if either operand is truthy. The operands of or here are:
  • command == "quit"
  • "exit"
The latter operand, being a string with nonzero length, is truthy in Python, and therefore the result of the or is always truthy. Therefore, this test matches everything and the suite under the if is always executed!

This error is particularly pernicious because it reads exactly how you would phrase it in English, and expresses the programmer's intent clearly. It's just that the meaning of or is more specific in Python than in English.

One way to write this test so that it works as expected is to explicitly tell Python to test command against both alternatives:

    if command == "quit" or command == "exit":

Of course, there's a shorter and clearer way to write this in Python:

    if command in ("quit", "exit"):

If you will be testing for a lot of alternatives and performance is important, use a set defined outside the loop. Testing for membership in an existing set is a lot faster than testing for membership in a tuple that's created new each time the test is executed.

exit_synonyms = set(["quit", "halt", "stop", 
                     "end", "whoa", "exit"])

while True:
    command = raw_input("command> ").lower().strip()
    if command in exit_synonyms:
        break
    # and so on as above

Tuesday, December 4, 2012

PF #3: Non-short-circuiting Boolean operations

Third entry in the Python Foundations series, thoroughly exploring the basic building blocks of Python programs.

Wow, it's been months, hasn't it? Our last Python Foundations entry talked about how the Boolean operators and and or work in Python, including their short-circuting behavior. But what if you don't want short-circuiting? What if you want both operands of and or or to be fully evaluated even when the first operand's value is sufficient to determine the expression's truth value?

Here, we can take advantage of the fact that Booleans are a special kind of integer in Python (see IDKTAP #2: Booleans as fancy integers), and perform math on them.

If you create a table of the results of adding the various combinations of 0 and 1, which are the integer values of False and True in Python:

+  | 0  1
----------
0  | 0  1
1  | 1  2

It bears a striking resemblance to the results of the Boolean or operator: the result is truthy (see PF #1: Truth and Consequences) if either of the operands is truthy.

Similarly, the multiplication operator is a dead ringer for Boolean and:

*  | 0  1
----------
0  | 0  0
1  | 0  1

Zero times anything is zero, just as False and anything is False.

If you have studied digital electronics, you may recognize this as Boolean arithmetic: arithmetic using only the values 0 and 1, where and is in fact equivalent to multiplication and or is equivalent to addition. The result of addition is "clamped" to 1 (that is, the highest result an addition can be is 1), and the only valid operations are addition and multiplication. It may seem like a toy arithmetic, but it's the foundation of everything a computer does.

We can use this equivalence of logical and arithmetic operations to implement non-short-circuiting ("long-circuiting"?) and and or in Python. To support truthy and falsy values, which may not actually be integers, we convert the operands to proper Booleans first using the bool() constructor.

bool(x) + bool(y) + bool(z)    # x or y or z
bool(x) * bool(y) * bool(z)    # x and y and z

Here, x, y, and z are always fully evaluated. If one of them is, say, a function call, the function is always called. In the equivalent expression using Boolean operators, remember, the expression will short-circuit as soon as it can, and may not evaluate all arguments. We explored how to exploit this behavior for fun and profit in PF #2.

Conveniently, multiplication has precedence over (is performed before) addition, just as and has precedence over or, so an expression like the following works exactly as you would expect:

bool(a) * bool(b) + bool(x) * bool(y)    # a and b or x and y

In fact, this is an easy way to remember which Boolean operator has precedence: because and is equivalent to multiplication, it has precedence over or, which is equivalent to addition. This is a rule I had trouble remembering when I started programming.

So far, what we've discussed isn't really specific to Python; you can do the same non-short-circuiting Boolean math in virtually any language with only minor syntax changes. We can, however, use Python's any() and all() functions to help make our non-short-circuiting Boolean operators a little more readable. Both functions operate on a sequence.
  • any() - Returns True if any element is truthy (equivalent to Boolean or)
  • all() - Returns True if all elemens are truthy (equivalent to Boolean and)
Technically, any()and all() do in fact short-circuit: they will stop inspecting elements in the sequence as soon as they see a value that definitively determines the result. That is, all() returns False when it sees the first falsy element, because a single such element is enough to know that not all elements are truthy. And any() returns True when it sees the first truthy element, because that's enough to know that there is at least one such.

However, the short-circuiting behavior is apparent only with generators, which "lazily" calculates a sequence one element at a time. (There will surely be a Python Foundations entry on generators in the future!) If you pass a sequence (i.e., a list or a tuple) to any() or all(), all the elements of the sequence will be fully evaluated when the sequence is constructed, before any() or all() is even involved. The fact that any() and all() short-circuit is then merely a performance optimization. Useful, in other words, but not exploitable for side effects as with and and or.

And so we can use any() and all() as non-short-circuiting or and and, respectively. x, y, and z below may be expressions of arbitrary complexity, and will be evaluated fully.

any([x, y])      # x or y
all([x, y])      # x and y
any([x, y, z])   # x or y or z (any number of arguments)
all((x, y, z))   # you may use a tuple instead of a list


As for the last example, I personally prefer using the square brackets and passing a list since it stands out more and makes it clearer what is happening, but constructing a tuple is faster, a fact which may be useful in some circumstances.

any() and all() can be nested to implement compound non-short-circuiting Boolean operations:

any([all([a, b]), all([x, y]])    # a and b or x and y

However, this gets difficult to read pretty quickly (unless, perhaps, you have experience with Lisp, where all operations are prefix).. Fortunately, any() and all() always return True or False (not the deciding element from the sequence, which may be merely truthy or falsy), so you can easily combine their results with + and *.

all([a, b]) + all([x, y])         # a and b or x and y, as above

We're almost finished here; this has become rather more in-depth than I intended! But while were talking about any() and all(), I should mention that it is something of a shame that neither returns the value that made the decision (i.e., the first truthy value for any() or the first falsy value for all()). This keeps them from being used to actually find the first truthy or falsy value in a sequence, which is especially a pity when you're using them with a generator because that value, having been consumed within any()/all(), is forever lost.

You can write your own versions of these functions that do return the deciding value, but because they are implemented in Python rather than in C, they will be slower than the built-ins. Still, here are such implementations.

def all(iterable):
    for item in iterable: 
        if not item: return item
    return True

def any(iterable):
    for item in iterable: 
        if item: return item
    return False

Wednesday, July 11, 2012

IDKTAP #3: Backticks

Did you know that quoting an expression using backticks is an alias for repr() in Python 2.x? That is, it prints the "representation" of the result of the expression, which is generally what you'd have to paste into your Python source file to produce that same object.

print `"Think, it ain't illegal yet!"`

They took this out in Python 3.

Thursday, July 5, 2012

IDKTAP #2: Booleans as fancy integers

I Didn't Know That About Python uncovers surprising tidbits about our favorite programming language. 

In Python, a Boolean (type bool) is a subtype of integers. True is equal to 1 and False is equal to 0.

True  == 1   # True
False == 0   # True

The special methods called __str__ and __repr__ define how an object is converted to a string and how it is represented in an interactive context, respectively. So bool is basically implemented like this:

class bool(int):
    def __str__(self):
        return "True" if self else "False"
    __repr__ = __str__


True, False = bool(1), bool(0)

(There are some other details I'm glossing over, such as the fact that True and False are singletons and only ever have one instance. That is, bool(1) is always the same object. The example above does not have this behavior. For a fun exercise, try adding it.)

The surprising thing is that in Python 2, True and False are simply built-in identifiers assigned to two specific bool singletons. This means that they can be reassigned:

True = 3

This will very likely cause quite odd behavior in your programs, which is why Python 3 reserves True and False as actual language keywords and doesn't let you reassign them. Still, issubclass(bool, int) remains True.

Sunday, July 1, 2012

PF #2: And and/or or

Second in the Python Foundations series.

In the previous installment of Python Foundations, I described the informal nature of truth and falsity in Python (dubbed "truthy" and "falsy"). Now let's explore how the Boolean operators and and or interact with such values.

In many languages, you'd expect the result of a Boolean operation (an operation between two truth values) to be True or False. But since Python considers a wide range of values True and False in a Boolean context, it is useful to return one of the operands. That is, when you combine two values using and or or, you get one of the values you put in.

In Python, and and or are short-circuiting. They always evaluate their first operand, and if that result is sufficient to decide the truth value of the expression, they stop and don't evaluate their second operand. Say what?

OK, have a look at the truth table for and:

AND   | True   False
-----------------------
True  | True   False

False | False  False

As you can see, False and anything is always False. Therefore, when the first operand of and is falsy, Python need not evaluate the second operand, and in fact does not. This not only saves time, it can be exploited for useful side effects (more on that later).

OR    | True   False
-----------------------
True  | True   True

False | True   False

This truth table for or similarly reveals that when the first operand is True, the result is always True, so Python doesn't need to evaluate the second operand of or when the first operand is truthy. And once again, since it doesn't need to, it doesn't.

What does it mean to not evaluate the second operand of and or or?  Among other things it means that if the second operand is a function call, that second function is never called:

v = a() or b()

If a() returns something truthy, b() is never even called. But what is the value of v? Python assigns v the return value of the operand that decided the result: a() if a's return value is truthy, or b() if a's return value is falsy. It works similarly with and, except the result is a's return value if it is falsy and b's otherwise. In short, the actual operand values are used in place of the constants True and False.

Short-circuiting lets you use and and or as substitutes for if statements in certain circumstances. It's very easy to write unreadable code using short-circuiting and and or; there was an idiom that was commonly used in an attempt to make up for the fact that Python didn't have a proper ternary conditional operator. Then Python added a proper ternary conditional operator (v = a if b else c) and everybody forgot about the hackish workaround. But simpler uses of short-circuiting are pretty readable and useful.

For example, easily provide a default value for blank inputs:

name = raw_input("What is your name? ") or "Jude"
print "Hey,", name, "don't make it bad"

Or check to make sure something is callable before calling it:

callable(func) and func()

I find those generally more readable than the alternatives using if statements:

name = raw_input("What is your name? ")
if not name:
    name = "Jude"

if callable(func):
    func()

Short-circuiting is, by the way, the reason Python doesn't let you override and and or using special methods on your own objects the way you can with arithmetic operators. Both operands would be evaluated before the operation is performed, so short-circuiting simply isn't possible. (There is in fact a proposal to change that, but it hasn't been implemented yet.)

Tuesday, June 26, 2012

Recipe #1: An indexable file class

Python Recipes are simple bits of my code you can use freely in your own programs. First in a series. Yes, I'm gonna have a lot of different series of posts on this blog.

Here's a file-like class that allows access to lines in the file using Python indexing or slicing notation without reading the whole file into memory. It's sort of a list/file hybrid since it does have methods for incrementally reading from the file, too.

class linefile(object):
   
    def __init__(self, path):
        self.file    = open(path)
        self.offsets = []

    def __enter__(self):
        return self

    def __exit__(self, *e):
        self.file.close()

    def readline(self):
        self.offsets.append(int(self.file.tell()))
        text = self.file.readline()
        if not text:
            self.offsets.pop()
        return text

    def readlines(self):
        for text in self:
            pass

        return self

    def next(self):
        return self.readline()

    def __iter__(self):
        text = self.readline()
        while text:
            yield text
            text = self.readline()

    def __getitem__(self, index):
        offset = self.offsets[index]
        tell   = self.file.tell()
        if type(index) is slice:

            if slice.step == 1:   # step is 1, this is fastest
                self.file.seek(offset[0])
                text = [self.file.readline() for line in offset]
            else:
                text = []
                for o in offset:
                    self.file.seek(o)
                    text.append(self.file.readline())
        else:
            self.file.seek(offset)
            text = self.file.readline()
        self.file.seek(tell)
        return text

    def __len__(self):
        return len(self.offsets)


Unlike a regular file, a linefile is read-only and text-only. Internally, the instance stores the offsets of each line that has been read. The lines themselves are retrieved from the file on demand; the instance does not store any lines. Positive offsets index from the beginning of the file and negative offsets index from the last line read from the file, which may not be the end of the file. (Index -1, then, is the line before the current one.) You can use the readlines() method to read all the lines and thereby note the offset of each line of the file; this will allow the entire file to be accessed by index or slice as though it were a list. The linefile object is iterable and is also a context manager (i.e, it supports the with statement).

Sunday, June 24, 2012

The world's stupidest Mac backup software

So it appears that I'm running the world's stupidest backup software on my Mac.

Last night, my "media" hard disk (with all my photos and music) failed. No problem, I have a backup. I look on the backup drive and boom, there's no backup.

Apparently this software was trying to keep the backup in sync with the original disk when I deleted files. The original disk was gone, so it made the backup match... by deleting it.

Yeah, that was real helpful. Thanks, NTI Shadow.

All that stuff can probably (probably) be recovered, since nothing else has been written to the backup drive. And I'm pretty sure the original disk can also be recovered, for a price. Still, what a pain.

Friday, June 22, 2012

PF #1: Truth and Consequences

Intended for those new to the language but not to programming, "Python Foundations" surveys the parts of Python that you need to know about to take full advantage of Python's power. First in a series.

Many programming languages have somewhat loose concepts of truth and falsity. Objects can have a truth value even if they're not Booleans. In C, truth is canonically represented as -1, but any non-zero value is considered true in a Boolean context such as an if statement.

Python takes this a step further, considering empty containers (including strings) to be false. The constant None is also considered false. Other objects are generally considered true, although this can be overridden (we'll discuss how in a moment). This property is useful for making code like this more readable:

name = raw_input("What is your name? ")
while not name:  # instead of while name == ""
     name = raw_input("Seriously, what's your name? ")

Since truth values are a little flexible, Python programmers have adopted the terms truthy and falsy (or falsey) to refer to an object's implicit truth value when used in a Boolean context. (I hasten to add that I don't believe these terms were coined in the Python community.)

In other words, the list [1, 2, 3] is not literally equal to the constant True, but it is truthy because if you tested it with an if statement, that if statement's body would be executed.

Instances of classes are generally truthy unless they are derived from a class that has some other built-in behavior (for example, a list, which, remember, is truthy when it contains any items). Functions, classes, iterators/generators, and modules are also truthy.

You can override the implicit truth value of your own classes by defining either a __len__() or __nonzero__()* special method. If your class has a __len__() method, it is probably a container, and Python will treat its instances like one: false when its length is zero and true when its length is nonzero. The __nonzero__() method is more explicit and can indicate the instance's truth value even for non-container classes. If a class has both of these methods, __nonzero__() takes precedence.

Here is a list subclass that is always truthy, even when empty:

class truthylist(list):
    def __nonzero__(self):
        return True  # always

When would you want to use such a list? Well, consider a situation in which you are reading records from a database or elements from an XML file and will return a list of them. If an error occurs, you have two choices: raise an exception or return an error code. In some scenarios, it's even convenient to just return an empty list on error, since then you can use the same code path to iterate over it whether there was an error or not. But then you don't know why you got the empty list: was it because there was an error, or because there was no data of the type you requested? The truthy list gives us a solution.

def getrecords(key):
    try:
       result =  ...  # get the records here
    else:
        return result if result else truthylist()
    except Exception:
        return []
       
Now, when we call this function, we can just check to see if the result is truthy. If it is, we successfully retrieved the records (even if there are none). At the same time, we retain the ability to iterate over the records without regard for what happened, if that's what we want to do.

records = getrecords("DNA")
for record in records: print record
if records = []: print "No records found",
if not records: print "due to error",
print

Admittedly, this verges on a Stupid Python Trick. The "empty container is falsy" convention is so engrained, other Python programmers will find truthylist more than a little odd.

In the next installment of Python Foundations, we'll look at Python's logical operators and how implicit truth values interact with them.

*In Python 3, the __nonzero__() special method was renamed __bool__() to better match the other type-coercion methods such as __str__() and __int__().

Thursday, June 21, 2012

Tweetery @Engyrus

You can find me on Twitter @Engyrus. I'll post there each time I post something here.

SPT #1: Automatically stripping debug code

"Stupid Python Tricks" explores ways to abuse Python features for fun and profit. This is the first in a series. Stupid Python Tricks are generally not best practices and may well be worst practices. Read at your own risk.

The standard Python interpreter ("CPython") has a command line option -O that triggers "optimization." Currently, the only real optimization performed is to ignore assert statements. And they're not simply ignored at runtime; they are never even compiled into the byte code executed by the Python virtual machine.

The value of this optimization is that you can sprinkle asserts liberally throughout your code to make sure it fails fast when something goes wrong, making it easier to debug. Yet, because all those statements are stripped out when Python is run in optimized mode, there's no performance penalty when the code is put into production.

Wouldn't it be great if you could strip all your debug code just as easily? Many of us write functions like this to let us easily turn off our debug messages:

def debug_print(*args):    
    if DEBUG:
        for arg in args: print arg,
        print

This can be optimized a bit to minimize overhead of the debug statements by checking the DEBUG flag only once and defining a do-nothing function when the flag isn't set:

def debug_print(*args):    
    for arg in args: print arg,
    print
if not DEBUG: debug_print = lambda *args: None

But there are still all those function calls to the dummy function being executed when running in non-debug mode. Plus, of course, you still need to define the DEBUG variable. So running in production requires both that you change that variable and put -O on the command line, doubling your chances of getting it wrong.

How can we abuse Python's optimization to actually strip out the calls to our debug_print function? Simple: by writing it as an assertion. To avoid raising an AssertionError, of course, debug_print must always return True.

def debug_print(*args):
    for arg in args: print arg,
    print
    return True

assert debug_print("Checkpoint 1")

Now we just need to run our script with -O and all those debug_print calls will be stripped automatically like we never even wrote them.

If you're using Python 3 (or Python 2.6 or 2.7 with from __future__ import print_function), print is already a function. Seems like a waste to define a new debug_print in that case. But we need the result to be True and print() returns None, which evaluates as False in a Boolean context. Well, you can just write one of the following, any of which is guaranteed to be True (the first only for functions that return None or another falsey value, the other two always) and prevent assert from sounding an alarm.

assert not print("Checkpoint 1")
assert print("Checkpoint 1") or True
assert [print("Checkpoint 1")]

Of these three, the last is what elevates this trick to the height of stupidity. Exploiting the fact that Python considers any non-empty container True, we simply make a list containing the return value from the function we called. The resulting code looks more like an unfamiliar bit of Python syntax than a dirty hack, but a dirty hack it is.

By the way, this Stupid Python Trick obviously also works for calls to loggers or any other function you want to call in debug mode.

Why this is a bad idea: It's very specific to current CPython behavior, which could change in the future, and may not have the desired effects with other Python implementations like IronPython and Jython. (Although it probably wouldn't hurt anything except possibly performance.) Furthermore, it's not really asserting anything about the program (the truth value is guaranteed to be True after all), but rather using the assert statement for its secondary effects, damaging Python's generally excellent readability.

IDKTAP #1: Omitting "print" in the Python shell

"I Didn't Know That About Python" covers surprising little things I've learned about Python. Nothing earthshattering. First in a series. 

Most Python hackers have spent a lot of time exploring in the Python interactive shell, and therefore know that Python obligingly prints the results of expressions for you when they return a value:

>>> 2+2
4

You might be forgiven for assuming Python is just printing the result of the last expression it evaluated. But in fact, it's printing the result of any and all expressions that you don't do anything else with.

>>> 2+2; 3+3
4
6
>>> for x in xrange(5): x
...
0
1
2
3
4

This behavior, though subtly different from what you might expect, lets you omit print much of the time in Python interactive mode, saving you precious fractions of a second each time. Because life is too short to spend it typing print!

Thanks to Joel Cornett for pointing this out in a comment on a StackOverflow answer.

Wednesday, June 20, 2012

Welcome to Engyrus

Engyrus is my one-man consulting sideline specializing in the Python programming language. I created this blog to share Python tips and tricks and information about the Python projects I'm working on, among other things. There'll probably be some Web stuff and some utilities made with AutoHotKey too.

I'm Jerry Kindall, technical writer by trade. At my day job, I use Python to build and validate our documentation. Python has put the fun back in programming for me. Coding in Python reminds me a lot of my Apple II hacking days, but with longer variable names and fewer GOTOs.

About the name: "Engyrus" is a disused scientific name for the genus Python (you know, snakes) dating back to the 1830s. As a collector of words, I really found this one lovely, so I appropriated it. (Much to my pleasure, the domain was available, too!)