On thinking differently

Wed 07 February 2018

This post is about an experience I had while solving a kata-style coding exercise. While the problem itself was very well defined and had a simple solution, I was very taken aback that I did not see the most elegant and simple solution, despite my proclaimed fluency with programmatic problem solving. This experience taught me that I still have a lot to learn about thinking outside the box, and I'm writing it down here mainly to try and articulate my thoughts to myself.

Let's begin

A colleague of my partner regularly posts small coding exercises to the company Slack channel. I think that this is a great idea for several reasons:

  • it gives people practice at translating problem specifications into code,
  • it makes people think about problems that are different to those on which they work day to day,
  • and it provides a central point for discussions about the merits of different ways of attacking problems, in addition to coding style.

The exercises do not even have to be very complicated (in fact I think that this is best); the most recent exercise was as follows:

Write a function that returns the most commonly occurring alphabetic character in a string, treating uppercase and lowercase letters as equivalent.

If two characters occur equally often, the the one that occurs earlier in the alphabet should be returned.

My solution

Seems simple enough, right? I coded up the simplest solution I could think of in a few minutes

from collections import Counter

def most_common(s):
    s = (c for c in s.lower() if c.isalpha())
    most_common, count = max(Counter(s).items(),
                             key=lambda c: (c[1], -ord(c[0])))
    return most_common

I will call the above code "solution 1". I was convinced that this was the optimal solution:

  • We filter out only the characters we care about, so the counting logic does not run for characters that we will later throw away,
  • We use a generator expression to avoid making a copy of the (potentially large) string in memory
  • We make a single pass over the input string

The other solution

This was the solution that was posted to the Slack channel after everyone had submitted theirs:

from string import ascii_lowercase

def most_common(s):
    return max(ascii_lowercase, key=s.lower().count)

I will call this code "solution 2". Just looking at it this is much cleaner than solution 1 (although, embarrassingly, it actually took me a minute to understand how it handles the edge case where two characters have the same count). It also works in a fundamentally different way to solution 1: here we iterate over the characters that we are interested in (ascii_lowercase) and compare them based on the number of times that they occur in the input string, taking the character with the maximum count. If several characters have the same counts, then max will choose the one that occurred first (it has the same semantics as a stable sort).

Despite its readability I was initially skeptical because we make 26 passes over the input string, rather than just 1. It is also the case that even if the input string contains only the character 'a' (for example) we will still iterate through the damn string 25 more times, counting up the occurrences of 'b', 'c' and so on! This is even though we know that it doesn't contain anything but as after the first iteration. My partner had actually tried to solve the problem in a similar manner to this, but I had dismissed it as suboptimal for the aforementioned reason. I said to myself "sure, this seems cleaner, but there's no way that it's more efficient".

This is why it was a huge shock to me that solution 2 actually outperforms solution 1 in terms of run time.

What I failed to account for, is that in this case we don't care about asymptotic complexity. Subconsciously I had been thinking: "hm, if the problem requirements change and we now want to find the most commonly occurring unicode character then we would have to iterate over the input string a hundred thousand times; not cool!". However, the problem very clearly states that we only care about ascii lowercase characters. In this regime solution 2 performs way better because the counting of individual characters is done by the builtin string method str.count, which uses a tight C loop. Compare this to solution 1, where we iterate over the input string in a python loop, incurring the additional cost of a dictionary lookup, integer addition, and an isalpha() check from Python, phew!

Thoughts

This blog post is mainly just me proving to myself that, in this instance, I was wrong. My solution was inferior in every possible metric. This was initially quite hard to swallow, as before seeing solution 2 I was fully convinced that it was impossible within the confines of Python to express a solution more cleanly and efficiently. Boy have I got a lot to learn!

It was also a good reminder for me to make sure that I actually optimise my designs for the intended use case. I have a natural tendency to try and write code that solves a problem more general than the one initially formulated. Although I'll try and justify this as making the code more "reusable" or "extensible", the real reason is probably just that I enjoy extracting the abstract structure of a problem. I really have to work on not abstracting into the stratosphere.