What is a JSON feed? Learn more

# JSON Feed Viewer

Browse through the showcased feeds, or enter a feed URL below.

Now supporting RSS and Atom feeds thanks to Andrew Chilton's feed2json.org service

CURRENT FEED

And now it’s all this

I just said what I said and it was wrong. Or was taken wrong.

JSON

# Python Countdown

Permalink - Posted on 2019-12-08 16:20

One of my wastes of time is watching celebrity-based British game shows on YouTube. This started with Never Mind the Buzzzcocks, expanded to QI, and now covers a handful of shows, including 8 Out of 10 Cats Does Countdown.1 One of the puzzles in this show is to form the longest possible word from a string of nine letters. I don’t have a head for anagrams, but I do like recreational programming, so I’ve always thought it would be fun to write a program that generates solutions to the show’s “letters round.”

The obvious way to do this—generate all permutations of the nine letters, then eight of the letters, then seven of the letters, etc., and filter them through a list of real words—didn’t appeal to me. There are 362,880 ways to arrange nine letters, the same number of ways to arrange eight of the nine letters, 181,440 ways to arrange seven of the nine letters, and so on. I’m not above brute force programming when I need to get a job done, but I prefer more elegant solutions when programming for fun. So my Countdown script didn’t get written.

Then last month John C. Cook wrote this post about anagram frequency and the light went on. The trick is to start with a list of words and build a data structure that allows you to look up all the anagrams of a given string of letters. A convenient data structure for this is an associative array, called a dictionary in Python. The clever bit is that the dictionary keys are the letters of the anagrams in alphabetical order and the values are lists of anagrams corresponding to those letters.

For example, one entry for six-letter words would have the key aeprss and the value

'aspers', 'parses', 'passer', 'prases', 'repass', 'spares', 'sparse', 'spears'


So if we were searching for anagrams in the string psasre, we’d first alphabetize it to aeprss and then look it up directly in our dictionary of six-letter anagrams.

The starting point, then, is to build a dictionary of anagrams. Because I wanted a bit more structure, I built four dictionaries, one each for nine-letter, eight-letter, seven-letter, and six-letter words.2 I then assembled those into a higher-level dictionary in which the word lengths were the keys.

I got the lists of words from which to build the dictionaries from this Scrabble help site. It provides long lists of legal Scrabble words of whatever length you like. By copying the pages for words of six through nine letters and doing a little editing, I created files of words of each length, with every word on its own line. I named these files word6.txt through word9.txt. Then I ran this script:

python:
1:  from collections import defaultdict
2:  import pickle
3:
4:  def sig(word):
5:    return ''.join(sorted(word)).strip()
6:
7:  words = {}
8:  for count in range(6, 10):
9:    lines = [ x.strip() for x in open(f'word{count}.txt') ]
10:    words[count] = defaultdict(set)
11:    for word in lines:
13:
14:  wordsfile = open('scrabble-words', 'wb')
15:  pickle.dump(words, wordsfile)


The sig function returns the alphabetized letters of the argument string and is used to generate the key, or “signature,” of each word in the file. It’s a simplified version of the sig function John C. Cook wrote.

The script goes through each of the wordn.txt files, adding the words in that file to the subdictionary associated with its word length. The defaultdict type from the collections module is used to avoid the initialization problem that arises when using regular Python dictionaries. The anagrams are stored as sets of strings (see Line 10 for how the defaultdicts are generated), which we’ll find useful later to avoid repetition when blending sets together.

When the looping in Lines 8—12 is done, we have a dictionary of dictionaries of sets assembled in words. The anagrams we looked at earlier can be found in words[6]['aeprss'].

This complex data structure is then stored in a file called scrabble-words using Python’s pickle method of data serialization. The idea is that we create this data structure once and then use it later without having to rebuild it every time we play Countdown.3

Which brings us to the script that gives us all the anagrams in a string:

python:
1:  import pickle
2:  from itertools import combinations
3:
4:  wordfile = open('scrabble-words', 'rb')
5:  words = pickle.load(wordfile)
6:
7:  def sig(word):
8:    return ''.join(sorted(word)).strip()
9:
10:  w = input("Letters: ").strip()
11:  print()
12:  for count in range(9, 5, -1):
13:    found = set()
14:    for s in combinations(w, count):
15:      t = ''.join(s)
16:      anagrams = sorted(words[count][sig(t)])
17:      if len(anagrams) > 0:
18:        found |= set(anagrams)
19:    if len(found) > 0:
20:      print(f'{len(found)} {count}-letter word{"s" if len(found)>1 else ""}:')
21:      print(' '.join(sorted(found)))
22:    else:
23:      print(f'No {count}-letter words')
24:    print()


Line 3–4 read in the scrabble-words file and convert it back into the data structure we want. Line 10 asks for the string of letters, and the rest of the script generates the anagrams.

The loop that begins on Line 12 sets the length of anagrams to search for. It starts at nine letters and works its way down to six, which means we’ll see the highest-scoring answers first. Line 13 creates an empty set of found words and Line 14 starts a loop that goes through all the count-letter combinations of the nine given letters. These combinations are generated by the aptly named combinations function in the itertools library.

Because the combinations function returns a tuple of letters, we use Line 15 to turn the tuple into a string. Line 16 then gets all the anagrams of that string. If there are any (Line 17), they’re added to the found set through the union operation to avoid repeats (Line 18).

The rest of the script is just output. Here’s an example of a run in Pythonista on my phone:

Is this cheating? Sure, but I’m not a contestant. I think of it more as an expansion of what Susie Dent does.

1. This is, oddly, an amalgam of two shows I find unwatchable: the Family Feud-like celebrity show, 8 Out of 10 Cats and the long-running traditional game show, Countdown

2. Words of five or fewer letters are fair game in Countdown and could have been included in my program, but I wasn’t interested in words that short.

3. This could be a false efficiency. While it’s true that reading the scrabble-words file is faster than building the data structure from scratch each time—about twice as fast according to my timings—both are so fast that the difference in speed is practically unnoticeable. And the speed difference comes at a price in storage space. The four wordn.txt files take up less than one megabyte of space in aggregate, while the scrabble-words file is just over five megabytes. I may test other storage schemes to see how they balance space and speed.

[If the formatting looks odd in your feed reader, visit the original article]

# A little more on Galileo’s column

Permalink - Posted on 2019-12-07 15:43

After last week’s post, I thought some more about Galileo’s column problem. The mechanic in Galileo’s story tried to reduce the bending stress when the column is being stored on its side by inserting a third support. A simpler way would be to keep two supports but change the spacing between them. Let’s see how to do that in a way that minimizes the bending stress.

Here’s a uniform beam with supports set in a distance $a$ from its ends. Below it is the corresponding bending moment diagram.

The moment at the supports is

and the moment at the center is

As we discussed last time, the difference between positive and negative moments is whether the tension—and therefore where the cracking starts—is on the bottom or top of the beam. In assessing the overall strength of the beam, top and bottom cracking are equivalent, so we don’t have to worry about the sign of moment, only its magnitude.1 We want to find the value of $a$ that minimizes the maximum moment in absolute terms.

Seeing the word “minimize” might make you think it’s time to do some calculus, but let’s try a different approach. Note that for small values of $a$, $|M_s|$ increases and $M_c$ decreases with increasing $a$. As we increase $a$, there will be a point at which the two are equal. This will minimize the maximum absolute moment in the beam.

So we set

and solve for $a$. To nondimensionalize the equation, let’s use the substitution $x = a/L$ to get

Expanding, canceling, and rearranging gives us

which we can solve by completing the square or through the quadratic formula. Either way, we get two solutions:

and

The first solution is negative and can be ignored as a mathematical artifact. The second is the solution we want. It’s approximately $x = 0.207$ and gives a maximum absolute moment in the beam of $M_{max} = 0.0214 wL^2$. Compare this with the maximum moment with the supports at the ends ($0.125 wL^2$) and you can see how much value there is in moving the supports in.

If we plot $M_{max}$ against $a$, we can see why calculus wouldn’t have helped us find the miminum. The smallest value of $M_{max}$ is at a cusp, the intersection of the lines for $M_c$ and $M_s$. Setting a derivative equal to zero won’t locate that point.

Note that the graph also shows us something we looked at in the previous post: the maximum moment when the beam is balanced over a support at the center is the same as when it’s supported at its ends.

Apart from being a fun little problem to think about, Galileo’s column illustrates an important concern in structural engineering. Structural elements that would survive perfectly well in the completed structure sometimes fail during construction because they see loading during assembly (or storage or transportation) that they’ll never see afterward. Although engineers’ main concern is how the elements behave when the building is complete, they also have to account for the stresses that arise before then.

1. This is not true in general, but it is true for beams in which the upper and lower halves of the cross-section are symmetric. Recall that this beam is intended to be tilted up and used as a column, which means its cross-section is either circular or circular with flutes and has the required symmetry.

[If the formatting looks odd in your feed reader, visit the original article]

# Galileo and failure

Permalink - Posted on 2019-11-30 17:14

Last week’s episode of the 99% Invisible podcast was about failure and the problems that arise when we try to design safety systems to prevent failure. I found myself disagreeing with a lot of what was said, but a particular example struck me as flatly wrong. As I looked into it in more detail, I learned that it was wrong for more reasons than I originally thought.

The episode was not produced by the usual 99% Invisible team. It was one of Tim Harford’s Cautionary Tales shows with a 99PI frame around it. My main problem with the show was that its thesis—Safety systems add complexity, and that complexity may lead to failure—was overemphasized to such an extent that it transformed into the more fatalistic Safety systems add complexity which leads to failure. The examples given in the show of failures caused by the introduction of complexity were, I thought, too glibly1 presented.

To be fair, I’ve been investigating failure professionally for three decades, and it may be that no simplified journalistic approach to the topic of failure would have satisfied me. But there is a line between justified simplification and misleading oversimplification, and one of the show’s examples went over that line.

Here’s how it was introduced (from 99PI’s transcript):

Tim Harford:
Galileo Galilei is known for his astronomy and because his work was consigned to the church’s ‘Index Librorum Prohibitorum’, the list of forbidden books, but the great man’s final work opens with a less provocative topic, the correct method of storing a stone column on a building site. Bear with me, this book from 1638 is going to explain the Oscar fiasco and much more.

Galileo Galilei:
“I must relate a circumstance which is worthy of your attention as indeed are all events, which happen contrary to expectation, especially when a precautionary measure turns out to be a cause of disaster.”

Tim Harford:
A precautionary measure turns out to be a cause of disaster. That’s very interesting, Galileo, please go on.

Galileo Galilei:
“A large marble column was laid out so that its two ends rested each upon the piece of beam.”

Tim Harford:
I can picture that in my mind, support the column while it’s being stored horizontally ready for use. If you lay it on the ground it may get stained and you’ll probably break it when you try to get ropes underneath it to pull it upright. So yes, store it flat but propped up by a support at one end and a support at the other. But what if the column can’t support its own weight like that and simply snaps in half? Galileo has thought of that.

Galileo Galilei:
“A little later it occurred to a mechanic that in order to be doubly sure its not breaking in the middle, it would be wise to lay a third support midway. This seemed too all an excellent idea.”

Tim Harford:
Yes. If two supports are good, surely three supports are better.

Galileo Galilei:
“It was quite the opposite, for not many months passed before the column was found cracked and broken exactly above the new middle support.”

Tim Harford:
How did that happen?

Galileo Galilei:
“One of the end supports had after a long while become decayed and sunken, but the middle one remained hard and strong, thus causing one half of the column to project in the air without any support.”

Tim Harford:
The central support didn’t make the column safer. It pressed into it like the central pivot of a seesaw snapping it in half. Galileo’s tale isn’t really about storing columns and neither is mine. It’s about what I’m going to call Galileo’s principle, the steps we take to make ourselves safe sometimes lead us into danger.

I first heard this while driving my car to work and thought “That doesn’t sound right.” I paused the podcast and thought about it some more: “No, it isn’t right.” But it’s easy to lose terms when you’re doing algebra in your head, so when I got to work I sketched out the two ways of supporting the column and redid the work. Harford (and Galileo?) were still wrong: the stress in the column with a central support is no larger than the stress in the column with end supports. There’s no reason to believe that putting in a central support made things worse.

It’s easy to see why, but we need a few preliminaries:

1. When a column is resting horizontally, it’s being bent by its own weight and the reaction forces at the supports. Therefore, it’s acting more like a beam than a column, and I will refer to it as a beam.
2. The stresses in a beam are related to the bending moments in that beam. The stresses will be highest where the bending moments are highest (in absolute value).
3. Structural engineers use a sign convention for bending moments: those that put the bottom of a beam in tension and top of the beam in compression are taken as positive; those that put the bottom of the beam in compression and the top of the beam in tension are taken as negative.
4. Stone is strong in compression but weak in tension. All other things being equal, it will fail where the tension is the highest.

For the purposes of this post, all of these will be taken as given. I’ve discussed bending moments and stresses in beams briefly here and here, but you can find better treatments in any strength of materials book.

Here’s Galileo’s beam with supports at its ends and the associated moment diagram:

The moment diagram is parabolic and positive along the entire length of the beam. If we call the beam’s weight per unit length $w$, then the peak moment at the center of the beam is

and if the beam were to fail, we would expect it to fail starting at the bottom center where the tension is the highest.

The goal of Galileo’s mechanic was to support the beam in the center as well as at the ends, reducing the maximum (absolute) moment. Here’s what that would look like:

The maximum moment in absolute terms is still at the center, but now it’s a negative moment with this magnitude:

This is one-fourth of the moment when the beam has just end supports, so the mechanic’s action makes sense.

Unfortunately, one of the end supports gave way, leading to this condition:

I’ve drawn the middle support as being a bit off-center; we’ll see why in a bit.

The maximum moment in absolute terms is at the middle support and is negative:

The largest value of the overhang length, $x$, is $L/2$. Any longer than that and the beam will tip clockwise, with the left end lifting off its support and the right end coming to rest on its lowered support. For the purposes of calculating the maximum moment, this would be structurally equivalent to (albeit a mirror image of) what we’ve shown above.

Setting $x$ to its maximum value of $L/2$, we get

This is exactly the same as the maximum moment with supports at the ends of the beam. The only difference is that this beam will start fracturing on the top surface (as Galileo said) instead of at the bottom surface.

So why didn’t the beam crack before the mechanic put in the middle support? I can think of a few reasons:

1. Because stone isn’t uniform in strength, it may be that the top of the beam is a bit weaker than the bottom. If so, a beam that’s just barely able to survive a high positive moment will fail when it sees a high negative moment of the same magnitude. This means that the beam’s survival when it was briefly on two end supports only was pure luck. If it had been set down with the weak side on the bottom, it would have failed then.
2. There is some time dependence to the cracking of the stone. If so, the beam would have cracked eventually on its two end supports.
3. The story is slightly off, and the mechanic had the middle support in place right from the start.
4. The whole story is just made up, either by Galileo or whoever told it to him.

None of these scenarios support the idea that what the mechanic did was add complexity that caused the failure.

I got curious about why Galileo even told this story. It’s in his Dialogues Concerning Two New Sciences (that’s an affiliate link), and is part of an introduction to the square-cube law. It’s preceded by

Thus, for example, a small obelisk or column or other solid figure can certainly be laid down or set up without danger of breaking, while the very large ones will go to pieces under the slightest provocation, and that purely on account of their own weight.

and succeeded by

This is an accident which could not possibly have happened to a small column, even though made of the same stone and having a length corresponding to its thickness, i.e., preserving the ratio between thickness and length found in the large pillar.

But the story of the inserted middle support has nothing to do with the square-cube law. It’s not a story about small structures being better able to support their own weight than large structures. Even weirder, Galileo knew perfectly well that the beam with half its length overhanging a support had the same strength as one supported at its ends. Later on in the book, near the end of the second day,2 Galileo explicitly explains this:

Hitherto we have considered the moments and resistances of prisms and solid cylinders fixed at one end with a weight applied at the other end; three cases were discussed, namely, that in which the applied force was the only one acting, that in which the weight of the prism itself is also taken into consideration, and that in which the weight of the prism alone is taken into consideration.

He’s referring to an earlier discussion on the behaviour of cantilever beams. Here’s the charming drawing that goes along with that portion of the book:

Let us now consider the same prisms and cylinders when supported at both ends or at a single point placed somewhere between the ends.

Here’s the drawing that goes with the new discussion:

I should point out that for the situation in which the right support has sunk out of the way and the middle support is at exactly $x = L/2$, the reaction at the left support will be zero, so it’s as if the beam were balanced on the central support, as Galileo shows in the upper drawing.

Here’s the nut:

In the first place, I remark that a cylinder carrying only its own weight and having the maximum length, beyond which it will break, will, when supported either in the middle or at both ends, have twice the length of one which is mortised into a wall and supported only at one end.

There you have it. The maximum length of a beam that’s balanced on a support at its center is equal to the maximum length of a beam that’s on supports at both ends. Both of them are twice the maximum length of a cantilever beam.

For completeness, here’s his explanation. It’s tough sledding.

This is very evident because, if we denote the cylinder by ABC and if we assume that one-half of it, AB, is the greatest possible length capable of supporting his own weight with one end fixed at B, then, for the same reason, if the cylinder is carried on the point G, the first half will be counterbalanced by the other half BC. So also in the case of the cylinder DEF, if its length be such that it will support only one-half this length when the end D is held fixed, or the other half when the end F is fixed, then it is evident that when supports, such as H and I, are placed under the ends D and F respectively the moment of any additional force or weight placed at E will produce fracture at this point.

In the introduction to Two New Sciences, the translators, Henry Crew and Alfonso de Salvio, say that they have “made this translation as literal as is consistent with clearness and modernity.” We may take it, then, that Galileo wrote like a patent attorney.

Having gone through all of the second day and a good chunk of the first day of Two New Sciences, I still have no idea why Galileo included the story of the column and the mechanic. It has nothing to do with the topic being covered where he included it, and it doesn’t match his own (generally correct) understanding of the strength of beams. I wonder if it was written before he’d done the work on the maximum length of beams, and he just never went back to check it.

I think we can forgive Galileo this lapse. He was creating new knowledge and, given his trouble with the Vatican, was desperate to get it published. Editing was of secondary concern at best.

I’m less forgiving of Tim Harford. Anyone who’s taken a statics class could have told him that the story on which he was basing “Galileo’s Principle” didn’t demonstrate that principle.

1. Yes, I thought the show oversimplified complexity.

2. The book’s conceit is that it is a series of discussions between a teacher and two pupils. The discussions take place over four days.

[If the formatting looks odd in your feed reader, visit the original article]

# The key to sorting in Python

Permalink - Posted on 2019-11-23 19:09

A couple of weeks ago, a member of The Incomparable Slack complained that the list of offerings on the new Disney+ service wan’t convenient for scanning. Too many titles are clumped together in the Ds because they start with “Disney’s.” To waste some time when I should have been raking leaves, I played around with ways to sort titles that would get around that problem. Later that week, I was able to recoup some of that wasted time by using what I’d learned to sort a long list in a program I was writing for work.

When computer scientists discuss sorting, they usually focus on the efficiency of various algorithms. Here, I’m more interested in the conventions we use to decide which items should be sorted before others. In the US, for example, the convention is that people’s names should be sorted alphabetically in lastname firstname order, even when they are presented in firstname lastname order. When you are used to that convention, it’s jarring to see it reversed. I’ve noticed this happening more often as it’s become more common for lists to be sorted by programs than by people.

When I look for Kindle book deals at Amazon, for example, I often scan the lists of authors whose books are on sale. Amazon presents the author list sorted by first name.

Ben Winters doesn’t belong in B; he should be in W with Donald Westlake. Obviously, Amazon knows the right way to do this but has decided it isn’t worth the effort.1

A simple way to avoid the “too many Disneys” problem is to extend the longstanding convention of moving definite and indefinite articles to the end of a title: “A Winter’s Tale” gets alphabetized as “Winter’s Tale A” and ends up in the W section instead of the A section. All we have to do is treat “Disney’s” the same way we treat “a,” “an,” and “the.”

As I was thinking about implementing this in Python, the apostrophe in “Disney’s” alerted me to another problem: non-ASCII characters. I wasn’t sure how Python treated them and whether that was how I wanted them treated. So I did some experimenting.

Sorting in Python is done by either the sort method (if you want to sort a list in place) or the sorted function (if you want to create a new list that’s sorted). By default, a list of strings will be sorted by comparing each character in turn, but you can change that by specifying a comparison function through the key parameter:

mylist.sort(key=myfunction)


The comparison function must be written to take a single instance of the kind of thing being sorted and return a value. The returned value can be a number, a string, a list, or anything else that Python already knows how to sort.

Here was my first script for sorting Disney titles:

python:
1:  #!/usr/bin/env python
2:
3:  import re
4:  import sys
5:
6:  articles = ['a', 'an', 'the', 'disney', 'disneys']
7:  punct = re.compile(r'[^\w ]')
8:
9:  titles = sys.stdin.readlines()
10:  titles = [ t.strip() for t in titles ]
11:
12:  def titlekey(title):
13:          s = title.casefold()
14:          s = punct.sub('', s)
15:          w = s.split()
16:          while w[0] in articles:
17:                  w = w[1:] + w[:1]
18:          return w
19:
20:  titles.sort(key=titlekey)
21:  print('\n'.join(titles))


It expects standard input to consist of a series of titles, one per line, and outputs a similar series of lines but with titles in alphabetical order. This input:

A Tiger Walks
The Løve Bug
Måry Poppîns
That Darn Cát!
One Hundred and One Dalmatians
Pollyanna
Kidnapped
Dumbo
The Sign of “Zörro”
The Prinçess and the Frog
The Parent Trap
Kim Poßible
Boy Meets World
Disney’s The Kid
Disney’s The Christmas Carol
Disney’s A Christmas Carol
Disney’s Fairy Tale Weddings
James and the Giant Peach
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh


produces this output:

Adventures in Babysitting
Boy Meets World
Disney’s A Christmas Carol
Disney’s The Christmas Carol
Dumbo
Disney’s Fairy Tale Weddings
James and the Giant Peach
Disney’s The Kid
Kidnapped
Kim Poßible
The Løve Bug
The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Måry Poppîns
One Hundred and One Dalmatians
The Parent Trap
Pollyanna
The Prinçess and the Frog
The Sign of “Zörro”
That Darn Cát!
A Tiger Walks


Note that I’ve changed a character or two in many titles to see how non-ASCII characters get sorted.

The bulk of the script’s logic is in the titlekey function, which gets passed as the key parameter in the sort call on Line 20. titlekey starts by applying the casefold method to the input, which the documentations describes as “similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string.” We don’t want uppercase letters sorting before lowercase, and I was hoping casefold would also handle non-ASCII characters gracefully.

Line 14 then gets rid of all the punctuation in the title. For our purposes, punctuation is defined on Line 7 as everything that isn’t a word character (letter, numeral, or underscore) or a space. I thought I could use the punctuation item defined in Python’s string module, but it doesn’t include curly quotes, em and en dashes, or other non-ASCII punctuation.

Line 15 splits the string into words, and Lines 16–17 loop through the words, moving articles (as defined in Line 6 to include both “disney” and “disneys” in addition to actual articles) to the end of the list. Line 18 returns the list of rearranged title words.

An interesting thing about titlekey is that it can deal with more than one article. As you can see from the sorted list, “Disney’s The Kid” was put in the K group. When passed to titlekey, it returned the list ['kid', 'disneys', 'the'].

But all is not well with titlekey. Note that “Måry Poppîns” got placed after “Mulan,” which doesn’t seem right to me. Clearly, Python thinks the non-ASCII å comes after the ASCII u, which is not how I think of it. I think characters with diacritical marks should sort like their unadorned cousins.2

So here’s a revised version of the script:

python:
1:  #!/usr/bin/env python
2:
3:  from unidecode import unidecode
4:  import re
5:  import sys
6:
7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
8:  punct = re.compile(r'[^\w ]')
9:
10:  titles = sys.stdin.readlines()
11:  titles = [ t.strip() for t in titles ]
12:
13:  def titlekey(title):
14:          s = unidecode(title).lower()
15:          s = punct.sub('', s)
16:          w = s.split()
17:          while w[0] in articles:
18:                  w = w[1:] + w[:1]
19:          return w
20:
21:  titles.sort(key=titlekey)
22:  print('\n'.join(titles))


There are only two changes: I imported the unidecode module in Line 3 and used its unidecode function in Line 14 instead of casefold. What unidecode does is transliterate non-ASCII characters into their ASCII “equivalent.” It’s a Python port of a Perl module and shares its advantages and disadvantages. For accented characters it does a good job.

The new version of titlekey returns ['mary', 'poppins'] when given Måry Poppîns, so it sorts the list of titles the way I expect.

Of course, unidecode’s transliteration is not an unalloyed success. If we feed the new script

Måry Poppîns
Märy Poppîns II
Máry Poppîns II
Märy Poppîns
Måry Poppîns II
Máry Poppîns
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh


we get back

The Many Adventures of Winnie the Pooh
Måry Poppîns
Märy Poppîns
Máry Poppîns
Märy Poppîns II
Máry Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan


This is pretty good, but notice that because the accented as are all treated the same, they don’t get sorted among themselves. Note that in the “Mary Poppins” set, the order goes

å ä á


while in the “Mary Poppins II” set, the order goes

ä á å


This is because Python’s sort is stable—items that have the same value come out of the sort in the same order they went in. In the original list, “Måry Poppîns” came before “Märy Poppîns” but “Märy Poppîns II” came before “Måry Poppîns II,” so that’s the order they come out.

Because our older version of the script—the one that uses casefold—doesn’t replace the accented characters, it does what might be considered a better job with all the “Mary Poppins” titles. Here’s how it sorts the list:

The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Máry Poppîns
Máry Poppîns II
Märy Poppîns
Märy Poppîns II
Måry Poppîns
Måry Poppîns II


Obviously, it still has the problem of putting them all after “Mulan,” but there’s a nice regularity to its “Mary Poppins” series.

What would be the best sorting? I’d say

The Many Adventures of Winnie the Pooh
Máry Poppîns
Märy Poppîns
Måry Poppîns
Máry Poppîns II
Märy Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan


This puts all the “Marys” before “Melody Time,” puts all the “IIs” after the originals, and puts the variously accented as in the same order in both the “Mary Poppins” and “Mary Poppins II” sections. Can we do this? Yes, by making two lists in titlekey, one that transliterates via unidecode and another that just does casefold. Then we sort according to a compound list of lists:

python:
1:  #!/usr/bin/env python
2:
3:  from unidecode import unidecode
4:  import re
5:  import sys
6:
7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
8:  suffixes = {'ii':2, 'iii':3, 'iv':4, 'v':5, 'vi':6, 'vii':7, 'viii':8, 'ix':9, 'x':10}
9:  punct = re.compile(r'[^\w ]')
10:
11:  titles = sys.stdin.readlines()
12:  titles = [ t.strip() for t in titles ]
13:
14:  def titlekey(title):
15:          s = unidecode(title).lower()
16:          r = title.casefold()
17:          s = punct.sub('', s)
18:          r = punct.sub('', r)
19:          w = s.split()
20:          v = r.split()
21:          if w[-1] in suffixes:
22:                  number = f'{suffixes[w[-1]]:02d}'
23:                  w = w[:-1] + [number]
24:                  v = v[:-1]
25:          while w[0] in articles:
26:                  w = w[1:] + w[:1]
27:          while v[0] in articles:
28:                  v = v[1:] + v[:1]
29:          return [w, v]
30:
31:  titles.sort(key=titlekey)
32:  print('\n'.join(titles))


Note also that the dictionary suffixes, which we use to better handle the possibility of Roman numerals at the end of a title. The idea is to convert them to Arabic so IX doesn’t come before V. I stopped at X, but that dictionary could easily be extended if necessary.3

The value returned by titlekey in Line 29 is a list of two lists. The first is basically what the unidecode version of the script gave us but with a two-digit Arabic number (see Lines 22–23) at the end if the title ended with a Roman numeral. The second is the list returned by the casefold version of the script, but with any Roman numeral stripped off.

By setting up the return value in this nested way, sort compares titles first by their unidecoded version and then by their casefolded version. That gives the ordering I like, with accented characters generally treated as unaccented but with different accents sorted consistently.

I’m certain you can come up with lists that won’t sort properly with this final version of the script. If you feel compelled to send them to me, make sure you include a script that can handle them.

I mentioned at the top that I used what I learned during my raking avoidance in a script for work. The work script had nothing to do with alphabetizing non-ASCII characters, but it did use the key parameter.

What I had was a list of strings that looked like “402-1,” “804-13,” and “1201-2.” Because of how they were entered, they were jumbled up, and I needed them sorted numerically by the first number, then the second number. Because there were no leading zeros, an alphabetical sort wouldn’t work. I did it by passing a lambda function as the key:

python:
mylist.sort(key=lambda x: [ int(y) for y in x.split('-') ])


The key function splits the string on the hyphen, converts each substring to an integer, and returns the list of integer pairs. Simple and easy to write, but something I would have spent more time on if I hadn’t been thinking about key just a few days earlier.

1. Or they’ve done some horrible A/B comparison and decided that people stay on the site longer when lists are alphabetized the wrong way.

2. An argument certainly can be made for leaving “Måry Poppîns” where it is. An å is not the same as an a, and maybe it should be sorted after u. But because I’m writing this for my own purposes (to avoid raking leaves), I get to decide the most reasonable sorting order. By revising the script to get å to sort like a, I stay at my computer longer.

3. I suspect there’s a Roman numeral conversion module, too, which would be better than a dictionary.

[If the formatting looks odd in your feed reader, visit the original article]

# Prompt forever

Permalink - Posted on 2019-11-16 16:41

A lot of my work on the iPad isn’t really on the iPad. It’s command line work that’s executed on a Mac or a Linux server while I use my iPad as a terminal via SSH. A feature recently added to Prompt has made that much easier.

The problem I’ve always had with using Prompt is that it would disconnect from the server if it was the background app for more than a few minutes. This was a “feature” of iOS that the folks at Panic couldn’t seem to get around. Because it’s common for me to jump between three or four apps while working, I couldn’t always keep Prompt as one of the active apps in Split View, and I’d often have to reconnect.

I tried making the reconnection as painless as possible by using tmux. This worked, but working in a tmux session didn’t allow me to scroll back freely to see the results of old commands. Using the mobile shell, Mosh, and the Blink app had the same scrolling problem. (I should make it clear that when using tmux or mosh I could use keyboard commands to scroll back a page at a time, but I didn’t want to work as if I were sitting at a text-only terminal in 1982. I like working at the command line, but I want to do so in a modern setting.)

In recent months, the terminal feature built in to Textastic has been my workhorse. Because I tend to keep Textastic active in one of the Split View panes as I take notes, edit a program, or write a report, its connection to the server seldom gets culled by the operating system. Unfortunately, its terminal emulation isn’t as full-featured as Prompt’s—it’s fine for simple tasks, but not great for a Jupyter console session.

Recently, though, I learned that a new feature of Prompt may give me everything I want. It’s called Connection Keeper, and when I first heard of it I had two misconceptions:

1. I thought it was a temporary workaround to the terrible background process culling that iOS 13.2.1 was doing and wouldn’t be helpful once Apple fixed that in 13.2.2.
2. I thought it was essentially built into Prompt, not a setting that has to be turned on.

I was set straight earlier this week by Athanasios Alexadrides and Anders Borum.

Here’s what Panic says in the latest release notes:

Prompt’s new Connection Keeper feature lets you audit exactly when and where you’ve connected to your servers. It also helps keep your connections alive while Prompt is in the background.

You can turn this feature on in “Settings > Connection Keeper”

While I’m sure there are plenty of people who want to track where and when they connected to the server, it’s the part tucked after the “also” that excites me. Since turning Connection Keeper on several days ago, every SSH session in Prompt has had an essentially permanent connection, no matter how long I’ve had Prompt in the background.

When Prompt is connected to a server and in the background, this flag appears in the statusbar:

It’s similar to the background flags that appear when you’re talking on the phone or getting directions in Maps. As with those flags, tap it and you’re taken back to Prompt.

As Panic says in the release notes, you turn Connection Keeper on in Prompt’s settings, but you might have trouble finding it. Like many gear/hamburger “menus” in iPadOS, Prompt’s Settings popup is limited in height. On my 12.9″ iPad, only the items on the left are shown when I tap the gear icon.

At first, I didn’t notice the line under the Keyboard item that indicates there are more items available. Scrolling up revealed all the other items shown on the right.

Turning Connection Keeper on is pretty much what you’d expect: flip the slider button to the right.

As with the release notes, the instructions here focus more on connection history than connection maintenance. I suspect Panic doesn’t want to oversell connection maintenance because it’s not entirely under their control; they know Apple could kill it with another point release.

But until that happens, I’m enjoying SSH connections that last as long as I want them to. One more step in the direction of making the iPad into a full-featured computer.

[If the formatting looks odd in your feed reader, visit the original article]

# A little chart adjustment

Permalink - Posted on 2019-11-14 04:11

One of the best things about today’s introduction of the 16″ MacBook Pro was that it inspired Marco Arment to write a new blog post—a rare treat. I am perhaps unusual in that I think Marco is a better writer than he is a talker. I like his talking; I just like his writing more.

Of course he commented on the new keyboard, and his comments were accompanied by this cute graph comparing the new keyboard with the old one and some others:

It’s nicely done. Displaying the key spacing and travel as a stacked pair of charts was a good idea. Putting them together in a single chart with the spacing and travel columns next to one another would have been the easy choice, but it wouldn’t be nearly as clear. And the decision to show the key travel columns growing down—the direction of travel—was inspired. It does what good charts do: give you an instant understanding of the point being made while also providing a clear way to explore the data further.

But I have opinions about chart style, and I think a handful of small improvements could be made. Most of these could not be made within the charting software Marco was using (Numbers, I think); the charts would have to be imported into a drawing package and manipulated there. But the small extra effort would be worth it.

I don’t have the data Marco was working from, and I wanted some practice editing images in Pixelmator on my iPad, so I edited his PNG image instead of creating a new chart from scratch. Here’s his chart with my edits:

First, I made it seem more like a single chart by getting rid of the superfluous second set of categories and moving the lower chart up to make it more obvious that it and the upper chart are sharing the categories. The legends were Marco’s only bad idea, so I got rid of them. A chart with only one data set doesn’t need a legend, it needs a title. I confess my placement of the titles could be better.

I also changed two of the grid lines in the lower chart, thickening the one at 0 and thinning the one at 5 to make it clear which was the base from which the red columns were growing. And I deleted the minus signs in front of the ordinate labels. I’m sure Marco used negative values to trick the charting software into making his columns grow downward, but we think of key travel as a positive number, and that’s how it should be labeled.

One change I couldn’t make in Pixelmator, but is the change I first thought of when looking at Marco’s original, is to adjust the ticks and grid lines to whole millimeters. This is a pet peeve of mine: charting algorithms often make unnatural decisions about default tick spacing, setting the marks at places that no human would. In this case, the algorithm no doubt felt that four divisions was best, leaving us with a silly grid spacing of 1.25 mm. It is true that adding an extra grid line would make the chart a little more busy, but the advantage of having simple numbers as the labels outweighs the additional clutter. This is especially true given that one of the main points of the new keyboard is that it’s travel is 1.0 mm.

My apologies to Marco for this. His chart really is good. I just can’t help myself.

[If the formatting looks odd in your feed reader, visit the original article]

# Accidents and estimates

Permalink - Posted on 2019-11-10 14:53

I came very close to being in a car accident on Friday, which got me thinking about kinematics and estimation in engineering calculations.

I was stopped in a line of cars at a traffic light when something—probably the squeal of brakes, although I may have heard that after later—made me look up into my rear view mirror. A car going way too fast was coming up behind me. I leaned back in my seat, put my hands on my head, and closed my eyes, waiting for the impact.

Which never came.

After a couple of seconds, I opened my eyes and looked in the mirror. There was a car back there, slanted out toward the shoulder, but it was much further away than I expected. Then I noticed a car on the shoulder to my right and ahead of me. That was the car I had expected to hit me. The driver had managed to swerve to the right and avoid me.

That led to some conflicting feelings. I was pleased he was skillful enough to steer out of the accident but angry at his stupidity in needing to exercise that skill. Then the engineer in me took over. If he came to a stop ahead of me, how fast would he have hit me if he hadn’t veered to the right?

It’s a pretty simple calculation, the kind you learn in high school physics. There are two equations of kinematics we need:

and

These are covering the period of time from when his front bumper passed my rear bumper to when he came to rest. The distance traveled is $d$, his speed at the beginning of this period is $v_0$, the duration is $t$, and the deceleration (assumed constant) is $\alpha g$. It’s common in situations like this to express the acceleration or deceleration as a fraction of the acceleration due to gravity; $\alpha$ is a pure number.

We don’t really care about $t$, so with a little algebra we can turn these into a single formula with only the variables of interest:

Based on where the car ended up, I’d say $d$ is about 25 feet. The deceleration factor, $\alpha$ is a bit more dicey to estimate, but it’s likely to be somewhere around 0.6 to 0.8. And since we’re using feet for distance, we’ll use 32.2 $\mathrm{ft/s^2}$ for $g$. That gives us a range of values from 31 to 36 $\mathrm{ft/s}$ for $v_0$. Converting to more conventional units for car speeds, that puts him between 21 and 24 mph. That would have been a pretty good smack. Not only would my trunk have been smashed in, I likely would have had damage to my front bumper from being pushed into the car ahead of me.

This was a simple calculation, but it illustrates an interesting feature of estimation. Despite starting with a fairly wide range in our estimate of $\alpha$ (0.8 is 33% higher than 0.6), we ended up with a much narrower range in the estimate of $v_0$ (24 is only about 15% higher than 21). For this we have the square root to thank. It cuts the relative error in half.

Why? Let’s say we have this simple relationship:

We can express our uncertainty in the value of $b$ by saying our estimate of it is $b (1 + \epsilon)$, where $\epsilon$ is the relative error in the estimate. We can then say

and using the Taylor series expansion of the second term about $\epsilon = 0$, we get

If the absolute value of $\epsilon$ is small, the higher order terms (h.o.t) won’t amount to much, and the relative error of $a$ will be about half that of $b$.

Lots of engineering estimates aren’t as forgiving as this one, so it’s important to know when your inputs have to be known precisely and when you can be a little loose with them.

Speaking of forgiving, I searched for rear end crash test results for my car to see how much damage it would have taken. I came up empty, but here’s a more recent model undergoing an impact at twice the speed.

[If the formatting looks odd in your feed reader, visit the original article]

# I don't want nobody nobody sent

Permalink - Posted on 2019-11-02 13:57

I have been known to complain bitterly about Apple’s decline in software quality. Sometimes the complaints have been typed into Twitter; more often they been spit out between clenched teeth as yet another damned thing doesn’t do what it should. But iOS 13 has added one new feature that is both incredibly valuable and works.

It’s Silence Unknown Callers, an option you can find in the Phone section of the Settings app.

In my experience, it does exactly what it says and has already saved me lots of time and frustration. It’s not that I often answered spam calls; I had already trained myself to almost never pick up a call that my phone didn’t associate with a contact. But I still had to stop what I was doing and look at my phone or my watch whenever one came in. Now that’s a thing of the past.

I’ve seen many people say they can’t use Silence Unknown Callers because they often need to take cold calls. I pity those people. I did wonder myself whether it was OK to silence business calls from prospective new clients who aren’t yet in my contacts list, but a little thought led me to the conclusion that those callers always leave messages and aren’t offended by having to do so.

Oddly enough, my first and only bad experience with Silence Unknown Callers was the exact opposite of missing an important call. A day or two after I had turned it on, a spam call rang on my phone. My initial reaction was that Apple had screwed up (yet again), but no. The call rang because the number was in my Contacts. Over several years I had collected spam numbers into a special contact—called AAASpammer to put it at the top of the list—that was blocked. I had apparently mistakenly unblocked him,1 and now because he was in my list of contacts, and the caller was reusing a number associated with that contact, the call rang through. I deleted AAASpammer from Contacts and have not been bothered by a spam call since.

If you have any sense of the history of Chicago machine politics, you will recognize the source of the post’s title. Spam callers are nobodies that nobody sent.

1. In my experience, when adding a new number to a blocked contact, you had to unblock the caller and then reblock him to get the newly added number to “take” (whether this was a bug or an idiotic design choice by Apple, I never knew). I must have missed a step in this dance the last time I added a number and left the contact unblocked.

[If the formatting looks odd in your feed reader, visit the original article]

# Data cleaning from the command line

Permalink - Posted on 2019-11-01 01:02

If you/be been reading John D. Cook’s blog recently, you know he’s been writing about “computational survivalism,” using the basic Unix text manipulation tools to process data. I know Kieran Healy has also written about using these tools instead of his beloved R for preprocessing data; this article talks a little about it, but I feel certain there’s a longer discussion somewhere else. You should probably just read all his posts. And John Cook’s, too. I’ll be here when you’re done.

I use Unix tools to manipulate data all the time. Even though a lot of my analysis is done via Python and Pandas, those old command-line programs just can’t be beat when it comes to wrangling a mass of text-formatted data into a convenient shape for analysis. I have a particular example from work I was doing last week.

I needed to analyze dozens of data files from a series of tests run on a piece of equipment. The data consisted of strain gauge and pressure gauge readings taken as the machine was run. After a brief but painful stint of having the test data provided as a set of Excel files, I got the technician in charge of the data acquisition to send me the results as CSV. That’s where we’ll start.

Each data file starts with 17 lines of preamble information about the software, the data acquisition channels, the calibration, and so on. This is important information for documenting the tests—and I make sure the raw files are saved for later reference—but it gets in the way of importing the data as a table.

I figured a simple sed command would delete these preamble lines, but I figured wrong. For God knows what reason, the CSV file that came from the computer that collected the data was in UTF-16 format (is this common in Windows?), even though there wasn’t a single non-ASCII character in the file. UTF-16 is not something sed likes.

So I took a quick look at the iconv man page and wrote this one-liner to get the files into a format I could use:

for f in *.csv; do iconv -f UTF-16 -t UTF-8 "$f" > "$f.new"; done


I suppose I could have chosen ASCII as the “to” (-t) format, but I’m in the habit of calling my files UTF-8 even when there’s nothing outside of the ASCII range.

After a quick check with BBEdit to confirm that the files had indeed converted, I got rid of the UTF-16 versions and replaced them with the UTF-8 versions

rm *.csv
rename 's/\.new//' *.new


The rename command I use is my adaptation of Larry Wall’s old Perl script. The first argument is a Perl-style substitution command.

With the files in the proper format, it was time to delete the 17-line preamble. The easiest way to do this is with the GNU version sed that can be installed through Homebrew:

gsed -i '1,17d' *.csv


The sed that comes with macOS requires a bit more typing:

for f in *.csv; do sed -i.bak '1,17d' "$f"; done  The built-in sed forces you to do two things: 1. Include an extension with the -i switch so you have backups of the original files. 2. Use a loop to go through the files. A command like sed -i.bak '1,17d' *.csv would concatenate all the CSV files together and delete the first 17 lines of that. The upshot is that only the first CSV file would have its preamble removed.  Hence my preference for gsed. With the preambles gone, I now had a set of real CSV files. Each one had a header line followed by many lines of data. I could have stopped the preprocessing here, but there were a couple more things I wanted to change. First, the data acquisition software inserts an “Alarm” item after every data channel, effectively giving me twice as many columns as necessary. To get rid of the Alarm columns, I needed to know which ones they were. I could have opened one of the files in BBEdit and started counting through the header line, but it was more conveniently done via John Cook’s header numbering one-liner: head -1 example.csv | gsed 's/,/\n/g' | nl  Because all the CSVs have the same header line, I could run this on any one of them. The head -1 part extracts just the first line of the file. That gets piped to the sed substitution command, which converts every comma into a newline. Finally, nl prefixes each of the lines sent to it with its line number. What I got was this:  1 Scan 2 Time 3 101 <SG1> (VDC) 4 Alarm 101 5 102 <SG2> (VDC) 6 Alarm 102 7 103 <SG3> (VDC) 8 Alarm 103 9 104 <SG4> (VDC) 10 Alarm 104 11 105 <SG5> (VDC) 12 Alarm 105 13 106 <SG6> (VDC) 14 Alarm 106 15 107 <SG7> (VDC) 16 Alarm 107 17 108 <SG8> (VDC) 18 Alarm 108 19 109 <Pressure> (VDC) 20 Alarm 109  With the headers laid out in in numbered rows, it was easy to construct a cut command to pull out only the columns I needed: for f in *.csv; do cut -d, -f 1-3,5,7,9,11,13,15,17,19 "$f" > "$f.new"; done  I’d like to tell you I did something clever like seq -s, 3 2 19 | pbcopy  to get the list of odd numbers from 3 to 19 and then pasted them into the cut command, but I just typed them out like an animal. Once again, I ran rm *.csv rename 's/\.new//' *.new  to get rid of the old versions of the files. The last task was to change the header line to a set of short names that I’d find more convenient when processing the data in Pandas. For that, I used sed’s “change” command: gsed -i '1cScan,Datetime,G1,G2,G3,G4,G5,G6,G7,P' *.csv  With this change, I could access the data in Pandas using simple references like df.G1  instead of long messes like df['101 <SG1> (VDC)']  All told, I used nine commands to clean up the data. If I had just a few data files, I might have done the editing by hand. It is, after all, quite easy to delete rows and columns of a spreadsheet. But I had well over a hundred of these files, and even if I had the patience to edit them all by hand, there’s no way I could have done it without making mistakes here and there. Which isn’t to say I wrote each command correctly on the first try. But after testing each command on a single file, I could then apply it with confidence to the full set.1 The tests were run in several batches over a few days. Once I had the commands set up, I could clean each new set of data in no time. 1. And it took a lot less time to develop and test the commands than it did to write this post. [If the formatting looks odd in your feed reader, visit the original article] # Old bugs never die or fade away Permalink - Posted on 2019-10-11 15:53 A few days ago, I ran into an odd and very old bug that isn’t even Apple’s fault. Imagine that! My discovery did start with a Apple bug, though. I was working on my larger iPad Pro and saved a couple of email attachments to a new folder in iCloud Drive. At least, that’s what I thought I’d done. When I opened Files on my smaller iPad a few minutes later, neither the files nor their enclosing folder, which I had created on the spot while saving the files, were visible. Probably just a brief delay in syncing, I thought, but I was wrong. When I looked again half an hour later, nothing had synced. I could see the new files and folder on the large iPad but not the small one. I wondered if the files had synced to my Macs. So I opened up Transmit on one of the iPads and tried to SFTP into my home iMac to look for the files. Couldn’t connect. Tried to SFTP into my office iMac and couldn’t connect there, either. I wondered if maybe iOS 13 was finally the end of Transmit for iOS, but a quick test showed I was able to connect to the server that hosts this blog. Also, I was able to use Prompt to log into both iMacs via SSH, so it was clear that their SSH server daemons were up and running. Several unfruitful Googlings later, I sat down at my home iMac and tried to diagnose the problem. When I entered ssh localhost  into Terminal, I got this response Received message too long 1113682792 Ensure the remote shell produces no output for non-interactive sessions.  I didn’t understand how the remote shell could be set up to produce no output—shells are supposed to produce output, aren’t they?—but I figured this sort of raw, system-level error message would be a better search term than the Transmit error messages, which were written by Panic. And sure enough, I quickly found this Stack Exchange question, for which the first answer was the solution. And it led to this OpenSSH FAQ, which has the official answer: sftp and/or scp may fail at connection time if you have shell initialization (.profile, .bashrc, .cshrc, etc) which produces output for non-interactive sessions. This output confuses the sftp/scp client. So the problem is not that the shell produces output, it’s that it produces output upon initialization. And I knew immediately why my shells were doing that. A month or so ago, I was messing around different shells and different versions of shells, and to keep track of what was running, I had added these lines this to the top of .bashrc: echo "Loading .bashrc..." echo$BASH_VERSION


These were the culprits. They were fine in an interactive SSH session, but were triggering a long-standing bug in an SFTP session. I commented out the lines and could make SFTP connections via Transmit (and ShellFish and FileBrowser) to both iMacs again.

The OpenSSH people don’t seem to think of this as a bug. The Stack Exchange answer says it’s been around at least ten years, and the FAQ says the fix is “you need to modify your shell initialization.”

In other words, “you’re initializing it wrong.”

By the way, once I got this SFTP stuff straightened out, I learned that the new files and folder (remember those? this is a song about Alice) weren’t on either of my Macs, so some combination of Files and iCloud Drive had screwed up. While still on my Mac, I created a new folder and saved the email attachments into it. They appeared in Files on both iPads immediately, except that the new folder was called untitled folder on the larger iPad.

I hope it takes Apple less than ten years to recognize and clean up these iCloud/Files bugs.

[If the formatting looks odd in your feed reader, visit the original article]

# Discontinuous ranges in Python

Permalink - Posted on 2019-09-22 15:58

This will be the third in an unplanned trilogy of posts on generating sequences. The first two, the jot and seq one and the brace expansion one, were about making sequences in the shell. But most of the sequences I make are in Python programs, and Python has some interesting quirks.

The fundamental sequence maker is range. In Python 2 and earlier, range created a list. For example,

python:
range(2, 10)


returns

[2, 3, 4, 5, 6, 7, 8, 9]


For very long sequences, you could save space by using the xrange function, which would generate the sequence on demand (“lazy evaluation” is the term of art) rather than creating it in full all at once.

python:
r = xrange(5000)


In this case, r would not be a list but would be of the special xrange type.

In Python 3, range stopped generating lists and became essentially what xrange used to be. And the now-redundant xrange was removed from the language. So

python:
r = range(5000)


would make r into a variable of the range class, and

python:
r2 = xrange(5000)


would return an error.

For most uses, the change in range made very little difference in how I write Python scripts. But there is one use I’ve had to modify.

I mentioned in the previous two posts that I often have to create a list of apartment or unit numbers for a building. I use the list to assist in developing inspection plans and keeping track of the inspection results. In the simplest case, the list could be made like this:

python:
units = [ '{}{:02d}'.format(f, u) for f in range(2, 10) for u in range(1, 6) ]


with units having the value

['201', '202', '203', '204', '205', '301', '302', '303',
'304', '305', '401', '402', '403', '404', '405', '501',
'502', '503', '504', '505', '601', '602', '603', '604',
'605', '701', '702', '703', '704', '705', '801', '802',
'803', '804', '805', '901', '902', '903', '904', '905']


But for taller buildings, a simple range for the floors wasn’t possible, because residential buildings generally don’t have 13th floors, at least as far as addressing is concerned. The numbering scheme of the units skips directly from the 12xx set to the 14xx set.

In Python 2, I handled this by adding the lists created by range:

floors = range(2, 13) + range(14, 25)


which gave floors the value

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24]


The + operator concatenated the lists produced by range, and it made for compact and easy-to-read code.

In Python 3, this doesn’t work because the new range class doesn’t understand the + operator.

python:
floors = range(2, 13) + range(14, 25)


returns a TypeError because + is unsupported for ranges.

How do we get around this? One way is to turn the ranges into lists before concatenation:

python:
floors = list(range(2, 13)) + list(range(14, 25))


This is certainly clear, but it’s ugly. Another way to do it, assuming we don’t need floors to be a list, is to use the chain function from the itertools library

python:
floors = chain(range(2, 13), range(14, 25))


This is less ugly than the list() construct, but still not to my taste. I would use it if I had a huge discontinuous sequence to deal with, but not when I have only dozens of items.

With Python 3.5, a new way to unpack iterators was introduced to the language, extending the definition of the unary * operator. I didn’t learn about it until I was already on Python 3.7, but I’ve been making up for lost time.

You probably know about using * to unpack a list variable when calling a function. Say you have a function f that takes a list of five positional variables and a five-element list variable x that has its items ordered just the way f wants. Instead of calling f like this:

python:
a = f(x[0], x[1], x[2], x[3], x[4])


you can call it like this:

python:
a = f(*x)


The extension introduced in Python 3.5 allows us to unpack more than one list in the function call. If list variable y has two items (corresponding to the first two arguments to f) and list variable z has three (corresponding to the final three arguments), we can call f this way:

python:
a = f(*y, *z)


This by itself doesn’t help with the problem of discontinous floor numbering, but the unpacking extension also allowed the multiple * construct to be used outside of function calls. Thus,

python:
b = *y, *z


will assign to b a tuple consisting of the concatenated elements of y and z. And this works for other iterables, too. So for the floor problem, I can do

python:
floors = *range(2, 13), *range(14, 25)


to get a tuple of the floors without 13. If I want a list, it’s

python:
floors = [ *range(2, 13), *range(14, 25) ]


This is neither as compact nor as clear as the old Python 2 way, but it’s not too bad, and it avoids the cluster of parentheses I was using with list() and chain().

I’ve made a promise to myself to read the release notes when I switch to Python 3.8.

Update Sep 22, 2019 2:15 PM
Joe Lion suggested this:

python:
floors = [ f for f in range(2, 25) if f != 13 ]


What I like about this is how explicit it is that we are excluding 13 from the list. What I’m less enthused about is the f for f in part, which is a lot of typing for essentially a no-op.

So it got me thinking about other ways to exclude the 13. To my surprise, I’m beginning to favor this:

python:
floors = list(range(2, 25))
floors.remove(13)


I still don’t like the nesting of range within list, and I have the common tendency to dislike using two lines when it can be done in one, but the intent of this—just like with Joe’s comprehension—is very clear: I want a list that goes from 2 through 24 but without 13. I’ll have to give it some thought.

[If the formatting looks odd in your feed reader, visit the original article]

# Brace yourself, I’m in an expansive mood

Permalink - Posted on 2019-09-07 18:08

A longstanding truth of this blog is that whenever I write a post about shell t̸r̸i̸c̸k̸s̸ features, I get a note from Aristotle Pagaltzis letting me know of a shorter, faster, or better way to do it. Normally, I add a short update to the post with Aristotle’s improvements, or I explain whay his faster way wouldn’t be faster for me (because some things just won’t stick in my head). But his response to my jot and seq post got me exploring an area of the shell that I’ve seen but never used before, and I thought it deserved a post of its own. I even learned something useful without his help.

Here’s Aristotle’s tweet:

@drdrang Bash/zsh brace expansion can replace 99% of jot/seq uses (though bash < 4.x doesn’t support padding or step size 🙁). Your last example becomes much simpler:

printf '%s\n' 'Apt. '{2..5}{A..D}

You even get your preferred argument order:

printf '%s\n' {10..40..5}

Brace expansion in bash and zsh doesn’t seem like a very important feature because it takes up so little space in either manual. The brief exposure I’ve had to it has been in articles that talked about using it to run an operation on several files at once. For example, if I have a script called file.py that generates text, CSV, PDF, and PNG output files, all named file but with different extensions, I might want to delete all the output files while leaving the script intact. I can’t do

rm file.*


because that would delete the script file. What works is

rm file.{txt,csv,pdf,png}


The shell expands this into

rm file.txt file.csv file.pdf file.png


and then runs the command.

This is cute, but I never thought it worth committing to memory because tab completion and command-line editing through the Readline library makes it very easy to generate the file names interactively.

What I didn’t realize until Aristotle’s tweet sent me to the manuals was that the expansion could also be specified as a numeric or alphabetic sequence using the two-dot syntax. Thus,

mkdir folder{A..T}


creates 20 folders in one short step, which is the sort of thing that can be really useful.

And you can use two sets of braces to do what is effectively a nested loop. With apologies to Aristotle, here’s how I would do the apartment number generation from my earlier post:

printf "Apt. %s\n" {2..5}{A..D}


This gives output of

Apt. 2A
Apt. 2B
Apt. 2C
Apt. 2D
Apt. 3A
Apt. 3B
Apt. 3C
Apt. 3D
Apt. 4A
Apt. 4B
Apt. 4C
Apt. 4D
Apt. 5A
Apt. 5B
Apt. 5C
Apt. 5D


just like my more complicated jot/seq command.

The main limitation to brace expansion when compared to jot and seq is that you can’t generate sequences with fractional steps. If you want numbers from 0 through 100 with a step size of 0.5,

seq 0 .5 100


is the way to go.

And if you’re using the stock version of bash that comes with macOS (bash version 3.2.57), you’ll run into other limitations.

First, you won’t be able to left-pad the generated numbers with zeros. In zsh and more recent versions of bash, you can say

echo {005..12}


and get1

005 006 007 008 009 010 011 012


where the prefixed zeros (which can be put in front of either number or both) tell the expansion to zero-pad the results to the same length. If you run that same command in the stock bash, you just get

5 6 7 8 9 10 11 12


Similarly, the old bash that comes with macOS doesn’t understand the three-parameter type of brace sequence expansion (mentioned by Aristotle), in which the third parameter is the (integer) step size:

echo {5..100..5}


which gives

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100


in zsh and newer versions of bash. Old bash doesn’t understand the three-parameter at all and just outputs the input string:

{5..100..5}


We’ve been told that Catalina will ship with zsh as the default shell, which means we shouldn’t have to worry about these deficiencies for long. Because I don’t want to learn a new system of configuration files, I’m sticking with bash, but I have switched to version 5.0.11 that’s installed by Homebrew. My default shell is now /usr/local/bin/bash.2

One more thing. I said last time that seq needs a weirdly long formatting code to get zero-padded numbers embedded in another string. The example was

seq -f "file%02.0f.txt" 5


to get

file01.txt
file02.txt
file03.txt
file04.txt
file05.txt


What I didn’t understand was how the %g specifier works. Based on my skimming of the printf man page, I thought it just chose the shorter output of the equivalent %f and %e specifiers. But it turns out to do further shortening, eliminating all trailing zeros and the decimal point if there’s no fractional part to the number. Therefore, we can use the simpler

seq -f "file%02g.txt" 5


to get the output above. Because printf-style formatting is used in lots of places, this is good to know outside the context of seq.

Of course, now that I understand brace expansion, I wouldn’t use seq at all. I’d go with something like

echo file{01..5}.txt


1. I’m using echo here to save vertical space in the output.

2. Fair warning: I will ignore or block all attempts to get me to change to zsh. I’m glad you like it, but I’m not interested.

[If the formatting looks odd in your feed reader, visit the original article]