Colemak: 0 to 40 WPM in 40 Hours

On April 1st my first child was born and I started a wonderful month of paternity leave. Holding a sleeping infant leaves you with lots of sleepy hours where its (sometimes) possible to do repetitive tasks, so I decided to follow the 10% of my Automattic colleagues that are using either Dvorak or Colemak. My love of natural language processing led me to build word lists based on English word frequency and word/character frequency of my code and command line history.Colemak_layout_2 I chose Colemak over Dvorak because only 17 keys change location and most of those only move slightly. A lot of the key combinations that are ingrained from 15 years of using emacs are still pretty much the same. Standard commands like Cmd-Q, Cmd-W,Cmd-Z, Cmd-X, Cmd-C, and Cmd-V are all in the same places.

Why Would You Do This?

Well, needless to say, a layout designed in 1878 is probably not optimized for computers. Colemak was actually designed to place the most frequent letters right at your fingers. The fluidity is unnerving. There is very sparse evidence that you can type any faster with Colemak if you are already a great QWERTY touch typist. If you want to read more this StackOverflow thread is interesting. I also know and work with a lot of folks who don’t regret moving to either Colemak or Dvorak.

For myself, I was not a great touch typist. I knew the theory. But practicing typing was never something I did. Before I started Colemak I had a QWERTY typing speed of about 60 words per minute when copying text using TypeRacer. That’s about average. I don’t like being average. And I’ve never practiced typing code for speed. My most common three character sequence when coding is not ‘the’, it is ‘( $’… sigh PHP. I bet I can be faster with some practice.

So, if I’m going to try and get faster why not go all out? I make my living by tapping keys in a precise order. Why not learn a modern layout that has been well designed? I’ve also occasionally had pain in my hands, and my knuckles like to crack in ominous ways sometimes. Altogether, now seemed like a good time to give it a try.

And the most important reason: Never stop learning.

Learning Strategy

My strategy evolved over time, but this is where I ended up and what I would recommend.

  • This article made me think about typing as analogous to learning a musical instrument. Research has shown that learning music requires: “accurate, consistent repetition, while maintaining perfect technique”. In short, strive for accuracy and focus on the parts that you are not doing well at to improve.
  • Your brain needs time to process and learn. I had a habit of practicing Colemak for at least one minute each day. Some days I practiced for an hour, rarely longer.
  • Start out by learning the keyboard layout. I used The Typing Cat for about two hours over the course of a week.
  • Get a software program that can take arbitrary lists of words, and track and analyze where you are slow. I used Amphetype. Its not a great UI, but worked well enough. When practicing word lists practice the same three words in a row repeated three times before moving on to the next (the, of, and, the, of, and, the, of, and, to, in, a, …). This just felt like a good mix of repetition and mixing words to me. Your mileage may vary.
  • Then focus on practicing frequent English key sequences (or whatever your preferred language).
  • The top 5 bi-grams (the two-letter sequences ‘th’, ‘he’, ‘in’, ‘er’, and ‘an’) comprise 10% of all bi-grams. You should be extraordinarily fast and accurate at the top 30 bigrams.
  • Similarly get fast at 3-grams, 4-grams, and 5-grams. I built my lists from Peter Norvig’s analysis of the Google N-Gram Corpus.
  • Learn the most frequent words. Also from the N-Gram Corpus, the top 50 English words are about 40% of all words. Get fast at those, and you are well on your way.
  • When you are typing the above lists at 30+ WPM start practicing the top 500 words.
  • Along the way, focus on your mistakes. With Amphetype you can analyze the words and tri-grams that you make the most mistakes with. Build new lists based on these, slow down, and practice them till you are doing them perfectly. Speed will come. Focus on not needing to make corrections.
  • Rinse and repeat. Take breaks.
  • Go cold turkey and switch over completely. This was a lot easier because I was on leave from work. It wasn’t really until a month of practice that I completely switched. My QWERTY speed is now about as slow as Colemak because my brain is confused.
  • I’ve also moved on beyond simply English words and am practicing the 200 most common terms in my code, the 40 most common unix command terms, and the most common 3, 4, and 5 grams in my code.

All of my word lists are available in this Github project. There are also instructions for building your own lists. Writing this post was my trigger for cleaning up my lists so I can be more efficient at getting from 40 WPM to 80 WPM.

Analysis of Time Spent

I use RescueTime to track all of my time on my computer. In April I spent a total of 46 hours on my computer. Looking at only the time I spent where I was typing (rather than editing adorable photos of my daughter):

In the first six days of May as I slowly ramped back up at work I spent 26 hours on my computer with the keyboard layout entirely set to Colemak. About an hour of that time was spent practicing in Amphetype (still doing at least a minute of practice per day). Total time spent with Colemak has been about 47 hours, but I’m pretty sure I am undercounting how often I switched back to Qwerty for writing email in April. On May 6th I reached 41 WPM on TypeRacer for the first time.

Forty WPM is not very impressive, but it is noticeably more fluid and continuing to improve steadily. At this point it is good enough that I can return to work and be productive (if a little terse).

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

Generating a Random Sample

Sometimes you get confronted with a set of results that are far larger than you want to analyze. If you want to cull out a few lines from a file, but you want to eliminate the biased ordering in the file, it’s very helpful to create a random sample of them.

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' data.txt | sort -n | head -100 | sed 's/^[0-9]* //'

This just adds a random number to the beginning of each line, sorts the list, takes the top 100 lines, and removes the random number. A quick and easy way to get a random sample. I also use this when testing new commands where I want to just try the command on 10 lines to verify I got the command right. I’m a big believer that randomized testing will find corner cases faster than you can think of them.

Most Frequent Response

cat data.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

This is pretty straight-forward, take each line (lowercased) and sort alphabetically. Then use the awesome uniq -c command to count number of identical responses. Finally, sort by most frequent response.

This is why sort | uniq -c | sort -rn is easily my favorite UNIX command.

Most Frequent Words (Uni-Grams)

Along with most frequent response you often want to look at most frequent words. This is just a natural extension of our previous command, but we want to remove stop words (“the”, “of”, “and”, etc) since they provide no useful information.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr ' ' '\n' | grep -v -w -f stopwords_en.txt | sort | uniq -c | sort -rn

Pretty much the same command except we are replacing spaces with carriage returns to break the document into words rather than lines. It would be a good improvement to do better tokenization than just splitting on whitespace, but for most purposes this works well.

The stopwords_en.txt file is just one stop word per line. I usually pull my list of stop words from Ranks.nl which also has stopwords in many other languages besides English.

Most Frequent Bi-Grams

Most Frequent words are great, but they throw away a lot of context (and hence meaning). By examining pairs of words (bi-grams) you can retain a lot more of the context.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sed 's/,//' | sed G | tr ' ' '\n' > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | grep -v -e "^," | grep -v -e ",$" | sort | uniq -c | sort -rn

Here we take our list of words and concatenate sequential words together separated by a comma. To indicate the end/beginning of a response we use sed G to add an extra line between each response before we split the responses into words. Then we filter out those beginning and ending words (grep -v -e "^," | grep -v -e ",$") so that we are left with only the bi-grams.

I’m not doing any removal of stop words in this case. To do that you would want to remove all bi-grams where both words were stop words which would probably mean creating an exhaustive list of them. Not too hard to do, just haven’t found it necessary yet.

Tri-Grams

Why stop at bi-grams?

tail -n+2 tmp2.txt > tmp3.txt
paste -d ',' tmp.txt tmp2.txt tmp3.txt | grep -v -e "^," | grep -v -e ",$" | grep -v -e ",," | sort | uniq -c | sort -rn

All you need to do is create a third file to concatenate together. Everything else is pretty much the same. Of course we could continue to expand this to 4-gram, 5-grams, etc, but if your documents are short then this won’t differ very much from your most frequent response results.

Topic Modeling

This is not a UNIX command, but is such a great, easy way to get better information about the ideas in a set of responses that I have to include it.

I’m not going to explain the math for how topic modeling works, but essentially it groups words that co-occur together in a document to create a list of topics across the entire document set. Each “topic” is a weighted list of words associated with the topic, and each topic has a weight that indicates how frequent that topic is across all documents. By looking at this weighted list of words you can easily pick out the most common themes across your responses.

The easiest way I’ve found to run topic modeling is to download and install Mallet. You can follow Mallet’s main topic modeling instructions, but I’ve reduced them down to a couple of command lines that almost always works for me:

#Import data that has one "document" per line:
bin/mallet import-file --input data.txt --output data.mallet --keep-sequence --remove-stopwords

#Import data that has one "document" per file:
bin/mallet import-dir --input data/* --output data.mallet --keep-sequence --remove-stopwords

lib/mallet-2.0.6/bin/mallet train-topics \
    --input data.mallet \
    --alpha 50.0 \
    --beta 0.01 \
    --num-topics 100 \
    --num-iterations 1000 \
    --optimize-interval 10 \
    --output-topic-keys data.topic-keys.out \
    --topic-word-weights-file data.topic-word-weights.out

#sort by most frequent topic, and remove the topic number
cat data.topics | cut -f 2-20 | sort -rn > data.sorted-topics

Depending on the size of your dataset, you almost certainly will need to play with the number of topics you generate. 50 or 100 is often fine, but if you were generating topics across something as diverse as Wikipedia you’d clearly need many more. If you don’t have enough topics then it is very easy for the topics to seem like a meaningless grouping of words. I usually look at the data results with 50, 100, and 300 topics to get a feel for the data.

Once you decide how many topics make sense with your dataset this technique is a powerful way to extract and rank the meaning from a large set of responses.