Colemak: 0 to 40 WPM in 40 Hours

On April 1st my first child was born and I started a wonderful month of paternity leave. Holding a sleeping infant leaves you with lots of sleepy hours where its (sometimes) possible to do repetitive tasks, so I decided to follow the 10% of my Automattic colleagues that are using either Dvorak or Colemak. My love of natural language processing led me to build word lists based on English word frequency and word/character frequency of my code and command line history.Colemak_layout_2 I chose Colemak over Dvorak because only 17 keys change location and most of those only move slightly. A lot of the key combinations that are ingrained from 15 years of using emacs are still pretty much the same. Standard commands like Cmd-Q, Cmd-W,Cmd-Z, Cmd-X, Cmd-C, and Cmd-V are all in the same places.

Why Would You Do This?

Well, needless to say, a layout designed in 1878 is probably not optimized for computers. Colemak was actually designed to place the most frequent letters right at your fingers. The fluidity is unnerving. There is very sparse evidence that you can type any faster with Colemak if you are already a great QWERTY touch typist. If you want to read more this StackOverflow thread is interesting. I also know and work with a lot of folks who don’t regret moving to either Colemak or Dvorak.

For myself, I was not a great touch typist. I knew the theory. But practicing typing was never something I did. Before I started Colemak I had a QWERTY typing speed of about 60 words per minute when copying text using TypeRacer. That’s about average. I don’t like being average. And I’ve never practiced typing code for speed. My most common three character sequence when coding is not ‘the’, it is ‘( $’… sigh PHP. I bet I can be faster with some practice.

So, if I’m going to try and get faster why not go all out? I make my living by tapping keys in a precise order. Why not learn a modern layout that has been well designed? I’ve also occasionally had pain in my hands, and my knuckles like to crack in ominous ways sometimes. Altogether, now seemed like a good time to give it a try.

And the most important reason: Never stop learning.

Learning Strategy

My strategy evolved over time, but this is where I ended up and what I would recommend.

  • This article made me think about typing as analogous to learning a musical instrument. Research has shown that learning music requires: “accurate, consistent repetition, while maintaining perfect technique”. In short, strive for accuracy and focus on the parts that you are not doing well at to improve.
  • Your brain needs time to process and learn. I had a habit of practicing Colemak for at least one minute each day. Some days I practiced for an hour, rarely longer.
  • Start out by learning the keyboard layout. I used The Typing Cat for about two hours over the course of a week.
  • Get a software program that can take arbitrary lists of words, and track and analyze where you are slow. I used Amphetype. Its not a great UI, but worked well enough. When practicing word lists practice the same three words in a row repeated three times before moving on to the next (the, of, and, the, of, and, the, of, and, to, in, a, …). This just felt like a good mix of repetition and mixing words to me. Your mileage may vary.
  • Then focus on practicing frequent English key sequences (or whatever your preferred language).
  • The top 5 bi-grams (the two-letter sequences ‘th’, ‘he’, ‘in’, ‘er’, and ‘an’) comprise 10% of all bi-grams. You should be extraordinarily fast and accurate at the top 30 bigrams.
  • Similarly get fast at 3-grams, 4-grams, and 5-grams. I built my lists from Peter Norvig’s analysis of the Google N-Gram Corpus.
  • Learn the most frequent words. Also from the N-Gram Corpus, the top 50 English words are about 40% of all words. Get fast at those, and you are well on your way.
  • When you are typing the above lists at 30+ WPM start practicing the top 500 words.
  • Along the way, focus on your mistakes. With Amphetype you can analyze the words and tri-grams that you make the most mistakes with. Build new lists based on these, slow down, and practice them till you are doing them perfectly. Speed will come. Focus on not needing to make corrections.
  • Rinse and repeat. Take breaks.
  • Go cold turkey and switch over completely. This was a lot easier because I was on leave from work. It wasn’t really until a month of practice that I completely switched. My QWERTY speed is now about as slow as Colemak because my brain is confused.
  • I’ve also moved on beyond simply English words and am practicing the 200 most common terms in my code, the 40 most common unix command terms, and the most common 3, 4, and 5 grams in my code.

All of my word lists are available in this Github project. There are also instructions for building your own lists. Writing this post was my trigger for cleaning up my lists so I can be more efficient at getting from 40 WPM to 80 WPM.

Analysis of Time Spent

I use RescueTime to track all of my time on my computer. In April I spent a total of 46 hours on my computer. Looking at only the time I spent where I was typing (rather than editing adorable photos of my daughter):

In the first six days of May as I slowly ramped back up at work I spent 26 hours on my computer with the keyboard layout entirely set to Colemak. About an hour of that time was spent practicing in Amphetype (still doing at least a minute of practice per day). Total time spent with Colemak has been about 47 hours, but I’m pretty sure I am undercounting how often I switched back to Qwerty for writing email in April. On May 6th I reached 41 WPM on TypeRacer for the first time.

Forty WPM is not very impressive, but it is noticeably more fluid and continuing to improve steadily. At this point it is good enough that I can return to work and be productive (if a little terse).

Elasticsearch, Open Source, and the Future

This essay started as a response to a comment on my multilingual indexing post. The comment is mostly an advertisement, but brings up some interesting points so I decided to publish it and turn my response into a full post.

For some context here’s the key part of the comment:

I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:

– Intelligent CJKT tokenization/segmentation
– Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
– Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.

There are many areas where Elasticsearch could benefit from better handling of multi-lingual text. And the NLP geek in me would love to see ES get some more modern Natural Language Processing techniques – such as these – applied within it.

Unfortunately I’m a lot less excited about this system because it is closed source, limiting its impact on the overall ES ecosystem. I love that Basis Technologies has whitepapers and it seems to be doing great evaluation of their system, but even the whitepapers require registering with them. This just seems silly.

Search engine technology has been around for a while, a big part of Elasticsearch’s success is due to it being open source. And particularly due to Lucene being an open source, collaborative effort over the past 15 years. I think ES has the potential to become a phenomenal NLP platform over the next five years, bringing many amazing NLP technologies coming out of academia to a massive number of developers. NLP researchers have done tremendous science in an open and collaborative manner. We should work to scale that technology in open and collaborative ways as well.

Building a platform on closed source solutions is not sustainable.

Humans express their dreams, opinions, and ideas in hundreds of languages. Bridging that gap between humans and computers – and ultimately between humans – is a noble endeavor that will subtly shape the next century. I’d like to see Elasticsearch be a force in democratizing the use of natural language processing and machine learning. These methods will impact how we understand the world, how we communicate with each other, and ultimately our democracy. We should not build that future on licensing that explicitly prevents citizens of some countries from participating.

I recognize that working for a successful open source company makes me luckily immune to certain the pressures of business, investors, and government contracts. But I have seen the unfortunate cycle of cool NLP technology getting trapped within a closed source company and eventually being completely shut down. I’ve recently been reading “The Theory That Would Not Die”. Where would we be now if the efficacy of Bayesian probability had not been locked inside classified government organizations for 40 years after World War II?

Basis Technologies has been around a long time, and has far more NLP talent and experience than I do, but the popularity of my very brief multi-lingual post tells me that there is also huge opportunity for the community to improve the multi-lingual capabilities of Elasticsearch. A company leading the way could build a strong business providing all the support that inevitably will be needed when dealing with multiple languages. I’d be happy to talk with anyone about how’s rather large set of multi-lingual data could help in such an endeavor. I bet other organizations that would be interested also.

New Blog

I’m building this blog as a place to talk about technology that I find interesting. I wanted to build something that was separate from my personal life so I can have an excuse to ignore one or the other depending on what is happening in real life.