Colemak: 0 to 40 WPM in 40 Hours

On April 1st my first child was born and I started a wonderful month of paternity leave. Holding a sleeping infant leaves you with lots of sleepy hours where its (sometimes) possible to do repetitive tasks, so I decided to follow the 10% of my Automattic colleagues that are using either Dvorak or Colemak. My love of natural language processing led me to build word lists based on English word frequency and word/character frequency of my code and command line history.Colemak_layout_2 I chose Colemak over Dvorak because only 17 keys change location and most of those only move slightly. A lot of the key combinations that are ingrained from 15 years of using emacs are still pretty much the same. Standard commands like Cmd-Q, Cmd-W,Cmd-Z, Cmd-X, Cmd-C, and Cmd-V are all in the same places.

Why Would You Do This?

Well, needless to say, a layout designed in 1878 is probably not optimized for computers. Colemak was actually designed to place the most frequent letters right at your fingers. The fluidity is unnerving. There is very sparse evidence that you can type any faster with Colemak if you are already a great QWERTY touch typist. If you want to read more this StackOverflow thread is interesting. I also know and work with a lot of folks who don’t regret moving to either Colemak or Dvorak.

For myself, I was not a great touch typist. I knew the theory. But practicing typing was never something I did. Before I started Colemak I had a QWERTY typing speed of about 60 words per minute when copying text using TypeRacer. That’s about average. I don’t like being average. And I’ve never practiced typing code for speed. My most common three character sequence when coding is not ‘the’, it is ‘( $’… sigh PHP. I bet I can be faster with some practice.

So, if I’m going to try and get faster why not go all out? I make my living by tapping keys in a precise order. Why not learn a modern layout that has been well designed? I’ve also occasionally had pain in my hands, and my knuckles like to crack in ominous ways sometimes. Altogether, now seemed like a good time to give it a try.

And the most important reason: Never stop learning.

Learning Strategy

My strategy evolved over time, but this is where I ended up and what I would recommend.

  • This article made me think about typing as analogous to learning a musical instrument. Research has shown that learning music requires: “accurate, consistent repetition, while maintaining perfect technique”. In short, strive for accuracy and focus on the parts that you are not doing well at to improve.
  • Your brain needs time to process and learn. I had a habit of practicing Colemak for at least one minute each day. Some days I practiced for an hour, rarely longer.
  • Start out by learning the keyboard layout. I used The Typing Cat for about two hours over the course of a week.
  • Get a software program that can take arbitrary lists of words, and track and analyze where you are slow. I used Amphetype. Its not a great UI, but worked well enough. When practicing word lists practice the same three words in a row repeated three times before moving on to the next (the, of, and, the, of, and, the, of, and, to, in, a, …). This just felt like a good mix of repetition and mixing words to me. Your mileage may vary.
  • Then focus on practicing frequent English key sequences (or whatever your preferred language).
  • The top 5 bi-grams (the two-letter sequences ‘th’, ‘he’, ‘in’, ‘er’, and ‘an’) comprise 10% of all bi-grams. You should be extraordinarily fast and accurate at the top 30 bigrams.
  • Similarly get fast at 3-grams, 4-grams, and 5-grams. I built my lists from Peter Norvig’s analysis of the Google N-Gram Corpus.
  • Learn the most frequent words. Also from the N-Gram Corpus, the top 50 English words are about 40% of all words. Get fast at those, and you are well on your way.
  • When you are typing the above lists at 30+ WPM start practicing the top 500 words.
  • Along the way, focus on your mistakes. With Amphetype you can analyze the words and tri-grams that you make the most mistakes with. Build new lists based on these, slow down, and practice them till you are doing them perfectly. Speed will come. Focus on not needing to make corrections.
  • Rinse and repeat. Take breaks.
  • Go cold turkey and switch over completely. This was a lot easier because I was on leave from work. It wasn’t really until a month of practice that I completely switched. My QWERTY speed is now about as slow as Colemak because my brain is confused.
  • I’ve also moved on beyond simply English words and am practicing the 200 most common terms in my code, the 40 most common unix command terms, and the most common 3, 4, and 5 grams in my code.

All of my word lists are available in this Github project. There are also instructions for building your own lists. Writing this post was my trigger for cleaning up my lists so I can be more efficient at getting from 40 WPM to 80 WPM.

Analysis of Time Spent

I use RescueTime to track all of my time on my computer. In April I spent a total of 46 hours on my computer. Looking at only the time I spent where I was typing (rather than editing adorable photos of my daughter):

In the first six days of May as I slowly ramped back up at work I spent 26 hours on my computer with the keyboard layout entirely set to Colemak. About an hour of that time was spent practicing in Amphetype (still doing at least a minute of practice per day). Total time spent with Colemak has been about 47 hours, but I’m pretty sure I am undercounting how often I switched back to Qwerty for writing email in April. On May 6th I reached 41 WPM on TypeRacer for the first time.

Forty WPM is not very impressive, but it is noticeably more fluid and continuing to improve steadily. At this point it is good enough that I can return to work and be productive (if a little terse).

Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

  • Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
  • Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
  • Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

  • Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
  • Morphological analysis of en and ru
  • Should we run spell checking or phonetic analysis
  • Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
  • Index bigrams with the shingle analyzer
  • Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

 

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

Generating a Random Sample

Sometimes you get confronted with a set of results that are far larger than you want to analyze. If you want to cull out a few lines from a file, but you want to eliminate the biased ordering in the file, it’s very helpful to create a random sample of them.

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' data.txt | sort -n | head -100 | sed 's/^[0-9]* //'

This just adds a random number to the beginning of each line, sorts the list, takes the top 100 lines, and removes the random number. A quick and easy way to get a random sample. I also use this when testing new commands where I want to just try the command on 10 lines to verify I got the command right. I’m a big believer that randomized testing will find corner cases faster than you can think of them.

Most Frequent Response

cat data.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

This is pretty straight-forward, take each line (lowercased) and sort alphabetically. Then use the awesome uniq -c command to count number of identical responses. Finally, sort by most frequent response.

This is why sort | uniq -c | sort -rn is easily my favorite UNIX command.

Most Frequent Words (Uni-Grams)

Along with most frequent response you often want to look at most frequent words. This is just a natural extension of our previous command, but we want to remove stop words (“the”, “of”, “and”, etc) since they provide no useful information.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr ' ' '\n' | grep -v -w -f stopwords_en.txt | sort | uniq -c | sort -rn

Pretty much the same command except we are replacing spaces with carriage returns to break the document into words rather than lines. It would be a good improvement to do better tokenization than just splitting on whitespace, but for most purposes this works well.

The stopwords_en.txt file is just one stop word per line. I usually pull my list of stop words from Ranks.nl which also has stopwords in many other languages besides English.

Most Frequent Bi-Grams

Most Frequent words are great, but they throw away a lot of context (and hence meaning). By examining pairs of words (bi-grams) you can retain a lot more of the context.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sed 's/,//' | sed G | tr ' ' '\n' > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | grep -v -e "^," | grep -v -e ",$" | sort | uniq -c | sort -rn

Here we take our list of words and concatenate sequential words together separated by a comma. To indicate the end/beginning of a response we use sed G to add an extra line between each response before we split the responses into words. Then we filter out those beginning and ending words (grep -v -e "^," | grep -v -e ",$") so that we are left with only the bi-grams.

I’m not doing any removal of stop words in this case. To do that you would want to remove all bi-grams where both words were stop words which would probably mean creating an exhaustive list of them. Not too hard to do, just haven’t found it necessary yet.

Tri-Grams

Why stop at bi-grams?

tail -n+2 tmp2.txt > tmp3.txt
paste -d ',' tmp.txt tmp2.txt tmp3.txt | grep -v -e "^," | grep -v -e ",$" | grep -v -e ",," | sort | uniq -c | sort -rn

All you need to do is create a third file to concatenate together. Everything else is pretty much the same. Of course we could continue to expand this to 4-gram, 5-grams, etc, but if your documents are short then this won’t differ very much from your most frequent response results.

Topic Modeling

This is not a UNIX command, but is such a great, easy way to get better information about the ideas in a set of responses that I have to include it.

I’m not going to explain the math for how topic modeling works, but essentially it groups words that co-occur together in a document to create a list of topics across the entire document set. Each “topic” is a weighted list of words associated with the topic, and each topic has a weight that indicates how frequent that topic is across all documents. By looking at this weighted list of words you can easily pick out the most common themes across your responses.

The easiest way I’ve found to run topic modeling is to download and install Mallet. You can follow Mallet’s main topic modeling instructions, but I’ve reduced them down to a couple of command lines that almost always works for me:

#Import data that has one "document" per line:
bin/mallet import-file --input data.txt --output data.mallet --keep-sequence --remove-stopwords

#Import data that has one "document" per file:
bin/mallet import-dir --input data/* --output data.mallet --keep-sequence --remove-stopwords

lib/mallet-2.0.6/bin/mallet train-topics \
    --input data.mallet \
    --alpha 50.0 \
    --beta 0.01 \
    --num-topics 100 \
    --num-iterations 1000 \
    --optimize-interval 10 \
    --output-topic-keys data.topic-keys.out \
    --topic-word-weights-file data.topic-word-weights.out

#sort by most frequent topic, and remove the topic number
cat data.topics | cut -f 2-20 | sort -rn > data.sorted-topics

Depending on the size of your dataset, you almost certainly will need to play with the number of topics you generate. 50 or 100 is often fine, but if you were generating topics across something as diverse as Wikipedia you’d clearly need many more. If you don’t have enough topics then it is very easy for the topics to seem like a meaningless grouping of words. I usually look at the data results with 50, 100, and 300 topics to get a feel for the data.

Once you decide how many topics make sense with your dataset this technique is a powerful way to extract and rank the meaning from a large set of responses.