UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

Generating a Random Sample

Sometimes you get confronted with a set of results that are far larger than you want to analyze. If you want to cull out a few lines from a file, but you want to eliminate the biased ordering in the file, it’s very helpful to create a random sample of them.

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' data.txt | sort -n | head -100 | sed 's/^[0-9]* //'

This just adds a random number to the beginning of each line, sorts the list, takes the top 100 lines, and removes the random number. A quick and easy way to get a random sample. I also use this when testing new commands where I want to just try the command on 10 lines to verify I got the command right. I’m a big believer that randomized testing will find corner cases faster than you can think of them.

Most Frequent Response

cat data.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

This is pretty straight-forward, take each line (lowercased) and sort alphabetically. Then use the awesome uniq -c command to count number of identical responses. Finally, sort by most frequent response.

This is why sort | uniq -c | sort -rn is easily my favorite UNIX command.

Most Frequent Words (Uni-Grams)

Along with most frequent response you often want to look at most frequent words. This is just a natural extension of our previous command, but we want to remove stop words (“the”, “of”, “and”, etc) since they provide no useful information.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr ' ' '\n' | grep -v -w -f stopwords_en.txt | sort | uniq -c | sort -rn

Pretty much the same command except we are replacing spaces with carriage returns to break the document into words rather than lines. It would be a good improvement to do better tokenization than just splitting on whitespace, but for most purposes this works well.

The stopwords_en.txt file is just one stop word per line. I usually pull my list of stop words from Ranks.nl which also has stopwords in many other languages besides English.

Most Frequent Bi-Grams

Most Frequent words are great, but they throw away a lot of context (and hence meaning). By examining pairs of words (bi-grams) you can retain a lot more of the context.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sed 's/,//' | sed G | tr ' ' '\n' > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | grep -v -e "^," | grep -v -e ",$" | sort | uniq -c | sort -rn

Here we take our list of words and concatenate sequential words together separated by a comma. To indicate the end/beginning of a response we use sed G to add an extra line between each response before we split the responses into words. Then we filter out those beginning and ending words (grep -v -e "^," | grep -v -e ",$") so that we are left with only the bi-grams.

I’m not doing any removal of stop words in this case. To do that you would want to remove all bi-grams where both words were stop words which would probably mean creating an exhaustive list of them. Not too hard to do, just haven’t found it necessary yet.


Why stop at bi-grams?

tail -n+2 tmp2.txt > tmp3.txt
paste -d ',' tmp.txt tmp2.txt tmp3.txt | grep -v -e "^," | grep -v -e ",$" | grep -v -e ",," | sort | uniq -c | sort -rn

All you need to do is create a third file to concatenate together. Everything else is pretty much the same. Of course we could continue to expand this to 4-gram, 5-grams, etc, but if your documents are short then this won’t differ very much from your most frequent response results.

Topic Modeling

This is not a UNIX command, but is such a great, easy way to get better information about the ideas in a set of responses that I have to include it.

I’m not going to explain the math for how topic modeling works, but essentially it groups words that co-occur together in a document to create a list of topics across the entire document set. Each “topic” is a weighted list of words associated with the topic, and each topic has a weight that indicates how frequent that topic is across all documents. By looking at this weighted list of words you can easily pick out the most common themes across your responses.

The easiest way I’ve found to run topic modeling is to download and install Mallet. You can follow Mallet’s main topic modeling instructions, but I’ve reduced them down to a couple of command lines that almost always works for me:

#Import data that has one "document" per line:
bin/mallet import-file --input data.txt --output data.mallet --keep-sequence --remove-stopwords

#Import data that has one "document" per file:
bin/mallet import-dir --input data/* --output data.mallet --keep-sequence --remove-stopwords

lib/mallet-2.0.6/bin/mallet train-topics \
    --input data.mallet \
    --alpha 50.0 \
    --beta 0.01 \
    --num-topics 100 \
    --num-iterations 1000 \
    --optimize-interval 10 \
    --output-topic-keys data.topic-keys.out \
    --topic-word-weights-file data.topic-word-weights.out

#sort by most frequent topic, and remove the topic number
cat data.topics | cut -f 2-20 | sort -rn > data.sorted-topics

Depending on the size of your dataset, you almost certainly will need to play with the number of topics you generate. 50 or 100 is often fine, but if you were generating topics across something as diverse as Wikipedia you’d clearly need many more. If you don’t have enough topics then it is very easy for the topics to seem like a meaningless grouping of words. I usually look at the data results with 50, 100, and 300 topics to get a feel for the data.

Once you decide how many topics make sense with your dataset this technique is a powerful way to extract and rank the meaning from a large set of responses.