UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

Generating a Random Sample

Sometimes you get confronted with a set of results that are far larger than you want to analyze. If you want to cull out a few lines from a file, but you want to eliminate the biased ordering in the file, it’s very helpful to create a random sample of them.

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' data.txt | sort -n | head -100 | sed 's/^[0-9]* //'

This just adds a random number to the beginning of each line, sorts the list, takes the top 100 lines, and removes the random number. A quick and easy way to get a random sample. I also use this when testing new commands where I want to just try the command on 10 lines to verify I got the command right. I’m a big believer that randomized testing will find corner cases faster than you can think of them.

Most Frequent Response

cat data.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

This is pretty straight-forward, take each line (lowercased) and sort alphabetically. Then use the awesome uniq -c command to count number of identical responses. Finally, sort by most frequent response.

This is why sort | uniq -c | sort -rn is easily my favorite UNIX command.

Most Frequent Words (Uni-Grams)

Along with most frequent response you often want to look at most frequent words. This is just a natural extension of our previous command, but we want to remove stop words (“the”, “of”, “and”, etc) since they provide no useful information.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr ' ' '\n' | grep -v -w -f stopwords_en.txt | sort | uniq -c | sort -rn

Pretty much the same command except we are replacing spaces with carriage returns to break the document into words rather than lines. It would be a good improvement to do better tokenization than just splitting on whitespace, but for most purposes this works well.

The stopwords_en.txt file is just one stop word per line. I usually pull my list of stop words from Ranks.nl which also has stopwords in many other languages besides English.

Most Frequent Bi-Grams

Most Frequent words are great, but they throw away a lot of context (and hence meaning). By examining pairs of words (bi-grams) you can retain a lot more of the context.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sed 's/,//' | sed G | tr ' ' '\n' > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | grep -v -e "^," | grep -v -e ",$" | sort | uniq -c | sort -rn

Here we take our list of words and concatenate sequential words together separated by a comma. To indicate the end/beginning of a response we use sed G to add an extra line between each response before we split the responses into words. Then we filter out those beginning and ending words (grep -v -e "^," | grep -v -e ",$") so that we are left with only the bi-grams.

I’m not doing any removal of stop words in this case. To do that you would want to remove all bi-grams where both words were stop words which would probably mean creating an exhaustive list of them. Not too hard to do, just haven’t found it necessary yet.

Tri-Grams

Why stop at bi-grams?

tail -n+2 tmp2.txt > tmp3.txt
paste -d ',' tmp.txt tmp2.txt tmp3.txt | grep -v -e "^," | grep -v -e ",$" | grep -v -e ",," | sort | uniq -c | sort -rn

All you need to do is create a third file to concatenate together. Everything else is pretty much the same. Of course we could continue to expand this to 4-gram, 5-grams, etc, but if your documents are short then this won’t differ very much from your most frequent response results.

Topic Modeling

This is not a UNIX command, but is such a great, easy way to get better information about the ideas in a set of responses that I have to include it.

I’m not going to explain the math for how topic modeling works, but essentially it groups words that co-occur together in a document to create a list of topics across the entire document set. Each “topic” is a weighted list of words associated with the topic, and each topic has a weight that indicates how frequent that topic is across all documents. By looking at this weighted list of words you can easily pick out the most common themes across your responses.

The easiest way I’ve found to run topic modeling is to download and install Mallet. You can follow Mallet’s main topic modeling instructions, but I’ve reduced them down to a couple of command lines that almost always works for me:

#Import data that has one "document" per line:
bin/mallet import-file --input data.txt --output data.mallet --keep-sequence --remove-stopwords

#Import data that has one "document" per file:
bin/mallet import-dir --input data/* --output data.mallet --keep-sequence --remove-stopwords

lib/mallet-2.0.6/bin/mallet train-topics \
    --input data.mallet \
    --alpha 50.0 \
    --beta 0.01 \
    --num-topics 100 \
    --num-iterations 1000 \
    --optimize-interval 10 \
    --output-topic-keys data.topic-keys.out \
    --topic-word-weights-file data.topic-word-weights.out

#sort by most frequent topic, and remove the topic number
cat data.topics | cut -f 2-20 | sort -rn > data.sorted-topics

Depending on the size of your dataset, you almost certainly will need to play with the number of topics you generate. 50 or 100 is often fine, but if you were generating topics across something as diverse as Wikipedia you’d clearly need many more. If you don’t have enough topics then it is very easy for the topics to seem like a meaningless grouping of words. I usually look at the data results with 50, 100, and 300 topics to get a feel for the data.

Once you decide how many topics make sense with your dataset this technique is a powerful way to extract and rank the meaning from a large set of responses.

Leave a comment

5 Comments

  1. ramsi haddad

     /  March 18, 2013

    Loved the trigram generator. I expanded it to go to quad-gram, penta-gram (should my priest be worried?) and hexagram. Thanks a million for the help. Ramsi

    Like

    Reply
    • Jose CW

       /  November 28, 2014

      How to avoid the linkage between the last words of line 1 link to first word of line 2 when building the bi gram?

      Like

      Reply
      • Greg Ichneumon Brown

         /  December 1, 2014

        Hmmm, if I remember correctly I thought this didn’t happen because I ended up with an extra line between the words in each response in the intermediate tmp.txt file. But depending on your file format its possible that won’t happen.

        Another way to deal with that would be to add another token to each string. An easy way to do that would be with awk by piping the original data through:

        awk '{print $0," END"}'
        

        Or completely add beginning and ending tokens:

        awk '{print "BEGIN ", $0," END"}'
        

        This is often a nice solution because you can easily analyze which words are often used first and last.

        Like

  2. eloj

     /  April 17, 2016

    ‘shuf’ seems easier than the sed setup for generating a random sample, unless there’s a bias there I’m not aware of.

    Liked by 1 person

    Reply
    • Greg Ichneumon Brown

       /  April 18, 2016

      Ya, though ‘shuf’ is not always available (eg OS X). I’ve also started using ‘sort –random-sort’

      Like

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: