The Linguistics of ReTweets

I’ve done a bunch of research into the characteristics of ReTweets in an effort to understand what makes them viral. ReTweets are the first entirely observable and analyzable viral content spreading mechanism in the history of mankind and as such they offer an unparalleled window into what makes humans spread ideas.

Over the past few weeks I’ve begun delving into much deeper analysis than I have in the past with more advanced tools and a much larger dataset. At present I have a database of over 10 million ReTweets and I’ve gained access to Twitter’s new streaming API which allows me to build a very large (10 million and growing) random sample of all tweets (not just ReTweets).

In re-visiting a data point that I looked at 6 months ago (this time with a larger data set), I found that in a random sample of normal (non-ReTweet) Tweets, 18.96% contained a link, whereas 3 times that many ReTweets (56.69%) included a link.



Then I tested the assumption that simplicity is a vital component of ReTweets (as it has been observed in other viral-content types) and I found that random Tweets have 1.58 syllables per word on average, while ReTweets had an average of 1.62 syllables per word. Longer, higher syllable-count words are typically more complex, indicating that ReTweets may be more complex than their less viral counterparts.


Comparing two different types of reading grade level analysis revealed that ReTweets, in general, are less “readable” and require a higher level of education to understand. A Flesch-Kincaid test gave ReTweets a reading grade level of 6.47 years of education, while random Tweets only required 6.04 years. The similar SMOG test (Simple Measure of Gobbledygook) indicated that ReTweets required 6.13 years of schooling, with random Tweets only needing 5.88 years.


Another characteristic commonly found in viral content is novelty; that is, the “newness” of the ideas and information presented. I created a measure of novelty by counting how many other times each word in my sample sets occurred. In the random Tweet sample, each word was found an average of 89.19 other times, while in the ReTweet sample each word was only found 16.37 other times. This shows us that while simplicity may not be very important to ReTweetability, novelty certainly is.


Part of speech (POS) tagging is an analysis technique in which an algorithm is used to label each word in a piece of content as a specific part-of-speech–noun, verb, adjective, etc. The graph below shows what percentages of words in each sample were labeled as a specific part-of-speech. It lists only the most interesting parts from the much larger list of POS tags.

Interesting points from this data include the noun and 3rd-person heaviness of ReTweets, indicating a subject matter and headline type nature.


I also used the two linguistic lexicons currently in use on TweetPsych: RID and LIWC.

First up is the more “Freudian” Regressive Imagery Dictionary (RID). This coding scheme is designed to measure the amount and type of three categories of content: primordial (the unconscious way you think, like in dreams); conceptual (logical and rational thought); and emotional.

Significantly more primordial content has been found in the poetry of poets who exhibit signs of psychopathology than in that of poets who exhibit no such signs (Martindale, 1975).

The first RID graph shows that ReTweets contain less primordial and emotional content than random Tweets and more conceptual content.


Looking at specific RID codes, we see that social and instrumental (constructive words like build and create) behavior are ReTweetable, while abstract thought and sensation-based words are not.


The last analysis I performed used LIWC (pronounced “Luke”). This is a lexicon similar to RID, but based in more reviewed and accepted research and refined over 15 years. LIWC measures the cognitive and emotional properties of a person based on the words they use.

In order to provide an efficient and effective method for studying the various emotional, cognitive, and structural components present in individuals’ verbal and written speech samples, we originally developed a text analysis application called Linguistic Inquiry and Word Count, or LIWC.

LIWC analysis shows that Tweets about work, religion, money and media/celebrities are more ReTweetable than Tweets about negative emotions, sensations, swear words and self-reference.