The Linguistics of ReTweets





I’ve done a bunch of research into the characteristics of ReTweets in an effort to understand what makes them viral. ReTweets are the first entirely observable and analyzable viral content spreading mechanism in the history of mankind and as such they offer an unparalleled window into what makes humans spread ideas.

Over the past few weeks I’ve begun delving into much deeper analysis than I have in the past with more advanced tools and a much larger dataset. At present I have a database of over 10 million ReTweets and I’ve gained access to Twitter’s new streaming API which allows me to build a very large (10 million and growing) random sample of all tweets (not just ReTweets).

In re-visiting a data point that I looked at 6 months ago (this time with a larger data set), I found that in a random sample of normal (non-ReTweet) Tweets, 18.96% contained a link, whereas 3 times that many ReTweets (56.69%) included a link.

 

 

Then I tested the assumption that simplicity is a vital component of ReTweets (as it has been observed in other viral-content types) and I found that random Tweets have 1.58 syllables per word on average, while ReTweets had an average of 1.62 syllables per word. Longer, higher syllable-count words are typically more complex, indicating that ReTweets may be more complex than their less viral counterparts.

 

Comparing two different types of reading grade level analysis revealed that ReTweets, in general, are less “readable” and require a higher level of education to understand. A Flesch-Kincaid test gave ReTweets a reading grade level of 6.47 years of education, while random Tweets only required 6.04 years. The similar SMOG test (Simple Measure of Gobbledygook) indicated that ReTweets required 6.13 years of schooling, with random Tweets only needing 5.88 years.

 

 
Another characteristic commonly found in viral content is novelty; that is, the “newness” of the ideas and information presented. I created a measure of novelty by counting how many other times each word in my sample sets occurred. In the random Tweet sample, each word was found an average of 89.19 other times, while in the ReTweet sample each word was only found 16.37 other times. This shows us that while simplicity may not be very important to ReTweetability, novelty certainly is.

 

 
Part of speech (POS) tagging is an analysis technique in which an algorithm is used to label each word in a piece of content as a specific part-of-speech–noun, verb, adjective, etc. The graph below shows what percentages of words in each sample were labeled as a specific part-of-speech. It lists only the most interesting parts from the much larger list of POS tags.

Interesting points from this data include the noun and 3rd-person heaviness of ReTweets, indicating a subject matter and headline type nature.

 

 
I also used the two linguistic lexicons currently in use on TweetPsych: RID and LIWC.

First up is the more “Freudian” Regressive Imagery Dictionary (RID). This coding scheme is designed to measure the amount and type of three categories of content: primordial (the unconscious way you think, like in dreams); conceptual (logical and rational thought); and emotional.

Significantly more primordial content has been found in the poetry of poets who exhibit signs of psychopathology than in that of poets who exhibit no such signs (Martindale, 1975).

The first RID graph shows that ReTweets contain less primordial and emotional content than random Tweets and more conceptual content.

 

 
Looking at specific RID codes, we see that social and instrumental (constructive words like build and create) behavior are ReTweetable, while abstract thought and sensation-based words are not.

 

 
The last analysis I performed used LIWC (pronounced “Luke”). This is a lexicon similar to RID, but based in more reviewed and accepted research and refined over 15 years. LIWC measures the cognitive and emotional properties of a person based on the words they use.

In order to provide an efficient and effective method for studying the various emotional, cognitive, and structural components present in individuals’ verbal and written speech samples, we originally developed a text analysis application called Linguistic Inquiry and Word Count, or LIWC.

LIWC analysis shows that Tweets about work, religion, money and media/celebrities are more ReTweetable than Tweets about negative emotions, sensations, swear words and self-reference.

If you liked this post, don't forget to subscribe to my RSS feed or my email newsletter so you never miss the science.

{ 33 comments }

C.W. Anderson July 1, 2009 at 9:50 am

Great analysis; would be good to get significance levels of your various findings, though. Is a difference in reading grade level of 6.47 years of education and 6.04 years statistically significant? How much so? (I would imagine all this stuff has a low p value because of the largeness of the sample, but I just dont know)

Sorry to nitpick, all in all this looks like an awesome project.

Patrick Jarrett July 1, 2009 at 10:27 am

Interesting stuff, I would also think it could informational to look at the retweets as sorted by the number of times they are retweeted, split them into X parts, and then run these analyses on those different parts to see how the reading level, word usage, and other choices affect the effectiveness of the continued transmission of the message.

Brandon Mendelson July 1, 2009 at 11:48 am

Great stuff Dan. I think, maybe it’s a bit too much analysis;-), but useful information to have.

The SUCCESS acronym from Made To Stick, I’ve found anyway, helps a great deal in getting a message re-tweeted (for what it’s worth).

Jack Repenning July 1, 2009 at 2:39 pm

When you quote stats for “retweets,” are you referring to the actual retweet, or the original tweet that got retweeted?

As a fairly frequent retweeter, I’m sure the actual retweets have higher linguistic complexity, by any measure: it’s a real struggle to squeeze in any thought of your own along side a pruned-down copy of the original, and yet stay within the magical 140! But that obviousness makes stats on the fact less interesting (to me, at any rate). If your data mean “a tweet is more likely to be retweeted if it has higher linguistic complexity,” I _would_ find that surprising, interesting, and possibly useful!

Michelle K. Gross July 1, 2009 at 2:54 pm

Your corpus refers only to English language tweets? Do you assess the native language or language proficiency of the tweeter?

It would be interesting to look at what happens when information is re-tweeted incorrectly. Does the original tweet spread more, or does the inaccurate re-tweet take precedence, or is there no pattern? Just a virus mutates, the RT may also.

Adrian Bailey July 1, 2009 at 3:20 pm

That “Average Syllables per Word” graph is very naughty.

Dan Zarrella July 1, 2009 at 3:22 pm

@adrian naughty how, because of the scale?

Jeff Heuer July 1, 2009 at 5:01 pm

Seriously? “ReTweets may be more complex than their less viral counterparts” because on average they have 0.035 more syllables per word? That’s a very tiny difference. The readability results don’t seem to be meaningful either; according to Wikipedia, the standard error of SMOG is more than 1.5 grades, meaning the difference of ~0.2 between retweets and tweets is totally statistically insignificant. Creating Excel bar charts doesn’t make something “science”.

Richard July 1, 2009 at 5:57 pm

This could have been an excellent post; there is a great deal of interesting territory to investigate with regard to the linguistics of retweets (which is not capitalised for reasons that should be obvious to someone that just did a linguistic analysis), such as: how drastically the text of retweets change as they get passed on and on by people editing them down to stay within the 140 character limit, or perhaps what specific qualities of tweets contributes to their viralness or virality or whatever you social media types call it — and the difference in parts of speech used is not an acceptable way to gauge this. As it is, it says almost nothing, and I find your analysis to be severely lacking. I hope this doesn’t come across as insulting, because I don’t mean it to be. The fact is: everything you have said means almost nothing.

The fact that retweets tend to contain links is somewhat interesting, but should be pretty obvious, and can be easily determined by anyone with an inkling of insight into social media. Filtering millions of tweets through whatever they were filtered through was hardly necessary.

The difference in the average syllables of words in tweets is so minimal that it says absolutely nothing. And I find it extremely cheap that the scale of the graph has been extended so that a 0.03 difference appears more significant than it is.

Looking at the reading level of tweets that contain so few words is also meaningless. It’s not the headline that needs to be readable, it’s the body text. This kind of analysis is done on large bodies of text to assess its readability over an extended period. Furthermore, as far as I know, all of the standard tests for this use the number of sentences as one of the primary metrics. Very few tweets contain more than two or three sentences. That’s kind of like polling 3 people for their favourite ice cream and declaring strawberry the best because two of them preferred it. It’s not an acceptable sample size.

On top of that, the standard error rate in ALL of those tests is greater than the differences you found, supporting the fact that they are absolutely meaningless.

Your average word occurrence findings began to approach something moderately interesting but as with all your other tests, you failed to actually provide any good analysis of your findings. One sentence that essentially just sums up your results is not an analysis, it’s a summary. And average word occurrence really doesn’t suggest a great deal about novelty at all. I could tweet “spling splong puddingflaps,” three words I’m sure do not appear anywhere else in your sample set, and I doubt that would get retweeted, despite its apparent novelty.

Withy our part of speech percetnages, the difference is, again, minimal enough to be considered negligible and doesn’t prove a great deal. You did manage to discover a lot of nouns, but again, that’s hardly news if you’ve used Twitter for more than ten minutes. And it barely qualifies as news if you haven’t.

I don’t know enough about the psychological analysis you performed to comment, so I won’t. However I will say this: automated analysis such as that is rarely accurate, and is probably not deemed authoritative by many people. An interesting way to scratch the surface, sure, but not to be taken seriously.

And the fact that retweets exhibit social behaviour? Come on. Who couldn’t have guessed that? Ditto on tweets being self-referential. I realise those are minor nitpicks.

Net gain from this: absolutely nothing. Maybe you should ask a linguist for his assistance next time? Or anyone, really.

Dan Zarrella July 1, 2009 at 6:10 pm

@richard not sure what you mean about nouns and watching twitter for any length of time, notice I’m comparing retweets with random tweets and retweets contain more nouns. Also, your analysis of my novelty metric is reductium ad absurdium I believe and therefore a logical fallacy. I agree that on many of the other points the differences are very small, but they are differences.

Natalie July 1, 2009 at 8:35 pm

It’s really interesting to see that tweets with more syllables per word and a higher reading level (however small the difference is) are retweeted more often. Your study seems to show that some people don’t mind a little vocabulary.

Trefor Walters July 2, 2009 at 8:32 am

Great post!

Just wanted to chime in and say that I love your work in general.

Also, I would conclude that this data should be looked upon as data alone. Yes, conclusions can be drawn that there are linguistic pattern differences in Tweets vs Re-Tweets, but this research doesn’t mean that by adjusting my status updates so that each word contains 0.04 more syllables, I will produce a magical viral RT.

Of course the next step would be to test the data, which I will leave to your kind self. It is entirely possible that very small changes in combinations of certain linguistic constructs could lead to exponential retransmission.

Hey, you’re the one calling yourself a “social and viral marketing scientist”!

(Which, despite the disdain amongst some web users for “Social Marketing Experts” etc, is a really cool title to promote yourself with. Here’s to the scientific method)

Brad July 2, 2009 at 11:01 am

Dan, after reading a couple of the comments above, I feel compelled to tell you that many of us think your thought process and creativity are AWESOME. Sure, perhaps there are some debatable issues with the data, but folks, there are debatable issues with any data, any time. I can’t believe some people above have the gall to be downright rude about what they’d rather see.

From one generous and optimistic human being to another, thanks for this article, and all of your other articles. Don’t ignore the criticisms of jerks — their ideas are worth considering and often valid — but please don’t let their psychotic rudeness hinder your generosity and intent. You have GREAT insights and LOTS of us appreciate your angle of thinking. Keep up the great work Dan!

Dan Zarrella July 2, 2009 at 11:52 am

@brad thanks for the kind words, I’m aware that presenting data and science is bound to draw a special kind of attack, but its always nice to read a comment like yours. thank you.

Ronnie Sullivan July 3, 2009 at 3:04 am

Yeah Brad you are absolutely rite ..Dan is very creative and always have new ideas .

jonah lopin July 3, 2009 at 7:32 am

Zarella, I think it’s a Good Thing to start running the numbers and putting them out there, even if you plan to iterate on your analysis later.

One cut that I think would be interesting would be to focus less on the tweets and more on the humans that tweet them. Does TG have an API such that you could pull the humans who tweet and retweet and try to figure out the differences between the humans? I wonder if tendency to retweet correlates with Twitter Grade, whether retweeters tend to tweet often using keywords in the tweet they retweeted, etc. That would be interesting because then if you want to drive retweets you could try to build a network of humans that fit the “profile” of a retweeter.

Keep the data coming.

Skip Shuda July 3, 2009 at 8:32 am

Dan – I’m loving your work, emerging toolset and research. Kudos for tackling the difficult hybrid domain of social media – anthropology – linguistics – psychology which is called for in today’s exploding communication realm. Please keep it coming!

Like Jack Repenning’s response above, my initial reaction was that the vehicle of Twitter – and the added constraint on a Retweet of mainatining original reference plus adding the reference to the retweeted source inside the 140 character Twitter limit forces me to use higher content language. Are we correct in assuming that you were actually analyzing the retweet vs. the original source that was retweeted? Seems that the latter analysis might yield deeper insights but would be toughter to extract a dataset automatically.

At any rate, I’m locked into following your work – and look forward to more goodies from you.

nihonjon July 3, 2009 at 12:04 pm

*Pro Tip*
Statistics is more than just graphing numbers.

mand July 3, 2009 at 1:20 pm

Fascinating. Without trying to get anything to go viral, for commercial reasons or whatever, i am intrigued by the whole social/statistical nature of it. Surely hashtags would have as much effect as links? Among my own tweets the most-retweeted are those with hashtags such as #writing that are presumably put into search engines, so non-followers are picking up my tweets.

I happened here from Twitter and very glad i did! :0)

Adrian Bailey July 3, 2009 at 1:59 pm

Yeah – creating misleading graphs like that suggests that either (a) you’re a noob or (b) you don’t mind spinning the data.

Trefor Walters July 4, 2009 at 8:47 am

@Adrian – In what way specifically were the graphs misleading?

Again, assuming you can supply an answer to the above question (surely a graph shows a graphical representation of numeric data, which are just numbers that do not have the ability to lead or mislead without interpretation), what is the relationship between the “misleading graphs” part of your statement and “you’re a noob”? How does one cause the other?

I’m also curious to understand how you apparently came to the conclusion that the data has been spun, and that Dan, should the data have been spun, didn’t mind doing it.

Especially considering that humans can’t do anything but subjectively experience and interpret everything, even “objective data”, based on internal belief and value filters.

The graphs themselves are not misleading – the narrative surrounding them perhaps, yes….

You are right though – the Average Syllables Per Word graph is naughty and should go and sit in the naughty corner.

Samuel Carrijo July 6, 2009 at 9:36 am

About the readability grade, I believe it is because many news are retweeted, and usually newspapers use a “harder to understand” vocabulary for they can’t be too informal

Mat July 6, 2009 at 11:54 am

Just wondering, what do you see the lifespan of twitter to be? Do you think it will morph into something else, or will it last for a limited amount of time then vanish?

Kenny July 8, 2009 at 3:48 am

I just love statistics. This is great stuff. Thanks for compiling all this data. Do you have any trend data over time?

Bill July 8, 2009 at 2:07 pm

A lot of the comments to this point seem, to me, to focus on only a single metric and to wonder how that can make a difference.

You have to look at the whole picture.

If I were going to summarize all the stats and charts in a tweet, it would have to be:

“Intelligent ideas, with links to valuable content, get re-tweeted. Crap dies where it is.”

Bill July 8, 2009 at 2:11 pm

In fact, I did just tweet that, less the “Crap dies where it is” (in order to include a link to this blog post).

Might as well test the idea. Let’s see if it gets re-tweeted.

darya July 9, 2009 at 11:44 pm

Very cool idea, Dan. As a scientist, however, I want to reiterate the point about p-values and significance. I cannot interpret these numbers without error bars on the graphs. If the differences aren’t statistically significant then they aren’t, well, significant. Shouldn’t be too hard to run a quick t-test.

If you think there’s a trend but significance isn’t reached, increase your n.

Looking forward to more data!

Ann Wylie July 19, 2009 at 11:02 am

Dan, what were some of the most/least common words you found? I’m intrigued by the novelty aspect of retweeting. Thanks!

brettburky August 27, 2009 at 7:52 am

hello Dan.

I have to say that i love the graphs. Is there any chance you could share where you are getting these wonderful graphs from. I would like to run some of the same tests for some research I am doing. Let me know if that is ok. If so please contact me through my email.

brettburky August 27, 2009 at 7:52 am

hello Dan.

I have to say that i love the graphs. Is there any chance you could share where you are getting these wonderful graphs from. I would like to run some of the same tests for some research I am doing. Let me know if that is ok. If so please contact me through my email.

brettburky August 27, 2009 at 2:52 pm

hello Dan.

I have to say that i love the graphs. Is there any chance you could share where you are getting these wonderful graphs from. I would like to run some of the same tests for some research I am doing. Let me know if that is ok. If so please contact me through my email.

thibaud March 5, 2010 at 12:00 pm

Great analysis indeed, as I am using Tweeter to help our website grow bigger and bigger !

Freestyling July 20, 2010 at 5:00 pm

That was excellent, w/ applications about language that undoubtedly extend beyond Twitter but to changes in linguistic expression generally, or at a minimum, in social media.

{ 20 trackbacks }