Using Text Mining to Find Out What @RDataMining Tweets are About

This post shows an example on text mining of Twitter data with R packages twitteR, tm and wordcloud. Package twitteR provides access to Twitter data, tm provides functions for text mining, and wordcloud visualizes the result with a word cloud.

If you have no access to Twitter, the tweets data can be downloaded as file “rdmTweets.RData” at  http://www.rdatamining.com/data, and then you can skip the first step below.

Retrieving Text from Twitter

Packages:

> library(twitteR)
> # retrieve the first 100 tweets (or all tweets if fewer than 100)
> # from the user timeline of @rdatammining
> rdmTweets <- userTimeline(“rdatamining”, n=100)
> n <- length(rdmTweets)
> rdmTweets[1:3]
[[1]]
Text Mining Tutorial http://t.co/jPHHLEGm
[[2]]
R cookbook with examples http://t.co/aVtIaSEg
[[3]]
Access large amounts of Twitter data for data mining and other tasks within
R via the twitteR package. http://t.co/ApbAbnxs

Transforming Text

The tweets are first converted to a data frame and then to a corpus.
> df <- do.call(“rbind”, lapply(rdmTweets, as.data.frame))
> dim(df)
[1] 79 10

> library(tm)
> # build a corpus, which is a collection of text documents
> # VectorSource specifies that the source is character vectors.
> myCorpus <- Corpus(VectorSource(df$text))

After that, the corpus needs a couple of transformations, including changing letters to lower case, removing punctuations/numbers and removing stop words. The general English stop-word list is tailored by adding “available” and “via” and removing “r”.
> myCorpus <- tm_map(myCorpus, tolower)
> # remove punctuation
> myCorpus <- tm_map(myCorpus, removePunctuation)
> # remove numbers
> myCorpus <- tm_map(myCorpus, removeNumbers)
> # remove stopwords
> # keep “r” by removing it from stopwords
> myStopwords <- c(stopwords(‘english’), “available”, “via”)
> idx <- which(myStopwords == “r”)
> myStopwords <- myStopwords[-idx]
> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

Stemming Words

In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”. However, after that, one may want to complete the stems to their original forms, so that the words would look “normal”.

> dictCorpus <- myCorpus
> # stem words in a text document with the snowball stemmers,
> # which requires packages Snowball, RWeka, rJava, RWekajars
> myCorpus <- tm_map(myCorpus, stemDocument)
> # inspect the first three “documents”
> inspect(myCorpus[1:3])
(Some details are removed to make it short. Same applies to inspect() below.)
[[1]]
text mine tutori httptcojphhlegm
[[2]]
r cookbook exampl httptcoavtiaseg
[[3]]
access amount twitter data data mine task r twitter packag httptcoapbabnx

> # stem completion
> myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus)

Print the first three documents in the built corpus.
> inspect(myCorpus[1:3])
[[1]]
text miners tutorial httptcojphhlegm
[[2]]
r cookbook examples httptcoavtiaseg
[[3]]
access amounts twitter data data miners task r twitter package httptcoapbabnxs

Something unexpected in the above stemming and stem completion is that, word “mining” is first stemmed to “mine”, and then is completed to “miners”, instead of “mining”, although there are many instances of “mining” in the tweets, compared to only one instance of “miners”.

Building a Document-Term Matrix

> myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
> inspect(myDtm[266:270,31:40])
A term-document matrix (5 terms, 10 documents)
Non-/sparse entries: 9/41
Sparsity : 82%
Maximal term length: 12
Weighting : term frequency (tf)
Docs
Terms             31 32 33 34 35 36 37 38 39 40
r                         0   0   1    1   1   0   1   2   1   0
ramachandran 0   0   0   0   0   0   1   0   0  0
ranked              0   0   0    1   0   0  0   0   0  0
rapidminer       0   0   0   0   0   0  0   0   0  0
rdatamining     0   0   1    0   0   0  0   0   0  0

Based on the above matrix, many data mining tasks can be done, for example, clustering, classification and association analysis.

Frequent Terms and Associations

> findFreqTerms(myDtm, lowfreq=10)
[1] “analysis” “data” “examples” “miners” “package” “r” “slides”
[8] “tutorial” “users”

> # which words are associated with “r”?
> findAssocs(myDtm, ‘r’, 0.30)
r         users   examples package canberra cran  list
1.00   0.44     0.34         0.31        0.30        0.30 0.30

> # which words are associated with “mining”?
> # Here “miners” is used instead of “mining”,
> # because the latter is stemmed and then completed to “miners”. :-(
> findAssocs(myDtm, ‘miners’, 0.30)
miners data classification httptcogbnpv mahout
1.00     0.56           0.47         0.47             0.47
recommendation sets   supports frequent itemset
0.47                      0.47     0.47          0.40     0.39

Word Cloud

After building a document-term matrix, we can show the importance of words with a word cloud (also kown as a tag cloud) . In the code below, word “miners” are changed back to “mining”.
> library(wordcloud)
> m <- as.matrix(myDtm)
> # calculate the frequency of words
> v <- sort(rowSums(m), decreasing=TRUE)
> myNames <- names(v)
> k <- which(names(v)==”miners”)
> myNames[k] <- “mining”
> d <- data.frame(word=myNames, freq=v)
> wordcloud(d$word, d$freq, min.freq=3)

The above word cloud clearly shows that “r”, “data” and “mining” are the three most important words, which validates that the @RDataMining tweets present information on R and data mining. The other important words are “analysis”, “examples”, “slides”, “tutorial” and “package”, which shows that it focuses on documents and examples on analysis and R packages.

More examples on data mining with R are available at RDataMining website, and also at my Twitter and groups below.
Twitter: http://twitter.com/RDataMining
Group on Linkedin: http://group.rdatamining.com
Group on Google: http://group2.rdatamining.com

About these ads

About Yanchang Zhao

I am a data miner, using R for data mining applications. My work on R and data mining: RDataMining.com; Twitter; Group on Linkedin; and Group on Google.
This entry was posted in Data Mining, R and tagged , , . Bookmark the permalink.

24 Responses to Using Text Mining to Find Out What @RDataMining Tweets are About

  1. Ben says:

    Nicely documented, thanks. I prefer to remove stopwords using the text file in the tm package and editing that file to add/remove words, the code to use the tm stopword file is:
    a <- tm_map(a, removeWords, stopwords("english"))

    Quite a bit of interesting work like this going on like this at the moment, have you seen these?

    http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/

    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

    http://practicalquant.blogspot.com/2010/04/text-mining-and-twitter.html

  2. Pingback: Getting Started With Twitter Analysis in R « OUseful.Info, the blog…

  3. Hello there, simply turned into aware of your blog thru Google, and found that it is really informative. I’m gonna watch out for brussels. I will appreciate if you continue this in future. A lot of other folks will be benefited from your writing. Cheers!

  4. Pingback: Thought this was cool: Twitter文本挖掘初步 « CWYAlpha

  5. Pingback: Generating Twitter Wordclouds in R (Prompted by an Open Learning Blogpost) « OUseful.Info, the blog…

  6. Pingback: Roma, analisi di una nevicata. « Geek ma non solo!

  7. Pingback: Analysis of #FNCE tweets – - nutsci.orgnutsci.org

  8. I’ve walked through this example, and now I’d like to export all the text I imported into a .txt file. How do I do that?

  9. Pingback: Reading with R | clio 3 blog

  10. Pingback: Using Text Mining to Find Out What @RDataMining Tweets are About | R for Journalists | Scoop.it

  11. Ganttic says:

    Thank you very much for sharing this informative post. I have learned a lot from here. I hope that there will be an update soon. Keep up the good work!

  12. David says:

    Hi, I’m completely new to R, but everything works fine until I get to this line:
    df <- do.call(“rbind”, lapply(rdmTweets, as.data.frame))

    I get this error:
    Error in as.data.frame.default(X[[1L]], …) :
    cannot coerce class "structure("status", package = "twitteR")" to a data.frame

    Any idea why? I can't find anyone else with my problem via Google.
    Thanks for any help.

  13. Pingback: CASEN survey – text mining after Social Network Analysis // Encuesta CASEN – mineria de texto despues del analisis de redes sociales | Kawin project

  14. Pingback: Document Similarity with R | fredgibbs

  15. buvnesh says:

    HI i have downloaded tweets using R and i have them in a csv file how to import those files in R and perform all the similar operations u have done above like building term document matrix and etc etc pls reply soon will be very help ful for my college project

  16. buvnesh says:

    Your above project resembles my whole project till term document matrix but i cannot proceed even from first step i have download tweets as csv files and pls help me out with coding. i cannot understand the data frame concept pls help me out i need to submit my project report by next week.

  17. buvnesh says:

    I get the following error pls help me out plsssssssssssssssssssssss replyyyyyyyyy
    df <- do.call(“rbind”, lapply(rdmTweets, as.data.frame))
    Error: unexpected input in "df <- do.call(“"

  18. buvnesh says:

    After generating a term document matrix how to export that matrix in to a excel or csv file

    • tdm <- TermDocumentMatrix(yourCorpus) # Create your Term Document Matrix
      tdm.matrix <- as.matrix(tdm) # Turn it into a matrix
      sorted.matrix <- sort(rowSums(tdm.matrix),decreasing=TRUE) # Sort the matrix by frequency of words (optional)
      tdf.df<- data.frame(Word = names(sorted.matrix),Frequency=sorted.matrix) # Create a data frame
      write.csv(tdf.df,"my_csv_data.csv") # write your data frame to a csv file.

  19. swapandr says:

    Nice,
    How do you get the reply to a tweet in R ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s