I have been following Chris Albon on Twitter and have seen some really nice looking machine learning cards on his Twitter. While one can go to his website and buy all the cards he has produced. However, I was curious to see if I could download those flash cards in R. So, I started looking for a R package that would help to download the tweets by Chris Albon. I ended up using rtweet package for my analysis.

The libraries that I would be using for this analysis are as follows:

  • rtweet : To import the tweets from Twitter to R.
  • dplyr : To do manipulation of tweets
  • rvest : To extract the information from web data

Let’s get started with rtweet package. First I am going to search for the tweets from Chris Albon.

# Load Libraries
library(rtweet)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2

rtweet package usage

I am going to use search_tweets function in rtweet package to find the tweets.

albon <- rtweet::search_tweets(q = "chrisalbon", include_rts = FALSE,retryonratelimit = TRUE, n = 18000)
## Searching for tweets...
## This may take a few seconds...
## Finished collecting tweets!
# Look at head of albon dataframe
head(albon)
## # A tibble: 6 x 68
##            status_id          created_at    user_id  screen_name
##                <chr>              <dttm>      <chr>        <chr>
## 1 943732332824969216 2017-12-21 06:38:16  473718208      SHiggan
## 2 943732329226305536 2017-12-21 06:38:15    6024272 sergeimuller
## 3 943732314663776257 2017-12-21 06:38:12   19340488      jortheo
## 4 943730962243862528 2017-12-21 06:32:49 2990872965     SETIEric
## 5 943729686881996800 2017-12-21 06:27:45  614046734    jdparaujo
## 6 943727710685163520 2017-12-21 06:19:54   14643231    alanmimms
## # ... with 64 more variables: text <chr>, source <chr>,
## #   display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   hashtags <list>, symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## #   quoted_followers_count <int>, quoted_friends_count <int>,
## #   quoted_statuses_count <int>, quoted_location <chr>,
## #   quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>,
## #   retweet_created_at <dttm>, retweet_source <chr>,
## #   retweet_favorite_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>,
## #   country <chr>, country_code <chr>, geo_coords <list>,
## #   coords_coords <list>, bbox_coords <list>
dim(albon)
## [1] 2100   68
# We could also use the following but I wanted to see the tweets from Chris Albon
# albon1 <- rtweet::search_tweets(q = "machinelearningflashcards.com", 
#                                 include_rts = FALSE, 
#                                 retryonratelimit = TRUE)

We have got a tibble with 2000 observations and 68 columns. Now, let’s look at the actual tweets. We will be looking at the column text as it has the text of the tweet.

# Text of tweet
albon[, "text"]
## # A tibble: 2,100 x 1
##                                                                                  text
##                                                                                 <chr>
##  1        @chrisalbon ..... for the love of god.. we tried, but the client escalated 
##  2 "@chrisalbon \xf0\u009f\u0091\u008d\xf0\u009f\u008f<U+00BE> my company has a code 
##  3                                                   Wise man https://t.co/eGpk9gjuZ9
##  4                                    Been there.  Done that. https://t.co/FNd21DZD8y
##  5        "When you have to work from your significant other’s old bedroom \xf0\u009f
##  6                                                 @chrisalbon Been there. Done that.
##  7                                      "@chrisalbon Too late \xf0\u009f\u0098\u008a"
##  8                                @chrisalbon @MystyVander We deployed last night tho
##  9        @chrisalbon Pelican. Hugo is pretty nice too, but I only use it with R’s bl
## 10        or deploy and artfully walk away from that spouse &amp; racist uncle https:
## # ... with 2,090 more rows

We are interested in the tweets that has images and url machinelearningcards.com. You will notice that all the images / links in Twitter are renamed with the prefix “https://t.co/”. After some observations, I found out that twitter renamed the website machinelearningflashcards.com as https://t.co/eZ2bbpDzwV. So, let’s use this link as our pattern and find the tweets that has the link. We are going to use grep function to find the pattern in the text column.

pattern <- "https://t.co/eZ2bbpDzwV"
# Create a new dataframe with only text as the column
machine_learning <- albon[grep(pattern, albon[,"text"] %>% .$text), "text"]
machine_learning
## # A tibble: 16 x 1
##                                                                           text
##                                                                          <chr>
##  1 SVC Radial Basis Function Kernel https://t.co/eZ2bbpDzwV https://t.co/l8Zhh
##  2             Standardization https://t.co/eZ2bbpDzwV https://t.co/sfZ4tOamRv
##  3          Adjusted R-Squared https://t.co/eZ2bbpDzwV https://t.co/fNzk1xC8Pn
##  4               Weak Learners https://t.co/eZ2bbpDzwV https://t.co/D0LSHzlJ3m
##  5        Total Sum-Of-Squares https://t.co/eZ2bbpDzwV https://t.co/ROQxeKKEbb
##  6 Sigmoid Activation Function https://t.co/eZ2bbpDzwV https://t.co/HW3haErLxn
##  7                    Boosting https://t.co/eZ2bbpDzwV https://t.co/4X3NOqLuKT
##  8            Interaction Term https://t.co/eZ2bbpDzwV https://t.co/8fokl8KJfh
##  9                  Hinge Loss https://t.co/eZ2bbpDzwV https://t.co/C0gFuRQnt6
## 10            One-Hot Encoding https://t.co/eZ2bbpDzwV https://t.co/jd2yOf8p5c
## 11   Issues With Platt Scaling https://t.co/eZ2bbpDzwV https://t.co/ziGuhNBycz
## 12               Interpolation https://t.co/eZ2bbpDzwV https://t.co/qZzIZIdyNx
## 13                Determinants https://t.co/eZ2bbpDzwV https://t.co/jTABNspxZz
## 14          Standard Deviation https://t.co/eZ2bbpDzwV https://t.co/Kf4YBHcbV3
## 15          Manhattan Distance https://t.co/eZ2bbpDzwV https://t.co/S3IahqLsBz
## 16                  Notation 4 https://t.co/eZ2bbpDzwV https://t.co/NZsUMwGGr5

After some manipulation, we have found 16 tweets that has machine learning terminology, website link, and the flash card image link. As you can see in all the tweets above, the first link is the website link and the last link is the image link.

If we try to download the image link, R will download the html document. We will have to process these links a little so we can download all the images directly from R.