Machine learning Cards from Twitter in R
I have been following Chris Albon on Twitter and have seen some really nice looking machine learning cards on his Twitter. While one can go to his website and buy all the cards he has produced. However, I was curious to see if I could download those flash cards in R. So, I started looking for a R package that would help to download the tweets by Chris Albon. I ended up using rtweet
package for my analysis.
The libraries that I would be using for this analysis are as follows:
- rtweet : To import the tweets from Twitter to R.
- dplyr : To do manipulation of tweets
- rvest : To extract the information from web data
Let’s get started with rtweet
package. First I am going to search for the tweets from Chris Albon.
# Load Libraries
library(rtweet)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
rtweet package usage
I am going to use search_tweets
function in rtweet
package to find the tweets.
albon <- rtweet::search_tweets(q = "chrisalbon", include_rts = FALSE,retryonratelimit = TRUE, n = 18000)
## Searching for tweets...
## This may take a few seconds...
## Finished collecting tweets!
# Look at head of albon dataframe
head(albon)
## # A tibble: 6 x 68
## status_id created_at user_id screen_name
## <chr> <dttm> <chr> <chr>
## 1 943732332824969216 2017-12-21 06:38:16 473718208 SHiggan
## 2 943732329226305536 2017-12-21 06:38:15 6024272 sergeimuller
## 3 943732314663776257 2017-12-21 06:38:12 19340488 jortheo
## 4 943730962243862528 2017-12-21 06:32:49 2990872965 SETIEric
## 5 943729686881996800 2017-12-21 06:27:45 614046734 jdparaujo
## 6 943727710685163520 2017-12-21 06:19:54 14643231 alanmimms
## # ... with 64 more variables: text <chr>, source <chr>,
## # display_text_width <dbl>, reply_to_status_id <chr>,
## # reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## # is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## # hashtags <list>, symbols <list>, urls_url <list>, urls_t.co <list>,
## # urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## # media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## # ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## # quoted_favorite_count <int>, quoted_retweet_count <int>,
## # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## # quoted_followers_count <int>, quoted_friends_count <int>,
## # quoted_statuses_count <int>, quoted_location <chr>,
## # quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>,
## # retweet_created_at <dttm>, retweet_source <chr>,
## # retweet_favorite_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>,
## # country <chr>, country_code <chr>, geo_coords <list>,
## # coords_coords <list>, bbox_coords <list>
dim(albon)
## [1] 2100 68
# We could also use the following but I wanted to see the tweets from Chris Albon
# albon1 <- rtweet::search_tweets(q = "machinelearningflashcards.com",
# include_rts = FALSE,
# retryonratelimit = TRUE)
We have got a tibble with 2000 observations and 68 columns. Now, let’s look at the actual tweets. We will be looking at the column text
as it has the text of the tweet.
# Text of tweet
albon[, "text"]
## # A tibble: 2,100 x 1
## text
## <chr>
## 1 @chrisalbon ..... for the love of god.. we tried, but the client escalated
## 2 "@chrisalbon \xf0\u009f\u0091\u008d\xf0\u009f\u008f<U+00BE> my company has a code
## 3 Wise man https://t.co/eGpk9gjuZ9
## 4 Been there. Done that. https://t.co/FNd21DZD8y
## 5 "When you have to work from your significant others old bedroom \xf0\u009f
## 6 @chrisalbon Been there. Done that.
## 7 "@chrisalbon Too late \xf0\u009f\u0098\u008a"
## 8 @chrisalbon @MystyVander We deployed last night tho
## 9 @chrisalbon Pelican. Hugo is pretty nice too, but I only use it with Rs bl
## 10 or deploy and artfully walk away from that spouse & racist uncle https:
## # ... with 2,090 more rows
We are interested in the tweets that has images and url machinelearningcards.com
. You will notice that all the images / links in Twitter are renamed with the prefix “https://t.co/”. After some observations, I found out that twitter renamed the website machinelearningflashcards.com
as https://t.co/eZ2bbpDzwV
. So, let’s use this link as our pattern and find the tweets that has the link. We are going to use grep
function to find the pattern in the text
column.
pattern <- "https://t.co/eZ2bbpDzwV"
# Create a new dataframe with only text as the column
machine_learning <- albon[grep(pattern, albon[,"text"] %>% .$text), "text"]
machine_learning
## # A tibble: 16 x 1
## text
## <chr>
## 1 SVC Radial Basis Function Kernel https://t.co/eZ2bbpDzwV https://t.co/l8Zhh
## 2 Standardization https://t.co/eZ2bbpDzwV https://t.co/sfZ4tOamRv
## 3 Adjusted R-Squared https://t.co/eZ2bbpDzwV https://t.co/fNzk1xC8Pn
## 4 Weak Learners https://t.co/eZ2bbpDzwV https://t.co/D0LSHzlJ3m
## 5 Total Sum-Of-Squares https://t.co/eZ2bbpDzwV https://t.co/ROQxeKKEbb
## 6 Sigmoid Activation Function https://t.co/eZ2bbpDzwV https://t.co/HW3haErLxn
## 7 Boosting https://t.co/eZ2bbpDzwV https://t.co/4X3NOqLuKT
## 8 Interaction Term https://t.co/eZ2bbpDzwV https://t.co/8fokl8KJfh
## 9 Hinge Loss https://t.co/eZ2bbpDzwV https://t.co/C0gFuRQnt6
## 10 One-Hot Encoding https://t.co/eZ2bbpDzwV https://t.co/jd2yOf8p5c
## 11 Issues With Platt Scaling https://t.co/eZ2bbpDzwV https://t.co/ziGuhNBycz
## 12 Interpolation https://t.co/eZ2bbpDzwV https://t.co/qZzIZIdyNx
## 13 Determinants https://t.co/eZ2bbpDzwV https://t.co/jTABNspxZz
## 14 Standard Deviation https://t.co/eZ2bbpDzwV https://t.co/Kf4YBHcbV3
## 15 Manhattan Distance https://t.co/eZ2bbpDzwV https://t.co/S3IahqLsBz
## 16 Notation 4 https://t.co/eZ2bbpDzwV https://t.co/NZsUMwGGr5
After some manipulation, we have found 16 tweets that has machine learning terminology, website link, and the flash card image link. As you can see in all the tweets above, the first link is the website link and the last link is the image link.
If we try to download the image link, R will download the html document. We will have to process these links a little so we can download all the images directly from R.
Separate terminology and image links
# Flash Card URL as url column
machine_learning$url <- lapply(
1:nrow(machine_learning),
FUN = function(x)
tail(strsplit(
machine_learning$text[x], split = " ")
[[1]],
n = 1)
) %>%
as.character()
# Machine learning terminology as name Column
machine_learning$name <- lapply(1:nrow(machine_learning), function(x)
gsub(
x = machine_learning$text[x],
pattern = "https://.+", replacement = ""
)) %>%
as.character()
machine_learning[, c("url", "name")]
## # A tibble: 16 x 2
## url name
## <chr> <chr>
## 1 https://t.co/l8ZhhzAytA SVC Radial Basis Function Kernel
## 2 https://t.co/sfZ4tOamRv Standardization
## 3 https://t.co/fNzk1xC8Pn Adjusted R-Squared
## 4 https://t.co/D0LSHzlJ3m Weak Learners
## 5 https://t.co/ROQxeKKEbb Total Sum-Of-Squares
## 6 https://t.co/HW3haErLxn Sigmoid Activation Function
## 7 https://t.co/4X3NOqLuKT Boosting
## 8 https://t.co/8fokl8KJfh Interaction Term
## 9 https://t.co/C0gFuRQnt6 Hinge Loss
## 10 https://t.co/jd2yOf8p5c One-Hot Encoding
## 11 https://t.co/ziGuhNBycz Issues With Platt Scaling
## 12 https://t.co/qZzIZIdyNx Interpolation
## 13 https://t.co/jTABNspxZz Determinants
## 14 https://t.co/Kf4YBHcbV3 Standard Deviation
## 15 https://t.co/S3IahqLsBz Manhattan Distance
## 16 https://t.co/NZsUMwGGr5 Notation 4
We have now separated the terminology and image links. When we will be saving the image we will use text in column name
as the name of the flash card image.
Use of rvest package to extract info from link
While we have separated the link that contains the link of the image, the actual link that can be used to download image needs to be extracted from html of the pages.
Let us use the link https://t.co/l8ZhhzAytA
from the above data frame to extract the link.
url <- "https://t.co/l8ZhhzAytA"
url %>%
read_html() %>%
rvest::html_nodes('div.js-adaptive-photo') %>%
as.character()
## [1] "<div class=\"AdaptiveMedia-photoContainer js-adaptive-photo \" data-image-url=\"https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png\" data-element-context=\"platform_photo_card\" style=\"background-color:rgba(64,47,59,1.0);\" data-dominant-color=\"[64,47,59]\">\n <img data-aria-label-part src=\"https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png\" alt=\"\" style=\"width: 100%; top: -0px;\">\n</div>"
From the output above, we can see the link https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png
. This is the link we will be using to download the flash card. We want to process it nicely so that we could use the same approach for all the 16 cards. Let’s extract only the image link.
url %>%
read_html() %>%
rvest::html_nodes('div.js-adaptive-photo') %>%
as.character() %>%
strsplit(split = "\\ ") %>%
unlist()
## [1] "<div"
## [2] "class=\"AdaptiveMedia-photoContainer"
## [3] "js-adaptive-photo"
## [4] "\""
## [5] "data-image-url=\"https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png\""
## [6] "data-element-context=\"platform_photo_card\""
## [7] "style=\"background-color:rgba(64,47,59,1.0);\""
## [8] "data-dominant-color=\"[64,47,59]\">\n"
## [9] ""
## [10] "<img"
## [11] "data-aria-label-part"
## [12] "src=\"https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png\""
## [13] "alt=\"\""
## [14] "style=\"width:"
## [15] "100%;"
## [16] "top:"
## [17] "-0px;\">\n</div>"
From the output above, we can see that there are two elements 5
and 12
that has the image download link that we are interested. I am going to use the 5th element to extract the link. The complete code for the url we chose is as follows:
url %>%
read_html() %>%
rvest::html_nodes('div.js-adaptive-photo') %>%
as.character() %>%
strsplit(split = "\\ ") %>%
unlist() %>%
.[5] %>%
gsub(pattern = "data-image-url", replacement = "") %>%
gsub(pattern = "\\=", replacement = "") %>%
gsub(pattern = '\"', replacement = "")
## [1] "https://pbs.twimg.com/media/DRg-8nNUMAAh7_g.png"
We got the link that we were looking for.
Extracting image links for all elements
We figured out how to extract the link for 1 instance. Let’s use the R sapply
function and generalize the above procedure and write a code to download all 16 images.
# Use rvest to process the image
sapply(1:nrow(machine_learning), function(x)
machine_learning$url[x] %>%
as.character() %>%
# Read URL using rvest function read_html
xml2::read_html() %>%
rvest::html_nodes('div.js-adaptive-photo') %>%
as.character() %>%
strsplit(split = "\\ ") %>%
unlist() %>%
.[5] %>%
gsub(pattern = "data-image-url", replacement = "") %>%
gsub(pattern = "\\=", replacement = "") %>%
gsub(pattern = '\"', replacement = "") %>%
download.file(destfile = paste0(
"machine_learning/",machine_learning$name[x], ".png"
), mode = 'wb'))
I will have to do some research to find out previous flash cards. He might have posted the images without the link to his website.