• Publicado: 07 Jan 2016

  • Archivado en: r-english, datascience

More republican debate analysis with R

A few weeks late, here is a follow-up analysis using R, of the transcript of the latest Republican primary debate held at Las Vegas, Nevada.

Like the previous post, it should be interesting to see some word-clouds and some trends from the front-runners (and of course, Donald Trump).

Getting and cleaning the data

As in the last post, we’re going to import the data and clean with a function that was nicely improved by Alan Jordan:

# some packages for scraping and cleaning the data
library(rvest)
library(plyr)
library(dplyr)
library(stringi)
library(magrittr)

# function to partially separate and clean into a data.frame a debate from the presidency project
MakeDebateDF<-function(df){
  newdf <- data.frame(
    person = apply(df, 
                   MARGIN = 1, 
                   function(x){
                     stri_extract_first_regex(x, 
                                              "[A-Z'-]+(?=(:\\s))")
                   }),
    message = apply(df, 
                    MARGIN = 1, 
                    function(x){
                      stri_replace_first_regex(x,
                                               "[A-Z'-]+:\\s+", 
                                               "")
                    }),
    stringsAsFactors=FALSE
  )
  for (j in 2:nrow(newdf)) { 
  if (is.na(newdf[j,'person'])) 
		{newdf[j,'person'] <-  newdf[(j-1),'person'] }
	}

  return(newdf)
}

This time i’m only downloading one debate, and joining with the last four I had parsed…

# Importing debates --- 
# url for all debates
url <- "http://www.presidency.ucsb.edu/ws/index.php?pid="

### -------- debate in Las Vegas, Nevada (fifth debate)
lasvegas <- "111177"

debate_v <- read_html(paste0(url, lasvegas)) %>% 
  html_nodes("p") %>%
  html_text()

debate_v <- ldply(debate_v, rbind)
debate_v <- MakeDebateDF(debate_v)

Analyzing

Let’s join this data with the previous debates and see some stats and wordclouds…

# the last 4 debates were stored in "all_debates" object...
all_debates <- rbind(all_debates, 
                     debate_v)

Because he’s the most interesting to watch, let’s see what Trump says overall and in this debate…

library(ggplot2)
# this is for order_axis and theme_eem
# it can be downloaded using 
# devtools::install_github("eflores89/eem")
library(eem)
# all debates
trump_words <- apply(subset(all_debates, person == "TRUMP")['message'],
                    1,
                    paste)
# cloud
# function taken from: 
# http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need
trump_cloud <- rquery.wordcloud(trump_words, 
    "text", 
    max.words = 300,
    excludeWords = c("going","and",
                    "applause","get",
                    "got","let"))

trump_freq <- trump_cloud$freqTable

# debate in Las Vegas
trump_words_l <- apply(subset(debate_v, person == "TRUMP")['message'],
                    1,
                    paste)
trump_cloud_l <- rquery.wordcloud(trump_words_l, 
    "text", 
    max.words = 300,
    excludeWords = c("going","and",
                    "applause","get",
                    "got","let"))

trump_freq_l <- trump_cloud_l$freqTable

Overall word-cloud

Donald Trump all debates wordcloud

Las Vegas

Donald Trump las vegas debate word cloud

Shifts in speech

Of course, over the same five debates, topics have shifted tremendously both among the contenders and Trump.

For example, let’s see what the most spoken words were by debate…

# using previous data for each debate....
debate_words_h <- rquery.wordcloud(x = debate_h$message) #ohio, 1st
  # just the frequency table...
  # a bit lazy to do myself!
  debate_words_h <- debate_words_h$freq %>% mutate("Debate" = "Ohio")
debate_words_c <- rquery.wordcloud(x = debate_c$message) #cali, 2nd
  debate_words_c <- debate_words_c$freq %>% mutate("Debate" = "California")
debate_words_b <- rquery.wordcloud(x = debate_b$message) #boulder, 3rd
  debate_words_b <- debate_words_b$freq %>% mutate("Debate" = "Boulder")
debate_words_w <- rquery.wordcloud(x = debate_w$message) #wisc, 4th
  debate_words_w <- debate_words_w$freq %>% mutate("Debate" = "Wisconsin")
debate_words_v <- rquery.wordcloud(x = debate_v$message) #vegas, 5th
  debate_words_v <- debate_words_v$freq %>% mutate("Debate" = "LasVegas")

# join all
all_debate_words <- rbind.data.frame(debate_words_h, debate_words_c) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_b) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_w) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_v) 

# graph with some interesting words...
interesting_words <- subset(all_debate_words, word %in% c("government",
                                              "isis","president","senator",
                                              "money", "jobs", "tax", "obama",
                                              "clinton", "america"))

interesting_words$Debate <- factor(interesting_words$Debate, 
                          levels = c("Ohio","California",
                                     "Boulder","Wisconsin",
                                     "LasVegas"))

ggplot(data = interesting_words, 
        aes(x = Debate, 
            y = freq, 
            group = word)) + 
        geom_line(aes(colour = word)) +
        theme_eem() +
        scale_colour_eem(20) + 
        labs(x = "Debate", 
             y = "Frequency", 
             title = "Shifts in speech")

Apparently, “tax” is out: it wasn’t even mentioned this past debate, in contrast with the increasingly present “isis”. “Clinton” and “obama” are a constant:

shifts in speech republican debates

Aggregate stats

Now lets see some aggregate stats by contender.

This function is a bit confusing and/or unnecesary, I’ll probably find a better way to do this in the future…

UnlistAndExtractInfo <- function(candidate){
# this function is not general - it only applies to these particular debates...
# all the debates must be named the same in the parent env.
# for example: debate_h ...

allwords_1 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_h, person == candidate)['message'],
                    1,
                    paste))))
allwords_2 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_c, person == candidate)['message'],
                    1,
                    paste))))
allwords_3 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_b, person == candidate)['message'],
                    1,
                    paste))))
allwords_4 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_w, person == candidate)['message'],
                    1,
                    paste))))
allwords_5 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_v, person == candidate)['message'],
                    1,
                    paste))))
df_insights <- data.frame(
debate = c("Ohio", "California", "Colorado", "Wisconsin","Vegas"),
average_intervention = c(mean(stri_count_words(
                        apply(
                          subset(debate_h, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_c, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_b, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_w, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_c, person == candidate)['message'],
                                  1,
                        paste)))
                        ),
words_total = c(length(allwords_1),
                length(allwords_2),
                length(allwords_3),
                length(allwords_4),
                length(allwords_5)),
words_unique = c(length(unique(allwords_1)),
                 length(unique(allwords_2)),
                 length(unique(allwords_3)),
                 length(unique(allwords_4)),
                 length(unique(allwords_5))
                 ),
words_repeated_fromfirst = c(0, sum(allwords_2 %in% allwords_1), 
                            sum(allwords_3 %in% allwords_1),
                            sum(allwords_4 %in% allwords_1),
                            sum(allwords_5 %in% allwords_1)),
unique_words_repeated_fromfirst = c(0,
                            length(unique(allwords_2[allwords_2 %in% allwords_1])),
                            length(unique(allwords_3[allwords_3 %in% allwords_1])),
                            length(unique(allwords_4[allwords_4 %in% allwords_1])),
                            length(unique(allwords_5[allwords_5 %in% allwords_1]))
                            ),
words_repeated_fromsecond = c(0, 0, 
                            sum(allwords_3 %in% allwords_2),
                            sum(allwords_4 %in% allwords_2),
                            sum(allwords_5 %in% allwords_2)),
unique_words_repeated_fromsecond = c(0, 0,
                            length(unique(allwords_3[allwords_3 %in% allwords_2])),
                            length(unique(allwords_4[allwords_4 %in% allwords_2])),
                            length(unique(allwords_5[allwords_5 %in% allwords_2]))
                            ),
words_repeated_fromthird = c(0, 0, 0,
                            sum(allwords_4 %in% allwords_3),
                            sum(allwords_5 %in% allwords_3)),
unique_words_repeated_fromthird = c(0, 0, 0,
                            length(unique(allwords_4[allwords_4 %in% allwords_3])),
                            length(unique(allwords_5[allwords_5 %in% allwords_3]))
                            )
, stringsAsFactors = FALSE)
return(df_insights)
}

# going to create a data frame with all the counts from the top candidates...
candidates <- c("TRUMP","CARSON","RUBIO",
                "KASICH","CRUZ","BUSH",
                "FIORINA","PAUL","CHRISTIE")
info <- NULL
info_all <- NULL
for(i in 1:9){
info <- UnlistAndExtractInfo(candidates[i])
info$CANDIDATE <- candidates[i]
info_all <- rbind(info_all, info)
}

# i'm going to add a few more columns...
info_all %<>% mutate(carry_over_p1 = unique_words_repeated_fromfirst/words_unique,
                     word_repeat = words_total/words_unique)

Using this information to graph…

# graph of most words spoken by debate
ggplot(order_axis(
  subset(info_all, debate != "Ohio" & CANDIDATE != "CHRISTIE"), # christie didn't go to wisconsin
    CANDIDATE, carry_over_p1), 
       aes(x = CANDIDATE_o, 
           y = carry_over_p1)) + 
  geom_bar(stat = "identity", 
           aes(fill = CANDIDATE_o)) + 
  facet_grid(debate ~.) + 
  theme_eem() +
  scale_fill_eem(20) + 
  labs(title = "Repetition of words by candidate", 
       x = "Candidate", 
       y = "% of unique words repeated from first debate")

As the graph shows, Trump continues to lead in repetitiveness. In the latest debate, the Donald repeated 44.8% of the words he said during the first debate, followed by 38% from Kasich and 36% from Bush.

This is a key metric Trump has been consistently winning…

repetitions_trump

Again, if we plot total words versus unique words, to find the repetition of each individual word, we find Mr. Trump consistently below the trend: he says each word much more than the average candidate.

On the other hand, Carson and Fiorina tend to have a larger vocabulary of words.

ggplot(subset(info_all,CANDIDATE != "CHRISTIE"), 
       aes(x = words_total, 
           y = words_unique)) + 
    geom_point(aes(colour = CANDIDATE), size = 3, shape = 2) +
    stat_smooth()+
    theme_eem()+ # uses "eflores/eem"
    scale_colour_eem(20) + # uses "eflores/eem"
    labs(title = "Words per Debate",
         x = "Total Words", 
         y = "Unique Words")

unique words vs words by debate

Aggregating over the whole gives us a sense of this difference much more clearly:

# average times unique word is repeated...

ggplot(info_all, 
       aes(x = factor(CANDIDATE), 
           y = word_repeat, fill = eem_colors[1])) +
  geom_boxplot() +
  theme_eem()+
  labs(title = "Average repetition of unique words",
       x = "Candidate", 
       y = "Repetitions") + theme(legend.position = "none")

words per unique word

Speed of Intervention

This last debate also had the effect of spreading the gap between Trump and his opponents in terms of speed in interventions. Every time he talks, he always says less words, but this was even more apparent in Las Vegas…

# order the debates...
info_all$debate <- factor(info_all$debate, 
                          levels = c("Ohio","California",
                                     "Colorado","Wisconsin",
                                     "Vegas"))

# average length of interventions
ggplot(info_all, 
       aes(x = debate, 
           y = average_intervention, 
           group = CANDIDATE)) + 
  geom_path(aes(colour = CANDIDATE)) + 
  theme_eem() + 
  scale_colour_eem(20) + 
  labs(x = "Debate", 
       y = "Words", 
       title = "Average words per intervention")

words by intervention

This can also be an indication of how popular he is or how much “hits” he’s taking. When you need to counter an argument, sometimes only a few words is enough. If you do this constantly more than the others, the average is bound to go down.

The Data

As Alan Jordan suggested, i’ve left this data openly available via github, so anyone can play around with it and find a few more insights. Here is the link.