• Publicado: 19 Jan 2016

  • Archivado en: r-english, datascience

South Carolina Republican Debate with R

Continuing with the series analyzing republican debates, the latest in South Carolina confirms a few of the trends i’ve been observing, mainly that:

For those following along, this is the new url ending:

south_carolina <- "111395"

To save some precious real estate, i’ve done the work previously explained in the last posts, and built the data.frame with the debate. (You can see the rough scripts here)

Some word-clouds

I’ll start with the usual word-clouds. In this case, it’s actually interesting to see how the “war” between Cruz and Trump has been playing out.

First we see an overall word-cloud (all the debates) for Mr. Cruz: Ted Cruz all debates wordcloud

But this last debate, the Donald made it to Ted’s most frequent words: Ted Cruz all debates wordcloud

Trump, on the other hand, barely flinched. China was on his top agenda:

Donald Trump south carolina debate word cloud

Debate-cloud

As for the entire debate, the most frequent words were “people” (intuitively makes sense) and “president”.

# all debates
scarolina <- apply(subset(all_debates, debate == "SouthCarolina")['message'],
                    1,
                    paste)
debate_cloud <- rquery.wordcloud(scarolina, 
    "text", 
    max.words = 300,
    excludeWords = c("going","and",
                    "applause","get",
                    "got","let"))

Republican debate in South Carolina word cloud

Individual obsessions

It’s also pretty clear from a few minutes of watching the debate, that candidates have their own quirky obsessions and mostly resort to talking about them as much as they can.

First, a small function to give me the counts of words…

TopMentions <- function(x){
df <- all_debates
  counts <- stri_count_regex(df$message, 
                            pattern = x,
                            case_insensitive = TRUE)
  df$counts <- counts
  df <- df %>% dplyr::group_by(person) %>%
                 dplyr::summarise("mentions" = sum(counts))
  df <- as.data.frame(df)
  df <- subset(df, mentions>0)
  return(df)
}

Let’s see who likes to talk the most about guns…

# guns 
w <- c("gun","guns")
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning guns

I think we know who is gonna win here…

# mexico
w <- "mexico"
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning mexico

Republicans are also characteristically strong on military-speak, so let’s see who likes this subject the most…

# isis, military
w <- c("isis","military")
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning isis

On the economic front, a natural contender pops up…

# taxes
w <- c("taxes")
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning taxes

And finally… this a surprise!

# muslim
w <- "muslim"
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning muslim

I guess it’s worth mentioning a caveat here: this has nothing to do with sentiment, Mr. Bush is arguably much nicer in tone that Trump…

Also, some candidates prefer another term…

w <- "islamic"
ggplot(order_axis(data = TopMentions(w),
                  axis = person, 
                  column = mentions), 
        aes(x = person_o, 
            y = mentions)) + 
        geom_bar(stat = "identity", 
                 fill = eem_colors[1]) +
        theme_eem() +
        labs(x = "Person", y = "Mentions", 
        title = paste0("Top mentions of ",w))

Top contenders mentioning islamic

Aggregate stats

Let’s see what the aggregate stats have to say about this new debate…

candidates <- c("TRUMP","CARSON","RUBIO",
                "KASICH","CRUZ","BUSH",
                "FIORINA","PAUL","CHRISTIE")
info <- NULL
info_all <- NULL
for(i in 1:9){
info <- UnlistAndExtractInfo(candidates[i])
info$CANDIDATE <- candidates[i]
info_all <- rbind(info_all, info)
}

# i'm going to add a few more columns...
info_all %<>% mutate(carry_over_p1 = unique_words_repeated_fromfirst/words_unique,
                     word_repeat = words_total/words_unique)

Using this information to graph…

ggplot(subset(info_all, CANDIDATE != "CHRISTIE" & words_unique>90), 
       aes(x = words_total, 
           y = words_unique)) + 
    geom_point(aes(colour = CANDIDATE), size = 3, shape = 2) +
    stat_smooth()+
    theme_eem()+ # uses "eflores89/eem"
    scale_colour_eem(20) + # uses "eflores89/eem"
    labs(title = "Words per Debate",
         x = "Total Words", 
         y = "Unique Words")

unique words vs words by debate

I’m not an expert, but I would argue that the longer the debate format, the worst off Trump will likely do. Why?

There seems to be a declining curve once about 3,000 words are spoken. Whereas Ted Cruz can fluently speak about more things (he says more unique words), Trump seems to be struggling to find new words.

Of course, I’m assuming saying the same thing over and over again is bad for your campaign. Perhaps the formula is the other way around.

But again, let’s see this trend more clearly with another graph…

# average times unique word is repeated...
ggplot(subset(info_all, CANDIDATE != "CHRISTIE" & words_unique>90), 
       aes(x = factor(CANDIDATE), 
           y = word_repeat, 
           fill = eem_colors[1])) +
  geom_boxplot() +
  theme_eem()+
  labs(title = "Average repetition of unique words",
       x = "Candidate", 
       y = "Repetitions") + theme(legend.position = "none")

words per unique word

The trend continues. Trump says considerably more times each unique word than the other candidates.

Speed of Intervention

Like the debate in Las Vegas, South Carolina was similar for Trump in terms of words spoken in each “intervention” (time he was continually speaking). As I said earlier, this can simply be due to the fact that he has to constantly play defense…

Kasich, on the other hand, delivers his remarks and follows the rules: he does not speak when not spoken to (which is most of the debate).

# order the debates...
info_all$debate <- factor(info_all$debate, 
                          levels = c("Ohio","California",
                                     "Colorado","Wisconsin",
                                     "Vegas", "SCarolina"))

# average length of interventions
ggplot(info_all, 
       aes(x = debate, 
           y = average_intervention, 
           group = CANDIDATE)) + 
  geom_path(aes(colour = CANDIDATE)) + 
  theme_eem() + 
  scale_colour_eem(20) + 
  labs(x = "Debate", 
       y = "Words", 
       title = "Average words per intervention")

words by intervention

The Data

As usual, I’ve left this data openly available via github, so anyone can play around with it and find a few more insights. Here is the link.

You can also contact me via twitter for any questions: @eflores89 or via an issue in github.