Objectivity analysis of Rubio's Iowa speech

By special request, here is a topical breakdown of Rubio's Feb 1, 2016 "Iowa" speech, after his surprising results at the Iowa caucuses, using the topic model trained on GOP Debate speeches. 

Given the small sample size, some of the topics are a bit out of whack but I cleaned it up as much as I could and tried to focus on longer statements with strong topic representation. Below are some of the more significant ones.

I also used a simple, off-the-shelf sentiment analyzer which measures the objectivity of a sentence on a purely lexical basis -- e.g. "I think" is less objective than, "I know." It's not purely weighted by verb mood like in my example, but it's not super sophisticated either.

Clearly, objectivity is not the same as substance -- it's easy to make long, sweeping objective statements that have low topical substance. Perhaps there is a way I can create an overall "substance" index weighted by lexical objectivity and some metric of inner-topic coverage to provide a more semantically qualitative measure of the statement. Indeed, at first glance, "valueless" statements appear to have many more uniformly weak topics, rather than a few strong signals -- sort of a topic cohesion measure. Food for thought....

Anyway, here are the most significant phrases in Rubio's Iowa speech, by overall lexical objectivity. The top bar chart shows the overall topic, and it's average (median) objectivity; the table below shows the actual statements, and their topic distributions in the form of a pie chart. The color of each slice represents the topic (you can hover over for the value),  and the size of the slice is proportional to it's objectivity. 

As you can see, for example, the first statement in the table "I thank my Lord and Savior Jesus Christ...", etc. you will see that the objectivity score is zero. When you scroll to the bottom of the table, the statement "When my parents first arrived here in this country...", etc. you will see it scores as "purely" objective, with a score of 1.

If you scroll the bar chart, you can see which topics are generally more or less objective than others. There are some interesting things in there.


It's hard to make this work well on a mobile browser, but I tried.

Semantic analysis of GOP debates

A few weeks ago, I played around with extracting key phrases from GOP debates using statistical measures to identify groups of words spoken by candidates. Some of the phrases were pretty good, but they were missing a clear topical direction; a pure lexical approach fails to exploit similarities between phrases such as "islamic_extremism" and "islamic_radicals," among other deficiencies.

This time, I pushed the GOP debate transcripts through Mallet, a popular implementation of LDA topic modeling written in Java by Andrew McCallum at the University of Massachusetts. made available under the Common Public License.

I still had lists of "key phrases" from the last GOP analysis I did, so with a simple search and replace, I was able to turn multi-word phrases into single words by replacing spaces with underscores; that is, "islamic terror" becomes the single word "islamic_terror" -- prior to vectorizing and training the topic model. This ensured that key words don't get split apart by the model.

I eyeballed the topic model results and settled on a 20-topic model; I didn't perform any sophisticated quantitative measurements -- with such a small corpus, it was fairly easy to judge topic cohesion across the topic words at the end of each sampling and adjust accordingly.

After training a topic model across 20 possible topics, I was able to assign topic proportions across all candidate statements; for example, Trump speaks more about "winning elections, campaigning, Hilary Clinton" than any other topic whereas Bush talks most about "country, people, family" and Cruz's largest topic is "borders, immigration, people."

Using off-the-shelf lexicons of subjective adjectives and moderators, as well as open-source research packages from the Computational Linguistics and Psycholinguistics Research Center, I took the additional step of parsing statements by parts of speech and calculating subjectivity, positivity, and specificity as a pure heuristic based on the presence of pre-tagged lexical terms.

Objectivity
Objectivity is measured as a proportion of parts of speech that contain subjective or objective mood, plotted between 0.0 and 1.0. An example of an objective statement is, "I'm going to talk about my record" (1.0) and a subjective statement is "I think it's great that businesses start a 401k." (0.0) -- you can play with these values to select the most (or least) objective statements made by a candidate on any topic. (Both of those are actual statements from the debates, btw)

Positivity
This is measured by searching for the presence of certain key words from a lexicon (i.e. good or bad) with polarity values between -1.0 (negative) and 1.0 (positive). This should be somewhat self-evident. Note that lexical polarity is not necessarily a measurement of quality: "I hate cancer" is a lexically negative statement, but with positive semantic overtones which are lost in these kind of simple analyses. That's another fun problem for another day.

Specificity
Measured as a value between -1.0 (non-specific) and 1.0 (specific), the specificity of a statement depends on the grammatical mood of auxiliary verbs and adverbs which imply certainty or ambiguity (i.e. definitely, maybe, or the use of subjunctive verbs such as "wish")

Up next
I think next I'm going to run a similar analysis on the Democratic debates. Hopefully, we'll soon have primary election results and perhaps we can correlate some of the debate features to election results. Also, I had a good time playing with modality/subjectivity scores so I think I'm going to dive into that a bit.

As always, I'll put the code for this up onto github as soon as I get it cleaned up enough to have company. Feel free to follow that repo so you can get updated when I finally push.

Key Phrases From the GOP Debates

For this project, I started with the transcripts of the first six GOP debates (including moderators) and split the statements by speaker and debate occurrence. I then make use of a statistical test to learn the significant phrases among the statements, and calculated an overall "strength of signal" index based on the overall information gain of the phrase relative to the entire debate performance.

From this, I was able to present the significant phrases grouped by speaker and rank the phrase's overall importance to the candidate's message. Selecting a phrase from the bubble chart will provide the specific context of the phrase by the speaker for selected phrases, and calculate a histogram displaying the overall sum of signal strength per candidate.

The greater number of candidates in the GOP debates provided a great starting point to experiment with data mining of political debates; stay tuned for a similar analysis of the Democratic debates, and for further topical analysis of the debate performances. There are a lot of overlaps of phrases ("radical_islamic" and "islamic_terrorism", for example); grouping phrases by topic should help identify similarity of message between candidates.