Tim Thoughts data/rstats/policy

Election Day

Good morning. In less than an hour, volunteers from here will join others from around the United States. And you will be launching the largest GOTV effort in this history of mankind.

Mankind – that word should have new meaning for all of us today. We can’t be consumed by our petty differences anymore. We will be united in our common interests.

Perhaps its fate that today is the 8th of November, and you will once again be fighting for our freedom, not from tyranny, oppression, or persecution – well actually, you will be.

We’re fighting for our right to live, to exist.

And should we win the day, the 8th of November will no longer be known as an American election, but as the day when the world declared in one voice:

“We will not go quietly into the night! We will not vanish without a fight! We’re going to live on! We’re going to survive!”

Today, we celebrate our Election Day!

Tampa Update

things I’ve drastically improved my knowledge of:
git
shiny
rselenium
votebuilder
sql
hamilton lyrics
snapchat
coffee consumption

A Million Things I Haven't Done

So much for writing about the election this year. I just moved to Tampa to join Hillary Clinton and the Florida Democrats, I’ll be working on the data and analytics team for the duration of the election. Volunteering as a fellow during the California primary for HRC will always be a highlight of my life, and I’m really thankful for all the great people I met.

But now the hardest part comes next. I’ll be working with some really smart people on all sorts of various projects, my primarily focus will be aiding the organizers and ground game to maximize their impact here in Florida. I’m excited to use my growing data skills on the largest (and according to recent polls, the most contested) battleground state with some amazingly gifted people.Posting R code and graphs will have to wait for a while. Wish me luck, I’m excited to learn all and do all the things.

When times get hard, and they will, I have to remind myself of why I started all this. One of my first posts here a year ago was a little short story joke I wrote about the Don, but it’s no longer funny. I take my politics very seriously, and I’m eager to do whatever it takes to put Hillary into office. This isn’t the post where I tell you why I support her (maybe a later time), but I’ll briefly say that I am proud to stand with my fellow Democrats behind such a qualified candidate.

My boss gave me a survey to complete before I arrived, it’s a list of questions about my professional identity and personality. I’ve been thinking a lot about what I want to do with my life, and it was refreshing to rethink some of my goals about what I want out of this campaign for myself. It’s really important to put my pride and ego aside and concentrate on my daily tasks, but I get that I’m supposed to grow as a person. I’ll list some of them here now to remind myself.

Professional goals: Master the VAN/Votebuilder, better understand the landscape of Democratic Party data infrastructure, learn how to do more data science things like random forest algorithms, practice SQL/Tableau, absorb all of Daniel Kreiss’s new book

Personal goals: Eat healthy, practice self-care, send postcards home, call mom/dad more, memorize the Hamilton soundtrack

Unlikely goals: Post more updates on Snapchat/Instagram, 2K MMR on DOTA, try more EDM music, sleep, date hahahahahaha okay let’s just stop here.

There’s so much left to learn, I am so eager for this challenge. I won’t come home til the job is finished, and I hope to be a better version of myself by then. I hope this election will change me for the better, and I look forward to the grind. See you in November!

Data Perfectionism is the Enemy

Back in graduate school, my development economist professor assigned us a story by Jorge Luis Borges. In “Funes the Memorious”, the protagonist meets a teenage prodigy who has developed perfect memory due to a horseriding accident. For example the young boy could recall an entire’s worth of memories, and spend an entire day reliving his thoughts from the previous day.

As we discussed the story, a student remarked wow, what a cool ability! But I recognized the lesson right away. The boy’s ability was a blessing but yet a curse, his perfectionist ability to get every single detail correct prevented him from thinking abstractly or broadly. Hence the relevence to a class on economics, as our professor was stressing the necessity of sacrificing a tiny bit of detail for more broader policy validity. Or something like that, I assumed that was his point. Sorry Jeff.

I’ve been working on aggregating all the primary results by congressional districts, and it’s been increasingly frustrating when noting the disparity between state reporting methods. Tuesday’s results in Pennsylvania is a fine example, for they only reported the results by county.

I messaged the Green Papers, and they pointed me towards an AP press release for the Democrats that was much more helpful. To attain their estimated delegate count, they told me their method was “to go from county to CD – say, 30% of a county is in CDa, 60% in CDb and 10% in CDc: we take 30% of the county vote and apply it to CDa, 60% of the county vote and apply it to CDb, and 10% of the county vote and apply it to CDc. We did that for each county. We have found that the results much more usually end up pretty close to what the final delegate numbers per CD turn out to be”.

It should be good enough for me.

Sigh. Okay so there’s 67 counties and 18 congressional districts in Pennsylvania. Some counties are entirely located in one district, some are split across more than one. But I now know that Precinct 1271 of Whitehall Dist 1 of Alleghany County in Pennsylvania (literally the smallest and atomic unit of political geography in the United States) is actually split into 2 Congressional Districts.

I know this because I’m going to every 67 county websites, downloading their data directly, and filtering it into R. I figured I can obtain which district a particular precinct belongs to if I check which Congressional race they’re voting for. That’s when I realized 1271 of Alleghany was voting for both candidates and delegates to the 14th and 18th district. When I checked 1272-1288, etc to see the rest of Whitehall 2-16’s District, they’re all in the 18th. You can double-check my results by Control + F “1271 Whitehall Dist 1” at this website.

So that ONE particular precinct has a few people living in the 14th. Just to confirm, I made a phone call to Alleghany’s elections department this morning, and they told me that because of some redistricting that occured during even-numbered years (when Congressional representatives are elected), a precinct such as 1271 may be split. It’s literally the only one in the county.

Aftering doing more online digging, I finally found an updated spreadsheet matching precincts to districts at the state website. Considering the lesson of Funes and grad school, I will just recode 1271 a part of the 18th district and move on. Going for perfect granularity when you’re dealing with election data is a sisyphean task that will be an endless timesuck.

I just thought I should write a disclaimer, because while I theoretically accept and understand what I just wrote, it’s still frustrating to know that there’s incomplete data out there. Just learn to live with it, Tim.

Election Analysis, Pt 2: Web-scraping with rvest and gdata using Alabama

So I’ve been relying on the excellent Green Papers site for the majority of the data that I want. Here is what’s on their website, and it’s generally the most reliable source (you’ll notice that a lot of major media sites like FiveThirtyEight will reference their work).

If I wanted to scrape this exact data frame as it is, I can use “rvest”” to do so.

library(rvest)
al<- "http://www.thegreenpapers.com/P16/AL-R" #get URL
al<- read_html(al) 
al<-html_table(html_nodes(al, "table"), fill=TRUE)[7] #extract relevant information
al<-data.frame(al)
al<-al[c(-1:-2, -12),] #subset correct columns 
al[2:13]<-apply(al[2:13], 2, function(x) as.numeric(gsub(",", "", x))) #delete thousands separator and convert to numeric, will be handy
names(al)<- c("CD","Pop_Vote","Qual_Vote", "Total_Del","Trump_Pop","Trump_Alloc","Trump_Del","Cruz_Pop","Cruz_Alloc","Cruz_Del","Rubio_Pop","Rubio_Alloc","Rubio_Del") #rename columns
print(al)

So the data is ready…right? Sadly I have a slight OCD complex with political data, and we have a much larger task. I want the complete data set, direct from Alabama’s Secretary of State’s office, where they have a much more detailed breakdown. Such information would contain the rest of the candidates, precinct totals, absentee vote percentages.

Fortunately the Green Papers provided a direct link to Alabama’s website. There are two different data sets available. One is labeled “Results Spreadsheet, Certified March 11 2016” and the other is a ZIP archive of Excel files titled “County-By-County Precinct Level Primary Election Results”.

This is what the first file looks like.

Scrolling down to row 75, we’ll see the totals divided by Congressional District. Jackpot, it has all of the candidates!

Notice that there are 66 more sheets that look the same for each county. Might be relevant for calculating county totals, but more on that later.

If you want to use this data, we can use “gdata” and “RCurl” to directly read XLS files into our environment. Alternatively, you can download the Excel file directly into your own directory, open it, save it as a csv, and read it back into R with a simple read.csv function. You may find that option more simpler and faster than learning how to read an Excel document into R. However the spirit of this blog is to maintain reproducubility as much as possibile, and I heed Karl Broman’s guidelines on avoiding absolute paths.

library(gdata)
library(tools)
library(stringr)
library(dplyr)
url<-"http://www.alabamavotes.gov/downloads/election/2016/primary/primaryResultsCertified-Republican-Spreadsheet-2016-03-11.xlsx"
alabama_by_congress <- read.xls(url, na="")
alabama_by_congress<- alabama_by_congress[1:3] #only need the first 3 columns
alabama_by_congress<- na.omit(alabama_by_congress[74:183,]) #only need the Presidential races, ignore House District/Senate/etc races
names(alabama_by_congress)<- c("Candidate", "Votes", "Percent") #rename the columns for ease
alabama_by_congress$State<- "Alabama" #I like adding "State" as a column, useful when combining for future datasets with other states
alabama_by_congress$Votes<-as.numeric(gsub(",","",alabama_by_congress$Votes)) #2 functions in this, delete all comma separators and convert to numeric
alabama_by_congress$Candidate<- toTitleCase(word(tolower(alabama_by_congress$Candidate),-1)) #3 functions in this. "tolower" converts JEB BUSH to lower case, then word (from stringr) takes the last word from the string "bush", then toTitleCase from "tools" will capitalize it to title case.
alabama_by_congress<- alabama_by_congress%>% group_by(Candidate) %>% slice(1:7) %>% filter(Candidate !="Total_cd", Votes != "Votes")
#dplyr deserves its own tutorial, but group_by and slice will parse this down further
alabama_by_congress$Percent<-as.numeric(sub("%","",alabama_by_congress$Percent)) #deletes % sign, converts to numeric
alabama_by_congress$District<- paste0(str_pad(1:7, 2, "left", pad="0")) #Creates a "District" column. Will be relevant for geocoding, in later tutorials
alabama_by_congress<- alabama_by_congress[,c(4,5,1,2,3)] #reorder columns so it goes state first

Stop here if you’re satisfied! Part 3 will go into my OCD complex because (spoilers) this data is actually isn’t the most up to date data available. Also, this spreadsheet only contains Republicans! We’ll be reading the ZIP archive in the next post. If you really don’t care, move along…