Clif Reeder

Vox Product and Developing With Production Data

Vox Product Logo

As you may know, I’m currently a developer at Vox Media, and a proud member of their Product Team. This is the term we use to describe all of the designers, developers, community/product managers and ops folk that produce the platform that powers SB Nation, The Verge and the upcoming Polygon.

We’ve recently launched a blog to talk about the work we do. I happen to think it’s pretty cool, and you should check it out here.

I wrote one of the inaugural posts entitled Developing with Production Data, which gives a brief overview of a talk I gave at the DC Ruby Users Group in April. Since the post is pretty short, I’ll just say that you should check it out. If you are too lazy for that, the slides are here.

In the future, I’ll probably be writing more for the Vox Product blog, so expect to see some more links back there in the future.

Google Code Jam 2012

This past weekend, I competed in Round 1 of Google Code Jam, a competitive programming competition. Although it was my second time participating in GCJ, things didn’t really go much better than last year. It should be noted I used the same tools I created last year to test solutions, which was very helpful.

I’ll talk a little about each round/problem to the best of my recollection, and as always, my (attempted) solutions can be found on GitHub here.

Notes From Rails Conf

This last week I attended Rails Conf 2012 in Austin, which was a welcome contrast to my SXSW experience. I found lots of talks that where interesting and applicable to problems/questions I have in day to day Rails development. If you are curious about specific content, I highly recommend checking out this wiki of presentation notes created by the New Haven Ruby group.

JSON objects

The other thing is that I bought an iPad recently and used the Paper to take notes for a number of the talks. Although the process was a little slow, I’d like to try it with a stylus and expect to continue to do the same thing in the future.

Here are some quick thoughts/notes about talks I found particularly interesting. I took these using the Paper app for iPad, which I quite enjoyed and would recommend.

SXSWi 2012 Retrospective

Having just returned from my first SXSW Interactive experience, it’s time to finally take a breather and reflect on a few things.

It’ a hard event to fully describe, just because of the sheer size of the conference and many other facets of it. To make things a little easier, I’m going to split this into talking about the panels, people, and parties.

SXSW 2012 by cloudioweb, on Flickr SXSW 2012 by cloudioweb, on Flickr

Why I Use DuckDuckGo, and You Should Too

I use DuckDuckGo (DDG) because it allows me to work faster and be more productive. If you are a developer or spend a lot of time on using Google search, I think it can help you too.

The way that DDG does this is by trading off text directives for GUI/mouse interactions. Assuming you can type faster than move and click a cursor, this is a big difference. To me, its like the difference between using Vim and a GUI based text editor.

DDG calls this the bang syntax. What it allows you to do is pipe your search directly to another site’s search - for example, Amazon, Wikipedia, Google Image Search, or hundreds of other places.

Introducing Attribute Delegator

Today I created and released my first gem: attribute_delegator. It’s basically a small plugin for ActiveRecord that generates getter/setter methods that allow you to treat the attributes of a has_one relationship like they are native to another model. It also makes sure that any changes to the has_one model are seamlessly saved.

The main code of them started off as a mixin in Vox Media’s CMS application. We have few particularly large (column and row wise) STI tables, and wanted to limit their bloat. Specifically, we have a large ‘entries’ table, and many different subtypes of entry. However, different subtypes sometimes require additional fields, but I wanted to avoid the classic problem of ending up with unused columns for certain subtypes.

Adding these subtype specific fields to a different model is nothing new, but I also didn’t want to put complicated logic into our story editor (in essence, a massive form) to deal with these fields. Attribute delegator allows us to treat these fields the same as any other in our form though, which is a big win.

At risk of duplicating documentation, check out the source/readme on Github for more info.

PS: Big thanks to Bundler gem development guide for making the publication process really easy!

 

Scraping Wikipedia for Death Pool 2012

About a month ago, one of my friends invited me to join a 2012 death pool. Our league was to be made up entirely of Michigan alums, and the rules stipulated a number of bonuses for Michigan births, deaths, and various degrees of attendance at the University of Michigan. The other important rule is that all candidates had to be ‘notable’, the criteria of which is if someone has a Wikipedia page. Clearly, this was a problem to be solved with programming.

Strategy

Scoring for the pool is that if a draft pick dies in 2012, the draftee gets 115 minus the age of the deceased points. There is a slew of minor, unpredictable bonuses, but the biggest one is that any deceased Michigan alum automatically gets 11 extra points.

Given this knowledge, my strategy was to scrape wikipedia to compile a list of the oldest Michigan alums I could find. Wikipedia has several lists that cover the topic, so I figured finding elderly Wolverines would be no problem.

Development

While I probably should have used the actual Wikipedia API, I opted to screen scrape instead so that I could get things going as fast as possible. I knew the hardest part would be parsing differently formated pages to find a birth/death date, and did’t think the API would be any easier that pure html.

For scraping, I used Nokogiri, and an excellent gem called VCR to speed up my unit tests. VCR records and plays back HTTP requests during tests, meaning that I wasn’t hitting wikipedia I ran my tests. This proved useful, so that I could tweak my functions and test them again quickly.

The meat of the program I wrote is in two functions, one that parsed a set of alumni lists, and another that given a wikipedia url, creates a Person model (stored in mongodb, because I wanted to try it out), and tries to figure out their name, birthday, deathday, and if they are indeed a Michigan alum. I used a rake task to generate the list of names, and then Resque in order to scrape the individual pages. I chose Resque because I expected a lot of random failures from rate limiting or parsing problems, both of which did happen.

The full source of the program is up on Github here.

Results

With all said I done, I ended up with sixteen Michigan alums on my roster that where born between 1918 and 1926. Actuary tables suggest that the chances of death within one year for someone born in 1918 is 20%, and for 1926, is 8.7% Not bad.

However, I know that I incorrectly scraped many entries because I relied on the ‘infobox’ in the top right of the entry to get the age, and probably missed a number of good candidates this way. I also neglected to add the list of Michigan football players, some of which where good picks. More time and testing probably could have fixed this.

Time will tell if this strategy will work well, but I already have big plans for next year’s pool.

 

Learning From My First Google Code Jam

This past weekend, I competed in the Rounds 1A and 1B of Google Code Jam. This was my first year competing in this, or any other programming competion. I went in with no idea what to expect, (I barely scraped through qualification and didn’t look at any old problems), and while my results were pretty dismal (0 and 8 points, respectively), it was fun and has me interested in digging a bit deeper into competative programming.

The Good

The only prep work I did was creating a Rakefile to setup my working folder and run tests. Upon starting on a problem, I would create a directory, copy this Rakefile into it, and run ‘rake setup’. This would create a barebones Ruby file to read the fileinput, and two folders, “works” and “fails”. Each of these folders are supposed to contain sets of ‘name.in’ and ‘name.out’ files, that are known to either be correct input/output pairs, or bad. So for example, I would copy ‘sample.in’ and ‘sample.out’, which contained the problems sample data, into ‘works’. Then, whenever I thought my program should be working, I could run ‘rake’ and be sure.

The real benefit of this was doing regression testing after getting a small input incorrect. I could move that input/output pair into “fails”, and know that after modifying my program, I wasn’t producing that same incorrect output.

While still very rough around the edges, I think this tool helped me, and would like to iterate on it in the future. The full source is available on Github.

The Bad

I think my biggest mistake was not practicing beforehand or looking over old problems. While I conceptually understand enough algorithms, recognizing which ones to adapt and use, and then implementing them fast and correctly is hard. I suspect practice is the only real remedy to this. 

Implementation wise, I also made a big mistake while working on the RPI problem. I knew I was going to be fetching a list of particular team’s games over and over again, and I started off extracting it from an hash of every game for each query. I reasoned that while this was slow, if it became a time issue, I would just start saving results (ala dynamic programming), so they would only have to be built once. 

But what’s faster that once? Never. After time was up, I realized that I should have just built a list of games for every team as I read them in. Although I think the dynamic approach would have worked fine time wise, I had a simple array access error in my implementation, which I never caught, and caused every lookup to miss. 

The leason? Shorter, simplier, and more concise code means less room for a bug to hide in.