January 2016

Analyzing ~425 days of Hacker News posts with standard shell commands

(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:

Traceback (most recent call last):
  File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module>
    main()
  File "/home/dummy/projects/hn-cron/hn.py", line 53, in main
    log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n"
KeyError: 'url'

Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.

A few notes about this data, and the (in)accuracy thereof:

  1. The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
  2. I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
  3. A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
  4. I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
  5. I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
  6. These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
  7. The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.

Okay, let’s get down to it!

Continue reading