Analyzing ~425 days of Hacker News posts with standard shell commands

(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:

Traceback (most recent call last):
  File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module>
    main()
  File "/home/dummy/projects/hn-cron/hn.py", line 53, in main
    log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n"
KeyError: 'url'

Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.

A few notes about this data, and the (in)accuracy thereof:

  1. The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
  2. I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
  3. A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
  4. I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
  5. I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
  6. These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
  7. The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.

Okay, let’s get down to it!

15 Most Popular Domains

Script

cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $3}' | grep -o '^h.*' | sed 's/https\?:\/\///' | grep -o '^[^/]*' | sed 's/^www\.//' | sort | uniq -c | sort -nr | head -15

WTF does that do?

  1. Gets only the unique lines in the file (couldn’t trust myself to actually get that part right with the script)
  2. Get the link, chop off junk (http(s)://, trailing slash, www.)
  3. Sort results (actually this is just a hacky way to get uniq -c to work by getting rid of extra whitespace)
  4. Get unique items again, outputting the number of repeats for each domain (i.e. number of links containing that domain)
  5. Sort this by its numeric, not lexicographic, value (i.e. where 100 > 11), in reverse (descending order)
  6. Get the first 15 lines/domains

And?

 2152 github.com
 1387 nytimes.com
 916 medium.com
 731 techcrunch.com
 486 washingtonpost.com
 477 bbc.com
 472 theguardian.com
 420 wired.com
 406 bloomberg.com
 388 nautil.us
 354 youtube.com
 329 bbc.co.uk
 324 newyorker.com
 323 theatlantic.com
 316 arstechnica.com

Most of these aren’t exactly shocking, though I suppose I didn’t realize just how popular nautil.us had become. Well done, chaps.

50 Most Popular Words

Script

cat log.txt | awk 'BEGIN {FS = "\t+" }; {print $2}' | grep -o "[^ ]*" | tr '[:upper:]' '[:lower:]' | tr -cd '[[:alnum:]\n]' | sort | uniq -c | sort -nr | head -50

WTF does that do?

  1. Gets titles
  2. Splits into words by spaces
  3. Converts to lowercase
  4. Deletes anything that’s not a letter, number, or newline
  5. Mashes
  6. Counts instances
  7. Sorts in reverse, numeric order
  8. Gets 50 lines/words (I couldn’t settle on where to draw the line on useless words, so I figured I’d just include the top 50)

And?

 12446 the
 7474 a
 7166 of
 6208 to
 5296 in
 5022 and
 4513 for
 3220               <- wtf? oops..
 2561 is
 2557 hn
 2501 on
 2392 with
 2134 how
 1877 show
 1358 from
 1347 new
 1325 an
 1313 [pdf]
 1209 why
 1103 your
 978 are
 969 you
 920 at
 896 what
 830 by
 816 data
 811 that
 805 i
 729 google
 692 as
 678 ask
 655 it
 624 using
 608 its
 607 we
 604 be
 590 can
 551 about
 547 programming
 545 web
 543 us
 543 not
 524 code
 510 do
 501 my
 483 open
 471 go
 471 first
 467 c
 465 language

Of course “The” is at the top of the list. But the order of common question words is (maybe) more interesting:

  1. How – 2134
  2. Why – 1209
  3. What – 896
  4. Who – 366
  5. When – 347
  6. Where – 157

So we care a lot about how stuff works, and why, and just what that stuff is, but we’re a global group of post-linear-time robots, so we don’t care about whos/whens/wheres.

Top 20 Hacker News Posts

Script

cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $4" "$2" - "$3" ("$1")"}' | sort -nr | uniq | grep -vE "\((85|90)" | sed -r "s/\([0-9]+\)$//g" | head -20

WTF does that do?

  1. Print the fields in a different order (score title – URL(ID))
  2. Sort in reverse, numeric order
  3. De-dupe (again?)
  4. Remove some arbitrary stories that got lucky (i.e. I started the script when they were already popular) based on their ID
  5. Remove the ID from output

And?

  1. Sir Terry Pratchett has died – 448
  2. Pro Rata – 330
  3. Unreal Engine 4 is now available to everyone for free – 326
  4. “Swift will be open source later this year” – 289
  5. Leonard Nimoy, Spock of ‘Star Trek,’ Dies at 83 – 285
  6. Airbnb, My $1B Lesson – 263
  7. Announcing Rust 1.0 – 257
  8. JRuby 9000 released – 246
  9. US to ban soaps and other products containing microbeads – 244
  10. Handwriting Generation with Recurrent Neural Networks – 217
  11. Snowden Meets the IETF – 187
  12. Fired – 178
  13. Symple Introduces the $89 Planet Friendly Ubuntu Linux Web Workstation – 167
  14. Jessica Livingston – 166
  15. FCC Passes Strict Net Neutrality Regulations on 3-2 Vote – 164
  16. Ellen Pao Is Stepping Down as Reddit’s Chief – 164
  17. YC Research – 158
  18. New Star Trek Series Premieres January 2017 – 158
  19. Just doesn’t feel good – 154
  20. Gay Marriage Upheld by Supreme Court – 154

From this list, it’s clear that there are a few things you can do to ensure you fit in with the HN zeitgeist and make it to the top of the front page:

  1. Be famous (BONUS: be Paul Graham)
  2. Be a life-changing programming language or framework
  3. Change politics forever (in America)
  4. Die or get fired

So there you have it, everything you never wanted to know about Hacker News! Thanks for reading, and I hope you enjoyed this slightly tongue-in-cheek analysis as much as I enjoyed writing it 🙂

 

Leave a Reply

Your email address will not be published. Required fields are marked *