today   2 Apr 2018

access_time  19 minutes to read

Web Scraping for Fun

This post is all about how to scrape the web with Ruby. I’ll be covering the four main ways to interact with the webserver and get the data you want.

Best Case Scenario

This is the absolute dream, you don’t need anything outside of the standard library (but something like Curb could also be used for handling cases where the HTTP requests go bad).

Sadly not all websites give you nice API like this, or sometimes it’s only accessible for a fee (lots of sports data is like this).

So, let’s look at a bit of code that fetches a random image from Reddit:

#!/usr/bin/env ruby

require 'json'
require 'open-uri'

def get_threads
  JSON.parse(open('https://www.reddit.com/r/EarthPorn/hot.json').read)
rescue OpenURI::HTTPError => error
  raise error unless error.io.status.first == '429'
  puts 'Got a 429, guess Reddit is busy right now. I\'ll try again in a bit <3'
  sleep 60
  get_threads
end

def get_random_thread
  get_threads['data']['children'].map { |thread|
    thread['data'] if thread['data']['domain'] == 'i.redd.it'
  }.compact.sample
end

def download(thread)
  url = thread['url']
  puts "Downloading #{url}"
  open(url) do |stream|
    open(File.expand_path('~/background.jpg'), "w") do |fout|
      while (buffer = stream.read(8192))
        fout.write buffer
      end
    end
  end
end

download(get_random_thread)

The first thing you’ll notice is that we require json and open-uri which are both part of the standard library. json obviously gives us all the tools for parsing and creating JSON objects. open-uri allows us to pass URLs to open and read it just like we would a local file, this is really cool because it leads to some very clean and simple code.

get_threads is a funny little method, it was initially called try_really_hard_to_get_threads but I thought that wasn’t very professional so I renamed it. This is a very simplistic implementation that trusts I’m always going to get back the JSON I expect or that there will be a HTTPError I can respond to. The reason I catch the 429 here is that Reddit will very frequently return 429 Too Many Requests, so we just wait a minute and try again.

The next method get_random_thread is all about pulling out the relevant data from the API we want. The parts of the json we care about are structured like this:

{
  "kind": "Listing",
  "data": {
    "children": [
      {
        "kind": "t3",
        "data": {
          ...
          "domain": "self.EarthPorn",
          ...
          "url": "https://www.reddit.com/r/EarthPorn/comments/80mjqv/today_rearthporn_is_joining_operation_onemorevote/",
          ...
        }
      },
      {
        "kind": "t3",
        "data": {
          ...
          "domain": "i.redd.it",
          ...
          "url": "https://i.redd.it/ebcibjlwidp01.jpg",
          ...
        }
      }
      ...

We loop through every thread and get only the ones with the domain i.redd.it because that’s the one we know we can download really easily. Once we match on the domain we pass the url to map. At the end the result of map looks like [nil, url, nil, url] so we call compact on it to get rid of the nil results. We then call sample to get a single random result to return.

download is slightly over engineered to give you an example of how to stream a download to a disk, this function was originally just File.write(File.expand_path('~/background.jpg'), open(url).read) but that requires reading the entire download into memory before flushing it to the disk which is very bad if you want to download anything of a decent size.

For the full version of this script with does some nice naming of the file and actually sets it as the background you can check it out on GitHub here.

Usual Scenario

This is when there is no API but all you need to do is parse some HTML and turn it into data. 90% of the scripts I write fall into this and the previous category. I very rarely need to touch the last two but they’re still a great learning experience.

OpenGraph data / Twitter card

We’ll start off with a really simple example to get the Open Graph data from sites.

#!/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'

def get_opengraph_data(uri)
  document = Nokogiri::HTML(open(uri).read)
  document.css('meta[property^="og:"]').map { |element|
    [element['property'].gsub('og:', ''), element['content']]
  }.to_h
end

p get_opengraph_data('http://www.imdb.com/title/tt0117500/')
#=> {
#     "url"=>"http://www.imdb.com/title/tt0117500/",
#     "image"=>"https://ia.media-imdb.com/images/M/MV5BZDJjOTE0N2EtMmRlZS00NzU0...",
#     "type"=>"video.movie",
#     "title"=>"The Rock (1996)",
#     "site_name"=>"IMDb",
#     "description"=>"Directed by Michael Bay.  With Sean Connery, Nicolas Cage..."
#   }

This time we’re using the Nokogiri gem, this will handle parsing the HTML and allowing us to navigate it using CSS selectors and ruby itself.

The first thing we do is download the webpage and pass it into the Nokogiri HTML parser. This gives us an object which we can query against in really nice ways.

In the example we use a single css query meta[property^="og:"] which if you’re not very familiar with CSS means “All meta elements witch a property property that starts with ‘og:’”

So if our HTML looks like this:

<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@nytimesbits" />
<meta name="twitter:creator" content="@nickbilton" />
<meta property="og:url" content="http://bits.blogs.nytimes.com/2011/12/0..." />
<meta property="og:title" content="A Twitter for My Sister" />
<meta property="og:description" content="In the early days, Twitter grew..." />
<meta property="og:image" content="http://graphics8.nytimes.com/images/2..." />

It will only return the last four meta tags.

Next we loop over all the results and turn them into an array. For each element we return an array that looks like ['title', 'A Twitter for My Sister'], so the result of the map is:

[
  ['url', 'http://bits.blogs.nytimes.com/2011/12/0...'],
  ['title', 'A Twitter for My Sister'],
  ['description', 'In the early days, Twitter grew...'],
  ['image', 'http://graphics8.nytimes.com/images/2...'],
]

Now, when you call to_h on an array that is an array of two element arrays it turns it into a hash like this:

{
  'url' => 'http://bits.blogs.nytimes.com/2011/12/0...',
  'title' => 'A Twitter for My Sister',
  'description' => 'In the early days, Twitter grew...',
  'image' => 'http://graphics8.nytimes.com/images/2...',
}

So this little method has given us a pretty hash of all the Open Graph data for this page. This is useful for things like forums where when a user submits a link you can include a little more information.

I’ve got an example of this script which also handles Twitter card data, you can find it on GitHub here.

When the data isn’t very pretty

A lot of the time you’ll we working with fairly awkward data, for instance this little script gets the front page of Hacker News and returns hashes of each submission.

#!/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'

def hackernews_front_page
  document = Nokogiri::HTML(open('https://news.ycombinator.com/').read)
  # Each hackernews item is actually contained in 3 tr elements without nice
  # classes on them. And there are two lines at the end we don't care about,
  # hence the check for how many elements we have.
  results = []
  document.css('.itemlist tr').each_slice(3) { |element|
    next unless element.size == 3
    user_element = element[1].at_css('.hnuser')

    results << {
      rank: element[0].at_css('.rank').text.gsub('.', '').to_i,
      story: {
        title: element[0].at_css('.title a').text,
        link: element[0].at_css('.title a')[:href]
      },
      user: {
        name: user_element ? user_element.text : nil,
        link: user_element ? user_element[:href] : nil
      },
      comments: {
        count: element[1].css('.subtext a').last.text.to_i,
        link: element[1].css('.subtext a').last[:href]
      }
    }
  }

  results
end

p hackernews_front_page[0]
#=> {
#     :rank=>1,
#     :story=>{
#       :title=>"How Cambridge Analytica’s Facebook targeting model really worked",
#       :link=>"http://www.niemanlab.org/2018/03/this-is-how-cambridge-analyti..."
#     },
#     :user=>{
#       :name=>"Dowwie",
#       :link=>"user?id=Dowwie"
#     },
#     :comments=>{
#       :count=>128,
#       :link=>"item?id=16719403"
#     }
#   }

Let’s break this down piece by piece, the first thing we do is fetch the HTML and initialize a Nokogiri object.

Secondly we select all elements which match .itemlist tr and we loop over them three at a time. The reason for this is because all the data for a single submission is contained in three table rows. This is done using the lovely method each_slice from ruby core.

So the first thing we do is check that we actually have three elements, the reason for this is because the last two rows are actually the “more” at the bottom. So we want to ignore that and skip over it.

Next we grab user element, the reason I’ve done it like this is because sometimes this one wont exist and defining it here makes it a bit cleaner later on. The at_css method returns the first element that matches the selector, which is handy when you know you only have one or know you want the first.

After that we start populating the hash for this submission. I’m going to go through this quickly as it’s all pretty self explanatory with only minor differences between them. For rank we look for an element with the rank class on it and get it’s text contents, we then remove any . from it and turn it into an integer.

Next up we get the story details, we only care about the a element under the element with the title class. We grab both the text and link. When you have a single element for Nokogiri you can access the properties on the element as a hash which is really nice.

This is why we got the user_element earlier, we don’t have try since we haven’t included ActiveSupport so we just do a simple ternary. I could have done user_element&.text which was introduced in Ruby 2.3 but I wanted to remain compatible with Ruby 2.2 since it’s still supported.

And lastly we want to get information about the comments, here we use the css selector so we can get the last element. Here I use a bit of a trickery with to_i, if you pass in something like 123blah456 you’ll get 123. This is because to_i will stop converting to an integer at the very first non-digit character. If the first non-whitespace character it encounters is not a digit, it’ll return zero. For example:

"123abc456".to_i
#=> 123
"one".to_i
#=> 0
"  1337   test".to_i
#=> 1337
"  a1b2c3".to_i
#=> 0

Posting and Sessions

When you need to keep your session data and cookies it can be troublesome to use the more lightweight approaches above. Using Mechanize is a good way to handle it.

Let’s have a look at the code below:

#!/usr/bin/env ruby

require 'mechanize'

session = Mechanize.new
# Log in as a user
session.get('https://thredded.org/user_sessions/new')
form = session.page.forms.last
form['name'] = 'Jane'
form.submit

# Confirm we are logged in
session.page.parser.css('.thredded--flash-message').map(&:text).map(&:strip)
#=> ["Signed in as Jane, an admin."]

# Go into the off-topic area
session.page.links[8].click

# Create a new thread
form = session.page.forms.last
form['topic[title]'] = 'Web Scraping with Mechanize'
form['topic[content]'] =
  "It's _super_ cool, you should check out the [GitHub](https://github.com/sparklemotion/mechanize)!"
form.submit

For this demo we’re using the demo site of Thredded which is a simple Rails backed forum that is mobile friendly. The reason for choosing Thredded is because its demo site doesn’t require email verification, captcha, or even a password!

We start off by requiring Mechanize and then go on to initialize a new instance of it.

Next we use session.get to change to the login page of the Thredded demo. From here we use the session.forms.last function to get a Mechanize::Form object we can populate with our data. For this example we set our name to ‘Jane’ and we don’t touch the admin checkbox since it’s already set to true. Then we click the ‘Sign in’ button.

Now just to confirm we’ve logged in we spit out what we find in the flash messages. You may have recognised the css method used here, that’s because behind the scenes Mechanize uses Nokogiri for HTML parsing.

session = Mechanize.new
session.page.parser.class
#=> Nokogiri::HTML::Document

So you can treat it like my examples up above once you’ve gotten to the page you’re interested in.

We then click the 9th link on the page to take us into the Off-Topic area of the forum.

And finally we get the form responsible for creating a topic and populate it with a bit of data just like we did the login form, then we submit. If you go to the Off-Topic category you should be able to find the thread created but only for a little while since the demo site refreshes it’s database regularly.

Worst Case Scenario

I consider this the worst case scenario and the only time I see it as actually necessary is when you are running a Javascript testing framework from Rspec.

The gem used here is Selenium and it lets you interact with websites using an actual web browser. Chrome, Firefox, Safari, and Internet Explorer are all supported. About a month ago I would have recommended using PhantomJS but it’s since been deprecated I can no longer suggest it.

I’m going to keep this section quite short since I really don’t condone using it as it’s been very flaky for me.

#!/usr/bin/env ruby

require "selenium-webdriver"

driver = Selenium::WebDriver.for :chrome
driver.manage.window.resize_to(1024, 768)
sleep 5

# Log in as a user
driver.navigate.to 'https://thredded.org/user_sessions/new'
element = driver.find_element(name: 'name')
# Send backspace three times to clear the form
3.times { element.send_keys :backspace }
element.send_keys 'Selenium'
element.submit

# Go into the off-topic area
driver.find_element(partial_link_text: 'Off-Topic').click

# Wait for turbolinks to finish loading the next page
# (My internet obviously sucks)
sleep 5

# Create a new thread
element = driver.find_element(id: 'topic_title')
element.send_keys 'Web Scraping with Selenium'
# Wait for CSS animations to finish
sleep 1
element = driver.find_element(id: 'topic_content')
element.send_keys "It's _super_ cool, you should check out the [GitHub](https://github.com/SeleniumHQ/selenium/wiki/Ruby-Bindings)!"
element.submit

# Close the browser
driver.quit

You’ll notice this is pretty much the same as the Mechanize example but with quite a few sleep statements, this is because if something isn’t visible you can’t interact with it.

The basic way of interacting using Selenium is to select the expected element and send whatever keys you’d like to it. There are also ways of faking mouse interaction if you need to drag and drop or hover.

You can see a full run of the demo below, you’ll also notice it took me a few tries to get this recording right!

Protecting Yourself from Users

Users are the worst, they’ll do strange things that’ll break whatever code you write but thankfully there are some things you can do to protect yourself!

#!/usr/bin/env ruby

require 'net/https'

def should_i_fetch?(path)
  uri = URI(path)

  # We only handle HTTP and HTTPS
  return false unless ['http', 'https'].include?(uri.scheme)

  connection = Net::HTTP.new(uri.host, uri.port)
  connection.use_ssl = (uri.scheme == 'https')
  head = connection.request_head(uri.path)
  return false unless head.content_type == 'text/html'
  return false unless (head.content_length || 0) <= 1024 * 1024 * 5 # 5 Megabytes
  true
end

ftp_site = 'ftp://speedtest.tele2.net/'
big_file = 'http://mirror.filearena.net/pub/speed/SpeedTest_2048MB.dat'
regular_site = 'https://www.adam.com.au/support/blank-test-files'
json_api = 'https://www.reddit.com/r/EarthPorn/hot.json'

puts should_i_fetch? ftp_site
#=> false
puts should_i_fetch? json_api
#=> false
puts should_i_fetch? big_file
#=> false
puts should_i_fetch? regular_site
#=> true

Basically all this does is check that we are requesting a site over HTTP or HTTPs, checking the amount of data is under 5MB, and that the data we are getting back is HTML.

It’s pretty rudimentary but should help protect you at a pretty basic level.

Closing Thoughts

So in this article you’ve learnt how to read data from sources all over the web, but keep in mind people pay good money to keep those sites up. Don’t hammer them too hard and if you’re going to build a spider for a search engine, make sure to respect the robots.txt.