today 2 Apr 2018
access_time 19 minutes to read
This post is all about how to scrape the web with Ruby. I’ll be covering the four main ways to interact with the webserver and get the data you want.
This is the absolute dream, you don’t need anything outside of the standard library (but something like Curb could also be used for handling cases where the HTTP requests go bad).
Sadly not all websites give you nice API like this, or sometimes it’s only accessible for a fee (lots of sports data is like this).
So, let’s look at a bit of code that fetches a random image from Reddit:
The first thing you’ll notice is that we require
are both part of the standard library.
json obviously gives us all the tools
for parsing and creating JSON objects.
open-uri allows us to pass URLs to
open and read it just like we would a local file, this is really cool because
it leads to some very clean and simple code.
get_threads is a funny little method, it was initially called
try_really_hard_to_get_threads but I thought that wasn’t very professional so
I renamed it. This is a very simplistic implementation that trusts I’m always
going to get back the JSON I expect or that there will be a
HTTPError I can
respond to. The reason I catch the 429 here is that Reddit will very frequently return
429 Too Many Requests, so we just wait a minute and try again.
The next method
get_random_thread is all about pulling out the relevant data
from the API we want. The parts of the json we care about are structured like this:
We loop through every thread and get only the ones with the domain
because that’s the one we know we can download really easily. Once we match on
the domain we pass the url to
map. At the end the result of
map looks like
[nil, url, nil, url] so we call
compact on it to get rid of the
results. We then call
sample to get a single random result to return.
download is slightly over engineered to give you an example of how to stream
a download to a disk, this function was originally just
File.write(File.expand_path('~/background.jpg'), open(url).read) but that
requires reading the entire download into memory before flushing it to the
disk which is very bad if you want to download anything of a decent size.
For the full version of this script with does some nice naming of the file and actually sets it as the background you can check it out on GitHub here.
This is when there is no API but all you need to do is parse some HTML and turn it into data. 90% of the scripts I write fall into this and the previous category. I very rarely need to touch the last two but they’re still a great learning experience.
We’ll start off with a really simple example to get the Open Graph data from sites.
This time we’re using the Nokogiri gem, this will handle parsing the HTML and allowing us to navigate it using CSS selectors and ruby itself.
The first thing we do is download the webpage and pass it into the Nokogiri HTML parser. This gives us an object which we can query against in really nice ways.
In the example we use a single css query
meta[property^="og:"] which if
you’re not very familiar with CSS means “All meta elements witch a property
property that starts with ‘og:’”
So if our HTML looks like this:
It will only return the last four
Next we loop over all the results and turn them into an array. For each element
we return an array that looks like
['title', 'A Twitter for My Sister'], so
the result of the map is:
Now, when you call
to_h on an array that is an array of two element arrays it
turns it into a hash like this:
So this little method has given us a pretty hash of all the Open Graph data for this page. This is useful for things like forums where when a user submits a link you can include a little more information.
I’ve got an example of this script which also handles Twitter card data, you can find it on GitHub here.
A lot of the time you’ll we working with fairly awkward data, for instance this little script gets the front page of Hacker News and returns hashes of each submission.
Let’s break this down piece by piece, the first thing we do is fetch the HTML and initialize a Nokogiri object.
Secondly we select all elements which match
.itemlist tr and we loop over
them three at a time. The reason for this is because all the data for a single
submission is contained in three table rows. This is done using the lovely
each_slice from ruby core.
So the first thing we do is check that we actually have three elements, the reason for this is because the last two rows are actually the “more” at the bottom. So we want to ignore that and skip over it.
Next we grab user element, the reason I’ve done it like this is because
sometimes this one wont exist and defining it here makes it a bit cleaner later
at_css method returns the first element that matches the selector,
which is handy when you know you only have one or know you want the first.
After that we start populating the hash for this submission. I’m going to go
through this quickly as it’s all pretty self explanatory with only minor
differences between them. For rank we look for an element with the
on it and get it’s text contents, we then remove any
. from it and turn it
into an integer.
Next up we get the story details, we only care about the
a element under the
element with the
title class. We grab both the text and link. When you have a
single element for Nokogiri you can access the properties on the element as a
hash which is really nice.
This is why we got the
user_element earlier, we don’t have
we haven’t included
ActiveSupport so we just do a simple ternary. I could
user_element&.text which was introduced in Ruby 2.3 but I wanted to
remain compatible with Ruby 2.2 since it’s still supported.
And lastly we want to get information about the comments, here we use the
selector so we can get the last element. Here I use a bit of a trickery with
to_i, if you pass in something like
123blah456 you’ll get
123. This is
to_i will stop converting to an integer at the very first non-digit
character. If the first non-whitespace character it encounters is not a digit,
it’ll return zero. For example:
When you need to keep your session data and cookies it can be troublesome to use the more lightweight approaches above. Using Mechanize is a good way to handle it.
Let’s have a look at the code below:
For this demo we’re using the demo site of Thredded which is a simple Rails backed forum that is mobile friendly. The reason for choosing Thredded is because its demo site doesn’t require email verification, captcha, or even a password!
We start off by requiring Mechanize and then go on to initialize a new instance of it.
Next we use
session.get to change to the login page of the Thredded demo.
From here we use the
session.forms.last function to get a
Mechanize::Form object we can populate with our data. For
this example we set our name to ‘Jane’ and we don’t touch the admin checkbox
since it’s already set to true. Then we click the ‘Sign in’ button.
Now just to confirm we’ve logged in we spit out what we find in the flash
messages. You may have recognised the
css method used here, that’s because
behind the scenes Mechanize uses Nokogiri for HTML parsing.
So you can treat it like my examples up above once you’ve gotten to the page you’re interested in.
We then click the 9th link on the page to take us into the Off-Topic area of the forum.
And finally we get the form responsible for creating a topic and populate it with a bit of data just like we did the login form, then we submit. If you go to the Off-Topic category you should be able to find the thread created but only for a little while since the demo site refreshes it’s database regularly.
The gem used here is Selenium and it lets you interact with websites using an actual web browser. Chrome, Firefox, Safari, and Internet Explorer are all supported. About a month ago I would have recommended using PhantomJS but it’s since been deprecated I can no longer suggest it.
I’m going to keep this section quite short since I really don’t condone using it as it’s been very flaky for me.
You’ll notice this is pretty much the same as the Mechanize example but with
quite a few
sleep statements, this is because if something isn’t visible you
can’t interact with it.
The basic way of interacting using Selenium is to select the expected element and send whatever keys you’d like to it. There are also ways of faking mouse interaction if you need to drag and drop or hover.
You can see a full run of the demo below, you’ll also notice it took me a few tries to get this recording right!
Users are the worst, they’ll do strange things that’ll break whatever code you write but thankfully there are some things you can do to protect yourself!
Basically all this does is check that we are requesting a site over HTTP or HTTPs, checking the amount of data is under 5MB, and that the data we are getting back is HTML.
It’s pretty rudimentary but should help protect you at a pretty basic level.
So in this article you’ve learnt how to read data from sources all over the web, but keep in mind people pay good money to keep those sites up. Don’t hammer them too hard and if you’re going to build a spider for a search engine, make sure to respect the robots.txt.