An Amateur Guide to Web Scraping

I hate to turn this website into a place where I write tutorials. I’m not good enough of a programmer to really be guiding people on the “right” way to do things, and on almost every topic a good tutorial already exists. However, I get asked about web scraping a lot, so I thought I’d give it a try writing this. I should note that web scraping raises a number of tricky legal issues. I’m not advocating you do anything unlawful, infringe anyone’s rights, or breach any contract. With that out of the way, let’s get started. There is some data on a website that you want to have access to, and you want to know how to get it. Whenever I have this problem, there are three approaches that I use.

Approach 1: Is Scraping Even Necessary?

One lesson I’ve learned, after many wasted hours, is that sometimes the data is already in a computer-readable format that you can download if you know where to look. This is obviously true if the website actually has a “download data” button somewhere, but often the data is still available somewhere even if the website doesn’t make that known. Politico, for example, displays the US election results by countyfor each state, but there is no “download this data” button anywhere. The trick, if you want to call it that, is to inspect the website’s source code. In Chrome, which I will be using for this tutorial, you do this by going to the Chrome menu (top right), clicking “More Tools”, and then clicking “Developer Tools”. What we want to know is whether the page is loading some source that we can access directly. Click on the “Network” tab, and refresh the page. Start looking through the network requests that the site makes, and see if the data you’re looking for is there. There will be a lot of files that clutter things up, so click through some of the file types that might be relevant (namely “XHR”, “JS”, “Doc”, and “Other”). Click the “Response” tab to see the data returned from each request. In this case, if we click “XHR”, we’ll find a request that returns this response: Image Did Not Load That is the data we want. Now go to the headers tab and figure out where that response came from. In this case, all you need is the URL. Copy/paste the URL into your web browser, press enter, and you’ll now have an XML file of the data. And that’s it! You can now parse the XML and use it however you’d like. This approach also can work for data that isn’t displayed on the site, such as the data used to create a visualization. For example, say you want the election results for Texas’ 24th congressional district, which is displayed on a map I made. Just follow the steps above and you will find three CSV files (one for each of the three counties in the congressional district). Copy and paste the URL to each and then save each page as a CSV. That’s it.

Approach 2: Requests

Unfortunately, data is not always made available in a convenient format that you can find a direct link to. Sometimes you actually need to scrape it. The second, and most basic, approach is to use requests to return the html, and then parse the html for the data you need. I’ll be using Python for this tutorial. If you prefer some other language like R, you’ve made a mistake. Come back when you know Python. I am going to use the Runescape Grand Exchange as an example. For those of you who weren’t cool enough to play Runescape as a kid, it’s an online game, and one aspect of the game is buying items with coins. The items trade on an in-game exchange, and the prices can be found on the Runescape website. So let’s say you want to track the price of three different items, pulling the prices and storing them into a CSV. Here is how you would do that:

import pandas as pd
import requests
from lxml import html
from datetime import date

df = pd.read_csv('runescape_data.csv', index_col='item_name')

item_dict = {'Yew_logs': '1515', 'Rune_hatchet': '1359', 'Small_fishing_net': '303'}
for i in df.index:
    page = requests.get('http://services.runescape.com/m=itemdb_rs/' + i + '/viewitem?obj=' + item_dict.get(i))
    tree = html.fromstring(page.content)
    price = tree.xpath('//*[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/h3/span')[0].text
    df.ix[i, date.today()] = int(price.replace(',', ''))
    df.ix[i, 'current_price'] = int(price.replace(',', ''))

df.to_csv('runescape_data.csv')

First, we read a CSV containing the name of each item we are interested in. For each item, we make a get request with requests.get(), filling in the URL with the item name and ID (taken from item_dict, a dictionary I created manually). We then use html from the lxml package to convert the string response from that get request into html. From there, we can parse the html to find the information we want. In this case, we are looking for the price, which appears here on the page: Image Did Not Load Go to Developer Tools, click the far top left button (select an element to inspect it), and then click on the price. In the html, you will see it brings you to a span tag containing the price. Right click, Copy, and click “Copy XPath”. Now just copy and paste the xpath into the tree.xpath() part of the code. At the end of that line, you will see that it says [0].text. This is because xpath() returns a list of objects, and we want the first (and only) object, and we want the text from that object (i.e. the price of the item). We then save the price to various columns in the CSV, and at the end write the CSV with the updated data. This is, of course, an incredibly simple example. With that said, using this example for more complex tasks is actually fairly straightforward. For example, if you need more than one element off a page, you might write the xpath to return a list of multiple objects that you are looking for. On the Runescape page, you might want the data for today’s change, the 1 month change, 3 month change, and 6 month change. The xpath for each looks like this (in order):

[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/ul/li[1]/span/span[1]
[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/ul/li[2]/span/span[1]
[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/ul/li[3]/span/span[1]
[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/ul/li[4]/span/span[1]

To get each of these elements, notice the only thing changing is the li[1], li[2], etc. So, simply remove the index for the li tag, and use the xpath:

[@id="grandexchange"]/div/div[2]/main/div[2]/div[2]/ul/li/span/span[1]

Then, drop the [0].text from the end of the xpath() line, and instead iterate over price. You can then do whatever you want with each item (if you do for x in price:, you will likely be doing something with x.text). Writing xpaths isn’t always this easy, unfortunately, and sometimes the xpath returned by Chrome isn’t what you want. When that’s the case, you probably need to start playing with the xpath yourself, trying to construct it from some parent element you can access. You can find various tutorials for writing xpaths online. The last thing to note on requests is that sometimes you need to log in to view the data you want, and in that case you will want to use a session. This allows cookies to persist across various requests, allowing you to stay logged in while you do your scraping. To use sessions, your code will look something like this:

import requests
from lxml import html

with requests.Session() as s:
    login_payload = {
        'username': '',
        'password': ''
    }
    login_response = s.post('', data=login_payload)
    page = requests.get('')
    tree = html.fromstring(page.content)
    target_element = tree.xpath('')

That is the general idea behind using requests for web scraping. A number of tricky problems can arise, like finding the correct URL to make a request to, posting the correct form data, and writing correct xpaths. However, the general idea always remains the same, and it is hard to generalize about these types of problems; it’s best to just deal with them as they arise.

Approach 3: Selenium Webdriver

Using requests is great for the simple stuff, but unfortunately it doesn’t always work. Specifically, if data is loaded dynamically on the page with JavaScript, you’ll probably need to use some other tool. This is where a Selenium Webdriver comes in. I’ll admit now, I’m very partial towards using a webdriver for scraping. It’s a more versatile tool than requests, and often is more intuitive as well. Let’s look at a relatively complex task: I want to get information on every tax treaty that appears in the UN’s treaty database. Ultimately, what I want is the name of each tax treaty, the dates of signing and entry into force, and the name of each country that is a party to the treaty. (Oh yea, and let’s ignore that the UN provides a “download to Excel” button.) What we want to do is search the UN Treaty Series Online for tax treaties. For those of you too lazy to click the link, this is what the page looks like: Image Did Not Load Having searched for “tax”, the URL doesn’t change, and there’s no intelligible POST request to be found in developer tools. Even if there were such a request, it seems that the response wouldn’t include the table data, which is what we need. This isn’t a problem using a webdriver, as we can simply tell our browser to do all of the things we would do if we were searching this website manually. In the code below, I initialized a webdriver by pointing Selenium to where I have chromedriver installed, and then told the webdriver to type “tax” in the search box, click “search”, and change the number of results displayed per page to 500. You will also notice that I have a number of sleep statements added in the code; without these sleeps, the code might get ahead of the browser and try to access elements before they have loaded. Now that we have all of the results on one page, we can start scraping. All we need to do is look at xpaths in devtools to figure out how to access the objects we want. This is no different than what we would do with requests, except I locate the object using driver.find_element_by_xpath(). In the code below, I iterate the table column-wise, and in each case store the attribute I want into a list (which I then put into a pandas dataframe). I can access various attributes of an element using get_attribute(), which is how I accessed the link to the agreement (the href) instead of the text “See Details”. Here is the code:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from time import sleep


def get_table_with_links():
    driver = webdriver.Chrome('/Applications/chromedriver')
    url = 'https://treaties.un.org/Pages/UNTSOnline.aspx?id=2'
    driver.get(url)
    search_box = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_txtTitle"]')
    search_box.send_keys('tax')
    submit_search = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_btnSubmit"]')
    submit_search.click()
    sleep(5)
    results_per_page = Select(driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_drpPageSize"]'))
    results_per_page.select_by_value('500')
    sleep(5)
    # Read first column
    # Yes I know I'm repeating myself here. I think this actually helps readability for beginner programmers.
    treaty_names = []
    for i in driver.find_elements_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_dgSearch"]/tbody/tr/td[2]'):
        try:
            name = i.get_attribute('title')
            treaty_names.append(name)
        except:
            treaty_names.append('')
    df = pd.DataFrame(index=treaty_names)
    # Read second column
    treaty_links = []
    for i in driver.find_elements_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_dgSearch"]/tbody/tr/td[3]/a'):
        try:
            link = i.get_attribute('href')
            treaty_links.append(link)
        except:
            treaty_links.append('')
    df['link'] = treaty_links
    # Read third column
    treaty_signed = []
    for i in driver.find_elements_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_dgSearch"]/tbody/tr/td[4]'):
        try:
            treaty_signed.append(i.text)
        except:
            treaty_signed.append('')
    df['signed'] = treaty_signed
    # Read fourth column
    treaty_entry = []
    for i in driver.find_elements_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolderInnerPage_dgSearch"]/tbody/tr/td[5]'):
        try:
            treaty_entry.append(i.text)
        except:
            treaty_entry.append('')
    df['entry'] = treaty_entry
    driver.quit()
    df.to_csv('tax_table.csv')

get_table_with_links()

Note that Selenium gives us a number of options for locating elements, so we don’t have to always use xpaths. Elements can be found by id, name, xpath, link text, tag name, css name, and css selector. We also can choose to find a single element, or to find all elements with that identifier (find_element_by vs. find_elements_by). There are also a number of options for navigating pages; when working with dropdown boxes, we can select by value, text, or index. We can also send keys to scroll down on a page, if data is loaded by scrolling (as is the case on Twitter). Of course, we still don’t have everything we want for this tax treaty problem, as we don’t have the names of the countries that are party to each treaty. That is what the links are for. I tried using requests, but the name information on each treaty page isn’t part of the response data; as such, we need to use a webdriver again. All we need to do is iterate each link, find the parties (using xpaths), and add them to the pandas dataframe. The code looks like this:

import pandas as pd
from selenium import webdriver

def get_participants():
    df = pd.read_csv('tax_table.csv')
    df = df.head(5)
    driver = webdriver.Chrome('/Applications/chromedriver')
    for i in df.index:
        parties = ''
        url = df.ix[i, 'link']
        driver.get(url)
        for x in driver.find_elements_by_xpath('//*[@id="dgParticipants"]/tbody/tr/td'):
            parties += (x.text + ', ')
        df.ix[i, 'parties'] = parties
    df.to_csv('tax_table.csv')
    driver.quit()

get_participants()

You should probably add some safety measures into the code so you save the data you’ve scraped after every loop rather than only writing to a CSV at the end. This is because the code might hit a dead link or find some unexpected formatting, or the server might kick you out. You don’t want to lose your scraping progress if that occurs. That is basically everything I want to say about Selenium. You’ll note that I never told you how to install any of this, and that’s because it’s kind of a pain and depends on your setup. In short, though, you’ll need to install Selenium (you can use pip), a web browser, and a driver for that browser. I use Chrome with Chromedriver. Some people have mentioned that they don’t like Selenium because it physically opens windows when it is running, which is annoying; this is where a headless browser comes in handy. I use PhantomJS as a headless browser. It works just like a normal browser, but there is no graphical user interface. However, I’ve heard that Chrome now supports headless browsing too. I’ve never tried it, but I’ll update this post once I have.

Scrapers Gonna Scrape

This post was long and boring, but hopefully someone out there finds it helpful. Using the three approaches above, I’m able to scrape pretty much any data that I want. The only challenge arises on sites that use a paid service to block webdrivers, but that is very uncommon. And, if you’re really evil, if you DM me on Twitter we can talk about how one might hypothetically get around that. Anyways, let me know on Twitter if you found this helpful or if you think I'm dumb.