Web Scraping with Selenium
This tutorial will show how to scrape dynamic websites using Selenium.
Setup
Setting up Selenium requires 2 steps:
The easiest way to install the Selenium library is simply running pip install selenium
To install the browser drivers, get the version of your browser and then download the appropriate driver. This tutorial will use the Chrome driver. Either put the downloaded driver in a directory that is already listed in the PATH environment variable, or put it in a new directory and add that directory to the PATH environment variable.
Running chromedriver.exe from the command line will test if the driver has been added correctly.
Simple Example
To get started, we'll do a simple and scrape from a static site (example.com). The following code will get the site title and the h1 tag:
from selenium import webdriver
from selenium.webdriver.common.by import By
#start the session
driver = webdriver.Chrome()
#navigate to the web page
driver.get('https://www.example.com')
#extract page title
title = driver.title
#find any h1 tags
h1 = driver.find_element(by=By.TAG_NAME, value='h1')
print(title, h1.text)
#end the session
driver.quit()
Of course, we don't need Selenium for such a basic example on a static site. But this demonstrates the key commands that nearly every Selenium script will use.
Scraping Dynamic Web Pages
If all we wanted was to scrape static sites like example.com, we could just use the built-in python requests library and send get requests. But we're interested in getting data from dynamic websites- websites where information is constantly changing.
If we sent a simple get request to such a site, we'd likely get back a bare-bones structure of the site, perhaps just the loading page. This is because javascript code needs to be run to perform AJAX requests and obtain data from the backend, which is then rendered on the page without the page reloading. Thus, we need a tool like Selenium, which lets javascript run in the browser.
Example Overview
In this example we'll look at data from the online sports book FanDuel (https://sportsbook.fanduel.com). Odds and sporting events are constantly changing, and this data is updated without the page being reloaded. We need a tool like Selenium to access the data.
Our goal for this tutorial will be the following:
- Find an event currently in progress
- Record the type of the event (MLB, NFL, etc.)
- Record the name the teams (or players if not a team sport)
- Record the moneyline (i.e. who will win) odds for both teams involved at regular intervals
By doing this we will get a record of how the odds for each team change throughout the match. Perhaps this data could be useful for training a model in the future.