Webscraping With Selenium



In this post I’ll talk about the RSelenium package as a tool to navigate websites and how it can be combined with the rvest package to scrape dynamic web pages. To understand this post, you’ll need basic knowledge of rvest, HTML and CSS. You can download the full R script HERE!

  1. Using the Python programming language, it is possible to “scrape” data from the web.
  2. Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping.
  3. Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping). There are many methods available in the Selenium API to select elements on the page.
  4. This page explains how to do web scraping with Selenium IDE commands. Web scraping works if the data is inside the HTML of a website. If you want to extract data from a PDF, image or video you need to use visual screen scraping instead.

Make webscraping faster: use an API. 12 seconds is a fantastic amount of time for UI automation to execute. I frequently run scripts that take anywhere from 1 minute (minimum) to 15 minutes max. Rendering a browser and HTML content on a page requires response times from the website you are automating - Selenium / Python is actually the fastest.

Observation: Even if you are not familiar with them, I explained as much as possible everything I did. For that reason, those who know about this stuff might find some parts of the post redundant. Feel free to read what you need and skip what you aldeady know!

Let’s compare the following websites:

On IMDb, if you search for a particular movie (for example, this one), you can see that the URL changes, and that URL is different from any other movie (for example, this one). The same behavior is shown if you search for different actors.

On the other hand, if you go to Premier League Player Stats, you will notice that modifying the filters or clicking the pagination button to access more data doesn’t produce changes on the URL.

As I understand it, the first website is an example of a static web page, while the second one is an example of a dynamic webpage.

The following definitions where taken from https://www.pcmag.com/.

  • Static Web Page: A Web page (HTML page) that contains the same information for all users. Although it may be periodically updated from time to time, it does not change with each user retrieval.

  • Dynamic Web Page: A Web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.

rvest is a great tool to scrape data from static web pages (check out Creating a Movies Dataset to see an example!).

But when it comes to dynamic web pages, rvest alone can’t get the job done. This is when RSelenium joins the party…

Java

You need to have Java installed. You can use Windows’ Command Prompt to check this. Just type java -version and press Enter. You should see something that looks like this:

If it throws an error, it might mean that you don’t have Java installed. You can download it from HERE.

R Packages

The following packages need to be installed and loaded in order to run the code written in this post.

Starting a Selenium server and browser is pretty straightforward using rsDriver().

However, when you run the code above it may produce the following error:

This error is addressed in this StackOverflow post. Basically, it means that there is a mismatch between the ChromeDriver and the Chrome Browser versions. As mentioned in the post, each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683.

Webscraping With Selenium

The parameter chromever defined this way always uses the latest compatible ChromeDriver version (the code was edited from this StackOverflow post).

After you run rD <- RSelenium::rsDriver(...), if everything worked correctly, a new chrome window will open. This window should look like this:

You can find more information about rsDriver() in the Basics Vignette.

In this section I’ll apply different methods to the remDr object created above. I’m only going to describe the methods that I think will be used most frequently. For a complete reference, check the package documentation.

  • navigate(url): Navigate to a given url.
  • goBack(): Equivalent to hitting the back button on the browser.
  • goForward(): Equivalent to hitting the forward button on the browser.
  • refresh(): Reload the current page.
  • getCurrentUrl(): Retrieve the url of the current page.
  • maxWindowSize(): Set the size of the browser window to maximum. By default, the browser window size is small, and some elements of the website you navigate to might not be available right away (I’ll talk more about this in the next section).
  • getPageSource()[[1]] Get the current page source. This method combined with rvest is what makes possible to scrape dynamic web pages. The xml document returned by the method can then be read using rvest::read_html(). This method returns a list object, that’s the reason behind [[1]].
  • open(silent = FALSE): Send a request to the remote server to instantiate the browser. I use this method when the browser closes for some reason (for example, inactivity). If you have already started the Selenium server, you should run this instead of rD <- RSelenium::rsDriver(...) to re-open the browser.
  • close(): Close the current session.

Working with Elements

  • findElement(using, value). Search for an element on the page, starting from the document root. The located element will be returned as an object of webElement class. To use this function you need some basic knowledge of HTML and CSS (or xpath, etc). This chrome extension, called SelectorGadget, might help.

  • highlightElement(): Utility function to highlight current Element. This helps to check that you selected the wanted element.

  • sendKeysToElement(): Send a sequence of key strokes to an element. The key strokes are sent as a list. Plain text is enter as an unnamed element of the list. Keyboard entries are defined in ‘selKeys‘ and should be listed with name ‘key‘.

  • clearElement(): Clear a TEXTAREA or text INPUT element’s value.

  • clickElement(): Click the element. You can click links, check boxes, dropdown lists, etc.

Other Methods

Even though I have never used them, I believe this methods are worth mentioning. For more information, check the package documentation.

In this example, I’ll scrape data from Premier League Player Stats. This is what the website looks like:

You will notice that when you modify the Filters, the URL does not change. So you can’t use rvest alone to dynamically scrape this website. Also, if you scroll down to the end of the table you’ll see that there are pagination buttons. If you click them, you get more data, but again, the URL does not change. Here you can see how those pagination buttons look like:

Observation: Even though choosing a different stat does change the URL, I’ll work as if it didn’t.

Target Dataset

The dataset I want will have the following variables:

  • Player: Indicates the player name.
  • Nationality: Indicates the nationality of the player.
  • Season: Indicates the season the stats corresponds to.
  • Club: Indicates the club the player belonged to in the season.
  • Position: Indicates the player position in the season.
  • Stats: One column for each Stat.

For simplicity, I’ll scrape data from seasons 2017/18 and 2018/19, and only from the Goals, Assists, Minutes Played, Passes, Shots and Fouls stats. This means that our dataset will have a total of 11 columns.

Before we start…

Scraping

In order to run the code below, you have to start a Selenium server and browser, and create the remDr object. This procedure was described in the Start Selenium section.

First Steps

The code chunk below navigates to the website, increases the windows size to find elements that might be hidden (for example, when the window is small I can’t see the Filters) and then clicks the “Accept Cookies” button.

You might notice two things:

  • The use of the Sys.sleep() function. Here, this function is used to give the website enough time to load. Sometimes, if the element you want to find isn’t loaded when you search for it, it will produce an error.

  • The use of CSS selectors. To select an element using CSS you can press F12 an inspect the page source (right clicking the element and selecting Inspect will show you which part of that code refers to the element) and/or use this chrome extension, called SelectorGadget. I recommend learning a little about HTML and CSS and use this two approaches simultaneosly. SelectorGadget helps, but sometimes you will need to inspect the source to get exactly what you want. In the next subsection I’ll show how I selected certain elements by inspecting the page source.

Getting Values to Iterate Over

I know that in order to get the data, I’ll have to iterate over different lists of values. In particular, I need a list of stats, seasons, and player positions.

We can use rvest to scrape the website and get these lists. To do so, we need to find the corresponding nodes. As an example, after the code I’ll show where I searched for the required information in the page source for the stats and seasons lists.

The code below uses rvest to create the lists we’ll use in the loops.

Observation: Even though in the source we don’t see that each word has its first letteruppercased, when we check the dropdown list we see exactly that (for example, we have “Clean Sheets” instead of “Clean sheets”). I was getting an error when trying to scrape these type of stats, and making them look like the dropdown list solved the issue. That’s the reason behind str_to_title().

Web scraper free

Stats

This is my view when I open the stats dropdown list and right click and inspect the Clean Sheets stat.

Taking a closer look to the source where that element is present we get:

Seasons

This is my view when I open the seasons dropdown list and right click and inspect the 2016/17 season.

Taking a closer look to the source where that element is present we get:

As you can see, we have an attribute named data-dropdown-list whose value is FOOTBALL_COMPSEASON and inside we have li tags where the attribute data-option-name changes for each season. This will be useful when defining how to iterate using RSelenium.

Web Scraping With Selenium Nodejs

Positions

The logic behind getting the CSS for the positions is similar to the one described above, so I won’t be showing it.

Webscraping Loop

The code has comments on each step, so you can check it out! But before that, I’ll give an overview of the loop.

  1. Preallocate stats vector. This list will have a length equal to the number of stats to be scraped.

  2. For each stat:

    1. Click the stat dropdown list
    2. Click the corresponding stat
    3. Preallocate seasons vector. This list will have a length equal to the number of seasons to be scraped.
    4. For each season inside stat:
      1. Click the seasons dropdown list
      2. Click the corresponding season
      3. Preallocate positions vector. This list will have length = 4 (positions are fixed: GOALKEEPER, DEFENDER, MIDFIELDER and FORWARD).
      4. For each position inside season inside stat
        1. Click the position dropdown list
        2. Click the corresponding position
        3. Check that there is a table with data (if not, go to next position)
        4. Scrape the first table
        5. While “Next Page” button exists
          1. Click “Next Page” button
          2. Scrape new table
          3. Append new table to table
        6. Change stat colname and add position data
        7. Go to the top of the website
      5. Rowbind each position table
      6. Add season data
    5. Rowbind each season table
    6. Assign the table to the corresponding stat element.

The result of this loop is a populated list with a number of elements equal to the number of stats scraped. Each of this elements is a tibble.

This may take some time to run, so you can choose less stats to try it out.

As I mentioned, you can check the code!

Observation: Be careful when you add more stats to the loop. For example, Clean Sheets has the Position filter hidden, so the code should be modified (for example, by adding some “if” statement).

Data Wrangling

Finally, some data wrangling is needed to create our dataset. data_topStats is a list with 6 elements, each one of those elements is a tibble. The next code chunk removes the Rank column from each tibble, reorders the columns and then makes a full join by all the non-stat variables using reduce() (the reason behind this full join is that not all players have all stats). In the last line of code I replace NA values with zero in the stats variables.

This is how the data looks like.

SeasonPositionClubPlayerNationalityGoalsAssistsMinutes PlayedPassesShotsFouls
2018/19DEFENDERBrighton and Hove AlbionShane DuffyIreland51308813053722
2018/19DEFENDERAFC BournemouthNathan AkéNetherlands40341216962528
2018/19DEFENDERCardiff CitySol BambaCote D’Ivoire4124755502235
2018/19DEFENDERWolverhampton WanderersWilly BolyFrance40316817152429
2018/19DEFENDEREvertonLucas DigneFrance44296614573439
2018/19DEFENDERWolverhampton WanderersMatt DohertyIreland45314713994630

The framework described here is an approach to working in parallel with RSelenium.

First, we load the libraries we need.

The function defined below stops Selenium on each core.

We determine the number of cores we’ll use. In this example, I use four cores.

We have to list the ports that are going to be used to start Selenium.

We use clusterApply() to start Selenium on each core. Pay attention to the use of the Superassignment operator. When you run this function, you will see that four chrome windows are opened.

This is an example of pages that we will open in parallel. This list will change depending on the particular scenario.

Use parLapply() to work in parallel. When you run this, you will see that each browser opens one website, and one is still blank. This is a simple example, I haven’t defined any scraping, but of course you can!

when you are done, stop Selenium on each core and stop the cluster.

Observation: Sometimes, when working in parallel some of the browsers close for no apparent reason (or at least a reason that I don’t understand).

Workaround browser closing for no reason

Consider the following scenario: your loop navigates to a certain website, clicks some elements and then gets the page source to scrape using rvest. If in the middle of that loop the browser closes, you will get an error (for example, it won’t navigate to the website, or the element won’t be found). You can work around these errors using tryCatch(), but when you skip the iteration where the error occurred, when you try to navigate to the website in the following iteration, an error would occur again (because there is no browser open!).

You could, for example, use remDr$open() in the beggining of the loop, and remDr$close() in the end, but I think that will open and close many browsers and make the process slower.

So I created this function that handles part of the problem (even though the iteration where the browser closed will not finish, the next one will and the process won’t stop).

It basically tries to get the current URL using remDr$getCurrentUrl(). If no browser is open, this will throw an error, and if we get an error, it will open a browser.

Closing Selenium

Sometimes, even if the browser window is closed, when you re-run rD <- RSelenium::rsDriver(...) you might encounter an error like:

This means that the connection was not completely closed. You can execute the lines of code below to stop Selenium.

You can check this. StackOverflow post for more information.

Wrapper Functions

You can create functions in order to type less. Suppose that you navigate to a certain website where you have to click one link that sends you to a site with different tabs. You can use something like this:

Observation: this function is theoretical, it won’t work if you run it.

I won’t show it here, but you can create functions to find elements, check if an element exists on the DOM (Document Object Model), try to click an element if it exists, parse the data table you are interested in, etc. You can check this StackOverflow for examples.

The following list contains different videos, posts and StackOverflow posts that I found useful when learning and working with RSelenium.

Web Scraping With Selenium Tutorial

  • The ultimate online collection toolbox: Combining RSelenium and Rvest ( Part I and Part II ). If you know about rvest and just want to learn about RSelenium, I’d recommend watching Part II. It gives an overview of what you can do when combining RSelenium and rvest. It has nice an practical examples. As a final comment regarding these videos, I wouldn’t pay too much attention to setting up Docker because at least I didn’t need to work that way in order to get RSelenium going. In fact, at least now, getting it going is pretty straightforward.

  • RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium. I found this post really useful when trying to set up RSelenium. The solution given in this StackOverflow post, which is mentioned in the article, seems to be enough.

  • Dungeons and Dragons Web Scraping with rvest and RSelenium. This is a great post! It starts with a general tutorial for scraping with rvest and then dives into RSelenium. If you are not familiar with rvest, you can start here.

  • RSelenium Tutorial. This post might be helpful too.

  • RSelenium Package Website. It has more advanced and detailed content. I just took a look to the Basics Vignette.

  • These StackOverflow posts helped me when working with dropdown lists:

  • RSelenium: server signals port is already in use. This post gives a solution to the “port already in use” problem. Even though is not marked as best, the last line of code of the second answer is useful.

  • Data Scraping in R. Thanks to this post I found the Premier League Stats website, which was exactly what I was looking for to write a post about RSelenium. Also, I took some hints from the answer marked as best.

  • CSS Tutorials:

Advanced

Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.

What is web scraping?

To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.

Webscraping

What is selenium?

According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project

When should you use selenium?

Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:

  • When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
  • When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.

So why not use selenium all the time? It is a bit slower then using requests and urllib. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.

Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)

Pre-requisites for using selenium

Step 1: Install selenium library

Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.

Once pip is installed you can proceed with the installation of selenium, with the following command

Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:

Web Scraping With Selenium Ide

Step 2: Install web driver

Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)

Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.

Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.

From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog

What Is Web Scraping

Download the driver to a common folder which is accessible. Your script will refer to this driver.

You can follow our guide on how to install the web driver here.

A Simple Selenium Example in Python

Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver from selenium in a python file as it’s shown in the following.

Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:

Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit() so the window will automatically be closed after the job is done. Code now will look like this.

Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample

How To Run Selenium in background

In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.

The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.

Example of Scraping a Dynamic Website in Python With Selenium

Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.

Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.

The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py

Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:

from selenium.webdriver.common.keys import Keys.

The driver.quit() line is going to be commented temporally so we are able to see what we are performing

The Youtube page shows a list of videos from the search as expected!

Web Scraping With Selenium C#

As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.

What is XPath?

XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath

There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.

Looping Through Elements with Selenium

Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.

Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”

We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.

One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.

Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.

Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.

So instead of the time.sleep(5), you can then replace the code with:

This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout

Conclusion

With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next

Python Selenium Scrape Table

Subscribe to our newsletter

Get new tips in your inbox automatically. Subscribe to our newsletter!