Web scraping using R -Part 1-

Posted by Fouad Kouzmane on july 20, 2018

What is web scraping ?

Web scraping is a commonly used method in data science field which allow the utilization of the data available on websites, so we could define it as an accurate and reliable extraction of data .

Tools we will need

For the sake of simplicity, we will be using the R language which is mostly utilized in data analysis and data visualization. We will be using Rstudio as an editor for our scraping script

Download Rstudio

Rstudio work station

Step 1 : installing rvest

When scraping data using R we will need the rvest library which is a package that will simplify data harvesting with a set of predefined functions.

installing the rvest package

In the console section we will execute this commande line

                    install.packages('rvest')
            

Step 2 : reading the webpage

Now, we need to setup our script by creating a new file, then we will upload the Rvest package, after that we are going to read the webpage we want to scrap, in this tutorial we will be using this webpage
from the famous imdb website .


library('rvest') 
url  <-  'https://www.imdb.com/search/title?year=2017&title_type=feature&page=1&ref_=adv_prv'
webpage <- read_html(url) 
                 

Now the webpage variable contains the html source code .

step 3 : choosing the selectors

The next step is extracting the wanted informations, lets say for example we want to extract the name of the 50 movies in this webpage and to achieve that we will use the html_nodes function to select nodes from an html document using XPath or css selectors (be careful choosing the right css selector) and the html_text function in order to harvest the text of the extracted node The html code of the movie title is :

html code source

as we can see the associated css selector is a nested in the lister-item-header
class So we will be using that in the html-nodes function than extract the html text

title_html <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(title_html)

If we repeat the same thing for the meta score, the runtime and the rating :


title_html <- html_nodes(webpage,'.lister-item-header a') 
title <- html_text(title_html)
                      
rating_html <- html_nodes(webpage,'.lister-item.mode-advanced strong')
rating <- html_text(rating_html)
                      
metascore_html <- html_nodes(webpage,'.metascore') 
metascore <- html_text(metascore_html)
                      
runtime_html <- html_nodes(webpage,'span.runtime')
runtime <- html_text(runtime_html)
                      
                   

step 4 : structuring the scraped data

After collecting the data we need and putting it into different variables the only thing that is left is organizing our data into a more appealing format and that is by defining a data frame


                        data.frame (Title = title, Rating = rating , Runtime = runtime ,Metascore = metascore)
                     

Finally we will be getting a structured data :

data structured in a data frame

Disclaimer : scraping websites may sometimes not be legal because of the terms of service , so to avoid any problems you need to check the rules set by the website owners

How to check if a website is legally harvestable