What is web scraping ?
Web scraping is a commonly used method in data science field which allow the utilization of the data available on websites, so we could define it as an accurate and reliable extraction of data .
Tools we will need
For the sake of simplicity, we will be using the R language which is mostly utilized in data analysis and data visualization. We will be using Rstudio as an editor for our scraping script
Download Rstudio
Step 1 : installing rvest
When scraping data using R we will need the rvest library which is a package that will simplify data harvesting with a set of predefined functions.
In the console section we will execute this commande line
install.packages('rvest')
Step 2 : reading the webpage
Now, we need to setup our script by creating a new file,
then we will upload the Rvest package, after that we are going to read
the webpage we want to scrap, in this tutorial we will be using this webpage
from the famous imdb website .
library('rvest')
url <- 'https://www.imdb.com/search/title?year=2017&title_type=feature&page=1&ref_=adv_prv'
webpage <- read_html(url)
Now the webpage variable contains the html source code .
step 3 : choosing the selectors
The next step is extracting the wanted informations, lets say for example we want to extract the name of the 50 movies in this webpage and to achieve that we will use the html_nodes function to select nodes from an html document using XPath or css selectors (be careful choosing the right css selector) and the html_text function in order to harvest the text of the extracted node The html code of the movie title is :
as we can see the associated css selector is a nested in the lister-item-header
class So we will be using that in the html-nodes function than extract the html text
title_html <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(title_html)
If we repeat the same thing for the meta score, the runtime and the rating :
title_html <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(title_html)
rating_html <- html_nodes(webpage,'.lister-item.mode-advanced strong')
rating <- html_text(rating_html)
metascore_html <- html_nodes(webpage,'.metascore')
metascore <- html_text(metascore_html)
runtime_html <- html_nodes(webpage,'span.runtime')
runtime <- html_text(runtime_html)
step 4 : structuring the scraped data
After collecting the data we need and putting it into different variables the only thing that is left is organizing our data into a more appealing format and that is by defining a data frame
data.frame (Title = title, Rating = rating , Runtime = runtime ,Metascore = metascore)
Finally we will be getting a structured data :
Disclaimer : scraping websites may sometimes not be legal because of the terms of service , so to avoid any problems you need to check the rules set by the website owners
How to check if a website is legally harvestable