guest@insea-it:~/blog/articles/scrapping$ cat scrapping.md

Web Scraping Techniques

Learn how to extract data from websites

Web scraping using R -Part 1-

What is web scraping ?

Web scraping is a commonly used method in data science field which allow the utilization of the data available on websites, so we could define it as an accurate and reliable extraction of data .

Tools we will need

For the sake of simplicity, we will be using the R language which is mostly utilized in data analysis and data visualization. We will be using Rstudio as an editor for our scraping script

Step 1 : installing rvest

When scraping data using R we will need the rvest library which is a package that will simplify data harvesting with a set of predefined functions.

In the console section we will execute this commande line

Step 2 : reading the webpage

Now, we need to setup our script by creating a new file, then we will upload the Rvest package, after that we are going to read the webpage we want to scrap, in this tutorial we will be using this webpage from the famous imdb website .

Now the webpage variable contains the html source code .

step 3 : choosing the selectors

The next step is extracting the wanted informations, lets say for example we want to extract the name of the 50 movies in this webpage and to achieve that we will use the html_nodes function to select nodes from an html document using XPath or css selectors (be careful choosing the right css selector) and the html_text function in order to harvest the text of the extracted node The html code of the movie title is :

as we can see the associated css selector is a nested in the lister-item-header class So we will be using that in the html-nodes function than extract the html text

as we can see the associated css selector is a nested in the lister-item-header class So we will be using that in the html-nodes function than extract the html text

If we repeat the same thing for the meta score, the runtime and the rating :

step 4 : structuring the scraped data

After collecting the data we need and putting it into different variables the only thing that is left is organizing our data into a more appealing format and that is by defining a data frame

Finally we will be getting a structured data :

Disclaimer : scraping websites may sometimes not be legal because of the terms of service , so to avoid any problems you need to check the rules set by the website owners

// End of article - Thank you for reading!

guest@insea-it:~/blog/articles/scrapping$
MODE: ARTICLE AUTHOR: INSEA IT Team 00:00:00