• Category
  • >R Programming

Data Scraping in R Programming: Part 3 (Importing Tables from HTML, Cleaning, and more)

  • Lalit Salunkhe
  • Oct 04, 2020
Data Scraping in R Programming: Part 3 (Importing Tables from HTML, Cleaning, and more) title banner

In the previous two articles in this series of “Data Scraping in R Programming”, we have covered the scraping of CSV, Excel, Zip files from a web URL, and Scraping HTML data through the web. As I said in the first article of this series itself. 

 

This topic is vast and you can go way too deep into it and yet feel like being stood in water until your knee roll. This part, however, will be the last one which focuses on scraping tables from HTML webpage and cleaning of the data that is scraped through the webpages.


 

HTML Tables Scraping

 

One of the most common formats in which data gets stored on a webpage is a table. If you have gone through our previous article in the series Data Scraping in R Programming: Part 2, you probably might remember that the elements in HTML language are written between the tags (start tag, and end tag). Meaning, the elements (content more precisely which can either be a paragraph, header, etc. ) are stored between tags. Similarly, tables fall under the <table> tag. We are using the same package (rvest) that we have used in our previous article to scrap the HTML tables.

 

In this article, the data we are going to use will be from the Webpage of BLS Employee Statistics. If you will redirect to the link and hover through, there are around 10 HTML data tables and some of the tables are used to format the webpage (ex. Table of contents). 

 

To extract tables from this webpage, we are going to use the same functions that we have used in the previous article. The read_html() function helps us in reading the webpage into R environment whereas the html_nodes() function will extract elements under the  <table> node through this webpage. See an example below where we try to extract how many elements are present under the <table> tag/node on the given webpage.


This image shows how to extract the table nodes from a webpage using html_nodes() and listing those nodes out in R workspace.

Reading the table nodes in R environment


In this example, you first need to load the package “rvest” under R workspace using library() function in order to work on the HTML data. This package is already been installed on my system. However, if you are not sure how to install a package in R, you should read Packages in R Programming this article out.

 

Here, the read_html() page allows us to read the data from given HTML page and we store it as an object under the “data” variable. The html_nodes() function will help return the “table” nodes from our “data”. Here, the html_nodes() function only parses the table tags from the given web page. It doesn’t capture the tables information. To accesses the actual tables from the list of 16 tables through the webpage, we need to use the html_table() function.

 

Remember, there are 16 tables on this web page? (see the output of the last code we run). Do we really need to access all of these tables right away? Not necessarily!

 

We will be looking at a specific table across these 16. We will try to extract the first and second table from the list out of which the first table represents the total employment in the nonfarm sector for the year and adjusted for 2019. The second one represents the adjusted employment data for major industry sectors adjusted as of 2019. See the code below that gives us these two tables. 


This image shows how to read specific tables from the list of all tables from a webpage. 

Reading specific tables from a webpage under R


Now, a thing we could take note here is, these tables are being stored as an element of the list. The list named tables_list contains two elements. These elements are tables and each table contains 24 elements and 6 variables.

 

If you see the code above here, the use of a pipeline operator is important in the code, as we are trying to extract the second and third element from the table node. Remember the first table is actually at position second in the list (the first position is acquired by the table of contents).

 

The str() function allows us to see the structure of the list we just have created. We get information about the number of observations, number of variables, number of elements each variable contain data type, etc.

 

Well, if you are curious about the functions in R programming, you can go through following articles of us on functions;

 

  1. User-Defined Functions in R

  2. R Programming Vector Functions

  3. Character Functions in R

  4. Matrix Functions in R

  5. Recursive Functions in R

 

Now, we can see the first few rows of the tables we have extracted using the head() function. See code and output below:


This image shows how we can use the head() function we can extract the first few rows of the tables from the given list.

Using the head() function to get the first four rows from each table


We can do a bit of cleaning here in this code. Here, the first row of each table is the part of split headings. Which is one of the major drawback of using this read_table() function which doesn’t treat nicely the split headers. See the piece of code here that does the task for us.


This image shows cleaning the tables extracted through the webpage.

Cleaning the tables extracted through the webpage


If you could see, we have removed the first line of the table that contains the unnecessary header. After that, we have renamed the actual headers.


 

HTML Tables Scraping Using XML package

 

The XML package in R also allows us to read the tables data through a webpage. For that to work, you first need to install and access the package inside R workspace.


This image shows Installing and accessing the "XML" package in R workspace.

Installing and accessing “XML” package in R


This package has a function readHTMLTable() which is more convenient than the html_table() function from rvest. This function stores a table as a data frame in case there are multiple tables, this function stores each one of them as a separate element of the list.

 

Lets see an example where we try to extract the tables using readHTMLTable() function from XML package.


Reading tables from the given webpage using readHTMLTable() function from XML package.

Reading table from web using readHTMLTable() function from XML package


This image shows how easily we can read tables from a webpage using readHTMLTable() function from XML package. The link we used here is one which is provided by statistics times and shows a table of countries projected based on the GDP.

 

In the next ploy,  we will try to see what sort of tables we have extracted using the str() function. After that, we will extract each table from the given list to have a better overview of them independently using list slicing in combination with the head() function. See the output below:


Reading each of the two tables one by one which we extracted from the webpage.

Reading all tables one by one extracted through the webpage


Here, if you see, the first table is nothing but just a table of contents which has some basic information about the actual table. If you feel, you can eliminate this one by truncating the first element of the list XML_tabls.


 

Advantages of XML over rvest

 

  1. We can specify the column headers under the readHTMLTable() function itself instead of creating a new name object.

  2. Classes for each column can be specifies within the readHTMLTable() function itself.

  3. Same with the rows to truncate. We can mention the number of rows to truncate in the function itself.


 

Summary

 

  • To read tables from HTML web page, commonly used and more popular package is rvest()

  • html_table() function allows you to read the tables from HTML table tags.

  • We can clean the data up using simple techniques such as truncating the unnecessary tables, rows, columns, etc.

  • If there are multiple tables being extracted from the webpage, they will be stored as individual elements of a  list object. If a single table, it would be a data frame.

  • XML package is more convenient to use than the rvest package if you love the simple coding structure.

  • The readHTMLTable() function reads a table/s from the given webpage and stores the same as either a data frame or a list.

 

This article ends here. In our next article, we will come up with an interesting article that will be focused on Exporting data from R workspace. Until then, keep safe!

Latest Comments

  • brenwright30

    May 11, 2024

    THIS IS HOW YOU CAN RECOVER YOUR LOST CRYPTO? Are you a victim of Investment, BTC, Forex, NFT, Credit card, etc Scam? Do you want to investigate a cheating spouse? Do you desire credit repair (all bureaus)? Contact Hacker Steve (Funds Recovery agent) asap to get started. He specializes in all cases of ethical hacking, cryptocurrency, fake investment schemes, recovery scam, credit repair, stolen account, etc. Stay safe out there! Hackersteve911@gmail.com https://hackersteve.great-site.net/