Scraping coral reef data from a JS website

Web scraping is a powerful tool for data extraction and analysis. This innovative technique involves using specialized software to gather and process data from websites, social media platforms, and other online sources. So why is web scraping so useful? For starters, it enables us to collect and analyze large amounts of data quickly and efficiently. With a web scraper, you can gather information from multiple sources, and compile it into a single dataset for different purposes, such as market research, lead generation, social media monitoring, and content aggregation.

But that’s just the tip of the iceberg when it comes to the benefits of web scraping. One of the most significant advantages of using a web scraper is its ability to gather data from sources that don’t have an API available, or when the data you need is not open in a public database or not formatted in CSV files.

Gathering data from ReefBase

ReefBase is a data repository that provides access to information on coral reefs around the world. It was created by the international organization WordlFish Center, to ease the monitoring and analysis of coral reef health and the quality of life of reef-dependent people. However, it had a small issue, there were no data download options. The solution to taking advantage of such a great online repository was to create a web scraper that extracts reef and marine data from different countries. This is our page of interest:

reefbase_home

The main characteristic is that ReefBase is a JavaScript website that allows users to interact with the interface to search across the repository. For example, users can search by region or country, using dropdown menus. Due to this fact, I used Selenium to create a bot that mimics human interaction with the web page to locate specific elements on the page, select options from the dropdown (each country), and click on the search button. Then, the program dives automatically through the reef data for each country, and downloads the information that we need.

The result is amazing! First, let’s see this video where we show how the bot works. Look at how the dorpdown and the data change after each iteration. After that, I show how I built the scraper.

full_data = data.frame()

for (j in seq_len(n_countries)){
  print(j)
  if(j == 1 | j == 39 | j == 46 | j == 63  | j == 94 | j == 112 | j == 123) {next}
  xpath = paste0("//td//select[@name = 'ctl00$Content$cboCountry']//option[", j, "]")
  Sys.sleep(2)
  driver$findElement(value = xpath)$clickElement()
  Sys.sleep(2)
  driver$findElement(value = '//div/input[@type = "submit" and 
                     @name="ctl00$Content$btnAdvancedSearch"]')$clickElement()
  page_source <- driver$getPageSource()[[1]]
  all_tables = page_source %>% 
     read_html() %>% 
     html_table() 
   
  data <- all_tables[[143]]
    
  colnames(data) = tolower(data[1, ] )
  data[-1,]
  
  data = data %>% 
    rename(area_statistics=1, unit=3) %>% 
    select(!c(graph, table)) %>% 
    mutate(value = as.numeric(value),
           ctr_code = j)
  
  full_data = full_data %>%
    rbind(data)

}

Once we had a dataframe with the reef information by country, some cleaning was applied, and the result is a dataset ready for analysis and visualization:

country Marine_Area Shelf_Area Coastline Land_Area Reef_Area Reefs_ At_Risk
Australia 6664100 2065200 25760 7682300 48960 32
Bahrain 8000 10000 161 665 760 82
Bangladesh 39900 59600 580 130168 50 100
Barbados 48800 320 97 430 90 100
Belize 12800 8700 386 22806 1420 63
Brazil 3442500 711500 7491 8459417 1200 84
Colombia 706100 16200 3208 1038700 2060 44
Costa Rica 542100 14800 1290 51060 30 93
Dominican Rep 246500 5900 360 48320 1350 89
Egypt 185300 50100 2450 995450 3800 61

GitHub