Scraping coral reef data from a JS website

Web scraping is a powerful tool for data extraction and analysis. This innovative technique involves using specialized software to gather and process data from websites, social media platforms, and other online sources. So why is web scraping so useful? For starters, it enables us to collect and analyze large amounts of data quickly and efficiently. With a web scraper, you can gather information from multiple sources, and compile it into a single dataset for different purposes, such as market research, lead generation, social media monitoring, and content aggregation.

But that’s just the tip of the iceberg when it comes to the benefits of web scraping. One of the most significant advantages of using a web scraper is its ability to gather data from sources that don’t have an API available, or when the data you need is not open in a public database or not formatted in CSV files.

Gathering data from ReefBase

ReefBase is a data repository that provides access to information on coral reefs around the world. It was created by the international organization WordlFish Center, to ease the monitoring and analysis of coral reef health and the quality of life of reef-dependent people. However, it had a small issue, there were no data download options. The solution to taking advantage of such a great online repository was to create a web scraper that extracts reef and marine data from different countries. This is our page of interest:

reefbase_home

The main characteristic is that ReefBase is a JavaScript website that allows users to interact with the interface to search across the repository. For example, users can search by region or country, using dropdown menus. Due to this fact, I used Selenium to create a bot that mimics human interaction with the web page to locate specific elements on the page, select options from the dropdown (each country), and click on the search button. Then, the program dives automatically through the reef data for each country, and downloads the information that we need.

The result is amazing! First, let’s see this video where we show how the bot works. Look at how the dorpdown and the data change after each iteration. After that, I show how I built the scraper.

full_data = data.frame()

for (j in seq_len(n_countries)){
  print(j)
  if(j == 1 | j == 39 | j == 46 | j == 63  | j == 94 | j == 112 | j == 123) {next}
  xpath = paste0("//td//select[@name = 'ctl00$Content$cboCountry']//option[", j, "]")
  Sys.sleep(2)
  driver$findElement(value = xpath)$clickElement()
  Sys.sleep(2)
  driver$findElement(value = '//div/input[@type = "submit" and 
                     @name="ctl00$Content$btnAdvancedSearch"]')$clickElement()
  page_source <- driver$getPageSource()[[1]]
  all_tables = page_source %>% 
     read_html() %>% 
     html_table() 
   
  data <- all_tables[[143]]
    
  colnames(data) = tolower(data[1, ] )
  data[-1,]
  
  data = data %>% 
    rename(area_statistics=1, unit=3) %>% 
    select(!c(graph, table)) %>% 
    mutate(value = as.numeric(value),
           ctr_code = j)
  
  full_data = full_data %>%
    rbind(data)

}

Once we had a dataframe with the reef information by country, some cleaning was applied, and the result is a dataset ready for analysis and visualization:

country	Marine_Area	Shelf_Area	Coastline	Land_Area	Reef_Area	Reefs_ At_Risk
Australia	6664100	2065200	25760	7682300	48960	32
Bahrain	8000	10000	161	665	760	82
Bangladesh	39900	59600	580	130168	50	100
Barbados	48800	320	97	430	90	100
Belize	12800	8700	386	22806	1420	63
Brazil	3442500	711500	7491	8459417	1200	84
Colombia	706100	16200	3208	1038700	2060	44
Costa Rica	542100	14800	1290	51060	30	93
Dominican Rep	246500	5900	360	48320	1350	89
Egypt	185300	50100	2450	995450	3800	61

GitHub