Scraping dl, dt, dd HTML Data with Rvest and Hidden API Endpoints

Scraping dl, dt, dd HTML data

Table of Contents

Introduction

When scraping web data, particularly from websites that use HTML structures like dl, dt, and dd elements, we often encounter issues with extracting the desired information. This post aims to provide an overview of two approaches for scraping this type of HTML data using R programming language.

Understanding the Problem

The question at hand is about scraping publicly available house descriptions from an online search using Rvest & Selectorgadget and getting nothing returned. The given code snippet attempts to accomplish this task but doesn’t yield any results.

Background and Context

Before diving into the solutions, it’s essential to understand some basic concepts related to web scraping:

  • HTML Structure: HTML elements like dl (definition list), dt (definition term), and dd (definition description) are used to group related information in a structured manner.
  • Web Scraping Libraries: Rvest is a popular library for extracting data from the web using R, while Selectorgadget can be used to locate specific elements on a webpage.
  • Hidden API: Some websites provide an alternative endpoint or API for accessing their data. This approach avoids the need for rendering the webpage in a browser and can lead to faster results.

Method 1: Using Rvest and Selectorgadget

While using Rvest and Selectorgadget is technically possible, it’s not the recommended approach due to several reasons:

  • Browser Rendering: The most significant limitation of this method is that it relies on rendering the webpage in a browser, which can be slower and less efficient than other approaches.
  • Specificity Issues: Using Selectorgadget might lead to specificity issues when locating the desired elements on the webpage.
Site <- "https://paol.snb.ca/paol.html?lang=en&pan=00100004"
snb <- read_html(Site)
snb %>% html_nodes("dd") %>% html_text()

Method 2: Using Hidden API with Rvest and httr

The recommended approach is to use the hidden API endpoint, which bypasses the need for rendering the webpage in a browser. Here’s how you can accomplish this using rvest and httr:

library(rvest)
library(httr)

myurl <- "https://paol.snb.ca/pas-shim/api/paol/dossier/00100004"

# User Agent Header
ua <- user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36")

# Cookie Header
my_cookie <- "copy_your_cookie_from_browserotherwise_request_will_fail_given_error_no_cookie_available"

# Create HTML Session with API URL and User Agent
my_session <- html_session(myurl, ua,
                           add_headers(Cookie=my_cookie))

# Get Content from HTML Session
result_list <- httr::content(my_session$response, as = "parsed")

Example Usage

To demonstrate the usage of the hidden API endpoint, you can use the provided code snippet. The resulting result_list will contain a list of parsed JSON objects, each representing a house description.

# Extracting Specific Fields from Result List
$description <- result_list$description
imageKey <- result_list$imageKey

# Printing House Description
print(paste("Description:", $description))

By following this approach, you can efficiently scrape HTML data using the hidden API endpoint and improve your web scraping workflow.


Last modified on 2024-09-21