R - Read from URL/Web | Application Architect

Key Insights

R provides multiple approaches for reading web data: base R functions like readLines() and download.file(), the httr package for RESTful APIs, and rvest for web scraping HTML content
Authentication, error handling, and rate limiting are critical considerations when building production-ready web data pipelines in R
Modern R workflows benefit from httr2 for API interactions and polite for ethical web scraping with automatic throttling and robots.txt compliance

Reading Plain Text from URLs

Base R handles simple URL reading through readLines() and url() connections. This works for plain text, CSV files, and basic HTTP requests without authentication.

# Read plain text from URL
url <- "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv"
data <- read.csv(url(url))
head(data)

# Alternative using readLines for raw content
connection <- url("https://www.example.com/data.txt")
lines <- readLines(connection)
close(connection)

# Download file to disk
download.file(
  url = "https://example.com/dataset.zip",
  destfile = "dataset.zip",
  mode = "wb"  # Use binary mode for non-text files
)

The url() function creates a connection object that R’s standard reading functions accept. Always close connections explicitly or use functions that handle cleanup automatically.

Working with REST APIs Using httr

The httr package provides a comprehensive interface for HTTP operations with proper header management, authentication, and response parsing.

library(httr)
library(jsonlite)

# Basic GET request
response <- GET("https://api.github.com/users/hadley/repos")

# Check response status
if (status_code(response) == 200) {
  content <- content(response, as = "text", encoding = "UTF-8")
  repos <- fromJSON(content)
  print(repos[1:3, c("name", "stargazers_count")])
} else {
  stop(paste("Request failed with status:", status_code(response)))
}

# GET with query parameters
response <- GET(
  "https://api.example.com/search",
  query = list(
    q = "data science",
    page = 1,
    per_page = 50
  )
)

# POST request with JSON body
api_response <- POST(
  "https://api.example.com/data",
  body = list(
    name = "test",
    values = c(1, 2, 3)
  ),
  encode = "json",
  add_headers(
    "Content-Type" = "application/json",
    "Authorization" = paste("Bearer", Sys.getenv("API_TOKEN"))
  )
)

Store API keys in environment variables rather than hardcoding them. Use .Renviron file or Sys.setenv() for local development.

Modern API Interactions with httr2

The httr2 package introduces a pipe-friendly interface with built-in retry logic and better error handling.

library(httr2)

# Build and execute request
response <- request("https://api.github.com") |>
  req_url_path_append("repos/tidyverse/dplyr") |>
  req_headers(
    "Accept" = "application/vnd.github.v3+json",
    "User-Agent" = "R-script"
  ) |>
  req_retry(max_tries = 3) |>
  req_throttle(rate = 10 / 60) |>  # 10 requests per minute
  req_perform()

# Parse JSON response
repo_data <- response |>
  resp_body_json()

print(repo_data$full_name)
print(repo_data$stargazers_count)

# Handle pagination
get_all_pages <- function(base_url, max_pages = 5) {
  results <- list()
  
  for (page in 1:max_pages) {
    resp <- request(base_url) |>
      req_url_query(page = page, per_page = 100) |>
      req_perform()
    
    page_data <- resp_body_json(resp)
    if (length(page_data) == 0) break
    
    results[[page]] <- page_data
  }
  
  return(results)
}

The req_retry() function automatically handles transient failures, while req_throttle() prevents rate limit violations.

Web Scraping with rvest

For extracting data from HTML pages, rvest provides CSS selector and XPath support built on xml2.

library(rvest)
library(dplyr)

# Read HTML page
page <- read_html("https://en.wikipedia.org/wiki/List_of_R_packages")

# Extract tables
tables <- page |>
  html_elements("table.wikitable") |>
  html_table()

# Extract specific elements with CSS selectors
headings <- page |>
  html_elements("h2") |>
  html_text2()

# Extract links
links <- page |>
  html_elements("a") |>
  html_attr("href")

# More complex extraction
articles <- read_html("https://example.com/blog") |>
  html_elements(".article") |>
  map_df(~{
    tibble(
      title = html_element(.x, ".title") |> html_text2(),
      author = html_element(.x, ".author") |> html_text2(),
      date = html_element(.x, ".date") |> html_attr("datetime"),
      url = html_element(.x, "a") |> html_attr("href")
    )
  })

Use html_text2() instead of html_text() for better whitespace handling and more readable output.

Ethical Scraping with polite

The polite package enforces web scraping best practices by checking robots.txt and implementing automatic delays.

library(polite)
library(rvest)

# Create polite session
session <- bow(
  url = "https://example.com",
  user_agent = "Educational R Script (contact@example.com)",
  delay = 5  # seconds between requests
)

# Check what's allowed
print(session)

# Scrape pages politely
page1 <- scrape(session, query = list(page = 1))
data1 <- page1 |>
  html_elements(".item") |>
  html_text2()

# Navigate to another page
page2 <- nod(session, path = "/page2") |>
  scrape()

The bow() function reads robots.txt and respects crawl delays. The session object maintains rate limiting across multiple requests.

Error Handling and Robustness

Production code requires comprehensive error handling for network failures, malformed responses, and rate limiting.

library(httr2)
library(purrr)

safe_fetch <- function(url) {
  tryCatch({
    response <- request(url) |>
      req_timeout(30) |>
      req_retry(
        max_tries = 3,
        is_transient = ~resp_status(.x) %in% c(429, 500, 502, 503)
      ) |>
      req_error(is_error = ~FALSE) |>  # Don't throw on HTTP errors
      req_perform()
    
    if (resp_status(response) >= 400) {
      warning(paste("HTTP error:", resp_status(response), "for", url))
      return(NULL)
    }
    
    return(resp_body_json(response))
    
  }, error = function(e) {
    warning(paste("Failed to fetch", url, ":", e$message))
    return(NULL)
  })
}

# Fetch multiple URLs with error handling
urls <- c(
  "https://api.example.com/data1",
  "https://api.example.com/data2",
  "https://api.example.com/data3"
)

results <- map(urls, safely(safe_fetch))
successful <- keep(results, ~is.null(.x$error))
data_list <- map(successful, "result")

The safely() wrapper from purrr prevents one failure from stopping the entire pipeline.

Caching and Performance

Implement caching for expensive API calls to improve performance and reduce server load.

library(memoise)
library(cachem)

# Create disk cache
cache <- cache_disk("./api_cache", max_size = 1024 * 1024^2)  # 1GB

# Memoize fetch function
fetch_api_data <- memoise(
  function(endpoint) {
    response <- request(paste0("https://api.example.com/", endpoint)) |>
      req_perform()
    resp_body_json(response)
  },
  cache = cache
)

# First call hits API
data1 <- fetch_api_data("users/123")

# Second call uses cache
data2 <- fetch_api_data("users/123")

# Clear cache when needed
forget(fetch_api_data)

Disk caching persists across R sessions, making it ideal for data that changes infrequently. Set appropriate max_age parameters for time-sensitive data.

Handling Authentication

Different APIs require various authentication methods. Here are common patterns:

# Bearer token
response <- request("https://api.example.com/protected") |>
  req_auth_bearer_token(Sys.getenv("API_TOKEN")) |>
  req_perform()

# Basic authentication
response <- request("https://api.example.com/data") |>
  req_auth_basic(
    username = Sys.getenv("API_USER"),
    password = Sys.getenv("API_PASS")
  ) |>
  req_perform()

# OAuth 2.0 (using httr)
library(httr)

app <- oauth_app(
  "my_app",
  key = Sys.getenv("CLIENT_ID"),
  secret = Sys.getenv("CLIENT_SECRET")
)

token <- oauth2.0_token(
  oauth_endpoints("google"),
  app,
  scope = "https://www.googleapis.com/auth/analytics.readonly"
)

response <- GET(
  "https://www.googleapis.com/analytics/v3/data/ga",
  config(token = token)
)

Never commit credentials to version control. Use environment variables, secure vaults, or the keyring package for credential management.