R - Read from URL/Web
Base R handles simple URL reading through `readLines()` and `url()` connections. This works for plain text, CSV files, and basic HTTP requests without authentication.
Key Insights
- R provides multiple approaches for reading web data: base R functions like
readLines()anddownload.file(), thehttrpackage for RESTful APIs, andrvestfor web scraping HTML content - Authentication, error handling, and rate limiting are critical considerations when building production-ready web data pipelines in R
- Modern R workflows benefit from
httr2for API interactions andpolitefor ethical web scraping with automatic throttling and robots.txt compliance
Reading Plain Text from URLs
Base R handles simple URL reading through readLines() and url() connections. This works for plain text, CSV files, and basic HTTP requests without authentication.
# Read plain text from URL
url <- "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv"
data <- read.csv(url(url))
head(data)
# Alternative using readLines for raw content
connection <- url("https://www.example.com/data.txt")
lines <- readLines(connection)
close(connection)
# Download file to disk
download.file(
url = "https://example.com/dataset.zip",
destfile = "dataset.zip",
mode = "wb" # Use binary mode for non-text files
)
The url() function creates a connection object that R’s standard reading functions accept. Always close connections explicitly or use functions that handle cleanup automatically.
Working with REST APIs Using httr
The httr package provides a comprehensive interface for HTTP operations with proper header management, authentication, and response parsing.
library(httr)
library(jsonlite)
# Basic GET request
response <- GET("https://api.github.com/users/hadley/repos")
# Check response status
if (status_code(response) == 200) {
content <- content(response, as = "text", encoding = "UTF-8")
repos <- fromJSON(content)
print(repos[1:3, c("name", "stargazers_count")])
} else {
stop(paste("Request failed with status:", status_code(response)))
}
# GET with query parameters
response <- GET(
"https://api.example.com/search",
query = list(
q = "data science",
page = 1,
per_page = 50
)
)
# POST request with JSON body
api_response <- POST(
"https://api.example.com/data",
body = list(
name = "test",
values = c(1, 2, 3)
),
encode = "json",
add_headers(
"Content-Type" = "application/json",
"Authorization" = paste("Bearer", Sys.getenv("API_TOKEN"))
)
)
Store API keys in environment variables rather than hardcoding them. Use .Renviron file or Sys.setenv() for local development.
Modern API Interactions with httr2
The httr2 package introduces a pipe-friendly interface with built-in retry logic and better error handling.
library(httr2)
# Build and execute request
response <- request("https://api.github.com") |>
req_url_path_append("repos/tidyverse/dplyr") |>
req_headers(
"Accept" = "application/vnd.github.v3+json",
"User-Agent" = "R-script"
) |>
req_retry(max_tries = 3) |>
req_throttle(rate = 10 / 60) |> # 10 requests per minute
req_perform()
# Parse JSON response
repo_data <- response |>
resp_body_json()
print(repo_data$full_name)
print(repo_data$stargazers_count)
# Handle pagination
get_all_pages <- function(base_url, max_pages = 5) {
results <- list()
for (page in 1:max_pages) {
resp <- request(base_url) |>
req_url_query(page = page, per_page = 100) |>
req_perform()
page_data <- resp_body_json(resp)
if (length(page_data) == 0) break
results[[page]] <- page_data
}
return(results)
}
The req_retry() function automatically handles transient failures, while req_throttle() prevents rate limit violations.
Web Scraping with rvest
For extracting data from HTML pages, rvest provides CSS selector and XPath support built on xml2.
library(rvest)
library(dplyr)
# Read HTML page
page <- read_html("https://en.wikipedia.org/wiki/List_of_R_packages")
# Extract tables
tables <- page |>
html_elements("table.wikitable") |>
html_table()
# Extract specific elements with CSS selectors
headings <- page |>
html_elements("h2") |>
html_text2()
# Extract links
links <- page |>
html_elements("a") |>
html_attr("href")
# More complex extraction
articles <- read_html("https://example.com/blog") |>
html_elements(".article") |>
map_df(~{
tibble(
title = html_element(.x, ".title") |> html_text2(),
author = html_element(.x, ".author") |> html_text2(),
date = html_element(.x, ".date") |> html_attr("datetime"),
url = html_element(.x, "a") |> html_attr("href")
)
})
Use html_text2() instead of html_text() for better whitespace handling and more readable output.
Ethical Scraping with polite
The polite package enforces web scraping best practices by checking robots.txt and implementing automatic delays.
library(polite)
library(rvest)
# Create polite session
session <- bow(
url = "https://example.com",
user_agent = "Educational R Script (contact@example.com)",
delay = 5 # seconds between requests
)
# Check what's allowed
print(session)
# Scrape pages politely
page1 <- scrape(session, query = list(page = 1))
data1 <- page1 |>
html_elements(".item") |>
html_text2()
# Navigate to another page
page2 <- nod(session, path = "/page2") |>
scrape()
The bow() function reads robots.txt and respects crawl delays. The session object maintains rate limiting across multiple requests.
Error Handling and Robustness
Production code requires comprehensive error handling for network failures, malformed responses, and rate limiting.
library(httr2)
library(purrr)
safe_fetch <- function(url) {
tryCatch({
response <- request(url) |>
req_timeout(30) |>
req_retry(
max_tries = 3,
is_transient = ~resp_status(.x) %in% c(429, 500, 502, 503)
) |>
req_error(is_error = ~FALSE) |> # Don't throw on HTTP errors
req_perform()
if (resp_status(response) >= 400) {
warning(paste("HTTP error:", resp_status(response), "for", url))
return(NULL)
}
return(resp_body_json(response))
}, error = function(e) {
warning(paste("Failed to fetch", url, ":", e$message))
return(NULL)
})
}
# Fetch multiple URLs with error handling
urls <- c(
"https://api.example.com/data1",
"https://api.example.com/data2",
"https://api.example.com/data3"
)
results <- map(urls, safely(safe_fetch))
successful <- keep(results, ~is.null(.x$error))
data_list <- map(successful, "result")
The safely() wrapper from purrr prevents one failure from stopping the entire pipeline.
Caching and Performance
Implement caching for expensive API calls to improve performance and reduce server load.
library(memoise)
library(cachem)
# Create disk cache
cache <- cache_disk("./api_cache", max_size = 1024 * 1024^2) # 1GB
# Memoize fetch function
fetch_api_data <- memoise(
function(endpoint) {
response <- request(paste0("https://api.example.com/", endpoint)) |>
req_perform()
resp_body_json(response)
},
cache = cache
)
# First call hits API
data1 <- fetch_api_data("users/123")
# Second call uses cache
data2 <- fetch_api_data("users/123")
# Clear cache when needed
forget(fetch_api_data)
Disk caching persists across R sessions, making it ideal for data that changes infrequently. Set appropriate max_age parameters for time-sensitive data.
Handling Authentication
Different APIs require various authentication methods. Here are common patterns:
# Bearer token
response <- request("https://api.example.com/protected") |>
req_auth_bearer_token(Sys.getenv("API_TOKEN")) |>
req_perform()
# Basic authentication
response <- request("https://api.example.com/data") |>
req_auth_basic(
username = Sys.getenv("API_USER"),
password = Sys.getenv("API_PASS")
) |>
req_perform()
# OAuth 2.0 (using httr)
library(httr)
app <- oauth_app(
"my_app",
key = Sys.getenv("CLIENT_ID"),
secret = Sys.getenv("CLIENT_SECRET")
)
token <- oauth2.0_token(
oauth_endpoints("google"),
app,
scope = "https://www.googleapis.com/auth/analytics.readonly"
)
response <- GET(
"https://www.googleapis.com/analytics/v3/data/ga",
config(token = token)
)
Never commit credentials to version control. Use environment variables, secure vaults, or the keyring package for credential management.