Pandas - Read HTML Table from URL
The `read_html()` function returns a list of all tables found in the HTML source. Each table becomes a separate DataFrame, indexed by its position in the document.
Key Insights
- Pandas
read_html()function automatically parses all HTML tables from a URL or HTML string, returning a list of DataFrames that you can filter and manipulate immediately - The function uses lxml or html5lib parsers under the hood and handles common table structures including headers, multi-level columns, and merged cells without manual BeautifulSoup parsing
- Production implementations require error handling for network failures, table identification logic when multiple tables exist, and data type conversion for proper analysis
Basic HTML Table Extraction
The read_html() function returns a list of all tables found in the HTML source. Each table becomes a separate DataFrame, indexed by its position in the document.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url)
print(f"Found {len(tables)} tables")
print(tables[0].head())
This approach works for simple cases but lacks control. In production, you’ll want to identify the specific table you need rather than guessing indices.
Targeting Specific Tables with Match Parameter
The match parameter accepts a regex pattern to filter tables containing specific text, significantly reducing the number of DataFrames returned.
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2024_per_game.html'
# Find tables containing "Player" in any cell
tables = pd.read_html(url, match='Player')
if tables:
stats_df = tables[0]
print(stats_df.columns)
print(stats_df.head(10))
else:
print("No matching table found")
The match parameter searches through table contents, not just headers. This makes it powerful for isolating tables in pages with multiple similar structures.
Handling Headers and Column Names
HTML tables often have complex header structures. Pandas attempts automatic header detection, but you’ll frequently need manual intervention.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
tables = pd.read_html(url)
bank_failures = tables[0]
# Check if first row is actual header
print("Columns:", bank_failures.columns.tolist())
print("\nFirst row:", bank_failures.iloc[0].tolist())
# If headers are correct, proceed with cleaning
bank_failures.columns = bank_failures.columns.str.strip()
# Convert date columns
bank_failures['Closing Date'] = pd.to_datetime(bank_failures['Closing Date'])
# Filter recent failures
recent = bank_failures[bank_failures['Closing Date'] > '2020-01-01']
print(f"\nBank failures since 2020: {len(recent)}")
For tables without proper <th> tags, use the header parameter:
# Skip first row as header
df = pd.read_html(url, header=0)[0]
# Use specific row as header
df = pd.read_html(url, header=1)[0]
# No header row
df = pd.read_html(url, header=None)[0]
df.columns = ['Custom', 'Column', 'Names']
Authentication and Custom Headers
Many websites require authentication or specific headers. Pass a session object or use the storage_options parameter for more control.
import pandas as pd
import requests
# Method 1: Using requests session
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
url = 'https://example.com/data/table'
response = session.get(url)
if response.status_code == 200:
tables = pd.read_html(response.text)
df = tables[0]
else:
print(f"Failed to fetch: {response.status_code}")
# Method 2: For authenticated endpoints
session.auth = ('username', 'password')
response = session.get(url)
tables = pd.read_html(response.content)
For APIs requiring tokens:
import pandas as pd
import requests
headers = {
'Authorization': 'Bearer YOUR_TOKEN_HERE',
'User-Agent': 'Mozilla/5.0'
}
response = requests.get(url, headers=headers)
tables = pd.read_html(response.text)
df = tables[0]
Error Handling and Validation
Production code must handle network failures, parsing errors, and missing tables gracefully.
import pandas as pd
import requests
from requests.exceptions import RequestException
from urllib.error import URLError
def fetch_table_from_url(url, table_index=0, match=None, timeout=30):
"""
Fetch HTML table with comprehensive error handling.
Args:
url: Target URL
table_index: Which table to return (default: 0)
match: Regex pattern to filter tables
timeout: Request timeout in seconds
Returns:
DataFrame or None
"""
try:
# Fetch with timeout
response = requests.get(url, timeout=timeout)
response.raise_for_status()
# Parse tables
tables = pd.read_html(
response.text,
match=match,
flavor='lxml' # Explicitly specify parser
)
if not tables:
print(f"No tables found at {url}")
return None
if table_index >= len(tables):
print(f"Table index {table_index} out of range. Found {len(tables)} tables.")
return None
return tables[table_index]
except RequestException as e:
print(f"Network error: {e}")
return None
except ValueError as e:
print(f"Parsing error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
# Usage
df = fetch_table_from_url(
'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies',
match='Symbol'
)
if df is not None:
print(f"Successfully loaded {len(df)} rows")
Data Type Conversion and Cleaning
HTML tables contain string data by default. Convert types explicitly for numerical operations.
import pandas as pd
import numpy as np
url = 'https://www.ssa.gov/oact/babynames/decades/century.html'
tables = pd.read_html(url)
names_df = tables[0]
# Remove commas from numbers
if 'Number' in names_df.columns:
names_df['Number'] = names_df['Number'].str.replace(',', '').astype(int)
# Handle percentage strings
if 'Percent' in names_df.columns:
names_df['Percent'] = names_df['Percent'].str.rstrip('%').astype(float) / 100
# Convert rank to integer, handling non-numeric values
names_df['Rank'] = pd.to_numeric(names_df['Rank'], errors='coerce')
# Remove rows where conversion failed
names_df = names_df.dropna(subset=['Rank'])
names_df['Rank'] = names_df['Rank'].astype(int)
print(names_df.dtypes)
print(names_df.head())
Advanced Parsing with Multiple Tables
When pages contain multiple related tables, process them together for comprehensive analysis.
import pandas as pd
def extract_financial_tables(url):
"""Extract and combine multiple financial tables."""
tables = pd.read_html(url)
result = {}
for idx, table in enumerate(tables):
# Identify table by content
if 'Revenue' in str(table.columns):
result['income_statement'] = table
elif 'Assets' in str(table.columns):
result['balance_sheet'] = table
elif 'Operating Activities' in str(table.values):
result['cash_flow'] = table
return result
# Process multiple tables
url = 'https://example.com/financial-statements'
financial_data = extract_financial_tables(url)
if 'income_statement' in financial_data:
income = financial_data['income_statement']
# Clean and process
income.columns = income.columns.str.strip()
print(income.head())
Performance Optimization for Multiple URLs
When scraping multiple pages, implement rate limiting and caching to avoid overwhelming servers.
import pandas as pd
import time
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_read_html(url):
"""Cache results to avoid redundant requests."""
return pd.read_html(url)
def batch_fetch_tables(urls, delay=1.0):
"""
Fetch tables from multiple URLs with rate limiting.
Args:
urls: List of URLs to process
delay: Seconds to wait between requests
Returns:
Dictionary mapping URLs to DataFrames
"""
results = {}
for url in urls:
try:
tables = cached_read_html(url)
results[url] = tables[0] if tables else None
time.sleep(delay) # Rate limiting
except Exception as e:
print(f"Failed to fetch {url}: {e}")
results[url] = None
return results
# Usage
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
data = batch_fetch_tables(urls, delay=2.0)
valid_data = {k: v for k, v in data.items() if v is not None}
print(f"Successfully fetched {len(valid_data)}/{len(urls)} tables")
The read_html() function provides a streamlined approach to web scraping tabular data without manual HTML parsing. Combine it with proper error handling, data validation, and type conversion for production-ready data pipelines.