Python - Read/Write CSV Files
The `csv` module provides straightforward methods for reading CSV files. The `csv.reader()` function returns an iterator that yields each row as a list of strings.
Key Insights
- Python’s
csvmodule handles CSV operations with built-in support for delimiters, quoting, and dialects, while pandas offers high-performance operations for large datasets - DictReader and DictWriter provide intuitive column access by name, eliminating index-based errors and improving code readability
- Proper error handling and encoding specification (UTF-8 BOM for Excel compatibility) prevent data corruption and ensure cross-platform reliability
Basic CSV Reading with the csv Module
The csv module provides straightforward methods for reading CSV files. The csv.reader() function returns an iterator that yields each row as a list of strings.
import csv
with open('employees.csv', 'r', encoding='utf-8') as file:
csv_reader = csv.reader(file)
# Skip header row
next(csv_reader)
for row in csv_reader:
employee_id = row[0]
name = row[1]
department = row[2]
salary = float(row[3])
print(f"{name} ({department}): ${salary:,.2f}")
For CSV files with headers, DictReader provides cleaner access to columns by name:
import csv
with open('employees.csv', 'r', encoding='utf-8') as file:
csv_reader = csv.DictReader(file)
for row in csv_reader:
print(f"{row['name']} - {row['department']}")
print(f"Salary: ${float(row['salary']):,.2f}\n")
Writing CSV Files
The csv.writer() function writes rows to CSV files. Use newline='' to prevent blank lines on Windows systems.
import csv
data = [
['employee_id', 'name', 'department', 'salary'],
['E001', 'John Smith', 'Engineering', '95000'],
['E002', 'Sarah Johnson', 'Marketing', '78000'],
['E003', 'Mike Chen', 'Engineering', '102000']
]
with open('output.csv', 'w', newline='', encoding='utf-8') as file:
csv_writer = csv.writer(file)
csv_writer.writerows(data)
For dictionary-based writing, use DictWriter:
import csv
employees = [
{'employee_id': 'E001', 'name': 'John Smith', 'department': 'Engineering', 'salary': 95000},
{'employee_id': 'E002', 'name': 'Sarah Johnson', 'department': 'Marketing', 'salary': 78000},
{'employee_id': 'E003', 'name': 'Mike Chen', 'department': 'Engineering', 'salary': 102000}
]
fieldnames = ['employee_id', 'name', 'department', 'salary']
with open('employees_dict.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(employees)
Handling Custom Delimiters and Quoting
CSV files don’t always use commas. Handle tab-delimited or pipe-delimited files by specifying the delimiter:
import csv
# Read tab-delimited file
with open('data.tsv', 'r', encoding='utf-8') as file:
reader = csv.reader(file, delimiter='\t')
for row in reader:
print(row)
# Write pipe-delimited file
with open('output.txt', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file, delimiter='|')
writer.writerow(['ID', 'Name', 'Status'])
writer.writerow(['001', 'Product A', 'Active'])
Control quoting behavior for fields containing special characters:
import csv
data = [
['name', 'description', 'price'],
['Widget', 'A "premium" product', '29.99'],
['Gadget', 'Includes, commas', '49.99']
]
with open('products.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)
Quoting options:
csv.QUOTE_MINIMAL: Quote only fields with special characters (default)csv.QUOTE_ALL: Quote all fieldscsv.QUOTE_NONNUMERIC: Quote all non-numeric fieldscsv.QUOTE_NONE: Never quote (raises error if special characters present)
Using CSV Dialects
Dialects define CSV formatting rules. Python includes predefined dialects for common formats:
import csv
# Excel dialect (default)
with open('excel_format.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file, dialect='excel')
writer.writerow(['Name', 'Value'])
writer.writerow(['Test', '123'])
# Unix dialect (uses \n line terminator)
with open('unix_format.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file, dialect='unix')
writer.writerow(['Name', 'Value'])
# Custom dialect
csv.register_dialect('custom', delimiter='|', quoting=csv.QUOTE_ALL,
lineterminator='\n', skipinitialspace=True)
with open('custom_format.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file, dialect='custom')
writer.writerow(['Field1', 'Field2'])
Error Handling and Data Validation
Production code requires robust error handling for malformed CSV files:
import csv
import sys
def read_csv_safely(filename):
try:
with open(filename, 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
data = []
for line_num, row in enumerate(reader, start=2):
try:
# Validate required fields
if not row.get('employee_id'):
print(f"Warning: Missing employee_id on line {line_num}")
continue
# Validate data types
salary = float(row['salary'])
if salary < 0:
print(f"Warning: Negative salary on line {line_num}")
continue
data.append(row)
except ValueError as e:
print(f"Error on line {line_num}: Invalid number format - {e}")
except KeyError as e:
print(f"Error on line {line_num}: Missing column - {e}")
return data
except FileNotFoundError:
print(f"Error: File '{filename}' not found")
sys.exit(1)
except csv.Error as e:
print(f"CSV Error: {e}")
sys.exit(1)
employees = read_csv_safely('employees.csv')
print(f"Successfully loaded {len(employees)} employees")
Working with Large CSV Files Using Pandas
For large datasets, pandas provides superior performance and functionality:
import pandas as pd
# Read CSV with pandas
df = pd.read_csv('large_dataset.csv',
encoding='utf-8',
parse_dates=['date_column'],
dtype={'zip_code': str})
# Display basic info
print(df.head())
print(f"Shape: {df.shape}")
print(df.info())
# Filter and transform
high_earners = df[df['salary'] > 100000]
dept_summary = df.groupby('department')['salary'].agg(['mean', 'count'])
print(dept_summary)
# Write to CSV
df.to_csv('output.csv', index=False, encoding='utf-8')
Process large files in chunks to manage memory:
import pandas as pd
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process each chunk
filtered = chunk[chunk['status'] == 'active']
chunks.append(filtered)
# Combine results
result = pd.concat(chunks, ignore_index=True)
result.to_csv('filtered_output.csv', index=False)
Excel Compatibility and Encoding
Excel requires UTF-8 BOM encoding to properly display Unicode characters:
import csv
data = [
['Name', 'City', 'Country'],
['François', 'Paris', 'France'],
['José', 'Madrid', 'España'],
['李明', 'Beijing', '中国']
]
# Write with UTF-8 BOM for Excel
with open('excel_compatible.csv', 'w', newline='', encoding='utf-8-sig') as file:
writer = csv.writer(file)
writer.writerows(data)
Convert between different CSV formats:
import csv
def convert_csv_format(input_file, output_file, input_delimiter=',', output_delimiter='\t'):
with open(input_file, 'r', encoding='utf-8') as infile, \
open(output_file, 'w', newline='', encoding='utf-8') as outfile:
reader = csv.reader(infile, delimiter=input_delimiter)
writer = csv.writer(outfile, delimiter=output_delimiter)
for row in reader:
writer.writerow(row)
convert_csv_format('comma_delimited.csv', 'tab_delimited.tsv', ',', '\t')
Performance Optimization
For maximum performance with pure Python, use csv.reader() with minimal processing:
import csv
from collections import defaultdict
def fast_csv_aggregation(filename):
dept_salaries = defaultdict(list)
with open(filename, 'r', encoding='utf-8') as file:
reader = csv.reader(file)
next(reader) # Skip header
for row in reader:
dept = row[2]
salary = float(row[3])
dept_salaries[dept].append(salary)
# Calculate averages
return {dept: sum(salaries) / len(salaries)
for dept, salaries in dept_salaries.items()}
averages = fast_csv_aggregation('employees.csv')
for dept, avg in averages.items():
print(f"{dept}: ${avg:,.2f}")
Choose the right tool based on your requirements: use the csv module for simple operations and when dependencies must be minimal, use pandas for complex data analysis and transformations. Both approaches handle CSV files reliably when combined with proper encoding specification and error handling.