Pandas DataFrame Tutorial - Complete Guide with Examples
The most common way to create a DataFrame is from a dictionary where keys become column names:
Key Insights
- DataFrames are two-dimensional labeled data structures with columns of potentially different types, forming the core of pandas data manipulation
- Creating DataFrames from dictionaries, lists, CSV files, and databases requires understanding different constructor patterns and parameter options
- Mastering selection, filtering, grouping, and merging operations enables efficient data transformation workflows in production environments
Creating DataFrames from Different Sources
The most common way to create a DataFrame is from a dictionary where keys become column names:
import pandas as pd
import numpy as np
# From dictionary with lists
data = {
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [1200, 25, 75, 350],
'quantity': [5, 50, 30, 10]
}
df = pd.DataFrame(data)
print(df)
For CSV files, use read_csv() with specific parameters for real-world scenarios:
# Read CSV with custom settings
df = pd.read_csv(
'sales_data.csv',
sep=',',
encoding='utf-8',
parse_dates=['order_date'],
dtype={'product_id': str},
na_values=['NA', 'missing', '']
)
# Read from URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)
Creating DataFrames from lists of dictionaries handles JSON-like structures:
records = [
{'name': 'Alice', 'age': 30, 'city': 'NYC'},
{'name': 'Bob', 'age': 25, 'city': 'LA'},
{'name': 'Charlie', 'age': 35, 'city': 'Chicago'}
]
df = pd.DataFrame.from_records(records)
Selecting and Accessing Data
Use .loc[] for label-based indexing and .iloc[] for position-based indexing:
# Label-based selection
df.loc[0] # First row
df.loc[0:2, 'product'] # Rows 0-2, product column
df.loc[df['price'] > 100, ['product', 'price']] # Conditional selection
# Position-based selection
df.iloc[0] # First row
df.iloc[0:3, 1:3] # First 3 rows, columns 1-2
df.iloc[:, -1] # All rows, last column
# Column selection
df['product'] # Single column as Series
df[['product', 'price']] # Multiple columns as DataFrame
# Boolean indexing
expensive_items = df[df['price'] > 100]
multi_condition = df[(df['price'] > 50) & (df['quantity'] < 20)]
Filtering and Querying Data
The query() method provides readable filtering syntax:
# Query method for complex filters
result = df.query('price > 50 and quantity < 40')
# Using variables in queries
min_price = 100
result = df.query('price > @min_price')
# String operations in queries
df_with_text = pd.DataFrame({
'product': ['Laptop Pro', 'Mouse Wireless', 'Keyboard Mech'],
'price': [1200, 45, 150]
})
result = df_with_text.query('product.str.contains("Pro")', engine='python')
# isin() for membership testing
categories = ['Laptop', 'Monitor']
filtered = df[df['product'].isin(categories)]
Data Transformation and Manipulation
Apply transformations using apply(), map(), and vectorized operations:
# Vectorized operations (fastest)
df['total_value'] = df['price'] * df['quantity']
df['discounted_price'] = df['price'] * 0.9
# Apply function to column
df['price_category'] = df['price'].apply(
lambda x: 'High' if x > 500 else 'Medium' if x > 100 else 'Low'
)
# Apply function to entire DataFrame
def calculate_metrics(row):
return row['price'] * row['quantity'] * 1.1
df['revenue_with_tax'] = df.apply(calculate_metrics, axis=1)
# Map for value replacement
category_map = {'Laptop': 'Electronics', 'Mouse': 'Accessories'}
df['category'] = df['product'].map(category_map)
# String operations
df['product_upper'] = df['product'].str.upper()
df['product_length'] = df['product'].str.len()
Handling Missing Data
Production data always contains missing values requiring systematic handling:
# Create DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Detect missing values
print(df.isnull()) # Boolean DataFrame
print(df.isnull().sum()) # Count per column
# Drop rows/columns with missing values
df_dropped_rows = df.dropna() # Drop any row with NaN
df_dropped_cols = df.dropna(axis=1) # Drop columns with NaN
df_threshold = df.dropna(thresh=2) # Keep rows with at least 2 non-NaN
# Fill missing values
df_filled = df.fillna(0) # Fill with constant
df_filled = df.fillna(df.mean()) # Fill with column mean
df_filled = df.fillna(method='ffill') # Forward fill
df_filled = df.fillna(method='bfill') # Backward fill
# Fill with different values per column
df_filled = df.fillna({'A': 0, 'B': df['B'].median()})
Grouping and Aggregation
GroupBy operations enable powerful data summarization:
# Sample sales data
sales = pd.DataFrame({
'region': ['East', 'East', 'West', 'West', 'East'],
'product': ['A', 'B', 'A', 'B', 'A'],
'sales': [100, 150, 200, 175, 120],
'quantity': [10, 15, 20, 17, 12]
})
# Basic groupby
grouped = sales.groupby('region')['sales'].sum()
# Multiple aggregations
agg_result = sales.groupby('region').agg({
'sales': ['sum', 'mean', 'count'],
'quantity': ['sum', 'max']
})
# Custom aggregation functions
def range_calc(x):
return x.max() - x.min()
custom_agg = sales.groupby('region')['sales'].agg([
('total', 'sum'),
('average', 'mean'),
('range', range_calc)
])
# Multiple grouping columns
multi_group = sales.groupby(['region', 'product'])['sales'].sum()
# Transform to broadcast results back
sales['region_avg'] = sales.groupby('region')['sales'].transform('mean')
Merging and Joining DataFrames
Combine multiple DataFrames using merge, join, and concat:
# Sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 1],
'amount': [250, 150, 300]
})
# Inner join (default)
merged = pd.merge(customers, orders, on='customer_id')
# Left join - keep all customers
left_merged = pd.merge(customers, orders, on='customer_id', how='left')
# Merge on different column names
df1 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
df2 = pd.DataFrame({'customer_id': [1, 2], 'score': [5, 8]})
merged = pd.merge(df1, df2, left_on='id', right_on='customer_id')
# Concatenate DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated = pd.concat([df1, df2], ignore_index=True)
# Concatenate horizontally
horizontal = pd.concat([df1, df2], axis=1)
Sorting and Ranking
Organize data with sort operations:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 25, 35],
'salary': [50000, 60000, 55000, 70000]
})
# Sort by single column
sorted_df = df.sort_values('age')
# Sort by multiple columns
sorted_df = df.sort_values(['age', 'salary'], ascending=[True, False])
# Sort by index
sorted_df = df.sort_index()
# Rank values
df['salary_rank'] = df['salary'].rank(ascending=False)
df['age_rank'] = df['age'].rank(method='dense')
Performance Optimization Techniques
Optimize DataFrame operations for large datasets:
# Use categorical data types for repeated strings
df['category'] = df['category'].astype('category')
# Specify dtypes when reading CSV
df = pd.read_csv('large_file.csv', dtype={
'id': 'int32',
'value': 'float32',
'category': 'category'
})
# Use chunking for large files
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
processed = chunk[chunk['value'] > 100]
chunks.append(processed)
result = pd.concat(chunks, ignore_index=True)
# Vectorized operations instead of apply
# Slow
df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
# Fast
df['result'] = df['A'] + df['B']
These patterns cover the essential DataFrame operations needed for production data analysis workflows. Master these techniques to handle data transformation, cleaning, and analysis tasks efficiently.