Pandas DataFrame Tutorial - Complete Guide with Examples

The most common way to create a DataFrame is from a dictionary where keys become column names:

Key Insights

  • DataFrames are two-dimensional labeled data structures with columns of potentially different types, forming the core of pandas data manipulation
  • Creating DataFrames from dictionaries, lists, CSV files, and databases requires understanding different constructor patterns and parameter options
  • Mastering selection, filtering, grouping, and merging operations enables efficient data transformation workflows in production environments

Creating DataFrames from Different Sources

The most common way to create a DataFrame is from a dictionary where keys become column names:

import pandas as pd
import numpy as np

# From dictionary with lists
data = {
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 75, 350],
    'quantity': [5, 50, 30, 10]
}
df = pd.DataFrame(data)
print(df)

For CSV files, use read_csv() with specific parameters for real-world scenarios:

# Read CSV with custom settings
df = pd.read_csv(
    'sales_data.csv',
    sep=',',
    encoding='utf-8',
    parse_dates=['order_date'],
    dtype={'product_id': str},
    na_values=['NA', 'missing', '']
)

# Read from URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)

Creating DataFrames from lists of dictionaries handles JSON-like structures:

records = [
    {'name': 'Alice', 'age': 30, 'city': 'NYC'},
    {'name': 'Bob', 'age': 25, 'city': 'LA'},
    {'name': 'Charlie', 'age': 35, 'city': 'Chicago'}
]
df = pd.DataFrame.from_records(records)

Selecting and Accessing Data

Use .loc[] for label-based indexing and .iloc[] for position-based indexing:

# Label-based selection
df.loc[0]  # First row
df.loc[0:2, 'product']  # Rows 0-2, product column
df.loc[df['price'] > 100, ['product', 'price']]  # Conditional selection

# Position-based selection
df.iloc[0]  # First row
df.iloc[0:3, 1:3]  # First 3 rows, columns 1-2
df.iloc[:, -1]  # All rows, last column

# Column selection
df['product']  # Single column as Series
df[['product', 'price']]  # Multiple columns as DataFrame

# Boolean indexing
expensive_items = df[df['price'] > 100]
multi_condition = df[(df['price'] > 50) & (df['quantity'] < 20)]

Filtering and Querying Data

The query() method provides readable filtering syntax:

# Query method for complex filters
result = df.query('price > 50 and quantity < 40')

# Using variables in queries
min_price = 100
result = df.query('price > @min_price')

# String operations in queries
df_with_text = pd.DataFrame({
    'product': ['Laptop Pro', 'Mouse Wireless', 'Keyboard Mech'],
    'price': [1200, 45, 150]
})
result = df_with_text.query('product.str.contains("Pro")', engine='python')

# isin() for membership testing
categories = ['Laptop', 'Monitor']
filtered = df[df['product'].isin(categories)]

Data Transformation and Manipulation

Apply transformations using apply(), map(), and vectorized operations:

# Vectorized operations (fastest)
df['total_value'] = df['price'] * df['quantity']
df['discounted_price'] = df['price'] * 0.9

# Apply function to column
df['price_category'] = df['price'].apply(
    lambda x: 'High' if x > 500 else 'Medium' if x > 100 else 'Low'
)

# Apply function to entire DataFrame
def calculate_metrics(row):
    return row['price'] * row['quantity'] * 1.1

df['revenue_with_tax'] = df.apply(calculate_metrics, axis=1)

# Map for value replacement
category_map = {'Laptop': 'Electronics', 'Mouse': 'Accessories'}
df['category'] = df['product'].map(category_map)

# String operations
df['product_upper'] = df['product'].str.upper()
df['product_length'] = df['product'].str.len()

Handling Missing Data

Production data always contains missing values requiring systematic handling:

# Create DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Detect missing values
print(df.isnull())  # Boolean DataFrame
print(df.isnull().sum())  # Count per column

# Drop rows/columns with missing values
df_dropped_rows = df.dropna()  # Drop any row with NaN
df_dropped_cols = df.dropna(axis=1)  # Drop columns with NaN
df_threshold = df.dropna(thresh=2)  # Keep rows with at least 2 non-NaN

# Fill missing values
df_filled = df.fillna(0)  # Fill with constant
df_filled = df.fillna(df.mean())  # Fill with column mean
df_filled = df.fillna(method='ffill')  # Forward fill
df_filled = df.fillna(method='bfill')  # Backward fill

# Fill with different values per column
df_filled = df.fillna({'A': 0, 'B': df['B'].median()})

Grouping and Aggregation

GroupBy operations enable powerful data summarization:

# Sample sales data
sales = pd.DataFrame({
    'region': ['East', 'East', 'West', 'West', 'East'],
    'product': ['A', 'B', 'A', 'B', 'A'],
    'sales': [100, 150, 200, 175, 120],
    'quantity': [10, 15, 20, 17, 12]
})

# Basic groupby
grouped = sales.groupby('region')['sales'].sum()

# Multiple aggregations
agg_result = sales.groupby('region').agg({
    'sales': ['sum', 'mean', 'count'],
    'quantity': ['sum', 'max']
})

# Custom aggregation functions
def range_calc(x):
    return x.max() - x.min()

custom_agg = sales.groupby('region')['sales'].agg([
    ('total', 'sum'),
    ('average', 'mean'),
    ('range', range_calc)
])

# Multiple grouping columns
multi_group = sales.groupby(['region', 'product'])['sales'].sum()

# Transform to broadcast results back
sales['region_avg'] = sales.groupby('region')['sales'].transform('mean')

Merging and Joining DataFrames

Combine multiple DataFrames using merge, join, and concat:

# Sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 2, 1],
    'amount': [250, 150, 300]
})

# Inner join (default)
merged = pd.merge(customers, orders, on='customer_id')

# Left join - keep all customers
left_merged = pd.merge(customers, orders, on='customer_id', how='left')

# Merge on different column names
df1 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
df2 = pd.DataFrame({'customer_id': [1, 2], 'score': [5, 8]})
merged = pd.merge(df1, df2, left_on='id', right_on='customer_id')

# Concatenate DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated = pd.concat([df1, df2], ignore_index=True)

# Concatenate horizontally
horizontal = pd.concat([df1, df2], axis=1)

Sorting and Ranking

Organize data with sort operations:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 25, 35],
    'salary': [50000, 60000, 55000, 70000]
})

# Sort by single column
sorted_df = df.sort_values('age')

# Sort by multiple columns
sorted_df = df.sort_values(['age', 'salary'], ascending=[True, False])

# Sort by index
sorted_df = df.sort_index()

# Rank values
df['salary_rank'] = df['salary'].rank(ascending=False)
df['age_rank'] = df['age'].rank(method='dense')

Performance Optimization Techniques

Optimize DataFrame operations for large datasets:

# Use categorical data types for repeated strings
df['category'] = df['category'].astype('category')

# Specify dtypes when reading CSV
df = pd.read_csv('large_file.csv', dtype={
    'id': 'int32',
    'value': 'float32',
    'category': 'category'
})

# Use chunking for large files
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    processed = chunk[chunk['value'] > 100]
    chunks.append(processed)
result = pd.concat(chunks, ignore_index=True)

# Vectorized operations instead of apply
# Slow
df['result'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# Fast
df['result'] = df['A'] + df['B']

These patterns cover the essential DataFrame operations needed for production data analysis workflows. Master these techniques to handle data transformation, cleaning, and analysis tasks efficiently.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.