Pandas Interview Questions and Answers (Top 50)
Pandas remains the backbone of data manipulation in Python. Whether you're interviewing for a data scientist, data engineer, or backend developer role that touches analytics, expect Pandas questions....
Key Insights
- Pandas interviews test three distinct skill levels: syntax recall (beginner), data manipulation fluency (intermediate), and performance optimization awareness (advanced)—prepare for all three
- The most common interview trap is using loops instead of vectorized operations; always demonstrate you understand Pandas’ native methods first
- Real-world scenario questions matter most: interviewers want to see you chain operations together to solve business problems, not just recite API documentation
Introduction & How to Use This Guide
Pandas remains the backbone of data manipulation in Python. Whether you’re interviewing for a data scientist, data engineer, or backend developer role that touches analytics, expect Pandas questions. Interviewers use these questions to gauge not just your syntax knowledge, but your understanding of efficient data processing patterns.
This guide organizes 50 questions into tiers. Beginners should nail questions 1-15 before moving on—these appear in screening calls and take-home assessments. Intermediate questions (16-30) dominate technical interviews. Advanced and performance questions (31-47) separate senior candidates from the pack. Scenario-based questions (48-50) reflect real interview formats where you’ll code live.
Study approach: Don’t memorize. Type out each code example, break it, fix it. Interviewers can tell when you’ve only read about groupby() versus actually used it.
Beginner Questions (Questions 1-15)
Q1: What’s the difference between a Series and a DataFrame? A Series is a one-dimensional labeled array. A DataFrame is a two-dimensional table—essentially a dictionary of Series sharing the same index.
Q2: How do you create a DataFrame from a dictionary?
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']}
df = pd.DataFrame(data)
Q3: What’s the difference between loc and iloc?
loc uses labels (row/column names). iloc uses integer positions. This is the most common beginner question—get it wrong and the interview ends early.
df.loc[0, 'name'] # Label-based: row with label 0, column 'name'
df.iloc[0, 0] # Position-based: first row, first column
df.loc[0:2, 'name'] # Inclusive on both ends
df.iloc[0:2, 0] # Exclusive on end (Python standard)
Q4: How do you read a CSV file with common parameters?
df = pd.read_csv('data.csv',
sep=',',
header=0,
index_col='id',
usecols=['id', 'name', 'value'],
dtype={'value': float},
parse_dates=['date_column'],
na_values=['NA', 'missing'])
Q5: What do head(), tail(), info(), and describe() do?
head(n) returns first n rows (default 5). tail(n) returns last n rows. info() shows column types and non-null counts. describe() provides statistical summary for numeric columns.
Q6: How do you select multiple columns?
df[['name', 'age']] # Returns DataFrame
df['name'] # Returns Series
df.filter(items=['name', 'age']) # Alternative method
Q7: How do you check data types of columns?
Use df.dtypes for all columns or df['column'].dtype for a single column.
Q8: How do you rename columns?
df.rename(columns={'old_name': 'new_name'}, inplace=True)
df.columns = ['col1', 'col2', 'col3'] # Replace all at once
Q9-15 cover: adding/dropping columns, resetting index, checking shape, unique values (nunique(), unique()), value counts, and basic arithmetic operations on columns.
Intermediate Questions (Questions 16-30)
Q16: How do you filter rows with multiple conditions?
# Use parentheses and bitwise operators
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
# Alternative with query() - cleaner for complex conditions
filtered = df.query('age > 25 and city == "NYC"')
Q17: How do you handle missing values?
df.isna().sum() # Count missing per column
df.dropna(subset=['critical_col']) # Drop rows missing critical data
df.fillna({'age': df['age'].median(),
'city': 'Unknown'}) # Fill with specific values
df.interpolate(method='linear') # For time series
Q18: Explain groupby operations.
# Basic aggregation
df.groupby('city')['age'].mean()
# Multiple aggregations
df.groupby('city').agg({
'age': ['mean', 'min', 'max'],
'salary': 'sum'
})
# Named aggregations (cleaner output)
df.groupby('city').agg(
avg_age=('age', 'mean'),
total_salary=('salary', 'sum'),
count=('name', 'count')
)
Q19: What are the different merge types?
# Inner: only matching keys
pd.merge(df1, df2, on='key', how='inner')
# Left: all from left, matching from right
pd.merge(df1, df2, on='key', how='left')
# Outer: all from both
pd.merge(df1, df2, on='key', how='outer')
# Merge on different column names
pd.merge(df1, df2, left_on='id', right_on='user_id')
Q20: How does apply() work?
# Apply to column
df['age_bracket'] = df['age'].apply(lambda x: 'young' if x < 30 else 'senior')
# Apply to rows (axis=1)
df['full_info'] = df.apply(lambda row: f"{row['name']}, {row['age']}", axis=1)
# Apply with custom function
def categorize(value):
if value < 25: return 'junior'
elif value < 35: return 'mid'
return 'senior'
df['level'] = df['age'].apply(categorize)
Q21-30 cover: sorting (sort_values, sort_index), string methods (str.contains(), str.replace()), concat() vs merge(), handling duplicates, value_counts(normalize=True), conditional column creation with np.where(), and transform() for group-level operations.
Advanced Questions (Questions 31-42)
Q31: How do you create a pivot table?
pivot = df.pivot_table(
values='sales',
index='region',
columns='quarter',
aggfunc=['sum', 'mean'],
fill_value=0,
margins=True # Adds row/column totals
)
Q32: Explain multi-indexing.
df.set_index(['region', 'city'], inplace=True)
df.loc[('West', 'LA')] # Access specific combination
df.xs('LA', level='city') # Cross-section at level
df.reset_index() # Flatten back
Q33: How do window functions work?
# Rolling calculations
df['rolling_avg'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()
# Expanding (cumulative)
df['cumulative_max'] = df['value'].expanding().max()
# Shift for lag features
df['previous_value'] = df['value'].shift(1)
df['pct_change'] = df['value'].pct_change()
Q34: How do you optimize memory usage?
# Check current memory
df.memory_usage(deep=True)
# Convert to categorical for low-cardinality strings
df['status'] = df['status'].astype('category')
# Downcast numeric types
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['price'] = pd.to_numeric(df['price'], downcast='float')
Q35-42 cover: query() method internals, method chaining patterns, time series resampling (resample('M').sum()), melt() for unpivoting, explode() for list columns, and cut()/qcut() for binning.
Performance & Best Practices Questions (Questions 43-47)
Q43: Why avoid loops in Pandas?
import numpy as np
# Bad: Loop (slow)
for i in range(len(df)):
df.loc[i, 'doubled'] = df.loc[i, 'value'] * 2
# Good: Vectorized (fast)
df['doubled'] = df['value'] * 2
# Good: NumPy for complex conditions
df['category'] = np.where(df['value'] > 100, 'high', 'low')
Q44: itertuples() vs iterrows()?
itertuples() is 10-100x faster. iterrows() returns Series objects (expensive). Only iterate when absolutely necessary.
Q45: How do you process files too large for memory?
# Chunked processing
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
results = []
for chunk in chunks:
processed = chunk.groupby('category')['value'].sum()
results.append(processed)
final = pd.concat(results).groupby(level=0).sum()
Q46-47 cover: eval() for expression evaluation, selecting appropriate dtypes upfront, and using inplace=True considerations (often not faster, hurts readability).
Scenario-Based Questions (Questions 48-50)
Q48: Write a data cleaning pipeline.
def clean_sales_data(filepath):
return (pd.read_csv(filepath)
.rename(columns=str.lower)
.drop_duplicates(subset=['order_id'])
.dropna(subset=['customer_id', 'amount'])
.assign(
amount=lambda x: pd.to_numeric(x['amount'], errors='coerce'),
order_date=lambda x: pd.to_datetime(x['order_date']),
region=lambda x: x['region'].str.upper().str.strip()
)
.query('amount > 0')
.reset_index(drop=True))
Q49: Perform EDA on a dataset.
def quick_eda(df):
print(f"Shape: {df.shape}")
print(f"\nMissing values:\n{df.isna().sum()}")
print(f"\nNumeric summary:\n{df.describe()}")
print(f"\nCategorical columns:")
for col in df.select_dtypes(include='object'):
print(f" {col}: {df[col].nunique()} unique values")
Q50: Calculate month-over-month growth by category.
monthly_sales = (df
.assign(month=df['date'].dt.to_period('M'))
.groupby(['category', 'month'])['revenue']
.sum()
.unstack(level=0)
.pct_change()
.stack()
.rename('mom_growth'))
Quick Reference Cheat Sheet
| Operation | Syntax |
|---|---|
| Read CSV | pd.read_csv('file.csv') |
| Filter rows | df[df['col'] > value] |
| Select columns | df[['col1', 'col2']] |
| Group & aggregate | df.groupby('col').agg({'val': 'sum'}) |
| Merge | pd.merge(df1, df2, on='key', how='left') |
| Handle nulls | df.fillna(value) / df.dropna() |
| Sort | df.sort_values('col', ascending=False) |
| Pivot | df.pivot_table(values, index, columns, aggfunc) |
Interview day tips: Verbalize your thought process. Start with the simplest solution, then optimize. When stuck, describe what you’d Google—interviewers value problem-solving over memorization.