SQL vs Pandas - Equivalent Operations

Data professionals constantly switch between SQL and Pandas. You might query a data warehouse in the morning and clean CSVs in a Jupyter notebook by afternoon. Knowing both isn't optional—it's table...

Key Insights

  • SQL and Pandas share nearly identical conceptual models—once you internalize the mapping between them, switching contexts becomes effortless and you’ll write better code in both
  • Pandas often requires more verbose syntax for operations that are one-liners in SQL, but compensates with powerful chaining and Python ecosystem integration
  • Choose SQL for large-scale data processing at the database level; choose Pandas for exploratory analysis, complex transformations, and when your data already lives in Python

Introduction & Setup

Data professionals constantly switch between SQL and Pandas. You might query a data warehouse in the morning and clean CSVs in a Jupyter notebook by afternoon. Knowing both isn’t optional—it’s table stakes.

The good news: these tools think about data the same way. Tables are DataFrames. Columns are Series. Once you see the patterns, translation becomes mechanical.

Let’s establish a sample dataset we’ll use throughout. We’ll work with employee and department data—small enough to reason about, complex enough to demonstrate real operations.

import pandas as pd
import numpy as np

# Employees DataFrame
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'dept_id': [1, 1, 2, 2, 3, None],
    'salary': [75000, 82000, 65000, 91000, 78000, 55000],
    'hire_date': pd.to_datetime(['2020-01-15', '2019-06-01', '2021-03-20', 
                                  '2018-11-10', '2022-07-05', '2023-01-30'])
})

# Departments DataFrame
departments = pd.DataFrame({
    'dept_id': [1, 2, 3, 4],
    'dept_name': ['Engineering', 'Sales', 'Marketing', 'HR'],
    'budget': [500000, 300000, 200000, 150000]
})

The equivalent SQL tables would have identical schemas. Assume they exist in your database as employees and departments.

Selecting & Filtering Data

The most common operation: grab specific columns and rows.

SQL:

-- Select specific columns
SELECT name, salary FROM employees;

-- Filter with WHERE
SELECT name, salary 
FROM employees 
WHERE salary > 70000 AND dept_id = 1;

Pandas:

# Select specific columns
employees[['name', 'salary']]

# Filter with boolean indexing
employees.loc[(employees['salary'] > 70000) & (employees['dept_id'] == 1), 
              ['name', 'salary']]

# Or use query() for SQL-like syntax
employees.query('salary > 70000 and dept_id == 1')[['name', 'salary']]

The query() method deserves attention. It accepts a string expression that feels like a WHERE clause. For complex filtering, it’s more readable than chained boolean conditions. One gotcha: column names with spaces or special characters require backticks.

# query() handles variables with @
min_salary = 70000
employees.query('salary > @min_salary')

Pattern: SQL’s WHERE maps to Pandas boolean indexing or query(). Use query() when conditions get complex—your future self will thank you.

Sorting & Limiting Results

SQL:

-- Single column sort
SELECT * FROM employees ORDER BY salary DESC;

-- Multi-column sort
SELECT * FROM employees ORDER BY dept_id ASC, salary DESC;

-- Limit results
SELECT * FROM employees ORDER BY salary DESC LIMIT 3;

Pandas:

# Single column sort
employees.sort_values('salary', ascending=False)

# Multi-column sort
employees.sort_values(['dept_id', 'salary'], ascending=[True, False])

# Limit results
employees.sort_values('salary', ascending=False).head(3)

# Or use nlargest/nsmallest for top-N queries
employees.nlargest(3, 'salary')

nlargest() and nsmallest() are optimized for top-N queries—they’re faster than sorting the entire DataFrame when you only need a few rows. Use them.

Aggregations & Grouping

This is where both tools shine. Aggregations follow the same split-apply-combine pattern.

SQL:

-- Basic aggregations
SELECT 
    dept_id,
    COUNT(*) as emp_count,
    AVG(salary) as avg_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY dept_id;

-- Filter groups with HAVING
SELECT dept_id, AVG(salary) as avg_salary
FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 70000;

Pandas:

# Basic aggregations
employees.groupby('dept_id').agg(
    emp_count=('emp_id', 'count'),
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max')
).reset_index()

# Filter groups (HAVING equivalent)
(employees
    .groupby('dept_id')['salary']
    .mean()
    .reset_index(name='avg_salary')
    .query('avg_salary > 70000'))

The named aggregation syntax (emp_count=('emp_id', 'count')) arrived in Pandas 0.25 and is now the preferred approach. It’s explicit about which column gets which function and what the output column should be named.

For HAVING equivalents, you filter after aggregation. Chain a query() or boolean filter on the aggregated result.

# Multiple aggregations with filtering
result = (employees
    .groupby('dept_id')
    .agg({'salary': ['mean', 'count']})
    .droplevel(0, axis=1)  # Flatten column names
    .reset_index())
result.columns = ['dept_id', 'avg_salary', 'emp_count']
result.query('emp_count >= 2')

Joins & Merging Data

Joins are fundamental. The syntax differs more here than anywhere else.

SQL:

-- Inner join
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

-- Left join (keep all employees)
SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

-- Find employees without departments
SELECT e.name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;

Pandas:

# Inner join
employees.merge(departments, on='dept_id', how='inner')[['name', 'dept_name']]

# Left join
employees.merge(departments, on='dept_id', how='left')[['name', 'dept_name']]

# Find employees without departments
merged = employees.merge(departments, on='dept_id', how='left', indicator=True)
merged.query('_merge == "left_only"')['name']

The indicator=True parameter adds a _merge column showing where each row came from: left_only, right_only, or both. It’s invaluable for debugging joins and finding unmatched records.

Common gotcha: If join columns have different names, use left_on and right_on:

# When column names differ
employees.merge(departments, left_on='dept_id', right_on='department_id')

Advanced Operations

Window Functions

Window functions compute values across related rows without collapsing them. SQL has elegant syntax here; Pandas requires more work.

SQL:

-- Row number within department
SELECT name, dept_id, salary,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rank_in_dept
FROM employees;

-- Running total
SELECT name, salary,
    SUM(salary) OVER (ORDER BY hire_date) as running_total
FROM employees;

Pandas:

# Row number within department
employees['rank_in_dept'] = (employees
    .groupby('dept_id')['salary']
    .rank(method='first', ascending=False)
    .astype(int))

# Running total
employees_sorted = employees.sort_values('hire_date')
employees_sorted['running_total'] = employees_sorted['salary'].cumsum()

Conditional Logic

SQL:

SELECT name, salary,
    CASE 
        WHEN salary >= 80000 THEN 'Senior'
        WHEN salary >= 60000 THEN 'Mid'
        ELSE 'Junior'
    END as level
FROM employees;

Pandas:

# Using np.select for multiple conditions
conditions = [
    employees['salary'] >= 80000,
    employees['salary'] >= 60000
]
choices = ['Senior', 'Mid']
employees['level'] = np.select(conditions, choices, default='Junior')

# Or with apply for complex logic
def assign_level(salary):
    if salary >= 80000: return 'Senior'
    if salary >= 60000: return 'Mid'
    return 'Junior'

employees['level'] = employees['salary'].apply(assign_level)

Use np.select() for performance with simple conditions. Reserve apply() for complex logic that can’t be vectorized—it’s significantly slower.

Subqueries

SQL:

-- Employees earning above average
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

Pandas:

# Subquery equivalent via chaining
avg_salary = employees['salary'].mean()
employees.query('salary > @avg_salary')[['name', 'salary']]

Pandas handles “subqueries” through variable assignment or method chaining. It’s often more readable than nested SQL.

Performance & When to Use Each

Choose SQL when:

  • Data lives in a database and is large (millions+ rows)
  • You need to filter before transferring data
  • Multiple users need consistent, optimized queries
  • Your operations are standard: filters, joins, aggregations

Choose Pandas when:

  • Data is already in Python (APIs, files, scraping)
  • You need exploratory iteration—try something, inspect, adjust
  • Transformations require Python libraries (regex, datetime parsing, ML preprocessing)
  • You’re building a pipeline that feeds into other Python code

Quick Reference:

Operation SQL Pandas
Select columns SELECT a, b df[['a', 'b']]
Filter rows WHERE x > 5 df.query('x > 5')
Sort ORDER BY x DESC df.sort_values('x', ascending=False)
Limit LIMIT 10 df.head(10)
Group & aggregate GROUP BY x df.groupby('x').agg()
Join JOIN ... ON df.merge(..., on=)
Window function OVER (PARTITION BY) df.groupby().transform()
Conditional CASE WHEN np.select()

Master both. The mental model transfers. Your data work gets faster regardless of which tool you’re holding.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.