How to Filter NaN Values in Pandas

Key Insights

Use dropna() for quick removal of rows or columns with missing data, but understand its parameters (how, subset, thresh) to avoid accidentally deleting valuable data.
Boolean indexing with isna() and notna() gives you fine-grained control when you need to filter based on specific columns or combine multiple conditions.
Filtering NaN values isn’t always the right choice—consider whether filling missing values with fillna() better preserves your dataset’s integrity for the analysis at hand.

Understanding NaN Values in Pandas

NaN values are the silent saboteurs of data analysis. They creep into your datasets from incomplete API responses, failed data entry, sensor malfunctions, or mismatched joins. Left unchecked, they’ll corrupt your calculations, crash your models, and produce misleading visualizations.

Pandas represents missing data as NaN (Not a Number), a special floating-point value from NumPy. Understanding how to detect and filter these values is fundamental to any data cleaning workflow. This article covers the practical techniques you’ll use daily when wrangling real-world data.

Detecting NaN Values

Before filtering, you need to know what you’re dealing with. Pandas provides three primary methods for NaN detection: isna(), isnull(), and notna().

The methods isna() and isnull() are identical—they’re aliases for the same function. Use whichever reads more naturally to you, but pick one and stick with it for consistency.

import pandas as pd
import numpy as np

# Sample dataset with missing values
df = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [25, np.nan, 35, 28, np.nan],
    'salary': [50000, 60000, np.nan, 75000, 55000],
    'department': ['Engineering', None, 'Sales', 'Engineering', 'Marketing']
})

# Check for NaN values across the entire DataFrame
print(df.isna())

This returns a DataFrame of boolean values where True indicates a NaN. More useful is counting NaNs per column:

# Count NaN values per column
print(df.isna().sum())

# Output:
# name          1
# age           2
# salary        1
# department    1
# dtype: int64

For a quick overview of your dataset’s completeness, calculate the percentage of missing values:

# Percentage of NaN values per column
print((df.isna().sum() / len(df)) * 100)

# Output:
# name          20.0
# age           40.0
# salary        20.0
# department    20.0
# dtype: float64

The notna() method returns the inverse—True where values exist. This becomes essential when filtering with boolean indexing.

Filtering Rows with NaN Values

The dropna() method is your primary tool for removing rows containing NaN values. Its behavior changes dramatically based on the parameters you pass.

Basic Usage

Without arguments, dropna() removes any row containing at least one NaN value:

# Remove rows with any NaN values
clean_df = df.dropna()
print(clean_df)

# Output:
#     name   age   salary   department
# 0  Alice  25.0  50000.0  Engineering
# 3  Diana  28.0  75000.0  Engineering

This is aggressive—you went from five rows to two. In many cases, that’s too much data loss.

The `how` Parameter

Control whether to drop rows with any NaN or only rows where all values are NaN:

# Only drop rows where ALL values are NaN
df_with_empty_row = pd.concat([df, pd.DataFrame([[np.nan, np.nan, np.nan, np.nan]], 
                                                  columns=df.columns)])

# 'any' drops rows with at least one NaN (default)
print(df_with_empty_row.dropna(how='any'))

# 'all' only drops rows where every value is NaN
print(df_with_empty_row.dropna(how='all'))

Use how='all' when you want to remove completely empty rows while preserving partially complete data.

The `subset` Parameter

This is the parameter you’ll use most often. It lets you specify which columns to check for NaN values:

# Only drop rows where 'name' or 'salary' is NaN
filtered_df = df.dropna(subset=['name', 'salary'])
print(filtered_df)

# Output:
#     name   age   salary   department
# 0  Alice  25.0  50000.0  Engineering
# 1    Bob   NaN  60000.0         None
# 3  Diana  28.0  75000.0  Engineering
# 4    Eve   NaN  55000.0    Marketing

Notice that rows with NaN in age or department are preserved because those columns aren’t in our subset. This is how you target specific required fields.

The `thresh` Parameter

The thresh parameter specifies the minimum number of non-NaN values a row must have to survive:

# Keep rows with at least 3 non-NaN values
filtered_df = df.dropna(thresh=3)
print(filtered_df)

# Output:
#     name   age   salary   department
# 0  Alice  25.0  50000.0  Engineering
# 1    Bob   NaN  60000.0         None
# 3  Diana  28.0  75000.0  Engineering
# 4    Eve   NaN  55000.0    Marketing

Row index 2 was dropped because it only has two non-NaN values (name is None and salary is NaN). This approach works well when you need a minimum level of data completeness without being rigid about which specific columns must be present.

Filtering Columns with NaN Values

Sometimes the problem isn’t rows—it’s entire columns with too much missing data. Use axis=1 to drop columns instead of rows:

# Create a DataFrame with a mostly-empty column
df_with_sparse_column = df.copy()
df_with_sparse_column['notes'] = [np.nan, np.nan, np.nan, np.nan, 'Some note']

# Drop columns containing any NaN
print(df_with_sparse_column.dropna(axis=1))

# Output: Only columns with no NaN values remain

Combine with how and thresh for more nuanced filtering:

# Drop columns where ALL values are NaN
df_with_empty_column = df.copy()
df_with_empty_column['empty'] = np.nan
print(df_with_empty_column.dropna(axis=1, how='all'))

# Drop columns with fewer than 4 non-NaN values
print(df.dropna(axis=1, thresh=4))

A practical pattern for exploratory data analysis is dropping columns that exceed a missing data threshold:

# Drop columns with more than 30% missing values
threshold = 0.3
cols_to_keep = df.columns[df.isna().mean() < threshold]
df_filtered = df[cols_to_keep]

Boolean Indexing to Filter NaN

When dropna() doesn’t give you enough control, boolean indexing lets you build custom filtering logic.

Single Column Filtering

# Keep only rows where 'age' is not NaN
df_with_age = df[df['age'].notna()]
print(df_with_age)

# Keep only rows where 'age' IS NaN (useful for investigation)
df_missing_age = df[df['age'].isna()]
print(df_missing_age)

Combining Multiple Conditions

Boolean indexing shines when you need complex logic:

# Keep rows where BOTH 'age' and 'salary' are not NaN
df_complete = df[df['age'].notna() & df['salary'].notna()]

# Keep rows where EITHER 'age' or 'salary' is not NaN
df_partial = df[df['age'].notna() | df['salary'].notna()]

# Keep rows where 'name' exists AND ('age' is not NaN OR salary > 55000)
df_custom = df[df['name'].notna() & (df['age'].notna() | (df['salary'] > 55000))]

This approach is particularly useful when your filtering logic depends on the actual values, not just the presence of NaN:

# Keep rows where age is not NaN AND age is greater than 26
df_filtered = df[(df['age'].notna()) & (df['age'] > 26)]
print(df_filtered)

# Output:
#     name   age   salary   department
# 2   None  35.0      NaN        Sales
# 3  Diana  28.0  75000.0  Engineering

Filtering vs. Filling NaN Values

Here’s the decision that trips up many analysts: should you filter out NaN values or fill them?

When to Filter

Filter NaN values when:

The missing data represents truly absent information that can’t be reasonably estimated
You have enough data that losing some rows won’t impact your analysis
The missing values aren’t random (they’re systematically biased in a way that filling would mask)
You’re preparing data for algorithms that can’t handle any missing values

# Filtering: Remove incomplete records before analysis
analysis_df = df.dropna(subset=['age', 'salary'])
average_salary_by_age = analysis_df.groupby('age')['salary'].mean()

When to Fill

Fill NaN values when:

You can make reasonable assumptions about what the missing value should be
Losing rows would significantly reduce your dataset
The analysis method is sensitive to sample size
You’re using time series data where interpolation makes sense

# Filling: Replace NaN with appropriate values
df_filled = df.copy()

# Fill numeric columns with median (robust to outliers)
df_filled['age'] = df_filled['age'].fillna(df_filled['age'].median())
df_filled['salary'] = df_filled['salary'].fillna(df_filled['salary'].median())

# Fill categorical columns with mode or a placeholder
df_filled['department'] = df_filled['department'].fillna('Unknown')
df_filled['name'] = df_filled['name'].fillna('Anonymous')

A common hybrid approach filters rows missing critical identifiers while filling optional fields:

# Hybrid: Filter on required fields, fill optional ones
df_clean = df.dropna(subset=['name'])  # Name is required
df_clean['age'] = df_clean['age'].fillna(df_clean['age'].median())
df_clean['department'] = df_clean['department'].fillna('Unassigned')

Conclusion

Filtering NaN values in Pandas comes down to choosing the right tool for your specific situation. Use dropna() with subset when you need rows complete in specific columns. Use thresh when you care about overall row completeness rather than specific fields. Use boolean indexing when your filtering logic gets complex or depends on actual values.

The key insight is that filtering should be intentional. Don’t reflexively call dropna() at the start of every analysis. Understand your missing data first—why it’s missing, how much you have, and whether filtering or filling better serves your analytical goals. Your future self debugging a model trained on inappropriately cleaned data will thank you.