Spark Scala - DataFrame Filter Rows

DataFrame filtering is the bread and butter of Spark data processing. Whether you're cleaning messy data, extracting subsets for analysis, or implementing business logic, you'll spend a significant...

Key Insights

  • Spark’s filter() and where() methods are functionally identical—pick one and use it consistently across your codebase for readability
  • Filter operations push down to the data source when possible, so placing filters early in your transformation chain can dramatically reduce data shuffling and memory usage
  • String operations like contains() and rlike() are powerful but expensive—prefer exact matches or startsWith() when working with large datasets

Introduction

DataFrame filtering is the bread and butter of Spark data processing. Whether you’re cleaning messy data, extracting subsets for analysis, or implementing business logic, you’ll spend a significant portion of your Spark code filtering rows. Getting this right matters—not just for correctness, but for performance.

Spark Scala offers multiple syntaxes for filtering, which can be confusing when you’re starting out. This article cuts through the noise and gives you practical patterns you can apply immediately. We’ll cover everything from basic equality checks to complex multi-condition filters, with attention to the performance implications of each approach.

Basic Filter Syntax

Spark provides two methods for filtering DataFrames: filter() and where(). They’re completely interchangeable—where() exists purely for SQL familiarity. Pick one and stick with it.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
  .appName("FilterExamples")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

// Sample data
val employees = Seq(
  ("Alice", "Engineering", 85000, 32),
  ("Bob", "Sales", 65000, 28),
  ("Carol", "Engineering", 95000, 45),
  ("David", "Marketing", 55000, 23),
  ("Eve", "Engineering", 78000, 35)
).toDF("name", "department", "salary", "age")

// These are identical
val engineers1 = employees.filter($"department" === "Engineering")
val engineers2 = employees.where($"department" === "Engineering")
val engineers3 = employees.filter(col("department") === "Engineering")

// String expression syntax (less type-safe, but concise)
val engineers4 = employees.filter("department = 'Engineering'")

The $"column" syntax requires import spark.implicits._ and creates a Column object. The col("column") function from org.apache.spark.sql.functions does the same thing without the import. I prefer $"column" for brevity, but col() is more explicit and works in more contexts.

Note the triple equals === for equality comparison. This is Spark’s column equality operator, not Scala’s reference equality. Using == will compile but won’t do what you expect.

Comparison Operators

Spark supports all standard comparison operators, though the syntax differs slightly from SQL:

// Greater than, less than
val highEarners = employees.filter($"salary" > 70000)
val juniorStaff = employees.filter($"age" < 30)

// Not equal (two syntaxes)
val nonEngineers = employees.filter($"department" =!= "Engineering")
val alsoNonEngineers = employees.filter(!($"department" === "Engineering"))

// Greater/less than or equal
val midCareer = employees.filter($"age" >= 30 && $"age" <= 50)

// Between (inclusive)
val salaryRange = employees.filter($"salary".between(60000, 90000))

The between() method is inclusive on both ends, equivalent to >= AND <=. For numeric ranges, it’s cleaner than chaining two comparisons.

// These produce identical results
val range1 = employees.filter($"salary" >= 60000 && $"salary" <= 90000)
val range2 = employees.filter($"salary".between(60000, 90000))

String Filtering

String operations are where Spark filtering gets interesting. You have several options depending on your matching needs:

// Exact substring matching
val hasA = employees.filter($"name".contains("a"))  // Case-sensitive: Carol, David
val startsWithC = employees.filter($"name".startsWith("C"))  // Carol
val endsWithE = employees.filter($"name".endsWith("e"))  // Alice, Eve

// SQL LIKE patterns (% = any chars, _ = single char)
val likePattern = employees.filter($"name".like("_a%"))  // Second char is 'a': David

// Regular expressions
val regexMatch = employees.filter($"name".rlike("^[A-D].*"))  // Names starting A-D

For case-insensitive matching, convert to lowercase first:

val caseInsensitive = employees.filter(lower($"name").contains("alice"))

A word of caution: rlike() (regex) and even contains() can be expensive on large datasets because they can’t leverage indexes or partition pruning. If you’re filtering billions of rows, prefer startsWith() or exact matches when possible—these can push down to columnar formats like Parquet.

// Prefer this (can use statistics/indexes)
val efficient = employees.filter($"department" === "Engineering")

// Over this (requires scanning all values)
val lessEfficient = employees.filter($"department".contains("Engineer"))

Handling Null Values

Null handling trips up many developers. Spark follows SQL semantics where comparisons with null return null (not true or false), which means rows with null values often disappear unexpectedly.

val dataWithNulls = Seq(
  ("Alice", Some(85000)),
  ("Bob", None),
  ("Carol", Some(95000)),
  (null, Some(60000))
).toDF("name", "salary")

// Filter for nulls
val nullSalaries = dataWithNulls.filter($"salary".isNull)
val nullNames = dataWithNulls.filter($"name".isNull)

// Filter out nulls
val validSalaries = dataWithNulls.filter($"salary".isNotNull)

// Drop rows with any null values
val noNulls = dataWithNulls.na.drop()

// Drop rows with nulls in specific columns
val validNames = dataWithNulls.na.drop(Seq("name"))

When combining null checks with other conditions, be explicit:

// This won't include rows where salary is null
val highEarners = dataWithNulls.filter($"salary" > 70000)

// If you want nulls treated as failing the condition (same behavior, but explicit)
val highEarnersExplicit = dataWithNulls.filter($"salary".isNotNull && $"salary" > 70000)

// If you want to include nulls in results
val highOrUnknown = dataWithNulls.filter($"salary" > 70000 || $"salary".isNull)

Combining Multiple Conditions

Real-world filters rarely involve single conditions. Spark uses && for AND, || for OR, and ! for NOT:

// AND: Both conditions must be true
val seniorEngineers = employees.filter(
  $"department" === "Engineering" && $"age" > 35
)

// OR: Either condition can be true
val techOrSales = employees.filter(
  $"department" === "Engineering" || $"department" === "Sales"
)

// For multiple OR conditions on the same column, use isin()
val selectedDepts = employees.filter($"department".isin("Engineering", "Sales", "Marketing"))

// NOT: Invert a condition
val notEngineering = employees.filter(!($"department" === "Engineering"))

// Complex combinations with parentheses
val complexFilter = employees.filter(
  ($"department" === "Engineering" && $"salary" > 80000) ||
  ($"department" === "Sales" && $"age" < 30)
)

You can also chain multiple filter calls. Spark optimizes these into a single filter operation:

// These are equivalent after optimization
val chained = employees
  .filter($"department" === "Engineering")
  .filter($"salary" > 80000)
  .filter($"age" > 30)

val combined = employees.filter(
  $"department" === "Engineering" && $"salary" > 80000 && $"age" > 30
)

I prefer chaining when each filter represents a distinct logical step (e.g., “active users” then “premium tier” then “recent activity”). Use combined conditions when the logic is inherently coupled.

Performance Considerations

Filter placement matters enormously for performance. Spark’s Catalyst optimizer performs predicate pushdown—moving filters as close to the data source as possible—but you can help it along.

// Good: Filter early, reduce data volume for subsequent operations
val result = employees
  .filter($"department" === "Engineering")  // Reduces rows first
  .groupBy($"age")
  .agg(avg($"salary"))

// Suboptimal: Aggregating all data before filtering
val suboptimal = employees
  .groupBy($"department", $"age")
  .agg(avg($"salary"))
  .filter($"department" === "Engineering")

Use explain() to verify predicate pushdown is working:

employees.filter($"department" === "Engineering").explain(true)

The physical plan will show PushedFilters when reading from sources like Parquet or JDBC that support pushdown:

== Physical Plan ==
*(1) Filter (isnotnull(department#1) AND (department#1 = Engineering))
+- FileScan parquet [name#0,department#1,salary#2,age#3] 
   PushedFilters: [IsNotNull(department), EqualTo(department,Engineering)]

Avoid UDFs in filters when built-in functions suffice. UDFs prevent many optimizations:

// Bad: UDF prevents pushdown and optimization
val isEngineering = udf((dept: String) => dept == "Engineering")
val withUdf = employees.filter(isEngineering($"department"))

// Good: Native function, fully optimized
val native = employees.filter($"department" === "Engineering")

For filters involving small lookup datasets, broadcast the lookup to avoid shuffles:

val validDepartments = Seq("Engineering", "Sales").toDF("dept")

// Broadcast small dataset for efficient filtering
val filtered = employees.join(
  broadcast(validDepartments),
  $"department" === $"dept"
).drop("dept")

Filter operations are fundamental to efficient Spark applications. Master the syntax, understand null semantics, and pay attention to where you place filters in your transformation chain. The patterns in this article will serve you well whether you’re processing megabytes or petabytes.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.