PySpark - Show DataFrame Contents with show()

Key Insights

• The show() method triggers immediate DataFrame evaluation despite PySpark’s lazy execution model, making it essential for debugging but potentially expensive on large datasets • Control output precision with three key parameters: n for row count (default 20), truncate for column width (default 20 chars), and vertical for transposed display of wide DataFrames • Always combine show() with limit() or filtered views when working with production-scale data to avoid accidentally materializing millions of rows

Introduction to DataFrame Display in PySpark

When working with PySpark DataFrames, you’re operating in a distributed computing environment where data transformations are lazily evaluated. This means your code defines a execution plan rather than immediately processing data. The show() method breaks this pattern—it’s an action that forces evaluation and displays DataFrame contents to your console.

Unlike pandas where you can simply print a DataFrame or use head(), PySpark requires explicit display methods because your data might be distributed across hundreds of nodes in a cluster. The show() method intelligently samples and formats this distributed data for human-readable output, making it indispensable during development, debugging, and exploratory data analysis.

Here’s the simplest possible example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ShowExample").getOrCreate()

data = [
    (1, "Alice", "Engineer"),
    (2, "Bob", "Data Scientist"),
    (3, "Charlie", "Manager"),
    (4, "Diana", "Analyst")
]

df = spark.createDataFrame(data, ["id", "name", "role"])
df.show()

Output:

+---+-------+--------------+
| id|   name|          role|
+---+-------+--------------+
|  1|  Alice|      Engineer|
|  2|    Bob|Data Scientist|
|  3|Charlie|       Manager|
|  4|  Diana|       Analyst|
+---+-------+--------------+

Basic show() Syntax and Default Behavior

The show() method signature is straightforward: show(n=20, truncate=True, vertical=False). By default, it displays 20 rows and truncates any column content exceeding 20 characters. This conservative approach prevents overwhelming your console with massive outputs.

The default behavior is sensible for most interactive development scenarios. Twenty rows provide enough context to understand data structure and spot obvious issues without scrolling through excessive output. The truncation prevents single cells with long strings from destroying your table formatting.

For developers coming from pandas, here’s a comparison:

# Pandas approach
import pandas as pd
pandas_df = pd.DataFrame(data, columns=["id", "name", "role"])
print(pandas_df.head())  # Shows 5 rows by default

# PySpark equivalent
df.show(5)  # Explicitly request 5 rows

The key difference: pandas head() returns a new DataFrame you can assign or chain, while PySpark’s show() is purely for display and returns None. If you need the actual data for further processing, use take() or limit() instead.

Controlling Row Display

The first parameter n controls how many rows to display. This seems simple but has important implications for performance and usability.

Display fewer rows when quickly checking schema or validating transformations:

# Quick validation after transformation
df_transformed = df.filter(df.id > 2).select("name", "role")
df_transformed.show(5)

Output:

+-------+-------+
|   name|   role|
+-------+-------+
|Charlie|Manager|
|  Diana|Analyst|
+-------+-------+

For larger previews during exploratory analysis:

# More comprehensive view
df_large = spark.range(0, 1000).withColumn("value", spark.sql.functions.col("id") * 2)
df_large.show(100)  # Shows first 100 rows

Critical warning: Avoid show(df.count()) on large datasets. This materializes the entire DataFrame, which defeats PySpark’s distributed processing advantages. If you absolutely need to see all rows (which you rarely do), at least be aware you’re pulling potentially gigabytes of data to the driver node:

# DANGEROUS on large datasets
# df.show(df.count())  # Don't do this in production!

# Better approach: sample first
df.limit(1000).show(100)  # Sample 1000, display 100

Managing Column Truncation

The truncate parameter accepts three types of values: True (default, truncates at 20 chars), False (no truncation), or an integer specifying custom truncation length.

This becomes critical when working with columns containing long strings, JSON, or concatenated values:

data_long = [
    (1, "This is a very long description that will definitely exceed twenty characters"),
    (2, "Short text"),
    (3, "Another extremely long piece of text that contains important information we need to see")
]

df_long = spark.createDataFrame(data_long, ["id", "description"])

# Default truncation
print("Default (truncate=True):")
df_long.show()

# No truncation
print("\nNo truncation (truncate=False):")
df_long.show(truncate=False)

# Custom truncation
print("\nCustom truncation (truncate=50):")
df_long.show(truncate=50)

Output comparison:

Default (truncate=True):
+---+--------------------+
| id|         description|
+---+--------------------+
|  1|This is a very lo...|
|  2|          Short text|
|  3|Another extremely...|
+---+--------------------+

No truncation (truncate=False):
+---+------------------------------------------------------------------------------------+
| id|description                                                                         |
+---+------------------------------------------------------------------------------------+
|  1|This is a very long description that will definitely exceed twenty characters      |
|  2|Short text                                                                          |
|  3|Another extremely long piece of text that contains important information we need to see|
+---+------------------------------------------------------------------------------------+

Custom truncation (truncate=50):
+---+--------------------------------------------------+
| id|                                       description|
+---+--------------------------------------------------+
|  1|This is a very long description that will defi...|
|  2|                                        Short text|
|  3|Another extremely long piece of text that cont...|
+---+--------------------------------------------------+

Use truncate=False when debugging data quality issues in text columns, inspecting JSON payloads, or validating string transformations. Use custom integer values when you need more context than 20 characters but don’t want completely unwieldy output.

Vertical Display Mode

When working with wide DataFrames containing many columns, horizontal display becomes unreadable. The vertical=True parameter transposes the output, showing one column per line:

data_wide = [(1, "Alice", 30, "Engineer", "New York", "alice@example.com", "555-0001", "Full-time", 95000, "Engineering", "Active")]

columns = ["id", "name", "age", "role", "city", "email", "phone", "employment_type", "salary", "department", "status"]

df_wide = spark.createDataFrame(data_wide, columns)

# Horizontal (default) - hard to read
print("Horizontal display:")
df_wide.show()

# Vertical - much clearer
print("\nVertical display:")
df_wide.show(vertical=True)

Vertical output:

-RECORD 0---------------------------
 id              | 1                 
 name            | Alice             
 age             | 30                
 role            | Engineer          
 city            | New York          
 email           | alice@example.com 
 phone           | 555-0001          
 employment_type | Full-time         
 salary          | 95000             
 department      | Engineering       
 status          | Active

This format is particularly valuable when examining individual records in detail or when your DataFrame has more than 6-7 columns. You can combine vertical mode with row limiting to inspect specific records:

# Inspect first 3 records in detail
df_wide.show(3, vertical=True)

Performance Considerations and Best Practices

Every call to show() triggers a Spark job. On large datasets, this means actual computation across your cluster. Understanding this has important implications:

# This triggers TWO separate jobs
df.filter(df.id > 100).show()  # Job 1
df.filter(df.id > 100).count()  # Job 2

# Better: cache if you'll inspect multiple times
df_filtered = df.filter(df.id > 100).cache()
df_filtered.show()
df_filtered.count()  # Uses cached data

For large datasets, always limit before showing:

# BAD: Might process millions of rows
# large_df.show(20)  # Still scans data to find first 20

# GOOD: Explicitly limit computation
large_df.limit(100).show(20)

# BETTER: Combine with sampling for truly large datasets
large_df.sample(0.01).limit(100).show(20)

The take() method is more efficient when you need the actual data rather than just display:

# For display only
df.show(10)

# When you need the data
rows = df.take(10)  # Returns list of Row objects
for row in rows:
    print(row.name, row.role)

Common Use Cases and Practical Tips

Debugging transformation chains: Insert show() calls at each step to validate logic:

df.filter(df.age > 25) \
    .show(5)  # Checkpoint 1
    
df.filter(df.age > 25) \
    .groupBy("department") \
    .count() \
    .show()  # Checkpoint 2

Validating data quality: Check for nulls, unexpected values, or format issues:

# Show rows with null values
df.filter(df.email.isNull()).show(truncate=False)

# Inspect specific problematic columns
df.select("email", "phone").show(50, truncate=False)

Schema exploration: Combine with select() to focus on relevant columns:

# Instead of showing all 50 columns
# df.show()

# Focus on what matters
df.select("id", "name", "created_at", "status").show(20)

Quick statistics check: Use with aggregations for immediate feedback:

from pyspark.sql import functions as F

df.groupBy("department") \
    .agg(
        F.count("*").alias("count"),
        F.avg("salary").alias("avg_salary")
    ) \
    .show(truncate=False)

The show() method is your primary window into PySpark DataFrames. Master its parameters, understand its performance implications, and use it strategically throughout your development workflow. It’s simple on the surface but knowing when to truncate, when to go vertical, and when to limit your data makes the difference between frustrating debugging sessions and smooth development.