How to Register a Temp View in PySpark
Temp views in PySpark let you query DataFrames using SQL syntax. Instead of chaining DataFrame transformations, you register a DataFrame as a named view and write familiar SQL against it. This is...
Key Insights
- Use
createOrReplaceTempView()as your default method—it safely handles existing views without throwing errors, making your code more robust in iterative development and production pipelines. - Global temp views persist across SparkSessions but require the
global_tempdatabase prefix when querying, which catches many developers off guard. - Temp views are metadata pointers, not data copies—they don’t duplicate your DataFrame in memory, making them an efficient bridge between DataFrame API and SQL syntax.
Introduction
Temp views in PySpark let you query DataFrames using SQL syntax. Instead of chaining DataFrame transformations, you register a DataFrame as a named view and write familiar SQL against it. This is particularly valuable when you’re working with analysts who prefer SQL, migrating legacy SQL code to Spark, or simply find certain operations more readable in SQL.
The concept is straightforward: you take a DataFrame, give it a name, and then reference that name in SQL queries. PySpark offers two flavors—session-scoped temp views and global temp views—each with distinct lifetimes and access patterns. Understanding when to use each will save you debugging time and make your Spark applications more maintainable.
Prerequisites
You need PySpark installed and a SparkSession initialized. If you’re running locally for development, here’s the minimal setup:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("TempViewDemo") \
.master("local[*]") \
.getOrCreate()
The master("local[*]") setting runs Spark locally using all available cores. In production or cluster environments, you’d typically omit this and let your cluster manager handle resource allocation.
Verify your session is working:
print(spark.version)
If you see a version number, you’re ready to proceed.
Creating a Sample DataFrame
Let’s create a DataFrame to work with throughout this article. We’ll use employee data—simple enough to understand but realistic enough to demonstrate meaningful queries.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
# Define schema explicitly for clarity
schema = StructType([
StructField("employee_id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("department", StringType(), True),
StructField("salary", DoubleType(), True),
StructField("years_experience", IntegerType(), True)
])
# Sample data
data = [
(1, "Alice Chen", "Engineering", 95000.0, 5),
(2, "Bob Martinez", "Engineering", 87000.0, 3),
(3, "Carol White", "Marketing", 72000.0, 4),
(4, "David Kim", "Engineering", 110000.0, 8),
(5, "Eva Johnson", "Marketing", 68000.0, 2),
(6, "Frank Brown", "Sales", 75000.0, 6),
(7, "Grace Lee", "Sales", 82000.0, 4)
]
employees_df = spark.createDataFrame(data, schema)
employees_df.show()
Output:
+-----------+------------+-----------+--------+----------------+
|employee_id| name| department| salary|years_experience|
+-----------+------------+-----------+--------+----------------+
| 1| Alice Chen|Engineering| 95000.0| 5|
| 2|Bob Martinez|Engineering| 87000.0| 3|
| 3| Carol White| Marketing| 72000.0| 4|
| 4| David Kim|Engineering|110000.0| 8|
| 5| Eva Johnson| Marketing| 68000.0| 2|
| 6| Frank Brown| Sales| 75000.0| 6|
| 7| Grace Lee| Sales| 82000.0| 4|
+-----------+------------+-----------+--------+----------------+
Registering a Temp View with createOrReplaceTempView()
The createOrReplaceTempView() method is your workhorse for creating session-scoped temporary views. It registers the DataFrame as a SQL table that exists for the duration of your SparkSession.
# Register the DataFrame as a temp view
employees_df.createOrReplaceTempView("employees")
# Now query it with SQL
result = spark.sql("SELECT * FROM employees WHERE department = 'Engineering'")
result.show()
Output:
+-----------+------------+-----------+--------+----------------+
|employee_id| name| department| salary|years_experience|
+-----------+------------+-----------+--------+----------------+
| 1| Alice Chen|Engineering| 95000.0| 5|
| 2|Bob Martinez|Engineering| 87000.0| 3|
| 4| David Kim|Engineering|110000.0| 8|
+-----------+------------+-----------+--------+----------------+
You can run any valid Spark SQL against the view:
# Aggregation query
dept_stats = spark.sql("""
SELECT
department,
COUNT(*) as employee_count,
ROUND(AVG(salary), 2) as avg_salary,
MAX(years_experience) as max_experience
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
""")
dept_stats.show()
Output:
+-----------+--------------+----------+--------------+
| department|employee_count|avg_salary|max_experience|
+-----------+--------------+----------+--------------+
|Engineering| 3| 97333.33| 8|
| Sales| 2| 78500.0| 6|
| Marketing| 2| 70000.0| 4|
+-----------+--------------+----------+--------------+
The view persists until you explicitly drop it or the SparkSession ends. You can also join multiple temp views:
# Create a second DataFrame and view
departments_data = [
("Engineering", "Building A", "Jane Smith"),
("Marketing", "Building B", "Tom Wilson"),
("Sales", "Building B", "Sarah Davis")
]
departments_df = spark.createDataFrame(
departments_data,
["dept_name", "location", "manager"]
)
departments_df.createOrReplaceTempView("departments")
# Join the views
joined = spark.sql("""
SELECT
e.name,
e.department,
e.salary,
d.location,
d.manager
FROM employees e
JOIN departments d ON e.department = d.dept_name
WHERE e.salary > 80000
""")
joined.show()
Global Temp Views with createGlobalTempView()
Global temp views survive across multiple SparkSessions within the same Spark application. They’re stored in a special system database called global_temp, and you must use this prefix when querying them.
# Create a global temp view
employees_df.createGlobalTempView("global_employees")
# Query requires the global_temp prefix
global_result = spark.sql("SELECT * FROM global_temp.global_employees WHERE salary > 90000")
global_result.show()
Output:
+-----------+----------+-----------+--------+----------------+
|employee_id| name| department| salary|years_experience|
+-----------+----------+-----------+--------+----------------+
| 1|Alice Chen|Engineering| 95000.0| 5|
| 4| David Kim|Engineering|110000.0| 8|
+-----------+----------+-----------+--------+----------------+
The critical difference: if you create a new SparkSession (not a new Spark application), the global temp view remains accessible:
# Create a new session within the same application
new_session = spark.newSession()
# Global temp view is still accessible
new_session.sql("SELECT COUNT(*) FROM global_temp.global_employees").show()
# But session-scoped views are NOT accessible from the new session
# This would fail: new_session.sql("SELECT * FROM employees")
Key Differences and When to Use Each
| Aspect | createOrReplaceTempView() | createGlobalTempView() |
|---|---|---|
| Scope | Single SparkSession | All SparkSessions in application |
| Lifetime | Until session ends or view dropped | Until application ends or view dropped |
| Query syntax | SELECT * FROM view_name |
SELECT * FROM global_temp.view_name |
| Overwrites existing | Yes (that’s the “OrReplace” part) | No—throws error if exists |
| Primary use case | Standard SQL queries, notebooks, single-session jobs | Shared reference data across sessions |
Use session-scoped views for the vast majority of cases: ETL pipelines, data exploration in notebooks, and any single-session application. They’re simpler to work with and clean up automatically.
Use global temp views when you genuinely need to share data across SparkSessions—for example, reference tables that multiple concurrent sessions need to access without recomputing.
Common Pitfalls and Best Practices
Handle Existing Views Gracefully
The createTempView() method (without “OrReplace”) throws an error if the view already exists:
# First call succeeds
employees_df.createTempView("staff")
# Second call throws AnalysisException
# employees_df.createTempView("staff") # Error!
# Safe approach: use createOrReplaceTempView
employees_df.createOrReplaceTempView("staff") # Always works
For global temp views, use createOrReplaceGlobalTempView() to avoid similar issues:
# Safe global view creation
employees_df.createOrReplaceGlobalTempView("global_staff")
Clean Up When Appropriate
While temp views don’t duplicate data, they do consume catalog metadata. In long-running applications, explicitly drop views you no longer need:
# Drop a session-scoped view
spark.catalog.dropTempView("employees")
# Drop a global temp view
spark.catalog.dropGlobalTempView("global_employees")
# Check if a view exists before dropping
if spark.catalog.tableExists("employees"):
spark.catalog.dropTempView("employees")
Use Descriptive Names
Treat view names like variable names—make them descriptive:
# Poor naming
df.createOrReplaceTempView("temp1")
# Better naming
df.createOrReplaceTempView("filtered_active_customers_2024")
Understand That Views Are Lazy
Temp views don’t materialize data. They’re essentially named query plans. Each SQL query against a view re-executes the underlying DataFrame transformations:
# This transformation runs every time you query the view
expensive_df = spark.read.parquet("huge_dataset.parquet") \
.filter("status = 'active'") \
.groupBy("category").agg({"amount": "sum"})
expensive_df.createOrReplaceTempView("category_totals")
# Both queries re-execute the full transformation
spark.sql("SELECT * FROM category_totals WHERE category = 'A'").show()
spark.sql("SELECT * FROM category_totals WHERE category = 'B'").show()
If you’re querying the same view repeatedly, consider caching the underlying DataFrame:
expensive_df.cache()
expensive_df.createOrReplaceTempView("category_totals")
Temp views bridge the gap between PySpark’s DataFrame API and SQL. Use createOrReplaceTempView() as your default, reserve global temp views for genuine cross-session needs, and remember that views are metadata, not materialized data. With these fundamentals, you can seamlessly mix DataFrame transformations and SQL queries in your Spark applications.