PySpark - Create Temporary View (createOrReplaceTempView)
Temporary views bridge the gap between PySpark's DataFrame API and SQL queries. When you register a DataFrame as a temporary view, you're creating a named reference that allows you to query that data...
Key Insights
- Temporary views let you query DataFrames using SQL syntax without persisting data to disk, making them ideal for complex transformations and team members familiar with SQL
createOrReplaceTempView()is safer thancreateTempView()because it won’t throw errors when re-registering the same view name, which commonly happens in iterative development and production pipelines- Temporary views are session-scoped and automatically cleaned up when your SparkSession ends, while global temporary views persist across sessions and require the
global_tempdatabase prefix
Introduction to Temporary Views in PySpark
Temporary views bridge the gap between PySpark’s DataFrame API and SQL queries. When you register a DataFrame as a temporary view, you’re creating a named reference that allows you to query that data using standard SQL syntax through spark.sql(). This is particularly valuable when you’re working with data analysts who prefer SQL, migrating from traditional SQL-based ETL systems, or when SQL simply expresses your transformation logic more clearly than chained DataFrame operations.
Unlike permanent tables, temporary views don’t write data to disk. They’re metadata references to DataFrames that exist only within your SparkSession. When your session ends, these views disappear automatically. This session-scoped lifecycle makes them perfect for intermediate transformations in data pipelines without cluttering your metastore.
Basic Syntax and Usage
The createOrReplaceTempView() method is called directly on a DataFrame object. It takes a single argument: the name you want to assign to the view. This name becomes the table identifier you’ll use in SQL queries.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.appName("TempViewDemo").getOrCreate()
# Create sample data
data = [
("Alice", 34, "Engineering"),
("Bob", 45, "Sales"),
("Catherine", 29, "Engineering"),
("David", 38, "Marketing"),
("Eve", 42, "Sales")
]
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("department", StringType(), True)
])
df = spark.createDataFrame(data, schema)
# Create temporary view
df.createOrReplaceTempView("employees")
# Query the view using SQL
result = spark.sql("SELECT * FROM employees WHERE age > 35")
result.show()
Once you’ve registered the view, you can reference it in any SQL query within the same SparkSession. The DataFrame itself remains unchanged—you’re just creating an alias that the SQL engine can resolve.
Querying Temporary Views
The real power of temporary views emerges when you need complex SQL operations. You can use the full range of SQL syntax including subqueries, CTEs, window functions, and aggregations.
# Filter and project specific columns
young_engineers = spark.sql("""
SELECT name, age
FROM employees
WHERE department = 'Engineering' AND age < 35
""")
young_engineers.show()
# Aggregation with GROUP BY
dept_stats = spark.sql("""
SELECT
department,
COUNT(*) as employee_count,
AVG(age) as avg_age,
MAX(age) as max_age
FROM employees
GROUP BY department
ORDER BY avg_age DESC
""")
dept_stats.show()
# Complex query with subquery
above_avg_age = spark.sql("""
SELECT name, age, department
FROM employees
WHERE age > (SELECT AVG(age) FROM employees)
""")
above_avg_age.show()
# Create another DataFrame for joining
projects_data = [
("Alice", "Project Alpha"),
("Bob", "Project Beta"),
("Alice", "Project Gamma"),
("Eve", "Project Beta")
]
projects_df = spark.createDataFrame(projects_data, ["name", "project"])
projects_df.createOrReplaceTempView("projects")
# Join multiple views
employee_projects = spark.sql("""
SELECT
e.name,
e.department,
p.project
FROM employees e
INNER JOIN projects p ON e.name = p.name
ORDER BY e.name, p.project
""")
employee_projects.show()
This SQL-based approach is often more readable for complex joins and aggregations compared to equivalent DataFrame API calls, especially for teams with strong SQL backgrounds.
createOrReplaceTempView vs createTempView
PySpark provides two methods for creating temporary views: createTempView() and createOrReplaceTempView(). The difference is in how they handle existing views with the same name.
createTempView() throws an AnalysisException if a view with that name already exists. This strict behavior can be useful in scenarios where you want to ensure you’re not accidentally overwriting a view, but it’s problematic in iterative development and production pipelines that might re-run.
# First call succeeds
df.createTempView("strict_employees")
# Second call with same name throws error
try:
df.createTempView("strict_employees")
except Exception as e:
print(f"Error: {e}")
# Output: Error: Temporary view 'strict_employees' already exists
# createOrReplaceTempView handles this gracefully
df.createOrReplaceTempView("flexible_employees")
df.createOrReplaceTempView("flexible_employees") # No error - view is replaced
print("View successfully replaced")
In practice, createOrReplaceTempView() is almost always the better choice. It provides idempotent behavior, meaning you can run the same code multiple times without errors. This is critical for:
- Notebook environments where you re-run cells during development
- Production jobs that might retry on failure
- Scheduled pipelines that run repeatedly with updated data
Use createTempView() only when you specifically want to enforce that a view doesn’t already exist as a safety check.
Temporary Views vs Global Temporary Views
Temporary views created with createOrReplaceTempView() are session-scoped. They’re only visible within the SparkSession that created them. Global temporary views, created with createOrReplaceGlobalTempView(), are application-scoped and can be accessed across different SparkSessions within the same application.
# Create a session-scoped temporary view
df.createOrReplaceTempView("session_employees")
# Create a global temporary view
df.createOrReplaceGlobalTempView("global_employees")
# Query session-scoped view (no prefix needed)
spark.sql("SELECT * FROM session_employees").show()
# Query global view (requires global_temp prefix)
spark.sql("SELECT * FROM global_temp.global_employees").show()
# Create a new SparkSession
new_spark = spark.newSession()
# Session-scoped view is NOT accessible
try:
new_spark.sql("SELECT * FROM session_employees").show()
except Exception as e:
print(f"Cannot access session view: {type(e).__name__}")
# Global view IS accessible
new_spark.sql("SELECT * FROM global_temp.global_employees").show()
print("Global view accessible across sessions")
Global temporary views are useful when you need to share data across multiple concurrent SparkSessions in the same application, but they come with caveats. They persist until the Spark application terminates, so they can consume memory longer than expected. They also require the global_temp database prefix in queries, which can be less intuitive.
For most use cases, stick with regular temporary views unless you have a specific need for cross-session access.
Practical Use Cases
Temporary views shine in multi-step transformations where SQL expresses the logic more clearly than DataFrame operations. Consider a data pipeline that needs to join multiple data sources, apply business rules, and generate aggregated reports.
# Load multiple data sources
customers_data = [
(1, "Acme Corp", "Enterprise"),
(2, "TechStart", "Startup"),
(3, "MegaRetail", "Enterprise")
]
customers_df = spark.createDataFrame(customers_data, ["customer_id", "name", "segment"])
customers_df.createOrReplaceTempView("customers")
orders_data = [
(101, 1, "2024-01-15", 5000),
(102, 2, "2024-01-16", 1200),
(103, 1, "2024-01-17", 3000),
(104, 3, "2024-01-18", 8000),
(105, 2, "2024-01-19", 1500)
]
orders_df = spark.createDataFrame(orders_data, ["order_id", "customer_id", "order_date", "amount"])
orders_df.createOrReplaceTempView("orders")
# Complex analytical query using SQL
customer_analysis = spark.sql("""
WITH customer_metrics AS (
SELECT
c.customer_id,
c.name,
c.segment,
COUNT(o.order_id) as order_count,
SUM(o.amount) as total_revenue,
AVG(o.amount) as avg_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.segment
),
segment_benchmarks AS (
SELECT
segment,
AVG(total_revenue) as segment_avg_revenue
FROM customer_metrics
GROUP BY segment
)
SELECT
cm.name,
cm.segment,
cm.order_count,
cm.total_revenue,
cm.avg_order_value,
sb.segment_avg_revenue,
CASE
WHEN cm.total_revenue > sb.segment_avg_revenue THEN 'Above Average'
ELSE 'Below Average'
END as performance
FROM customer_metrics cm
JOIN segment_benchmarks sb ON cm.segment = sb.segment
ORDER BY cm.total_revenue DESC
""")
customer_analysis.show()
This SQL approach with CTEs is often clearer than the equivalent chain of DataFrame joins, groupBys, and window functions.
Best Practices and Cleanup
While temporary views are automatically cleaned up when your SparkSession ends, you should explicitly drop views that you no longer need in long-running applications to free up memory and avoid namespace pollution.
# List all temporary views
print("Current temporary views:")
for table in spark.catalog.listTables():
if table.tableType == "TEMPORARY":
print(f" - {table.name}")
# Drop a specific temporary view
spark.catalog.dropTempView("employees")
# Verify it's gone
try:
spark.sql("SELECT * FROM employees").show()
except Exception as e:
print(f"View successfully dropped: {type(e).__name__}")
# Drop global temporary view
spark.catalog.dropGlobalTempView("global_employees")
# Check if a view exists before dropping (returns True/False)
if spark.catalog.tableExists("projects"):
spark.catalog.dropTempView("projects")
print("Projects view dropped")
Key best practices:
- Use
createOrReplaceTempView()by default for idempotent behavior - Drop views explicitly in long-running applications to manage memory
- Use descriptive view names that indicate their purpose and scope
- Avoid creating hundreds of temporary views—consolidate when possible
- Prefer temporary views over global temporary views unless you need cross-session access
- Document which views are expected to exist at each stage of your pipeline
Temporary views are a powerful tool for making PySpark more accessible to SQL users and simplifying complex transformations. Use them strategically to write clearer, more maintainable data pipelines.