PySpark - Create Global Temporary View
Temporary views in PySpark provide a SQL-like interface to query DataFrames without persisting data to disk. They're essentially named references to DataFrames that you can query using Spark SQL...
Key Insights
- Global temporary views in PySpark persist across SparkSessions within the same application, unlike local temporary views which are session-scoped, making them essential for sharing data between different execution contexts.
- All global temporary views must be accessed through the
global_tempdatabase prefix (e.g.,global_temp.my_view), which is a reserved system database that exists for the lifetime of the Spark application. - Use
createOrReplaceGlobalTempView()instead ofcreateGlobalTempView()in production code to avoid errors when views already exist, and always validate view existence before operations to prevent runtime failures.
Introduction to Temporary Views in PySpark
Temporary views in PySpark provide a SQL-like interface to query DataFrames without persisting data to disk. They’re essentially named references to DataFrames that you can query using Spark SQL syntax. PySpark offers two types: local temporary views (session-scoped) and global temporary views (application-scoped).
Local temporary views exist only within the SparkSession that created them. When you close that session, the view disappears. Global temporary views, however, survive across multiple SparkSessions within the same Spark application. This distinction matters when you’re working with complex pipelines that spawn multiple sessions or when you need to share data between different execution contexts without writing to persistent storage.
Use global temporary views when you need to share intermediate results across different parts of your application, coordinate between parallel processing streams, or maintain reference data accessible to multiple sessions. They’re particularly valuable in notebook environments where different cells might create separate sessions, or in applications that dynamically create sessions for isolation.
Understanding Global Temporary Views
Global temporary views live in a special system-preserved database called global_temp. This database is automatically created by Spark and exists for the entire lifetime of your Spark application. Unlike user-created databases, you cannot drop or modify global_temp itself—it’s managed entirely by Spark.
The lifecycle of a global temporary view is tied to your Spark application, not individual sessions. Once created, the view remains accessible until explicitly dropped or the application terminates. This makes them perfect for sharing data structures that multiple sessions need to reference, but it also means you need to be deliberate about cleanup to avoid memory bloat.
Here’s the fundamental difference between local and global temporary views:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("TempViewDemo").getOrCreate()
# Sample data
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
df = spark.createDataFrame(data, ["name", "age"])
# Create a local temporary view
df.createTempView("local_people")
# Create a global temporary view
df.createGlobalTempView("global_people")
# Query local view (no prefix needed)
spark.sql("SELECT * FROM local_people").show()
# Query global view (requires global_temp prefix)
spark.sql("SELECT * FROM global_temp.global_people").show()
The key difference: local views are accessed directly by name, while global views require the global_temp. prefix. Forget this prefix, and you’ll get a “Table or view not found” error.
Creating a Global Temporary View
Creating global temporary views is straightforward. You call createGlobalTempView() on any DataFrame. Here’s a practical example loading data from a CSV file:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
spark = SparkSession.builder.appName("GlobalViewExample").getOrCreate()
# Define schema for better performance
schema = StructType([
StructField("product_id", StringType(), True),
StructField("product_name", StringType(), True),
StructField("category", StringType(), True),
StructField("price", DoubleType(), True),
StructField("stock", IntegerType(), True)
])
# Load data from CSV
products_df = spark.read.csv(
"products.csv",
header=True,
schema=schema
)
# Create global temporary view
products_df.createGlobalTempView("products_catalog")
print("Global temporary view 'products_catalog' created successfully")
You can also create global temporary views from SQL query results, which is useful for sharing filtered or aggregated data:
# Execute a SQL query and create a global view from results
expensive_products = spark.sql("""
SELECT
category,
product_name,
price,
stock
FROM global_temp.products_catalog
WHERE price > 100
ORDER BY price DESC
""")
# Create another global view from the filtered results
expensive_products.createGlobalTempView("premium_products")
# Verify the view was created
spark.sql("SELECT COUNT(*) as premium_count FROM global_temp.premium_products").show()
Accessing Global Temporary Views
Accessing global temporary views requires the global_temp database prefix. You can query them using either Spark SQL or the DataFrame API:
# Method 1: Using spark.sql()
result_sql = spark.sql("""
SELECT
category,
AVG(price) as avg_price,
SUM(stock) as total_stock
FROM global_temp.products_catalog
GROUP BY category
ORDER BY avg_price DESC
""")
result_sql.show()
# Method 2: Using spark.table()
result_table = spark.table("global_temp.products_catalog")
# Apply DataFrame transformations
filtered_result = result_table.filter(result_table.stock > 0) \
.select("product_name", "price", "stock") \
.orderBy("price")
filtered_result.show(10)
Both methods are functionally equivalent, but spark.sql() is more convenient for complex SQL queries, while spark.table() is better when you want to continue with DataFrame API transformations.
Here’s how global views work across multiple sessions:
# Create a new SparkSession (within the same application)
spark2 = SparkSession.builder.appName("TempViewDemo").getOrCreate()
# Access the global view from the new session
# This works because global views are application-scoped
cross_session_data = spark2.sql("SELECT * FROM global_temp.products_catalog")
cross_session_data.show(5)
# Local views from spark would NOT be accessible here
# This would fail: spark2.sql("SELECT * FROM local_view")
Managing Global Temporary Views
Managing the lifecycle of global temporary views is crucial for maintaining application health. Always use createOrReplaceGlobalTempView() in production code to handle cases where views might already exist:
# Safe approach - replaces view if it exists
products_df.createOrReplaceGlobalTempView("products_catalog")
# This won't throw an error even if run multiple times
updated_data = spark.sql("""
SELECT * FROM global_temp.products_catalog
WHERE stock > 0
""")
updated_data.createOrReplaceGlobalTempView("available_products")
Check if a view exists before performing operations:
# List all global temporary views
global_views = [table.name for table in spark.catalog.listTables("global_temp")]
print(f"Global temporary views: {global_views}")
# Check if specific view exists
if "products_catalog" in global_views:
print("View exists - proceeding with query")
spark.sql("SELECT * FROM global_temp.products_catalog").show(5)
else:
print("View not found - creating it now")
products_df.createGlobalTempView("products_catalog")
Drop global temporary views when they’re no longer needed to free memory:
# Drop a specific global temporary view
spark.catalog.dropGlobalTempView("products_catalog")
# Verify it's been dropped
remaining_views = [table.name for table in spark.catalog.listTables("global_temp")]
print(f"Remaining views: {remaining_views}")
# Attempting to drop a non-existent view returns False (no error)
result = spark.catalog.dropGlobalTempView("non_existent_view")
print(f"Drop result: {result}") # False
Practical Use Cases and Best Practices
Global temporary views excel in several scenarios. Use them for reference data that multiple processing stages need to access:
# Load reference data once, share across pipeline stages
reference_data = spark.read.parquet("reference/customer_segments.parquet")
reference_data.createOrReplaceGlobalTempView("customer_segments")
# Multiple sessions can now join against this reference
session1 = spark.sql("""
SELECT t.*, s.segment_name
FROM transactions t
JOIN global_temp.customer_segments s ON t.customer_id = s.customer_id
""")
session2 = spark.sql("""
SELECT s.segment_name, COUNT(*) as customer_count
FROM global_temp.customer_segments s
GROUP BY s.segment_name
""")
For large DataFrames that you’ll query multiple times, consider caching the underlying DataFrame before creating the view:
# Cache the DataFrame for better performance
large_df = spark.read.parquet("large_dataset.parquet")
large_df.cache()
# Create global view from cached DataFrame
large_df.createOrReplaceGlobalTempView("cached_large_dataset")
# Subsequent queries will benefit from caching
spark.sql("SELECT * FROM global_temp.cached_large_dataset WHERE date = '2024-01-01'").show()
Avoid global temporary views when you need persistence beyond application lifetime—use tables instead. Also avoid them for very large datasets that don’t need cross-session access; local temporary views or direct DataFrame references are more efficient.
Common Pitfalls and Troubleshooting
The most common error is forgetting the global_temp prefix:
# WRONG - This will fail
try:
spark.sql("SELECT * FROM products_catalog").show()
except Exception as e:
print(f"Error: {e}") # Table or view not found: products_catalog
# CORRECT - Always use global_temp prefix
spark.sql("SELECT * FROM global_temp.products_catalog").show()
Handle view creation errors gracefully with proper validation:
def safe_create_global_view(df, view_name):
"""
Safely create a global temporary view with error handling
"""
try:
# Check if view already exists
existing_views = [t.name for t in spark.catalog.listTables("global_temp")]
if view_name in existing_views:
print(f"View '{view_name}' already exists - replacing")
df.createOrReplaceGlobalTempView(view_name)
else:
df.createGlobalTempView(view_name)
print(f"View '{view_name}' created successfully")
return True
except Exception as e:
print(f"Failed to create view '{view_name}': {e}")
return False
# Usage
success = safe_create_global_view(products_df, "products_catalog")
if success:
spark.sql("SELECT COUNT(*) FROM global_temp.products_catalog").show()
Remember that global temporary views are tied to the Spark application, not the cluster. If your application restarts, all global temporary views are lost. Plan your data pipeline accordingly, and don’t rely on global views for true persistence.