Spark SQL - Views (Temporary and Permanent)
• Temporary views exist only within the current Spark session and are automatically dropped when the session ends, while global temporary views persist across sessions within the same application and...
Key Insights
• Temporary views exist only within the current Spark session and are automatically dropped when the session ends, while global temporary views persist across sessions within the same application and permanent views are stored in the metastore catalog • Views in Spark SQL are logical constructs that don’t store data physically—they’re saved query definitions that execute on-demand, making them ideal for data abstraction and reusable query logic • Permanent views require an active Hive metastore connection and create catalog entries that survive application restarts, enabling cross-application data sharing and standardized business logic
Understanding Spark SQL Views
Views in Spark SQL function as virtual tables defined by query expressions. Unlike physical tables, views don’t materialize data to disk. Instead, they store the query definition and execute it whenever referenced. This makes views powerful tools for simplifying complex queries, enforcing data access patterns, and creating logical data layers.
Spark supports three types of views: temporary views (session-scoped), global temporary views (application-scoped), and permanent views (catalog-scoped). Each serves distinct use cases based on persistence requirements and sharing needs.
Creating Temporary Views
Temporary views exist only within the current SparkSession. They’re ideal for intermediate query results and session-specific data transformations.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ViewsDemo").getOrCreate()
# Create sample DataFrame
data = [
(1, "Alice", "Engineering", 95000),
(2, "Bob", "Marketing", 75000),
(3, "Charlie", "Engineering", 105000),
(4, "Diana", "Sales", 85000),
(5, "Eve", "Engineering", 98000)
]
df = spark.createDataFrame(data, ["id", "name", "department", "salary"])
# Create temporary view
df.createOrReplaceTempView("employees")
# Query the view
result = spark.sql("""
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
""")
result.show()
Output:
+------------+----------+
| department|avg_salary|
+------------+----------+
| Engineering| 99333.33|
| Sales| 85000.0|
| Marketing| 75000.0|
+------------+----------+
The createOrReplaceTempView() method overwrites existing views with the same name. Use createTempView() if you want to raise an exception when a view already exists:
# This will raise AnalysisException if view exists
try:
df.createTempView("employees")
except Exception as e:
print(f"View already exists: {e}")
Working with Global Temporary Views
Global temporary views are visible across all sessions within the same Spark application. They’re stored in the special global_temp database and remain available until the application terminates.
# Create global temporary view
df.createOrReplaceGlobalTempView("global_employees")
# Access from same session
spark.sql("SELECT * FROM global_temp.global_employees WHERE salary > 90000").show()
# Simulate different session (in practice, this would be a separate SparkSession)
new_session = spark.newSession()
new_session.sql("SELECT COUNT(*) as high_earners FROM global_temp.global_employees WHERE salary > 90000").show()
Output:
+------------+
|high_earners|
+------------+
| 3|
+------------+
Global temporary views enable data sharing between different parts of your application without persisting data to storage. This is particularly useful in multi-threaded applications or when coordinating between different processing stages.
Creating Permanent Views
Permanent views are stored in the Hive metastore and persist beyond application lifecycles. They require a configured metastore and provide catalog-level data abstraction.
# Configure Spark with Hive support
spark = SparkSession.builder \
.appName("PermanentViews") \
.enableHiveSupport() \
.getOrCreate()
# Create permanent view using SQL
spark.sql("""
CREATE OR REPLACE VIEW engineering_team AS
SELECT id, name, salary
FROM employees
WHERE department = 'Engineering'
""")
# Create permanent view using DataFrame API
df.filter(df.department == "Sales") \
.select("id", "name", "salary") \
.createOrReplaceTempView("sales_temp")
spark.sql("""
CREATE OR REPLACE VIEW sales_team AS
SELECT * FROM sales_temp
""")
Permanent views appear in the metastore catalog and can be queried from any Spark application connected to the same metastore:
# List all views in current database
spark.sql("SHOW VIEWS").show()
# Describe view
spark.sql("DESCRIBE EXTENDED engineering_team").show(truncate=False)
Advanced View Patterns
Parameterized View Logic
While views don’t support direct parameterization, you can create reusable view patterns using string formatting:
def create_department_view(dept_name):
view_name = f"{dept_name.lower()}_department"
spark.sql(f"""
CREATE OR REPLACE TEMP VIEW {view_name} AS
SELECT name, salary, salary * 0.15 as tax
FROM employees
WHERE department = '{dept_name}'
""")
return view_name
# Create department-specific views
eng_view = create_department_view("Engineering")
sales_view = create_department_view("Sales")
spark.sql(f"SELECT * FROM {eng_view}").show()
Nested Views
Views can reference other views, creating abstraction layers:
# Base view
spark.sql("""
CREATE OR REPLACE TEMP VIEW employee_base AS
SELECT *, salary * 12 as annual_salary
FROM employees
""")
# Derived view
spark.sql("""
CREATE OR REPLACE TEMP VIEW high_earners AS
SELECT name, department, annual_salary
FROM employee_base
WHERE annual_salary > 1000000
""")
spark.sql("SELECT * FROM high_earners").show()
View with Complex Transformations
Views can encapsulate complex business logic:
spark.sql("""
CREATE OR REPLACE TEMP VIEW employee_analytics AS
SELECT
department,
COUNT(*) as headcount,
AVG(salary) as avg_salary,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
PERCENTILE_APPROX(salary, 0.5) as median_salary,
SUM(salary) as total_compensation
FROM employees
GROUP BY department
""")
# Use the view in subsequent queries
spark.sql("""
SELECT department, headcount, avg_salary
FROM employee_analytics
WHERE headcount > 1
ORDER BY avg_salary DESC
""").show()
Managing Views
Listing and Inspecting Views
# Show all views in current database
spark.sql("SHOW VIEWS").show()
# Show views matching pattern
spark.sql("SHOW VIEWS LIKE 'employee*'").show()
# Get view definition
view_def = spark.sql("SHOW CREATE TABLE engineering_team").collect()
print(view_def[0][0])
# Check if view exists
def view_exists(view_name):
views = spark.sql("SHOW VIEWS").collect()
return any(row.viewName == view_name for row in views)
print(f"View exists: {view_exists('employees')}")
Dropping Views
# Drop temporary view
spark.catalog.dropTempView("employees")
# Drop global temporary view
spark.catalog.dropGlobalTempView("global_employees")
# Drop permanent view
spark.sql("DROP VIEW IF EXISTS engineering_team")
# Drop multiple views
views_to_drop = ["sales_team", "marketing_team"]
for view in views_to_drop:
spark.sql(f"DROP VIEW IF EXISTS {view}")
Performance Considerations
Views don’t cache results. Each query against a view re-executes the underlying query definition. For frequently accessed views with expensive computations, consider caching:
# Cache the view's data
spark.sql("SELECT * FROM employee_analytics").cache()
# Or create a cached temporary table instead
spark.sql("""
CREATE TEMPORARY VIEW cached_analytics
CACHE AS
SELECT * FROM employee_analytics
""")
For permanent materialization, use tables instead of views:
# Create table from view
spark.sql("""
CREATE TABLE employee_analytics_table AS
SELECT * FROM employee_analytics
""")
Views are query optimization boundaries. Spark’s Catalyst optimizer can push predicates through views, but complex nested views may impact query planning. Monitor query execution plans using explain() to ensure optimal performance.