Spark SQL - Views (Temporary and Permanent)

Key Insights

• Temporary views exist only within the current Spark session and are automatically dropped when the session ends, while global temporary views persist across sessions within the same application and permanent views are stored in the metastore catalog • Views in Spark SQL are logical constructs that don’t store data physically—they’re saved query definitions that execute on-demand, making them ideal for data abstraction and reusable query logic • Permanent views require an active Hive metastore connection and create catalog entries that survive application restarts, enabling cross-application data sharing and standardized business logic

Understanding Spark SQL Views

Views in Spark SQL function as virtual tables defined by query expressions. Unlike physical tables, views don’t materialize data to disk. Instead, they store the query definition and execute it whenever referenced. This makes views powerful tools for simplifying complex queries, enforcing data access patterns, and creating logical data layers.

Spark supports three types of views: temporary views (session-scoped), global temporary views (application-scoped), and permanent views (catalog-scoped). Each serves distinct use cases based on persistence requirements and sharing needs.

Creating Temporary Views

Temporary views exist only within the current SparkSession. They’re ideal for intermediate query results and session-specific data transformations.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ViewsDemo").getOrCreate()

# Create sample DataFrame
data = [
    (1, "Alice", "Engineering", 95000),
    (2, "Bob", "Marketing", 75000),
    (3, "Charlie", "Engineering", 105000),
    (4, "Diana", "Sales", 85000),
    (5, "Eve", "Engineering", 98000)
]

df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

# Create temporary view
df.createOrReplaceTempView("employees")

# Query the view
result = spark.sql("""
    SELECT department, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
    ORDER BY avg_salary DESC
""")

result.show()

Output:

+------------+----------+
|  department|avg_salary|
+------------+----------+
| Engineering|   99333.33|
|       Sales|   85000.0|
|   Marketing|   75000.0|
+------------+----------+

The createOrReplaceTempView() method overwrites existing views with the same name. Use createTempView() if you want to raise an exception when a view already exists:

# This will raise AnalysisException if view exists
try:
    df.createTempView("employees")
except Exception as e:
    print(f"View already exists: {e}")

Working with Global Temporary Views

Global temporary views are visible across all sessions within the same Spark application. They’re stored in the special global_temp database and remain available until the application terminates.

# Create global temporary view
df.createOrReplaceGlobalTempView("global_employees")

# Access from same session
spark.sql("SELECT * FROM global_temp.global_employees WHERE salary > 90000").show()

# Simulate different session (in practice, this would be a separate SparkSession)
new_session = spark.newSession()
new_session.sql("SELECT COUNT(*) as high_earners FROM global_temp.global_employees WHERE salary > 90000").show()

Output:

+------------+
|high_earners|
+------------+
|           3|
+------------+

Global temporary views enable data sharing between different parts of your application without persisting data to storage. This is particularly useful in multi-threaded applications or when coordinating between different processing stages.

Creating Permanent Views

Permanent views are stored in the Hive metastore and persist beyond application lifecycles. They require a configured metastore and provide catalog-level data abstraction.

# Configure Spark with Hive support
spark = SparkSession.builder \
    .appName("PermanentViews") \
    .enableHiveSupport() \
    .getOrCreate()

# Create permanent view using SQL
spark.sql("""
    CREATE OR REPLACE VIEW engineering_team AS
    SELECT id, name, salary
    FROM employees
    WHERE department = 'Engineering'
""")

# Create permanent view using DataFrame API
df.filter(df.department == "Sales") \
    .select("id", "name", "salary") \
    .createOrReplaceTempView("sales_temp")

spark.sql("""
    CREATE OR REPLACE VIEW sales_team AS
    SELECT * FROM sales_temp
""")

Permanent views appear in the metastore catalog and can be queried from any Spark application connected to the same metastore:

# List all views in current database
spark.sql("SHOW VIEWS").show()

# Describe view
spark.sql("DESCRIBE EXTENDED engineering_team").show(truncate=False)

Advanced View Patterns

Parameterized View Logic

While views don’t support direct parameterization, you can create reusable view patterns using string formatting:

def create_department_view(dept_name):
    view_name = f"{dept_name.lower()}_department"
    spark.sql(f"""
        CREATE OR REPLACE TEMP VIEW {view_name} AS
        SELECT name, salary, salary * 0.15 as tax
        FROM employees
        WHERE department = '{dept_name}'
    """)
    return view_name

# Create department-specific views
eng_view = create_department_view("Engineering")
sales_view = create_department_view("Sales")

spark.sql(f"SELECT * FROM {eng_view}").show()

Nested Views

Views can reference other views, creating abstraction layers:

# Base view
spark.sql("""
    CREATE OR REPLACE TEMP VIEW employee_base AS
    SELECT *, salary * 12 as annual_salary
    FROM employees
""")

# Derived view
spark.sql("""
    CREATE OR REPLACE TEMP VIEW high_earners AS
    SELECT name, department, annual_salary
    FROM employee_base
    WHERE annual_salary > 1000000
""")

spark.sql("SELECT * FROM high_earners").show()

View with Complex Transformations

Views can encapsulate complex business logic:

spark.sql("""
    CREATE OR REPLACE TEMP VIEW employee_analytics AS
    SELECT 
        department,
        COUNT(*) as headcount,
        AVG(salary) as avg_salary,
        MIN(salary) as min_salary,
        MAX(salary) as max_salary,
        PERCENTILE_APPROX(salary, 0.5) as median_salary,
        SUM(salary) as total_compensation
    FROM employees
    GROUP BY department
""")

# Use the view in subsequent queries
spark.sql("""
    SELECT department, headcount, avg_salary
    FROM employee_analytics
    WHERE headcount > 1
    ORDER BY avg_salary DESC
""").show()

Managing Views

Listing and Inspecting Views

# Show all views in current database
spark.sql("SHOW VIEWS").show()

# Show views matching pattern
spark.sql("SHOW VIEWS LIKE 'employee*'").show()

# Get view definition
view_def = spark.sql("SHOW CREATE TABLE engineering_team").collect()
print(view_def[0][0])

# Check if view exists
def view_exists(view_name):
    views = spark.sql("SHOW VIEWS").collect()
    return any(row.viewName == view_name for row in views)

print(f"View exists: {view_exists('employees')}")

Dropping Views

# Drop temporary view
spark.catalog.dropTempView("employees")

# Drop global temporary view
spark.catalog.dropGlobalTempView("global_employees")

# Drop permanent view
spark.sql("DROP VIEW IF EXISTS engineering_team")

# Drop multiple views
views_to_drop = ["sales_team", "marketing_team"]
for view in views_to_drop:
    spark.sql(f"DROP VIEW IF EXISTS {view}")

Performance Considerations

Views don’t cache results. Each query against a view re-executes the underlying query definition. For frequently accessed views with expensive computations, consider caching:

# Cache the view's data
spark.sql("SELECT * FROM employee_analytics").cache()

# Or create a cached temporary table instead
spark.sql("""
    CREATE TEMPORARY VIEW cached_analytics
    CACHE AS
    SELECT * FROM employee_analytics
""")

For permanent materialization, use tables instead of views:

# Create table from view
spark.sql("""
    CREATE TABLE employee_analytics_table AS
    SELECT * FROM employee_analytics
""")

Views are query optimization boundaries. Spark’s Catalyst optimizer can push predicates through views, but complex nested views may impact query planning. Monitor query execution plans using explain() to ensure optimal performance.