Spark SQL - Managed vs External Tables
Spark SQL supports two table types that differ in how they manage data lifecycle and storage. Managed tables (also called internal tables) give Spark full control over both metadata and data files....
Key Insights
- Managed tables store both metadata and data in Spark’s warehouse directory, while external tables only track metadata and reference data in an external location—this fundamental difference affects data lifecycle, portability, and disaster recovery strategies.
- DROP TABLE behavior is the critical distinction: dropping a managed table deletes all data permanently, whereas dropping an external table only removes metadata, preserving the underlying data files.
- External tables provide better separation of concerns for production environments, enabling multiple processing engines to access the same data and supporting incremental data lake architectures without coupling storage to compute.
Understanding Table Types in Spark SQL
Spark SQL supports two table types that differ in how they manage data lifecycle and storage. Managed tables (also called internal tables) give Spark full control over both metadata and data files. External tables separate these concerns—Spark manages only the metadata while data files remain in a location you specify.
When you create a table without specifying a LOCATION clause, Spark creates a managed table and stores data in its default warehouse directory (configured via spark.sql.warehouse.dir). With a LOCATION clause, you create an external table pointing to existing or new data at that path.
// Managed table - Spark controls everything
spark.sql("""
CREATE TABLE managed_sales (
transaction_id STRING,
amount DECIMAL(10,2),
transaction_date DATE
)
USING parquet
""")
// External table - you control the data location
spark.sql("""
CREATE TABLE external_sales (
transaction_id STRING,
amount DECIMAL(10,2),
transaction_date DATE
)
USING parquet
LOCATION 's3://my-bucket/sales-data/'
""")
Data Lifecycle and DROP Behavior
The most critical operational difference emerges when dropping tables. This behavior has significant implications for data governance and recovery procedures.
// Setup: Create both table types with sample data
val salesData = Seq(
("TXN001", 150.50, "2024-01-15"),
("TXN002", 275.00, "2024-01-16")
).toDF("transaction_id", "amount", "transaction_date")
// Managed table
salesData.write.saveAsTable("managed_sales")
// External table
val externalPath = "/tmp/external_sales"
salesData.write.parquet(externalPath)
spark.sql(s"""
CREATE TABLE external_sales (
transaction_id STRING,
amount DOUBLE,
transaction_date STRING
)
USING parquet
LOCATION '$externalPath'
""")
// Drop managed table - data is DELETED
spark.sql("DROP TABLE managed_sales")
// The data files in spark-warehouse/managed_sales are gone
// Drop external table - only metadata removed
spark.sql("DROP TABLE external_sales")
// Files in /tmp/external_sales still exist
// You can recreate the table pointing to the same location
This behavior makes external tables safer for production environments where accidental table drops could cause catastrophic data loss. With external tables, you maintain an additional safety layer—even if metadata is corrupted or accidentally dropped, your raw data remains intact.
Creating Tables from Existing Data
External tables excel when working with data produced by other systems or when building data lakes where multiple tools access the same datasets.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TableTypes").getOrCreate()
# Scenario: Data already exists from an ETL pipeline
existing_data_path = "s3://data-lake/processed/customer_events/"
# Create external table over existing data
spark.sql(f"""
CREATE EXTERNAL TABLE customer_events (
customer_id BIGINT,
event_type STRING,
event_timestamp TIMESTAMP,
metadata MAP<STRING, STRING>
)
PARTITIONED BY (event_date DATE)
STORED AS PARQUET
LOCATION '{existing_data_path}'
""")
# Discover partitions if data already exists
spark.sql("MSCK REPAIR TABLE customer_events")
# Now query immediately without data movement
result = spark.sql("""
SELECT event_type, COUNT(*) as event_count
FROM customer_events
WHERE event_date >= '2024-01-01'
GROUP BY event_type
""")
For managed tables, Spark expects to control the data lifecycle from creation. Converting existing data to a managed table requires data movement:
# Reading existing data into a managed table requires copying
spark.read.parquet(existing_data_path) \
.write.mode("overwrite") \
.saveAsTable("managed_customer_events")
# Data is now copied to spark.sql.warehouse.dir
Partitioning Strategies and Table Types
Both table types support partitioning, but the implications differ for data management and performance tuning.
// Managed partitioned table
spark.sql("""
CREATE TABLE managed_logs (
log_level STRING,
message STRING,
application STRING
)
USING parquet
PARTITIONED BY (log_date DATE, log_hour INT)
""")
// Insert creates partition directories in warehouse
spark.sql("""
INSERT INTO managed_logs
PARTITION (log_date='2024-01-15', log_hour=10)
VALUES ('ERROR', 'Connection timeout', 'api-service')
""")
// External partitioned table with explicit location
spark.sql("""
CREATE EXTERNAL TABLE external_logs (
log_level STRING,
message STRING,
application STRING
)
PARTITIONED BY (log_date DATE, log_hour INT)
STORED AS PARQUET
LOCATION 'hdfs://cluster/logs/structured/'
""")
// Add partition pointing to existing data
spark.sql("""
ALTER TABLE external_logs
ADD PARTITION (log_date='2024-01-15', log_hour=10)
LOCATION 'hdfs://cluster/logs/structured/log_date=2024-01-15/log_hour=10'
""")
External tables provide flexibility to reorganize or reprocess partitions without affecting table definitions. You can drop and recreate partitions, run backfill jobs, or point to data processed by other frameworks.
Migration and Conversion Patterns
Converting between table types requires understanding the data movement implications and metadata management.
# Convert managed to external (preserves data at new location)
managed_location = spark.sql("""
DESCRIBE FORMATTED managed_sales
""").filter("col_name = 'Location'").collect()[0]['data_type']
new_external_location = "s3://archive/sales/"
# Copy data to new location
spark.table("managed_sales").write \
.mode("overwrite") \
.parquet(new_external_location)
# Drop managed table (removes old data)
spark.sql("DROP TABLE managed_sales")
# Create external table at new location
spark.sql(f"""
CREATE EXTERNAL TABLE managed_sales (
transaction_id STRING,
amount DECIMAL(10,2),
transaction_date DATE
)
USING parquet
LOCATION '{new_external_location}'
""")
# Convert external to managed (moves data to warehouse)
external_df = spark.table("external_sales")
external_df.write.mode("overwrite").saveAsTable("external_sales_managed")
# Original external table and data remain unchanged
Production Recommendations
For production data platforms, external tables provide better operational characteristics:
# Production pattern: External tables with versioned locations
base_path = "s3://data-lake/production"
version = "v2"
# Create external table with version in path
spark.sql(f"""
CREATE EXTERNAL TABLE product_catalog (
product_id BIGINT,
name STRING,
category STRING,
price DECIMAL(10,2)
)
USING delta
LOCATION '{base_path}/product_catalog/{version}/'
""")
# Enable multiple consumers (Spark, Presto, Athena)
# Data remains accessible even if Spark cluster is terminated
# Implement disaster recovery
backup_location = "s3://data-lake-backup/product_catalog/"
spark.table("product_catalog").write \
.mode("overwrite") \
.parquet(backup_location)
# Recreate table from backup if needed
spark.sql(f"""
CREATE EXTERNAL TABLE product_catalog_restored (
product_id BIGINT,
name STRING,
category STRING,
price DECIMAL(10,2)
)
USING parquet
LOCATION '{backup_location}'
""")
Use managed tables for temporary tables, staging areas, or when Spark is the sole consumer and you want simplified cleanup. Use external tables for production datasets, shared data lakes, or when data lifecycle extends beyond individual Spark applications.
The choice between managed and external tables fundamentally impacts your data architecture’s resilience, flexibility, and operational complexity. External tables align with modern data lake patterns where storage and compute are decoupled, while managed tables offer simplicity for Spark-centric workflows.