Apache Spark - Log4j Configuration
Debugging distributed applications is painful. When your Spark job fails across 200 executors processing terabytes of data, you need logs that actually help you find the problem. Poor logging...
Key Insights
- Spark 3.x migrated from Log4j 1.x to Log4j 2.x, requiring configuration file changes from
log4j.propertiestolog4j2.propertieswith updated syntax - Driver and executor logging require separate configuration strategies—driver logs follow your local config while executor logs need explicit JVM options passed through Spark configuration
- Verbose logging in production Spark jobs can devastate performance; set root logger to WARN or ERROR and selectively enable DEBUG only for specific packages during troubleshooting
Introduction to Spark Logging
Debugging distributed applications is painful. When your Spark job fails across 200 executors processing terabytes of data, you need logs that actually help you find the problem. Poor logging configuration leads to two equally frustrating outcomes: either you’re drowning in millions of useless INFO messages, or you’re staring at silent failures with zero context.
Spark’s logging infrastructure sits on top of Log4j, giving you fine-grained control over what gets logged, where it goes, and how verbose each component should be. The challenge is that Spark runs code in multiple JVMs simultaneously—your driver process and potentially hundreds of executor processes—each needing proper logging configuration.
Getting this right means faster debugging, cleaner production logs, and jobs that don’t waste I/O bandwidth writing log messages nobody will read.
Log4j Versions in Spark
Spark 3.0 introduced a breaking change that catches many developers off guard: the migration from Log4j 1.x to Log4j 2.x. This isn’t just a version bump—it’s a complete configuration syntax overhaul.
If you’re running Spark 2.x, you’re using Log4j 1.x with log4j.properties files. Spark 3.x and later use Log4j 2.x with log4j2.properties files. The property names, appender syntax, and configuration structure all changed.
Check your Spark version and corresponding Log4j dependency:
<!-- pom.xml for Spark 3.x projects -->
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.0</version>
</dependency>
<!-- Log4j 2.x is pulled transitively -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.20.0</version>
</dependency>
</dependencies>
For SBT projects:
// build.sbt for Spark 3.x
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.5.0",
"org.apache.spark" %% "spark-sql" % "3.5.0"
)
// Log4j 2.x comes with Spark 3.x - don't mix versions
excludeDependencies ++= Seq(
ExclusionRule("log4j", "log4j") // Exclude any Log4j 1.x
)
When upgrading from Spark 2.x to 3.x, you must migrate your logging configuration. The old log4j.properties file will be silently ignored, leaving you with default (verbose) logging behavior.
Configuring log4j2.properties
Spark looks for logging configuration in $SPARK_HOME/conf/log4j2.properties. The distribution ships with a template file named log4j2.properties.template—copy and rename it to get started.
Here’s a production-ready configuration template:
# log4j2.properties - Spark 3.x logging configuration
# Root logger configuration
rootLogger.level = warn
rootLogger.appenderRef.stdout.ref = console
rootLogger.appenderRef.rolling.ref = RollingFile
# Console appender - useful for development
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} %p %c{1}: %m%n
# Rolling file appender - for production
appender.rolling.type = RollingFile
appender.rolling.name = RollingFile
appender.rolling.fileName = ${sys:spark.yarn.app.container.log.dir}/spark.log
appender.rolling.filePattern = ${sys:spark.yarn.app.container.log.dir}/spark-%d{yyyy-MM-dd}-%i.log.gz
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} [%t] %-5p %c{2} - %m%n
appender.rolling.policies.type = Policies
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 100MB
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.max = 10
# Spark-specific loggers - reduce noise from internal components
logger.spark.name = org.apache.spark
logger.spark.level = warn
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = warn
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = warn
# Your application logger - set to DEBUG during development
logger.myapp.name = com.yourcompany.spark
logger.myapp.level = info
Key configuration decisions in this template:
The root logger is set to WARN, which silences the flood of INFO messages from Spark internals. Individual loggers for Spark, Jetty, and Hadoop components are explicitly set to WARN to catch any that might bypass the root logger. Your application package gets INFO level so you can see your own log statements without the noise.
The rolling file appender uses ${sys:spark.yarn.app.container.log.dir} which YARN sets automatically for each container. In local mode, you’ll need to set this system property or use a hardcoded path.
Setting Log Levels Programmatically
Sometimes you need to adjust logging at runtime without modifying configuration files. Spark provides a simple API for this:
// Scala - Setting log level on SparkContext
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("MyApplication")
.getOrCreate()
// Set log level for the entire application
spark.sparkContext.setLogLevel("WARN")
// Valid levels: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
# PySpark - Setting log level
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApplication") \
.getOrCreate()
# Reduce verbosity
spark.sparkContext.setLogLevel("WARN")
# For debugging a specific section, temporarily increase verbosity
spark.sparkContext.setLogLevel("DEBUG")
# ... problematic code ...
spark.sparkContext.setLogLevel("WARN")
This approach affects the driver’s Log4j configuration. It propagates to executors for Spark’s internal logging, but your custom application loggers on executors won’t be affected. For those, you need executor-specific configuration.
Driver vs Executor Logging
Here’s where Spark logging gets tricky: driver and executor processes are separate JVMs with independent logging configurations.
The driver uses whatever log4j2.properties file is in your classpath or $SPARK_HOME/conf/. Executors, however, are spawned by the cluster manager and need explicit configuration passed through Spark properties.
#!/bin/bash
# spark-submit with separate driver and executor logging
spark-submit \
--class com.yourcompany.MainApp \
--master yarn \
--deploy-mode cluster \
--driver-java-options "-Dlog4j.configurationFile=file:///path/to/driver-log4j2.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configurationFile=file:///path/to/executor-log4j2.properties" \
--files /path/to/executor-log4j2.properties \
your-application.jar
The --files flag distributes your executor logging config to all worker nodes. The path in spark.executor.extraJavaOptions should reference where the file lands on executors (typically the working directory, so just the filename works).
For simpler setups where driver and executor use the same config:
spark-submit \
--class com.yourcompany.MainApp \
--master yarn \
--deploy-mode client \
--conf "spark.driver.extraJavaOptions=-Dlog4j2.configurationFile=log4j2.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j2.configurationFile=log4j2.properties" \
--files log4j2.properties \
your-application.jar
Aggregating Logs in Cluster Mode
In cluster deployments, logs scatter across dozens of machines. You need aggregation to make them useful.
For YARN deployments, configure log aggregation in spark-defaults.conf:
# spark-defaults.conf - YARN log aggregation
# Enable YARN log aggregation
spark.yarn.historyServer.address=hadoop-history:18080
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs
# Preserve logs after application completion
spark.yarn.preserve.staging.files=false
# Rolling logs to prevent disk exhaustion
spark.executor.logs.rolling.maxRetainedFiles=5
spark.executor.logs.rolling.strategy=size
spark.executor.logs.rolling.maxSize=128m
# For S3 log aggregation (EMR or custom setup)
spark.eventLog.dir=s3a://your-bucket/spark-logs/
spark.history.fs.logDirectory=s3a://your-bucket/spark-logs/
For Kubernetes deployments, configure sidecar containers or use a logging operator:
# Kubernetes logging via spark-submit
spark-submit \
--master k8s://https://kubernetes-api:6443 \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=your-spark-image:latest \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-logs.mount.path=/var/log/spark \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-logs.options.claimName=spark-logs-pvc \
your-application.jar
Troubleshooting Common Issues
“No appenders could be found for logger”: This warning means Log4j can’t find your configuration file. Verify the file is named correctly (log4j2.properties for Spark 3.x), exists in $SPARK_HOME/conf/, and is readable. For executors, ensure the file is distributed via --files and referenced correctly in spark.executor.extraJavaOptions.
Duplicate log messages: Usually caused by multiple appenders inheriting from the root logger. Set additivity = false on child loggers:
logger.myapp.name = com.yourcompany.spark
logger.myapp.level = info
logger.myapp.additivity = false
logger.myapp.appenderRef.stdout.ref = console
Performance degradation from logging: Excessive logging, especially at DEBUG level, can slow jobs significantly. Each log statement requires string formatting and I/O. In tight loops processing millions of records, even INFO-level logging adds measurable overhead. Profile your jobs with logging disabled versus enabled if you suspect logging is the bottleneck.
Logs not appearing in YARN aggregation: Check that yarn.log-aggregation-enable is true in your YARN configuration, and that the aggregation destination (HDFS path) is writable. Logs aggregate only after the application completes—running applications store logs locally on node managers.
Proper logging configuration isn’t glamorous, but it’s the difference between spending five minutes finding a bug and spending five hours. Get it right once, template it, and move on to more interesting problems.