Apache Spark - Install on AWS EMR
Apache Spark is the de facto standard for large-scale data processing, but running it yourself is painful. You need to manage HDFS, coordinate node failures, handle software updates, and tune JVM...
Key Insights
- AWS EMR eliminates Spark infrastructure management while providing seamless S3 integration, auto-scaling, and pay-per-minute pricing that makes it the most practical option for production Spark workloads
- Always use Spot instances for task nodes (up to 90% savings), configure cluster termination protection, and set up S3 logging before your first production deployment
- The AWS CLI approach to cluster creation should be your default—it enables version control, reproducibility, and integration with CI/CD pipelines that console clicking never will
Why EMR for Spark
Apache Spark is the de facto standard for large-scale data processing, but running it yourself is painful. You need to manage HDFS, coordinate node failures, handle software updates, and tune JVM settings across dozens of machines. AWS EMR abstracts all of this.
EMR gives you a managed Spark cluster in minutes. It handles Hadoop/YARN resource management, integrates natively with S3 (so you can skip HDFS entirely), and provides automatic cluster termination when jobs complete. You pay only for the EC2 instances while they run.
The real advantage is operational: EMR clusters are ephemeral by design. Spin up a cluster, run your job, terminate. No idle infrastructure burning money overnight. This transient cluster pattern is how modern data engineering teams operate.
Prerequisites and AWS Setup
Before creating your first cluster, you need proper AWS foundations in place.
Account requirements:
- An AWS account with billing enabled
- A VPC with at least one subnet (public subnet simplifies initial setup)
- An S3 bucket for logs and input/output data
- An EC2 key pair for SSH access to cluster nodes
IAM permissions are where most people get stuck. EMR requires two roles: a service role for EMR itself and an instance profile for the EC2 nodes. Create a policy that grants the minimum permissions needed:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EMRFullAccess",
"Effect": "Allow",
"Action": [
"elasticmapreduce:*",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateSecurityGroup",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcs",
"ec2:RunInstances",
"ec2:TerminateInstances",
"iam:GetRole",
"iam:PassRole",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": "*"
}
]
}
For production, scope these permissions down to specific resources. The iam:PassRole permission is critical—it allows your user to assign the EMR service role to the cluster.
Create your S3 bucket structure:
aws s3 mb s3://your-company-emr-bucket
aws s3api put-object --bucket your-company-emr-bucket --key logs/
aws s3api put-object --bucket your-company-emr-bucket --key input/
aws s3api put-object --bucket your-company-emr-bucket --key output/
aws s3api put-object --bucket your-company-emr-bucket --key scripts/
Creating an EMR Cluster via AWS Console
The console is useful for learning and one-off experiments. Navigate to EMR in the AWS Console and click “Create cluster.”
Key configuration choices:
-
Release version: Choose the latest EMR 6.x release (e.g., emr-6.15.0). EMR 6.x runs on Amazon Linux 2 and includes Spark 3.x.
-
Applications: Select “Spark” from the application bundle. You can add Hive, Presto, or JupyterHub if needed.
-
Instance configuration:
- Master node:
m5.xlargeminimum for production (4 vCPUs, 16 GB RAM) - Core nodes: Start with 2x
m5.2xlargefor typical workloads - Task nodes: Optional, use Spot instances here
- Master node:
-
Cluster scaling: Enable managed scaling with min/max boundaries. EMR will add nodes when YARN resources are constrained.
-
Networking: Select your VPC and subnet. For initial testing, use a public subnet and assign a public IP.
-
Security: Select your EC2 key pair. Use the default EMR-managed security groups initially.
-
Logging: Enable logging to your S3 bucket. This captures Spark logs, YARN logs, and step output.
Click “Create cluster” and wait approximately 8-12 minutes for provisioning.
Creating an EMR Cluster via AWS CLI
The CLI approach is what you should use for anything beyond experimentation. It’s reproducible, scriptable, and can be version-controlled.
aws emr create-cluster \
--name "spark-production-cluster" \
--release-label emr-6.15.0 \
--applications Name=Spark \
--instance-groups '[
{
"InstanceGroupType": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
"Name": "Master"
},
{
"InstanceGroupType": "CORE",
"InstanceType": "m5.2xlarge",
"InstanceCount": 2,
"Name": "Core"
},
{
"InstanceGroupType": "TASK",
"InstanceType": "m5.2xlarge",
"InstanceCount": 4,
"Market": "SPOT",
"BidPrice": "0.10",
"Name": "Task-Spot"
}
]' \
--ec2-attributes '{
"KeyName": "your-key-pair",
"SubnetId": "subnet-abc123",
"EmrManagedMasterSecurityGroup": "sg-master123",
"EmrManagedSlaveSecurityGroup": "sg-slave456"
}' \
--service-role EMR_DefaultRole \
--instance-profile EMR_EC2_DefaultRole \
--log-uri s3://your-company-emr-bucket/logs/ \
--enable-debugging \
--auto-terminate \
--steps '[
{
"Name": "Run Spark Job",
"ActionOnFailure": "TERMINATE_CLUSTER",
"Type": "Spark",
"Args": ["--deploy-mode", "cluster", "s3://your-company-emr-bucket/scripts/my-job.py"]
}
]'
Key flags explained:
--auto-terminate: Cluster shuts down after all steps complete. Essential for cost control.--enable-debugging: Enables step debugging in the EMR console.Market: "SPOT": Task nodes use Spot pricing. Never pay on-demand for task nodes.ActionOnFailure: "TERMINATE_CLUSTER": Don’t leave failed clusters running.
For bootstrap actions (installing custom packages, configuring environment):
--bootstrap-actions '[
{
"Name": "Install Python packages",
"Path": "s3://your-company-emr-bucket/scripts/bootstrap.sh"
}
]'
Your bootstrap.sh might look like:
#!/bin/bash
sudo pip3 install pandas pyarrow boto3
sudo yum install -y htop
Configuring Spark Settings
Default Spark settings are conservative. For production workloads, you need to tune memory allocation and parallelism.
Pass configurations during cluster creation using the --configurations flag:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "8g",
"spark.executor.cores": "4",
"spark.driver.memory": "4g",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "2",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.sql.shuffle.partitions": "200",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
Save this as spark-config.json and reference it:
aws emr create-cluster \
... \
--configurations file://spark-config.json
Critical settings explained:
spark.dynamicAllocation.enabled: Let Spark scale executors based on workload. Always enable this on EMR.spark.sql.shuffle.partitions: Default is 200. Increase for large datasets, decrease for small ones.spark.serializer: Kryo is faster than Java serialization. Use it.
Submitting Your First Spark Job
Upload a simple PySpark script to S3:
# wordcount.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()
# Read text file from S3
text_df = spark.read.text("s3://your-company-emr-bucket/input/sample.txt")
# Word count using DataFrame API
from pyspark.sql.functions import explode, split, col
words = text_df.select(
explode(split(col("value"), "\\s+")).alias("word")
)
word_counts = words.groupBy("word").count().orderBy(col("count").desc())
# Write results back to S3
word_counts.write.mode("overwrite").parquet(
"s3://your-company-emr-bucket/output/wordcount/"
)
spark.stop()
Submit via spark-submit (if SSH’d into the master node):
spark-submit \
--deploy-mode cluster \
--master yarn \
--num-executors 4 \
--executor-memory 4g \
--executor-cores 2 \
s3://your-company-emr-bucket/scripts/wordcount.py
Or add it as a step to a running cluster:
aws emr add-steps \
--cluster-id j-XXXXXXXXXXXXX \
--steps '[{
"Name": "Word Count Job",
"Type": "Spark",
"ActionOnFailure": "CONTINUE",
"Args": [
"--deploy-mode", "cluster",
"--executor-memory", "4g",
"s3://your-company-emr-bucket/scripts/wordcount.py"
]
}]'
Monitoring, Costs, and Cleanup
Monitoring options:
Access the Spark UI by setting up an SSH tunnel to the master node:
ssh -i your-key.pem -N -L 18080:localhost:18080 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
Then open http://localhost:18080 for the Spark History Server.
Check cluster status programmatically:
aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXX \
--query 'Cluster.Status.State'
aws emr list-steps --cluster-id j-XXXXXXXXXXXXX \
--query 'Steps[*].[Name,Status.State]' \
--output table
Cost optimization:
- Use Spot instances for task nodes (60-90% savings)
- Enable auto-termination for batch jobs
- Right-size instances based on actual utilization
- Use S3 instead of HDFS to enable cluster termination between jobs
Cleanup:
Terminate clusters when done:
aws emr terminate-clusters --cluster-ids j-XXXXXXXXXXXXX
Verify termination:
aws emr list-clusters --active
Set up billing alerts in AWS Budgets. EMR costs can escalate quickly with large clusters—a 20-node cluster of m5.2xlarge instances runs approximately $7/hour.
EMR makes Spark accessible, but it doesn’t make it free. Design for transient clusters, automate everything with the CLI, and always check that your clusters terminated successfully.