Thought leadership from the most innovative tech companies, all in one place.

How to Run a PySpark Application on AWS Lambda

A proof of concept to see if we can run Spark on AWS Lambda.

In collaboration with Harshith Acharya.

image

With container support, we can run any runtime (within resource limitation) on AWS Lambda. In this blog, we will see how we can run a PySpark application on AWS Lambda.

The project structure for the experiment is as below:

├─spark-on-lambda
    ├───spark-class
    ├───spark-defaults.conf
    ├───Dockerfile
    └───spark_lambda_demo.py

spark-class

This shell script is the Spark application command-line launcher that is responsible for setting up the JVM environment and executing a Spark application. When started, spark-class will load $SPARK_HOME/bin/load-spark-env.sh(default), collect the Spark assembly jars, and execute org.apache.spark.launcher.Main. In our case, we have slightly modified it to load spark-env.sh file directly.

_#!/usr/bin/env bash_SPARK_ENV_SH=${SPARK_HOME}/spark-env.sh
if [[ -f "${SPARK_ENV_SH}" ]]; then
    set -a
    . ${SPARK_ENV_SH}
    set +a
fiexec /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-0.amzn2.0.1.x86_64/jre/bin/java -cp /var/lang/lib/python3.8/site-packages/pyspark/conf/:/var/lang/lib/python3.8/site-packages/pyspark/jars/* -Xmx1g "$@"

spark-defaults.conf

It is the default properties file of your Spark applications.

spark.driver.bindAddress 127.0.0.1
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint s3-us-east-1.amazonaws.com
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider

In this file, we can define other properties as well like spark.serializer, spark.driver.memory etc. In our current case, we have limited it to s3 specific properties.

Dockerfile

We are building an image from a Python-based image for Lambda.

FROM public.ecr.aws/lambda/python:3.8RUN yum -y install java-1.8.0-openjdk wget curl
RUN pip install pysparkENV JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.302.b08-0.amzn2.0.1.x86_64/jre"
ENV PATH=${PATH}:${JAVA_HOME}/binENV SPARK_HOME="/var/lang/lib/python3.8/site-packages/pyspark"
ENV PATH=$PATH:$SPARK_HOME/bin
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
ENV PATH=$SPARK_HOME/python:$PATH
RUN mkdir $SPARK_HOME/conf
RUN echo "SPARK_LOCAL_IP=127.0.0.1" > $SPARK_HOME/conf/spark-env.sh
RUN chmod 777 $SPARK_HOME/conf/spark-env.shARG HADOOP_VERSION=3.2.2
ARG AWS_SDK_VERSION=1.11.887ARG HADOOP_JAR=https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
ARG AWS_SDK_JAR=https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jarADD $HADOOP_JAR  ${SPARK_HOME}/jars/
ADD $AWS_SDK_JAR ${SPARK_HOME}/jars/COPY spark-class $SPARK_HOME/bin/spark-class
RUN chmod 777 $SPARK_HOME/bin/spark-class
COPY spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf
COPY spark_lambda_demo.py ${LAMBDA_TASK_ROOT}CMD [ "spark_lambda_demo.lambda_handler" ]

Through this Dockerfile, we are installing PySpark, Hadoop-AWS jar & AWS SDK jar. We need these as we are talking to the s3 filesystem. We also set some common env used by Spark. Finally, we execute the lambda handler function.

spark_lambda_demo.py

The sample app reads a file from S3 using spark and prints its schema.

import os, json
from pyspark.sql import SparkSessiondef lambda_handler(event, context):
    spark = SparkSession.builder \
       .appName("spark-on-lambda-demo")\
       .config("spark.hadoop.fs.s3a.access.key", os.environ['AWS_ACCESS_KEY_ID']) \
       .config("spark.hadoop.fs.s3a.secret.key", os.environ['AWS_SECRET_ACCESS_KEY']) \
       .config("spark.hadoop.fs.s3a.session.token", os.environ['AWS_SESSION_TOKEN']) \
       .getOrCreate() q = spark.read.orc("s3://<S3_BUCKET>/<PREFIX>/<FILE_NAME>")
    print(q.printSchema())
    spark.stop()
    return {
        'statusCode': 200,
        'body': json.dumps(q.printSchema())
    }

Docker build and push

aws ecr create-repository --repository-name <REPO_NAME> --regin us-east-1
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <REPO_URI>
docker build  -t <IMAGE_NAME> .
docker image tag <IMAGE_NAME>:<IMAGE_TAG> <REPO_URI>:<IMAGE_TAG>
docker image push <IMAGE_NAME>:<TAG>

Lambda Function

Once we have pushed the image we create a lambda function using this image. Lambda role needs the following access policies:

AmazonS3ReadOnlyAccess (S3 Access)
AWSLambdaENIManagementAccess (Lambda in vpc)
AWSLambdaBasicExecutionRole (Cloudwatch)

And there we have it. Thanks for reading.

Happy building on AWS!




Continue Learning