Let say you’ve got a prediction model built in R and you’d like to productionize it, for example, by serving it in a web application. One way is exposing the model through an API that returns the predicted result as a web service. However there are many issues. Firstly R is not a language for API development although there may be some ways - eg the plumber package. More importantly developing an API is not the end of the story as the API can’t be served in a production system if it is not deployed/managed/upgraded/patched/… appropriately in a server or if it is not scalable, protected via authentication/authorization and so on. Therefore it requires quite a vast range of skill sets that cover both development and DevOps (engineering).
A developer can be relieved from the overwhelming DevOps stuff if his/her model is deployed in a serverless environment that is provided by cloud computing companies - Amazon Web Service, Microsoft Azure, Google Cloud Platform and IBM OpenWhisk. They provide FaaS (Function as a Service) and, simply put, it allows to run code on demand without provisioning or managing servers. Furthermore an application can be developed/managed in a more efficient way if the workflow is streamlined by events. Let say the model has to be updated periodically. It requires to save new raw data into a place, to export it to a database, to manipulate and save it back to another place for modelling… This kind of workflow can be efficiently managed by events where a function is configured to subscribe a specific event and its code is run accordingly. In this regards, I find there is a huge potential for serverless event-driven architecture in data product development.
This is the first post of Serverless Data Product POC series and I’m planning to introduce a data product in a serverless environment. For the backend, a simple logistic regression model is packaged and tested for AWS Lambda - R is not included in Lambda runtime so that it is packaged and run via the Python rpy2 package. Then the model is deployed at AWS Lambda and the Lambda function is exposed via Amazon API Gateway. For the frontend, a simple single page application is served from Amazon S3.
[EDIT 2017-04-11] Deploying at AWS Lambda and exposing via API Gateway are split into 2 posts (Part II and III).
[EDIT 2017-04-17] The Lambda function hander (handler.py) has been modified to resolve an issue of Cross-Origin Resource Sharing (CORS). See Part IV for further details.
The data is from the LOGIT REGRESSION - R DATA ANALYSIS EXAMPLES of UCLA: Statistical Consulting Group. It is hypothetical data about graduate school admission and has 3 featues (gre, gpa, rank) and 1 binary response (admit).
GLM is fit to the data and the fitted object is saved as admission.rds. The choice of logistic regression is because it is included in the stats package, which is one of the default packages, and I’d like to have R as small as possible for this POC application. Note that AWS Lambda has limits in deployment package size (50MB compressed) so that it is important to keep a deployment package small - see AWS Lambda Limits for further details. Then the saved file is uploaded to S3 to a bucket named serverless-poc-models - the Lambda function handler will use this object for prediction as described in the next section.
Note that, if data is transformed for better performance, a model object alone may not be sufficient as transformed records are necessary as well. A way to handle this situation is using the caret package. The package has
preProcess() and associating
predict() so that a separate object can be created to transform records for prediction - see this page for further details.
Lambda function handler
Lambda function handler is a function that AWS Lambda can invoke when the service executes the code. In this example, it downloads the model objects from S3, predicts admission status and returns the result - handler.py and test_handler.py can be found in the GitHub repo.
This and the next sections are based on the following posts with necessary modifications.
- Analyzing Genomics Data at Scale using R, AWS Lambda, and Amazon API Gateway
- Run ML predictions with R on AWS Lambda
handler.py begins with importing packages and setting-up environment variables. The above posts indicate C shared libraries of R must be loaded. When I tested the handler while uncommenting the for-loop of loading those libraries, however, I encountered the following error -
OSError: lib/libRrefblas.so: undefined symbol: xerbla_. It is only when the for-loop is commented out that the script runs through to the handler. I guess the necessary C shared libraries are loaded via Lambda environment variables although I’m not sure why manual loading creates such an error. According to Lambda Execution Environment and Available Libraries, the following environment variables are available.
LAMBDA_TASK_ROOT- Contains the path to your Lambda function code.
LAMBDA_TASK_ROOT/lib. Used to store helper libraries and function code.
As can be seen in the next section, the shared libraries are saved in
LAMBDA_TASK_ROOT/lib so that they are loaded appropriately.
Then 3 functions are defined as following.
get_file_path- Given a S3 object key, it returns a file name or file path. Note that only
/tmphas write-access so that a file should be downloaded to this folder
download_file- Given bucket and key names, it downloads the S3 object having the key (eg admission.rds). Note that it does nothing if the object file already exists
pred_admit- Given gre, gpa and rank, it returns True or False depending on the predicted probablity
This is the Lambda function handler for this POC application. It returns a prediction result if there is no error. 400 HTTP error will be returned if there is an error.
Optionally code of the test handler is shown below.
According to Lambda Execution Environment and Available Libraries, Lambda functions run in AMI name: amzn-ami-hvm-2016.03.3.x86_64-gp2. A t2.medium EC2 instance is used from this AMI to create the Lambda deployment package. In order to use R in AWS Lambda, R, some of its C shared libraries, the Lambda function handler (handler.py) and the handler’s dependent packages should be included in a zip deployment package file.
In this step, R and necessary libraries are installed followed by cloning the project repository. The package folder is created as
/home/ec2-user/handler). The subfolder
$HOME/$HANDLER/library is to copy necessary R default packages separately - remind that I’d like to have R as small as possible.
Copy R and shared libraries
Firstly all files and folders in
/usr/lib64/R except for
library are copyed to the Lambda package folder. By default 29 packages are installed as can be seen in
/usr/lib64/R/library but not all them are necessary. Actually only the 7 packages listed below are loaded at startup and 1 package is required additionally by the rpy2 package. Therefore only the 8 default R packages are copyed to
- Loaded at startup - stats, graphics, grDevices, utils, datasets, methods, base
- Required by rpy2 - tools
In relation to C shared libraries, the default installation includes 4 libraries as can be checked in
/usr/lib64/R/lib and 1 library is required additionally by the rpy2 package - this additional library is for regex processing.
- Default C shared libraries - libRblas.so, libRlapack.so, libRrefblas.so, libR.so
- Required by rpy2 - libtre.so.5
Together with the above 5 shared libraries, the following 3 libraries are added: libgomp.so.1, libgfortran.so.3 and libquadmath.so.0. Further investigation is necessary to what extent these are used. Note that the above posts inclue 2 libraries for linear algebra (libblas.so.3 and liblapack.so.3) but they are not added as equivalent libraries seem to exist - I guess they are necessary if R is build from source with the following options: –with-blas and –with-lapack. A total of 8 C shared libraries are added to the deployment package.
Install rpy2 and copy to Lamdba package folder
Python virtualenv is used to install the rpy2 package. The idea is straightforward but actually it was a bit tricky as the rpy2 and its dependent packages can be found in either site-packages or dist-packages folder even in a single EC2 instance - the AWS Doc doesn’t explain clearly.
pip install rpy2 -t folder-path was tricky as well because the rpy2 package was not installed sometimes while its dependent packages were installed. One way to check is executing
pip list in the virtualenv and, if the rpy2 package is not shown, it is in dist-packages.
Copy handler.py/test_handler.py, compress and copy to S3 bucket
handler.py and test_handler.py are copied to the Lambda package folder and all contents in the folder are compressed. Note that handler.py should exist in the root of the compressed file so that it is necessary to run zip in the deployment package folder. The size of admission.zip is about 27MB so that it is good to deploy. Finally the package file is copied to a S3 bucket called serverless-poc-handlers - note that the aws cli should be configured to copy the file to S3.
For testing, an EC2 instance without R is necessary so that testing is made in a separate t2.micro instance from the same AMI. Configure the aws-cli and install the boto3 package - the boto3 package is avaialble in Lambda execution environment so that it doesn’t need to be added to the deployment package. LD_LIBRARY_PATH is an environment variable that points to the C shared libraries of the Lambda package. After downloading the package and decompressing it, testing can be made by running test_handler.py.
This is an example testing output where the model object has been downloaded already.
This is all that I’ve prepared for this post and I hope you don’t feel bored. The next posts will be much more interesting as this package will be exposed via an API.