Zachary Jablons

The state of SageMaker: It's a work in progress.

I have been working on evaluating AWS SageMaker with TensorFlow, and integrating it into our Python heavy ML stack to provide the model training step. The idea is to avoid buying GPU heavy servers to train and deploy models, and to take advantage of co-located S3 buckets for training and data storage. What is below is a review from our experiences over the past quarter.

I want to preface it by saying that SageMaker was released 5 months ago. It is early in its lifecycle, and we expect it to continue to quickly improve. We are still planning on using SageMaker and building our ML infrastructure around it despite all of the below. A lot of the issues are minor or easily worked around.

Some of the following are pitfalls that might not be super obvious based on their documentation, and some are legitimate bugs, many of which are under their consideration. However, for anyone considering using SageMaker, take note of what we've run into:

Performance

It takes ~5-6 minutes to start a training job and an endpoint. The only ways to get predictions out of a model trained on Sagemaker are to either download and recreate it locally or start up the endpoint (which is a TF serving instance afaict). I try not to keep endpoints running since they cost money (even if you're not making predictions against them), and even if they're cheap managing that is a bit more overhead than I'd like.

They're not the best about keeping Tensorflow up to date -- currently you can use 1.6. You can always specify newer versions in your requirements.txt, but that adds to the startup time.

Inconveniences

When you create a TensorFlow training job, the idea is that you give Sagemaker the code that you need to define a tf.Estimator (i.e. a model_fn, train_input_fn, etc.), and then it goes and does the training thing with the args you provide. If you have dependencies you can either specify them in requirements.txt (if possible) or include them in the source_dir, which is tar'd up and sent over. One note about the train_input_fn: SageMaker seems to expect the signature of train_input_fn to be ()->(dict of features, targets), whereas the tf.Estimator constructor just wants the signature of something like tf.estimator.inputs.numpy_input_fn. This is a little annoying if you want to have a way to test these functions without hitting SageMaker.

Their TensorBoard wrapper (i.e. setting run_tensorboard_locally=True in the fit call) is hilariously bad. I'd recommend instead just running Tensorboard yourself and pointing it at the S3 bucket where you store your checkpoints (it can do this by default, just give it the URI), however TensorBoard is a bit slow in doing this -- and you don’t really know when it’s done loading new summaries (that's not Amazon's fault).

Their 'local' mode works great once you have the docker setup, but it’s not obvious how to do so. Turns out it’s tucked away in some example somewhere. Once that's all taken care of, you can just specify instance_type='local'.

And, for the frustrated practitioner like me who just wants to know what the hell the TF container is doing: if you have the local mode container you can pull out that code from the container at /usr/local/lib/python2.7/dist-packages/tf_container. You’re welcome.

It does however leave uncleanable root-owned files around after training. This is less than ideal, as one might imagine. Additionally it stores everything in /tmp unless you specify 'local':{'container_root':'/not/tmp'} in ~/.sagemaker/config.yaml.

One of my favorite things about AWS is its detailed permissions system. Also, it's one of the most annoying things, and it's not easier in SageMaker.

Their Python SDK is weirdly incomplete in places. Like, you can't delete a model through it, you'd have to grab your boto3 session and then call the API directly through it.

Bugs

Don’t update tensorflow-serving-api to version 1.7. It breaks the sagemaker-python-sdk.

You know that thing about requirements.txt? Nope. Sorry. It’s been in the documentation for like a month now. Doesn’t seem to do anything ¯_(ツ)_/¯.

Limitations

Ultimately there's no way to 'extend' the Tensorflow container and still use it (afaict) with the Tensorflow SDK they provide since it will call upon the default Tensorflow image. You can of course write your own SageMaker containers but that's a lot more work...

Apparently, there’s a tiny size limit on the hyperparameters you send to SageMaker to start a training job (256 characters). Seems a little insane to me, but here we are. Working with AWS on getting this fixed.

There’s no VPC support for training jobs. What this means for us is that we can’t have training jobs pull data from our database since we’d need to setup a VPC pairing to do so. Instead I currently have a job that pulls and processes the data elsewhere then uploads it to the appropriate S3 bucket. The fix according to AWS Support is to store the hyperparameters in a channel much like how you’d store model initialization stuff. This is awkward but works.

Passing training / evaluation hooks is basically impossible, since you need to serialize anything you're sending over. You can have them put into the EstimatorSpec that your model_fn returns, but since that's called like... everytime you restore checkpoints, if you need the hook to preserve state it gets super awkward real quick.

What does work

It's really easy to stand up an endpoint from the SDK given just the information needed to create the model.

GPU training is easy-peasy -- just tell it to use a GPU instance. You have to get a service limit increase to use more than one however.

Distributed training is also easy: just tell it to use more instances!

Other notes

It seems to store the tar'd source dirs on a per-run basis in an S3 bucket I never told it to use (and keeps telling me it's creating that bucket when it runs). Actually looking at the code this is a pretty minor bug: in some cases it just asks for the default bucket rather than use the provided one.

The tf.Estimator framework with train_and_evaluate (not really a SageMaker problem, but that’s what they use for the TF container as of 1.6) is in my opinion awkward to use, and has some large holes (like early stopping is kind of impossible).

The README.txt on the sagemaker-python-sdk repo is much better documentation than like, the actual AWS documentation for SageMaker. I'd go to that first.

One last thing -- the order of responsiveness I’ve found for support channels has been:

  1. Our actual customer support line.
  2. Filing an issue againt their SDK's Github repo.
  3. The SageMaker developer forum. Don't even bother with this one.

Hope this helps anyone out there looking to start doing their ML training in the cloud!