The fightback starts here: Stop your Machine Learning quick wins becoming a long-term drain on resources

Machine learning (ML) is quickly becoming a fundamental building block of business operations, resulting in improved processes, increased efficiency, and accelerated innovation. It is a powerful tool that can be used to build complex prediction systems quickly and affordably; however, it is naive to believe these quick wins won’t have repercussions further down the line.

ML has matured over the last decade to become much more accessible with the availability of high-performance compute, inexpensive storage and elastic compute services in the cloud. However, the maturity of development and operations processes related to applying, enforcing, managing and maintaining a standard process for ML systems is still an emerging capability for most organisations. While some embark on the journey with confidence, often feeling secure in the knowledge that their mature DevOps process will ensure success, they are finding that there are nuances in the ML development process which are not considered in traditional DevOps. This realisation often only becomes apparent after a significant investment has been made into ML projects and inevitably results in failure to deliver.

One of the most effective ways of avoiding many of these pitfalls is the use of containerisation. Containers provide a standardised environment for ML development which can be provisioned rapidly on any device or platform etc.

What are Containers?

Containers provide an abstraction layer between the application and the hardware layers. This abstraction allows software to run reliably when moved between environments i.e. from a developer’s laptop to a test environment, or a staging environment into production or from a physical machine in a Datacentre to a virtual machine in a private or public cloud.

Put simply, a container consists of an entire runtime environment: an application, plus all its dependencies, libraries and other binaries, and configuration files needed to run it, bundled into one package. By containerising the application platform and its dependencies, differences in OS distributions and underlying infrastructure are abstracted away.

Why use Containers for ML?

Containers are particularly effective for MLOps as they ensure the consistency and repeatability of ML environments. This simplifies the deployment process for ML models by removing the complexity involved in building and optimising the ML development and test environments while addressing the risk of inconsistencies introduced by manual environment provisioning.

Some of the immediate benefits of containerising MLOps pipelines include:

  1. Rapid deployment. Using pre-packaged Docker images to deploy ML environments saves time and ensures standardisation and consistency across development and testing.
  2. Performance. Powerful ML frameworks including Tensorflow, PyTorch and Apache MxNet enable the best possible performance and provide flexibility, speed and consistency in ML development.
  3. Ease of use. Orchestrate ML applications using Kubernetes (K8s), an open-source container-orchestration system for automating application deployment, scaling, and management on cloud instances. For example, with an application deployed on K8s with Amazon EC2, you can quickly add machine learning as a microservice to applications using AWS Deep Learning (DL) Containers.
  4. Reduced management overhead of ML workflows. Using containers tightly integrated with cloud ML tools gives you choice and flexibility to build custom ML workflows for training, validation, and deployment.

Here are examples of how containers can be applied to resolve key challenges to ML projects running efficiently and cost effectively:

1. Complex model building and selection of the most suitable models

While in theory it makes sense to experiment with models to get the desired predictions from your data, this process is very time and resource intensive. You want the best model, while minimising complexity and securing control over a never-ending influx of data.

Resolution: ML models can be built using pre-packaged machine images which enable developers to test multiple models quickly. These images (e.g. Amazon Machine Images) can contain pre-tested ML framework libraries (e.g. TensorFlow, PyTorch) to reduce the time and effort required. This lets you tweak and adjust the ML models for different sets of data without adding complexity to the final models and gives you more control over monitoring, compliance and data processing.

2. Rapid configuration changes and the integration of tools and frameworks

It is much easier to design, deploy and train ML models the earlier it is done in the project. The catch is to control configuration changes while making sure that any data used for training doesn’t become stale in the process. Stale data (an artefact of caching, in which an object in the cache is not the most recent version committed to the data source) is one of the reasons most ML models never leave the training stage to see the light of day.

Resolution: Using containers enables the orchestration and management of ML application clusters. One example of this approach uses AWS EC2 instances with K8s. A major benefit of this approach is that pre-packaged ML AMIs are pre-tested with resource levels ranging from small CPU-only instances to powerful multi-GPU instances. These AMIs are always up to date with the latest releases of popular DL frameworks, solving the issue of configuration changes needed for training ML-models. Using cloud-based storage such as AWS S3 addresses the storage requirement for ever-changing and growing data sets. Using K8s you can then orchestrate application deployment and add ML as a microservice for those applications.

3. Creating self-learning models and managing data sets

The best way to achieve self-learning capabilities in ML is by using a wide range of parameters to test, train and deploy models. You need to be able to handle rapid configuration changes; have a monitoring platform for ML models; and set up an autonomous error handling process. You also need enough storage to integrate ML clusters with the inevitable expanding data sets and the continuous influx of new data.

Resolution: An increasingly popular and proven approach is to use Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS) and Amazon Sagemaker. EKS enables you to monitor, scale, and load-balance your applications, and provides a Kubernetes native experience to consume service mesh features and bring rich observability, traffic controls and security features to applications. Additionally, EKS provides a scalable and highly-available control plane that runs across multiple availability zones to eliminate a single point of failure. Amazon Elastic Container Service is a fully managed container orchestration service trusted with mission critical applications because of its security, reliability, and scalability. Amazon Sagemaker is a fully managed service that provides every developer the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.

How can we help?

Organisations can overcome their ML worries by partnering with ECS to deploy MLOps using containers. No matter where organisations are on their ML journey, ECS can guide them to take the next step to ML success.

Our expert team has a track record of deploying and managing complex ML environments for large enterprises including highly regulated FS institutions. ECS’s ML engineering team uses AWS DL Containers which provide Docker images pre-installed with DL frameworks. This enables a highly efficient, consistent and repeatable MLOps process by removing complexity and reducing the risk associated with building, optimising and maintaining ML environments.

By Harry Miller, Head of Data and Analytics Practice & Mehul Karia, Senior Cloud Consultant

Found this interesting? Why not share it: