Taking ML models from PoC to Production with MLOps2021-10-20T04:42:03+00:00

MLOps

Taking ML models from PoC to production with MLOps

A Gartner research shows only 53% of projects make it from artificial intelligence (AI) prototypes to production. This is attributed to the impediments that technology and business leaders face in moving ML models to production. The eBook discusses how to effectively approach MLOps using tried and tested methods.

Share This Content!

Table Of Contents

1. Implications of large scale ML models in operations

1. Implications of large scale ML models in operations

Read More >>

2. Misconceptions about productionizing ML models

2. Misconceptions about productionizing ML models

Read More >>

3. Overcoming challenges in Machine learning operations (MLOPs)

3. Overcoming challenges in Machine learning operations (MLOPs)

Read More >>

4. Building scalable infrastructure of the future

4. Building scalable infrastructure of the future

Read More >>

5. Case Study: Productionizing ML models

5. Case Study: Productionizing ML models

Read More >>

6. Conclusion

6. Conclusion

Read More >>

1. Implications of large scale ML models
in operations

The ability to utilize insights from data has extended beyond the realm of influencing strategic decisions to making business decisions on a daily basis. The need for advanced analytics systems that aid clear and timely decision making is now a key mandate for justifying business spends.
The appetite to uncover business intelligence from large volumes of data is ever growing and this has driven the need for Machine learning systems to be flexible in accommodating changing data types, scale with increasing data volume and consistently deliver accurate results despite uncertainties that accompany live data. Mike Gualtieri – Principal Analyst at Forrester, reveals that only 6% of the businesses interviewed have a mature capability to deploy ML models. This primarily refers to the incapability of their ML solutions to be used in production environments with rapidly scaling data.

Scaling ML models

Since the inception of the Apache Spark Project, MLib has revolutionized the way scalable ML models are created. A major advantage is that it allows data scientists to focus more on scientific research rather than solving complexities like those of infrastructure and configurations that surround distributed data. Data engineers on the other hand, can focus more on distributed system engineering using Spark’s easy to use API and at the same time enable data scientists to leverage the scale and speed of the Spark Core. Its important note that Spark MLib is a general-purpose library containing algorithms for most of the common use cases.

Also, there are full-fledged services such as Amazon SageMaker, Google Cloud ML, Azure ML, and others that not only come with the beneficial feature of auto-scaling but also offer algorithm specific features like auto-tuning of hyper-parameters, monitoring dashboards, easy deployment with rolling updates and so on.

mlops implications

2. Misconceptions about productionizing ML models

Traditionally productionizing ML models referred to:

  • Hosting a machine learning model in an API
  • Ways to update ML models, and the nuances involved

However, a holistic approach for putting ML models in production includes the following:

  • Creating a scalable environment for deployment
  • Capturing and reporting detailed statistics in a transparent manner
  • Serving the model – providing API and hosting
  • Adopting a reliable experimentation framework

Productionizing ML models encompasses all activities to realize tangible business gains and automate machine learning systems while minimizing the associated risks.

3. Overcoming challenges in MLOPs

The main set of challenges that are covered in this book are those that pose business risk and implementation risk, that Sigmoid has typically addressed while putting ML models into production.

ML teams that start off with a non data lake approach, begin with building machine learning models on top of their traditional databases or data warehouses. This makes access to the right data an arduous task and invariably affects the productivity of data scientists.

Some common issues with enterprise data that should be ironed out before building a machine learning model:

mlops challenges

MLOPs challenges

Data science teams often spend 80% of their time actually cleaning and managing data rather than focus on their core tasks of developing ML models.
    1. Setting up the right data lake
      An environment with easy and powerful access to a variety of data sources unblocks data scientists from experimenting with different data sets and understanding what information lies in them in order to be successful. Using robust accessible systems such as parquet standard warehouses, facilitate easy access and availability of data.
    2. Making scalable compute resources available
      This prevents data scientists from being bottlenecked by provisioning or requesting access. A scalable compute environment where they can scale up and down to ETL and process the data analyzed helps progress much faster and be more productive.
    3. Building a data cataloging system
      A strong cataloging system can fully leverage the work that the data science teams perform. Once the data is cleaned up and is structured appropriately, combining different data sources together, cataloguing and capturing it correctly, allows swift access to other data science teams and internal systems that need this data. Otherwise, teams usually invest in recreating the same data pipelines repeatedly because they have not designed a repeatable process.
    4. Having robust data governance and security
      Depending on the size of the organization, it is not uncommon find resistance in sharing data amongst different teams, because of the absence of a strong governance system in place. A strong governance system makes it comfortable to share data and allows quick access to data science teams.

Successful data teams can mitigate feasibility risks by addressing questions such as:

  • Is the data available early enough to make meaningful predictions?
  • Are all the relevant data sets in place to develop the machine learning system?

Once all the necessary data is available, models can be built and tweaked. Next come the phase wise approaches of:

  • Model selection
  • Model testing & deployment

1. Model selection

In this stage, data science teams start tweaking and building different models. After a model is selected, model weights are generated and assigned. Next, the idea is to mitigate modeling risk, which primarily deals with answering questions such as:

  • Is the data sufficient enough to actually do the predictions at this point?
  • Does it contain all the inputs needed to bring about the business change that is in question?

Choosing the right tech stack:

Technologies when selected diligently allows for interoperability across different modeling technologies, if the models are compatible across multiple stacks. Data scientists need to be given freedom to choose from a range of technology stacks or a range of modeling technologies so that they can conveniently explore. At the same time, there should be a check to refrain from technology that makes productionizing the model in consideration, complicated.

model selection parameters for success

2. Model Testing and deployment

As models are being built, testing is a critical process as models are eventually integrated into operational environments where they work on new data everyday and still the output can not be widely inconsistent. The tests may be statistical in nature and may seem constrained, but it can not be side-lined.

Model Deployment

After ensuring that the modeling risks have been eliminated, data is predictive in nature, and the right modeling techniques have been selected which can further be taken into production, the model is ready for next step – Deployment. Deployment is a very engineering-driven activity. Below are some of the key aspects that need attention to ensure a smooth process.

a. Codebase: The code base that has so far been written needs to be polished so that it can be battle tested and put into production.

b. Integration: Next the proper integration approach needs to be determined. Some questions that need to be answered are:

  • Will there be an API endpoint that people are going to use for obtaining results from the model?
  • Is it going to be a bulk process model that will be integrated with ETL tools?
  • How to orchestrate the workflow? Will there be a cloud scheduler in Google? if Airflow is used, how will it be automated so that it is comfortable for teams to carry out new integrations into the workflow system?
  • How to get detailed access and monitoring, to track SLA reliability.

c. Coding: What coding practices need to be employed?

d. Data Scientists:

  • Are they going to be involved in discussions while selecting the development stack?
  • Will they have full control over the system (Sigmoid recommends this) or will they just be able to check-in code and see production results?
  • How to ensure deep involvement of data scientists from the start, and prevent siloed activities?

The choices that are made at this stage are about eliminating integration risk that involves Integrating different teams together and integration with operational systems. Teams now also witness to many results that start appearing at this stage, e.g. Lift in clicks in case of recommendation systems.

model testing deployment

Going beyond deployment

Once the models are deployed, it is important to assess how they will be run and monitored in detail. If there is a sophisticated experimentation system, it is essential to measure the results of those experiments and update the business teams on the different models that are running along with the results they generate. And it’s very important at this stage to start involving more stakeholders to make sure that the results of the models and the ROI of the system is transparently reported for continued operations.

model maintenance

On-demand Webinar: How to Productionize ML Models at Scale

4. Building scalable infrastructure of the future

The central idea is to have systems that are sufficiently mature to handle the use case being considered. This ensures that analytics teams focus on their core  tasks and are most productive.

Key considerations while building data systems that set the foundation for robust MLOps

Building scalable infrastructure of the future

Involvement of interdependent data teams

There is a need for strong collaboration between the different data teams for successful deployment of ML initiatives. Over communication is always better, as it is vital to identify different types of risks across different phases of the project. Open plans, open culture of sharing and structuring a cross functional team with deep ML deployment experience can set the project on the path to success right from the start. The table below illustrates the roles and responsibilities of a winning data team.

RoleResponsibilities
Data EngineerMakes the appropriate data available for data scientists; focuses on data integration, modeling, optimization quality & self service. Is aware of the technology stacks, advantages and their limitations
Data ScientistIdentifies use cases, determines appropriate datasets and algorithms, experiments and builds AI models. Involving Data Scientists early in the process avoids redundancies and enables the creation of large epics, right in the beginning. This allows execution teams to have a clear understanding on completion milestones of each stage of model building before moving to production
AI architectIs the glue between data scientists, data engineers, developers, operations (DataOps, DevOps, MLOPS) and business unit leaders to govern and scale AI initiatives
ML EngineerDeploys AI models through effective scaling and ensuring production readiness, ensures continuous feedback loop
DataOps EngineerIs involved in development and deployment to deliver analytics to end users. Manages tools and processes to support the data infrastructure and has a fair understanding of how models are going into production

5. Case study: Productionizing ML models for 1:1 personalized email marketing at scale

Objective:

The client – a leading US based restaurant chain had over 12 million registered customers. They wanted to optimize profit & long-term sales by presenting the right message, at the right time, to the right person, via email.

Desired Scenario:

  • The restaurant chain wanted to make the existing scenario smarter by applying machine learning
  • They wanted to implement a feedback loop where they could understand whether a campaign is successful or not
  • They also wanted to explore the idea of introducing multiple campaigns at once

Existing Scenario:

  • A marketing bundle was developed to include discounts or special promotions that’s running for a certain length of time
  • A common email was sent out to the entire customer base. Akin to a hit and forget approach marketers hoped for the best results after the email is sent
  • They had to retroactively understand what offers do well and what didn’t and try better next time

Sigmoid’s Solution Approach

Several different aspects of the email were parameterised, and this enabled creation of over 200 variants Instead of one email variant that was being used. Parameters included tonality and offer contained in the subject line. e.g. does it have a discount? What’s the discount value?

To achieve this, the entire customer base had to be categorized into eight segments based on their purchase behaviors so that each message was more contextually relevant to them. This enabled accurate customization of the initial email variants to the identified customer segments.

MAB Models

  • The campaign effectiveness measurement was based on number of clicks, all the way to the conversion during the campaign period. The solution aimed at measuring and optimizing the messages to drive a particular segment, to purchase what was being offered to them.
  • The system used Multi Armed Bandit (MAB) – a type of reinforcement learning technique, ensured that the variants which performed well would be overweighted in the subsequent iteration of the experiment, and the variants that didn’t perform well would be under-weighted the next period
  • Over time the system optimized to promote messages that lead to higher conversions
  • Over time, the model also continuously learns as to which the performing variants are and which are the ones that are not performing.

Productionizing the ML models

The solution went beyond just establishing the MAB model, but also set up a strong foundation that could be leveraged for multiple ML deployment projects in the future.

Productionizing the ML models

The various components used to productionize the ML model

  1. All the different models were created by the machine learning teams, using platforms like Jupiter notebooks write and submit them as modules and GitHub to check them in.
  2. A CI system to pull in these models, Dockerise them and set them up.
  3. The Dockerise containers were then submitted to Dockerise Container Registry and were integrated into workflow systems.
  4. Its important to note here that ML teams may already be working On Dockerise containers and some exchanges may just be at Python package level. The process may vary depending on the maturity of the data scientist team and the technology stack that are used.
  5. Airflow’s Kubernetes Pod Operator was used to launch pods with the defined specifications which loaded the Docker images with all the necessary environment variables, secrets and dependencies.
  6. These environments were composed of EKS systems and Airflow – which was fetching in these Docker containers, putting them into the workflow and turning them into customer data sets that could be pushed into Breeze (or any other CRM systems that were being used).

The intermediate data sets were all stored in S3. This was contained into a standalone AWS account, which could be spawned off in order to manage governance and security.

Technology, people and process essentials for successful MLOPs

TechnologyPeopleProcess
There were multiple technologies in place within the client’s IT environment, and some parts of the business requirements were distributed across different cloud environments. It was very important to use cloud agnostic tools that avoided interlockings. This largely meant open source tools, and cloud services which can be used across cloud providers. For instance using Terraform for infrastructure creation and maintenance, helm charts to automate deployments, Airflow for scheduling and execution.ML models on multiple environments created the need to identify diverse data engineering specialisations required within the team. There had to be a plan for training data scientists on using new tools where needed. New technologies being used for deployment and CI, meant identifying and plugging skills gaps.

Sigmoid maintains a systematic method while productionizing ML models, which involves testing all processes, eliminating associated risks and ensuring stakeholder acceptance at each stage. This in turn also becomes a playbook for any future teams to take models into production. From a machine learning standpoint the roles of each resource is clear and that helps them swiftly move forward through different stages.

Parameters for Success

  1. Usage of Cloud Agnostic tools to avoid vendor lock in
  2. Execution of on-demand and scheduled jobs
  3. Systems designed for high availability, fault tolerance and auto scaling, disaster recovery
  4. Support for varying requirements in programming language, library, cpu & gpu, memory
  5. Terraform scripts that enable spin up & down the infrastructure
  6. Usage of GitLab CI pipeline setup
Parameters for Success

  1. Requirements gathering from Senior / Lead data scientists for setting up Data Science Laboratory
  2. Volunteering by Lead Data Scientists to try the setup and identify any feature gaps
  3. Conducting multiple training sessions on various topics pertinent to the use case
Parameters for Success

  1. Setting up infrastructure for monitoring, logging and alerting for maintenance and debugging
  2. Tracking model performance metrics, artifacts
  3. Promoting a model and using it for inferences
  4. Obtaining acceptance from cloud security & network teams

Results

There was lift of 9% in sales transaction and 8% sales lift across the business, which when projected across 52 weeks leads to a lift of 15 – $30 million in revenue annually.

ebook results

Seamlessly move your model from concept to production and automate ML pipelines

6. Conclusion

Data and analytics leaders can greatly reduce the risk of spending time and resources on ML projects that never go into production, by employing a robust framework for MLOps that seamlessly allows integration of AI solutions with existing live applications.

Almost every business relying on data greatly benefit from a scalable technology environment. Most of them need some model serving systems, especially if they are serving out APIs. However, it need not always translate to a very sophisticated investment, especially in case of bulk processing. Enterprises need to invest in detailed accesses to key data, servicing them up to business and then finally having an experimentation framework. It most certainly depends on the use case and a lot of businesses currently consider this as something that is ‘good to have’. But this will soon transform to a ‘must-have’ for any enterprise that is not only looking at uncovering seemingly imperceptible opportunities with data but also enhancing their daily business operations across domains.

Share This Content, Choose Your Platform!

Know more about Sigmoid’s data Engineering services

Go to Top