3. MLOPs Challenges and ways to overcome them
Despite advancements in tools and technologies, ML modeling is hard to translate into active business gains. There is a plethora of engineering, data, and business concerns that may hinder putting ML models into production. As per a poll conducted by Sigmoid, 43% of respondents said they get roadblocked in ML model production and integration.
Only 53% of projects make it from artificial intelligence (AI) prototypes to production. AI leaders find it hard to scale AI projects because they lack the tools to create and manage a production-grade AI pipeline.
The main set of challenges that are covered in this book is those that pose business risks and implementation risks, that Sigmoid has typically addressed while putting ML models into production. The following sections discuss some of the challenges that enterprises face:
ML model training requires about a million relevant records, and it cannot be just any data. Getting relevant data sets fast enough to do accurate predictions isn’t straightforward. Getting contextual data is also a problem. Most ML teams still train ML models on top of their traditional data warehouses, resulting in 80% of the time spent on cleaning and managing data rather than training models. These complexities result in high maintenance costs of running ML models.
Engineering and deployment
Even if the data is available, ML systems can be difficult to engineer. It is also important to standardize different technology stacks so it doesn’t hinder productionizing ML models which is crucial for the model’s success. For instance, data scientists may use tools like Pandas and code in Python. But these don’t necessarily translate well to a production environment where Spark or Pyspark is more desirable. Improperly engineered technical solutions can cost quite a bit.
Model drift refers to the degradation of the ML model’s predictive ability which can be caused by changes in the digital environment or changes in variables such as concept and data. Model drift is a common occurrence in machine learning models that happen with time simply by the nature of the machine language model as a whole. Model drift can be of two main types — concept drift and data drift — based on changes in either the variables or the predictors.
A scalable production environment that is well integrated with different datasets and modeling technologies is crucial for the ML model to be successful. Complicated codebases have to be made into well-structured systems ready to be pushed into production. In the absence of a standardized process to take a model to production, the team can get stuck at any stage.
Complex testing requirements
Testing machine learning models is difficult but is an important step of the production process. Running health checks and watching out for data anomalies keep a check on the overall performance of the ML models. However, testing ML models are more complex than software testing. ML-specific testing also includes data and model validation, and trained model quality evaluation.
Lack of skilled resources to drive MLOps
Driving MLOps initiatives requires specifically-skilled resources, such as data engineers and data scientists to perform a number of complex functions such as model development, data assessment, and analysis, and explore ML use cases. There should also be transparent communication across data science, data engineering, DevOps, and other relevant teams to drive ML success. But assigning roles, giving detailed access, and monitoring every team is complex.
Continuous pipeline visibility
ML model deployment requires multi-step pipelines that are critical for automated retraining and deployment. This step is complex as it needs various manual steps in the process to be automated before data scientists and engineers deploy the ML models. This leads to further productionizing challenges due to poor coding and evolving data profiles. It demands continuous tracking of performances to evaluate deviations from expectations to be able to improve the performances.
3.2 Best practices to address MLOps challenges
Data assessment and data feasibility check ensure that data teams have the right data sets to run machine learning models and that they are getting data fast enough to do predictions. Some common issues with enterprise data that should be ironed out before starting to build a machine learning model:
Setting up the right data lake
Building machine learning models on top of traditional data warehouses affects data scientists’ productivity. A data lake environment provides easy and powerful access to a variety of data sources while saving the team a lot of bureaucratic and manual overhead. It provides an opportunity for data scientists to experiment with different data sets and understand what information lies in them in order to be successful.
Evaluation of the right technology stack and scalable compute resources
Selecting the right technology to build and productionize ML models is a crucial step. The data team can pick from a range of technology stacks to experiment with and pick the ones that make ML productionizing easier. The technology chosen should be benchmarked against stability, the business use case, future scenarios, and cloud readiness. Moreover, a scalable computing environment, where they can scale up and down to ETL and process the data analyzed, helps progress much faster and be more productive.
Post-deployment support and testing
Once the ML models are deployed, the environment should be tested in real-time and monitored closely. In a sophisticated experimentation system, test results can be sent back to the data engineering teams to update the models. For instance, the data engineers can decide to overweight the variants that overperform in the next iteration while underweighting the underperforming variants. Negative or wildly wrong results should also be watched out for. The right SLAs need to be met.
Team collaboration and communication
Running successful ML models requires clear communication between the various cross-functional teams to mitigate risks at the right step. While data scientists have to take full control of the system to check in codes and see production results, the DevOps team contributes to maintaining the pipelines. Transparent communication and a strong collaboration between the teams can set up the project for success right from the start.
|Data Engineer||Makes the appropriate data available for data scientists; focuses on data integration, modeling, optimization quality & self-service. They are aware of the technology stacks, advantages, and limitations|
|Data Scientist||Identifies use cases, determines appropriate datasets and algorithms, experiment, and builds AI models. Involving Data Scientists early in the process avoids redundancies and enables the creation of large epics, right in the beginning. This allows execution teams to have a clear understanding of completion milestones of each stage of the model building before moving to production|
|AI architect||Is the glue between data scientists, data engineers, developers, operations (DataOps, DevOps, MLOPS), and business unit leaders to govern and scale AI initiatives|
|ML Engineer||Deploys AI models through effective scaling and ensuring production readiness, ensures continuous feedback loop|
|DataOps Engineer||Is involved in development and deployment to deliver analytics to end-users. Manages tools and processes to support the data infrastructure and has a fair understanding of how models are going into production|
Building data governance and cataloging system
A strong cataloging system can fully leverage the work that the data science teams do. After the data is cleaned up and structured in the right way combining different data sources together, cataloging, and capturing it correctly allows other data science teams and internal systems that need this useful data. A strong governance system makes it comfortable to share data and allows data science teams to quickly access it. There are a variety of technology systems to help both of these but what separates successful teams from others is the ability to mitigate feasibility risks.
Feasibility risk primarily deals with questions such as:
Addressing model drift
Early detection of model drift is critical when it comes to maintaining model accuracy. This is because the model accuracy decreases as time passes and the predicted values continue to deviate further from the actual ones. The further this process goes, the more irreplaceable damage is done to the model as a whole. Hence, catching the problem early on is essential. Refitting models based on past experiences can help to create a predictive timeline for when drift might occur in a model. With this in mind, the models can be redeveloped at regular intervals to deal with an impending model drift.