Artificial Intelligence as a theory began in the last century in academic spheres. Still, the real, applicable principles began to take shape only with the development of computing devices, which gave more and more opportunities for its implementation and application. And although the basic concept remains mainly uniform, some sets of applied principles have gained enough approaches and realizations to stand out as an independent discipline. One such highly successful discipline became machine learning and, a little later, deep learning.
And today, machine learning is booming in popularity for more than deserved reasons. Indeed, it allows us to analyze, predict, and generate many things that would be extremely difficult or impossible to do without it.
But while the results of its work are obvious, and we can directly feel them for ourselves, the principles of its work and the specifics of its use remain a mystery to many people. Therefore, there is a false belief that it is enough to implement machine learning in any service, and it will immediately change the world, turn the market upside down and be sure to be a resounding success.
Alas, statistics for 2020 show that for only 22% of companies, implementing machine learning was a successful solution that justified the costs and brought real value.
So why hasn't this wonderful technology worked for so many companies? The answer is that any technology requires quality development, implementation, and maintenance. More precisely, good ML requires good MLOps pipelines and MLOps practices.
So today, we're going to talk about what are MLOps and what's the difference between DevOps vs MLOps vs AIOps? We'll also talk about the important and necessary MLOps processes we build and apply at Mad Devs, best practices, and successful case studies from specific companies. And, of course, we'll take a close look at the MLOps market itself, its dynamics, and trends in precise figures. Please, get a bunch of useful information from the near future, after which it will be much clearer to you why and when to employ MLOps.
What is MLOps, and why do we need it?
MLOps is the set of practices and tools needed to build the infrastructure to automate machine learning models' learning, testing, deployment, and monitoring in end products and services.
A machine learning model that is not properly updated but continues to be used can cause a lack of success and even be a source of loss. However, an constantly improving machine learning model and its proper infrastructure can transform a business, which is best expressed in numbers.
Netflix believes that the combined effect of ML-based personalization and recommendations saves the company $1 billion annually via increased customer retention.
So let's be specific, what do we get with or without quality MLOps?
The availability of quality MLOps
The model's capabilities are constantly increasing, allowing you to process more new types of data and make more profound conclusions, starting from the behavior of the entire market and ending with the behavior of a particular consumer.
The model's accuracy is constantly increasing, making more comprehensive predictions and conclusions and more complex decisions with greater certainty.
Automatically tested and qualitatively updated versions of the model seamlessly integrate into the used services and allow you to have the most relevant tool for analysis and prediction at any given time.
The model continues to add more value to the business, improving the quality of decisions previously made or even enabling decisions that were formerly not even discussed.
Lack of or poor quality MLOps
Model capabilities remain flat at best, not withstanding competition.
Model accuracy slowly decreases, leading to increasingly inaccurate conclusions and more erroneous decisions.
Updates to the model do not reach the service with the necessary frequency or quality, making the service less and less usable.
The model continues to require at least minimal resources for its maintenance but cannot even recoup them, bringing no real value to the business and even producing losses due to inaccurate conclusions and erroneous decisions.
AIOps vs MLOps
Before we move away from the idea of MLOps and move on to a more detailed breakdown and comparison with DevOps, let's eliminate the confusion that may arise. What is the difference between AIOps and MLOps?
Well, the difference may not seem so obvious since ML is a subset of AI so MLOps can be similar to AIOps in some aspects. Moreover, both disciplines are still far from their final formation, which can also be confusing. But there are some important points that best demonstrate the difference.
AIOps is a set of practices and tools that uses big data and machine learning to automate all the operations, services, and applications your company uses or develops. AIOps allows you to automate event correlation, anomaly detection, and cause and effect finding, providing in-depth diagnostics of all company processes and projects.
MLOps is a set of practices and tools to best build, train, integrate, and deliver machine learning models into end products and services, designed to provide simple ways to solve complex, specialized problems.
That is, roughly speaking, you could see AIOps and MLOps as having a similar proportion to AI and ML as disciplines. But that would be a very simplified view because AIOps is an incredibly promising and expansive discipline and certainly deserves its own article. Share in the comments if you'd like us to break it down in detail.
DevOps vs MLOps
It's worth noting that the end goals of DevOps and MLOps are very similar: to automate the processes of making and deploying updates, ensuring continuous quality improvement. But there is a difference, and it is significant.
The reason is that machine learning models, unlike traditional software, can require changes not only to extend features, improve quality, and so on but even to keep them. The reason is that machine learning models can change and degrade over time. They need to be retrained by specially pre-processed new data.
So DevOps builds an infrastructure that includes practices like automated testing, continuous integration, and continuous delivery. Whereas MLOps uses a similar infrastructure, adopting most of its practices, but supplementing them with practices such as continuous training, building more complex processes not only with code but also with data.
One person can't know everything equally well, even if it's a related field. There can be an insurmountable chasm between the knowledge, skills, and experience of data scientists, data engineers, machine learning engineers, and software developers.
So build hybrid teams where everyone will strictly do their part of the job but will do it perfectly. To ensure this, hire a good MLOps engineer who perfectly understands the nature of each process and the role of each hybrid team member and knows how to design, implement, and maintain high standards.
Model precision should be agreed upon in advance, from the design phase through implementation and retraining.
Compliance with certain metrics influences even the choice of algorithms for data processing, even before the model itself is built and even more so before its integration and deployment. Therefore, it is necessary to define absolutely all desired metrics as precisely as possible and start development on their basis.
Focus on processes and consistency
The fact is that machine learning models have a black box effect, i.e., their creators often cannot understand why they got exactly the same result in some cases. Lately, it has even begun to gain the opinion that the quality of a model should not be determined firstly by its transparency and secondary by its accuracy.
Therefore it is necessary to keep the focus on the processes to have the best chance to understand which of them could lead to what results if we need to influence the results in a certain way. And systematicity, which makes it clear which of the constant processes in principle could not affect the result and which could.
Models always require some type of data transformation, which can be difficult if the data comes in real-time. Therefore, there are data pipelines in which you can prescribe to transform data before giving them to a model. There are many tools that let you create, manage and run such pipelines.
The main advantage is that such pipelines do not depend on the data itself. They can be automatically deployed within the CI/CD pipeline, which allows you to prescribe several pipelines for different data transformations. For example, in one case, automate their transformations to train the model, and in another case, take ready data from the database to run the model in real-time.
As mentioned earlier, models are mutable and much more defined by data than by code. So, you should not misunderstand directly associating a specific model version with a particular code version but rather treat it more comprehensively.
Comprehensive unit testing of models
Here's a pretty simple thought: unlike traditional software, models can't just pass or fail a test because they're never 100% accurate. So static tests are not suitable for them, and it is reasonable to test them not relative to such parameters as passed or failed but relative to their own previous iterations.
Just as any quality unit test does testing for multiple cases, it is necessary to do model testing relative to several different kinds of data. You can even automate model training online if such tests cover as many variants as possible.
Comprehensive unit testing of data
Testing data is a very important practice in the case of machine learning models. After all, a machine learning model is always code plus data. And if the data is not tested separately, it may distort the model's performance.
Therefore, separate unit testing of data should also be exhaustive, providing a variety of data types and types of data processing, for which, again, the same data pipelines provide very broad opportunities.
A good practice, which gets rid of legal problems and creates a good reputation, is the licensing of data on which the model was trained.
Of course, we're not talking about the internal data that the company generates but selected public data for the initial training of the model. Nevertheless, licensing any data is becoming an increasingly strong word in the reputation race.
An essential practice, some solutions for which have already been listed above, is to pay special attention to building an infrastructure for quality monitoring of model operation in a working environment. Because in the case of machine learning models, we want to deal not only with the things we have full control over but also with the data over which we have less control.
Consequently, in addition to monitoring standard metrics such as latency, traffic, and errors, we also need to monitor the model's prediction performance. All of this is achieved through data pipelines, additional pipelines, and advanced unit testing.
MLOps platforms are software products that help automate and manage all phases of the machine model lifecycle, from their initial build and training to their deployment and retraining, as well as the necessary data and operations on it.
Typically, MLOps platforms provide the following list of features:
MLOps frameworks integrations to build and train models
Tools for version control of datasets and pipelines
Tools for controlling and training versions of models
Tools for systematic optimization of hyperparameter values
Tools for deploying and monitoring a model in a production environment
Top MLOps platforms
Let's look at the list of top MLOps platforms, which can include both global companies providing a variety of platforms and tools, as well as exclusively MLOps companies and MLOps startups.
One of the most popular MLOps platforms for building, training, deploying, and managing ML models and great for any level solution, especially good for enterprise. The special thing is that it works great with AWS; if you're already familiar with it, you'll be twice as comfortable.
Also, one of the biggest MLOps platforms for building, training, deploying, and managing ML models and is great for businesses of all sizes.
Supports model and experiment versioning
Supports hyperparameter tuning
Supports model deployment and monitoring
Great for using Azure services
Supports TensorFlow, Scikit-learn, PyTorch, Microsoft Cognitive Toolkit
Strong focus on Azure services
Supports fewer frameworks, no support for Keras
May be more expensive than Amazon's solution
Google Cloud AI Platform
That is great all-in-one platform for data engineers, data scientists, machine learning engineers, and so on. It's easy enough to use and would also be great for businesses of all sizes.
Supports model versioning and experimentation
Supports hyperparameter tuning
Supports model deployment and monitoring
Supports TensorFlow, Scikit-learn, Keras
Easy to use
Does not support Anomaly Detection and Ranking
Supports few frameworks
A full-fledged MLOps open source platform greatly simplifies several machine learning stages, including training, pipeline development, and Jupyter laptop maintenance. Also, Kubeflow offers many specialized services and integration. It is great for businesses of all sizes, especially for small and medium ones.
Another extremely handy MLOps open-source platform is designed to let you quickly and efficiently do experiments and work with machine learning libraries, algorithms, and deployment tools. Perfect for small and medium-sized businesses.
Fast end efficient experiments
Supports tracking and versioning models
Supports Python, Java, R
More specialized tool
Main focus on storing and organizing models, running experiments
It is one of the most powerful MLOps platforms for research and delivering models quickly, securely, and effectively.
Allows for very high quality monitoring of the model
These platforms are listed in descending order of range of functionality and number of tools. Still, it's essential to understand that those at the bottom focus more on specific stages of the model lifecycle, which they handle just fine.
In fact, looking at what MLOps platforms are and what feature set they can provide and what features they have is a vast separate topic. If you'd like us to expand on this list and do a detailed breakdown and comparison of them, share it in the comments.
Who needs MLOps
The market for Machine Learning Model Management and Operations (MLOps) is estimated to grow over 10x (from $350 million in 2019 to $4 billion by 2025).
In fact, 10% of enterprises now use 10 or more AI applications. Plus, 73% of all CEOs and CHROs in the US plan to use more AI in the next 3 years.
And for a good reason – according to Salesforce Research, 69% of IT leaders believe ML is transforming their business.
Of course, the numbers are constantly changing as the desire to implement ML for various reasons in different industries and companies keeps growing. However, anyone who wants to develop or implement ML into their company or products in a way that brings value have to also implement quality MLOps. Let's take a closer look at examples of industries and companies in them.
Of course, here it's the first use because, within the IT industry, it's much easier to implement and realize than in others. Also, IT companies accumulate a crazy amount of data, which gives them a tremendous opportunity to build and train their own models.
If divided by the percentage, about 75% of companies implement MLOps for service operations, 45% for product or service development, 38% for marketing and sales, 26% for supply-chain management, 22% for manufacturing, 23% for risk predictions, 17% for increasing human capacity, and 17% for strategy and corporate finance.
Sharper Shape is a US IT company that used the Valohai platform for implementing MLOps.
Automation of infrastructure and experiment management tasks that takes a third of data scientists' time
New data scientists can be onboarded in a quarter of the time.
Another important area is E-Commerce because the large flow of users and the ever-expanding set of services require more complex and accurate machine learning models, thus good MLOps.
You can see this statistic that about 23% of companies implement MLOps for service operations, 13% for product or service development, 52% for marketing and sales, 38% for supply-chain management, 7% for manufacturing, 9% for risk predictions, and 8% for increasing human capacity.
Booking.com, a company everyone knows, has implemented MLOps with great results.
Ability to scale AI with 150 customer-facing ML models
We also have great examples from finance, where there has always been a huge flow of users and transactions, the accuracy of which has always had to remain high. As financial services become more functional, they also need quality MLOps to manage the increasing number of models that apply there.
Statistics say that 49% of companies implement MLOps for service operations, 26% for product or service development, 33% for marketing and sales, 7% for supply-chain management, 6% for manufacturing, 40% for risk predictions, 9% for increasing human capacity, and 14% for strategy and corporate finance.
Another famous company Payoneer implemented MLOps using Iguazio.
Built a scalable and reliable fraud prediction and prevention model that analyzes fresh data in real-time and adapts to new threats
Another incredibly important area where machine learning is simply necessary, and sometimes it's amazing how it was handled before it came along, is Insurance. Forecasting events and building solutions with maximum accuracy directly determine the profit of this industry. There are some interesting cases too.
Topdanmark, a large European insurance company, has also implemented MLOps for a number of their machine learning models
It saves us significant time previously spent on maintenance and investigation
Allows us to track model performance in real-time and compare it to our expectations
Automatically detected drift that previously would have taken months to detect
Manufacturing is one area where machine learning is not just making more money but also solving previously unsolvable problems of incredible complexity. So with the increasing development of ML, manufacturing is adopting it and it requires more quality MLOps, which provides huge benefits.
Oyak, a Turkish cement manufacturing company, implemented MLOps using the DataRobot platform and got great results.
Increased alternative fuel usage by 7 times
Cut 2% of total CO2 emissions
Reduced costs by $39 million
Traditionally, the transportation industry is very complex, as it involves many variables, most of which are incredibly difficult to analyze and predict. Of course, this is also an industry where ML not only increases profits but also solves a huge number of previously unsolvable problems. As model adoption in this industry grows, so does the models' complexity and asks for the adoption of quality MLOps.
Statistics say that 51% of companies implement MLOps for service operations, 34% for product or service development, 34% for marketing and sales, 18% for supply-chain management, 4% for manufacturing, 4% for risk predictions, 2% for increasing human capacity, and 3% for strategy and corporate finance.
The famous US company Uber uses a lot of ML models to ensure quality and profitability. But since their investment opportunities are great, they decided to develop everything from scratch, which was apparently a good decision.
Developed their own ML platform, Michelangelo
From zero to hundreds of ML products in three years, thanks to MLOps practices.
Another vital industry where ML is solving problems that until recently were beyond human capabilities. The excellent results of ML implementation are so telling there that its adoption is growing rapidly, and hence the demand for quality MLOps implementation is also growing.
Statistics show that 46% of companies implement MLOps for service operations, 48% for product or service development, 17% for marketing and sales, 21% for supply-chain management, 9% for manufacturing, 19% for risk predictions, 18% for increasing human capacity, and 13% for strategy and corporate finance.
Philips, a famous Dutch company, is probably best known to many people as a manufacturer of home electronics, but it's also a very big medical equipment provider. The company has many areas in different industries, and making it all work properly is extremely difficult. So they've been actively implementing ML and MLOps using the ClearML platform, which has yielded good results.
Hours saved through streamlined experiment tracking and automatic documentation
Steward Health Care US is a company that has implemented MLOps using the DataRobot platform and has shared some amazing results in numbers, which gives an extremely clear indication of how valuable a truly quality MLOps is.
$2 million/year in savings from nurses' hours paid per patient day
$10 million/year savings from reducing patient length of stay
And another US company called Theator, which has implemented MLOps through ClearML, also showed staggering numbers, once again proving the earlier point.
$130K-$170K annual savings directly related to MLOps
Chemical and pharm
When we talk about big things, they can seem infinitely complicated. But sometimes, the small things and dealing with them can be many times more complicated, and that's the case with the chemical and pharmaceutical industry. This is one industry where ML not only allows you to make a lot more money or even solve unsolvable problems but has already managed to change the industry fundamentally.
For so many companies, developing new chemical components and drugs is not a long and costly manual job with a lot of trial and error in the real world. Now it's modeling the behavior of complex molecules and their properties with more advanced machine learning models, for which maximum quality MLOps are vital.
Statistics tell us that 31% of companies implement MLOps for service operations, 31% for product or service development, 27% for marketing and sales, 13% for supply-chain management, 28% for manufacturing, 3% for risk predictions, 6% for increasing human capacity, and 4% for strategy and corporate finance.
For example, US company Ecolab has implemented MLOps using the Iguazio platform to improve its models. The results are amazing, especially for such an important industry.
Decreased model deployment times from 12 months to 30-90 days.
How to implement MLOps
Of course, it's not that simple. Artificial intelligence and machine learning, in particular, is not magic pill that will solve all problems. Implementing it requires a lot of effort and some investment. James Manyika and David Schwartz's talk on the McKinsey Podcast develops this theme nicely.
But avoiding AI and machine learning altogether can be a big problem for so many companies to ensure their future success and ability to compete in the market, and for some industries, it can be a stumbling block that can bring growth to a halt.
So let's look at the general process of MLOps implementation relevant to most companies and industries based on Mad Devs' experience.
By the way, we have an excellent book on software development processes and practices in principle, where we've detailed all of our experiences that we've been building up and improving over the years. And we share it all with you. Enjoy reading!
Data collection and processing
The model starts with the data, not the model itself. A team of Data Scientists and Data Engineers must first collect and analyze the business data to most accurately identify the problem, choose the algorithms to solve it and pre-process the data properly.
Data Cleansing is the process of pre-processing data for errors, inaccuracies, and heterogeneities in it. This process is essential because the quality of the data will directly affect the quality of the model.
Next, missing data is collected, data with errors are corrected, and all of this is entered into the database.
Then the data is further processed and structured into something that can be operated efficiently.
The data are processed, analyzed, prepared for presentation to stakeholders and approved by them.
Then a bit of magic starts, where they take the features from the subject data area and transform them into vectors that are useful for the future machine learning model.
Then we split up different data groups into different features, depending on the complexity and layering of the future model.
Only when we have the data we need, processed and transformed to the right form and divided by the necessary attributes we begin to build a model that we will train on them.
Building a model
Here the model is written, but the data are not yet applied. The architecture is defined, including the most appropriate algorithms defined in the beginning, and the number of layers and weights are specified.
Here, a main set of data is trained in the model, hyperparameters are analyzed and tuned to increase accuracy and efficiency.
Here the model is tested in detail for various parameters. The model results are shown to the customer if they satisfy the team. If it performs the tasks they want with the accuracy they want, the team receives approval, and it moves on to the next stage.
Training at scale
Here the model is trained on the entire data set prepared earlier, maximizing its capabilities. Also, the model is tested here, and if its performance has not deteriorated, it is prepared for implementation.
Now for what we are all here to do - bringing the model into the production environment and evaluating it in real-world conditions.
This is where we deploy the model into a production environment. This could be a Cloud or Edge infrastructure, depending on your business.
Next, we use the model in a production environment where it continuously receives and processes new data.
While it's running, we monitor its performance to what extent it's improving or enhancing relative to its past iterations.
Fintune & improvement
Here, we can tweak its scales to improve its performance and make more fundamental changes in its structure.
It is important to note that the latter steps are mostly possible only with CI/CD and CT pipelines, without which real-time model training, automatic testing and updates, data and model versioning, and so on are not entirely possible. However, companies can choose different scenarios and levels of MLOps implementation for one reason or another. Let's consider them in a little more detail.
MLOps 0 Level
This choice is for companies that are just starting to introduce ML and are not yet sure how much they need it and how to use it. Also, if the company is unsure how often they will update the model.
All processes are done manually, from data collection and processing to model development and deployment.
The model training and operations teams don't work as closely. The work is done sequentially on model building, training, and testing, others on model deployment to their infrastructure via APIs.
Rare iterations and limited versioning because the model is rarely updated and modified.
Lack of CI/CD and CT, since the model is infrequently tested, retrained, and redeployed, the automation of such processes is redundant.
Weak monitoring of the model since it does not require the most comprehensive collection of model metrics for improvement.
Models often break during deployment, fail in some cases, and can quickly degrade.
MLOps Level 1
The primary goal here is to introduce the CT model by automating the ML pipeline. In this way, predicting model behavior and improving it becomes a challenge.
The speed of experiments increases, and training is faster because it is done automatically.
Thanks to CT, models are trained in production, obtaining, and pre-processing new data.
Modular components and pipelines are suitable for reuse and reconfiguration.
Continuous model delivery allows you to deliver models always tested and trained on fresh data to the service.
You deploy not just the model itself in a working environment but the entire training pipeline that will serve to run and train the model.
Data validation is automatically set up with appropriate pipelines.
The feature repository is implemented as a separate repository that you provide access to for the pipeline.
Metadata management is also automated to help with data and artifact origin, reproducibility, and comparison.
Configured machine learning triggers to run automatic learning all the time, or on a schedule, or from new data arrivals or performance degradation.
This works great when you only have one model pipeline, but if you have multiple pipelines and models, you already need a CI/CD to deploy them. So we move on to the next level.
MLOps Level 2
Here we look at the complete implementation of MLOps if you have multiple data pipelines and models with entirely different algorithms that are frequently updated and come into production.
Quick and efficient automated development and experimentation with new algorithms and building new models.
Continuous Integration allows you to build pipeline components that can be reused later.
Continuous Delivery lets you automate the deployment of artifacts and entire pipelines with new model iterations.
Real-time monitoring allows you to collect various data about the performance of models and run pipelines or experiments in the production environment, depending on the set triggers.
How much cost to implement MLOps?
It's worth noting right away here that the numbers can be as individual as possible. It all depends on the industry, company, model requirements, choice of MLOps levels, and choice of infrastructure. For example, the website phdata.io provides very clear statistics about it.
But choosing a good vendor that will provide the right processes and use the most profitable framework can greatly reduce the cost of the final models and greatly increase the quality and speed of development.
Quality MLOps platform costs serious money, some might think it's easier to develop your own infrastructure. It may be a good idea if your resource pool is large and your own infrastructure will allow the creation of specific models generating huge profits in your case.
However, if you look at most cases, although building MLOps infrastructure from existing solutions requires a large investment, in the beginning, it greatly reduces the cost of building new models and maintaining existing ones in the long run.
It is clear that machine learning is with us for the long term. It is showing excellent results, receiving huge investments, and even special hardware is being developed for such tasks, which is now installed even in mobile chips. All this leads to the fact that artificial intelligence and machine learning will solve even more problems going deeper and deeper, which means its improvement requires the right practices like MLOps.
We hope you've received enough information to think about implementing promising technologies that provide a place in the future market and your company and products. If you still have questions, you can always contact us for a free consultation, and we'll look at the best options specifically for your case.