MLOps: Machine Learning Ops And Why It Matters In Business

Machine Learning Ops (MLOps) describes a suite of best practices that successfully help a business run artificial intelligence. It consists of the skills, workflows, and processes to create, run, and maintain machine learning models to help various operational processes within organizations.

Aspect	Explanation
Concept Overview	MLOps (Machine Learning Operations) is a set of practices and techniques that aim to operationalize and automate the end-to-end machine learning lifecycle. It combines aspects of machine learning, software engineering, and DevOps to streamline the development, deployment, monitoring, and maintenance of machine learning models in production. MLOps ensures that machine learning projects are scalable, maintainable, and reliable.
Key Principles	MLOps is guided by several key principles: 1. Automation: Automate repetitive tasks in the machine learning pipeline, such as data preprocessing, model training, and deployment. 2. Collaboration: Foster collaboration between data scientists, engineers, and operations teams to ensure smooth integration of ML models into production. 3. Reproducibility: Maintain a record of all experiments, code, and data to enable model reproducibility and version control. 4. Continuous Integration and Deployment (CI/CD): Implement CI/CD pipelines for model testing and deployment. 5. Monitoring and Governance: Continuously monitor model performance, drift, and compliance with regulations. 6. Scalability: Design systems that can handle increased workloads as machine learning adoption grows.
MLOps Lifecycle	The MLOps lifecycle typically involves the following stages: 1. Data Acquisition and Preparation: Collect and preprocess data for model training. 2. Model Development: Build and experiment with machine learning models. 3. Model Training: Train models on the prepared data. 4. Model Evaluation: Assess model performance and select the best model. 5. Model Deployment: Deploy the selected model to a production environment. 6. Model Monitoring and Maintenance: Continuously monitor model performance and retrain as needed. 7. Model Governance and Compliance: Ensure models comply with ethical, legal, and regulatory requirements.
Benefits	Implementing MLOps offers several benefits: 1. Increased Efficiency: Streamlined processes reduce development and deployment times. 2. Improved Model Reliability: Continuous monitoring and automated testing catch issues early. 3. Collaboration: Cross-functional teams collaborate more effectively. 4. Scalability: Scalable systems accommodate growing machine learning workloads. 5. Reproducibility: Easy replication of experiments and models. 6. Cost-Efficiency: Reduced manual intervention lowers operational costs.
Challenges and Risks	Challenges in adopting MLOps include the complexity of integrating machine learning with existing infrastructure, the need for specialized skills, and the potential for data privacy and ethical concerns, particularly in AI applications.
Applications	MLOps is primarily applied in fields where machine learning models are central, including predictive analytics, natural language processing, computer vision, and recommendation systems. It is used in industries such as finance, healthcare, e-commerce, and manufacturing.
Tools and Technologies	Various tools and technologies support MLOps, including Docker and Kubernetes for containerization and orchestration, Jenkins, GitLab CI/CD, and Travis CI for CI/CD pipelines, and specialized MLOps platforms like MLflow, Kubeflow, and TFX (TensorFlow Extended).

Understanding Machine Learning Ops

Machine Learning Ops is a relatively new concept because the commercial application of artificial intelligence (AI) is also an emerging process.

Indeed, AI burst onto the scene less than a decade ago after a researcher employed it to win an image-recognition contest.

Since that time, artificial intelligence can be seen in:

Translating websites into different languages.
Calculating credit risk for mortgage or loan applications.
Re-routing of customer service calls to the appropriate department.
Assisting hospital staff in analyzing X-rays.
Streamlining supermarket logistic and supply chain operations.
Automating the generation of text for customer support, SEO, and copywriting.

As AI becomes more ubiquitous, so too must the machine learning that powers it. MLOps was created in response to a need for businesses to follow a developed machine learning framework.

Based on DevOps practices, MLOps seeks to address a fundamental disconnect between carefully crafted code and unpredictable real-world data. This disconnect can lead to issues such as slow or inconsistent deployment, low reproducibility, and a reduction in performance.

The four guiding principles of Machine Learning Ops

As noted, MLOps is not a single technical solution but a suite of best practices, or guiding principles.

Following is a look at each in no particular order:

Machine learning should be reproducible

That is, data must be able to audit, verify, and reproduce every production model.

Version control for code in software development is standard.

But in machine learning, data, parameters, and metadata must all be versioned.

By storing model training artifacts, the model can also be reproduced if required.

Machine learning should be collaborative

MLOps advocates that machine learning model production is visible and collaborative.

Everything from data extraction to model deployment should be approached by transforming tacit knowledge into code.

Machine learning should be tested and monitored

Since machine learning is an engineering practice, testing and monitoring should not neglected.

Performance in the context of MLOps incorporates predictive importance as well as technical performance.

Model adherence standards must be set and expected behaviour made visible.

The team should not rely on gut feelings.

Machine learning should be continuous

It’s important to realize that a machine learning model is temporary and whose lifecycle depends on the use-case and how dynamic the underlying data is.

While a fully automated system may diminish over time, machine learning must be seen as a continuous process where retraining is made as easy as possible.

Implementing MLOps into business operations

In a very broad sense, businesses can implement MLOps by following a few steps:

Step 1 – Recognise stakeholders

MLOps projects are often large, complex, multi-disciplinary initiatives that necessitate the contributions of different stakeholders.

These include obvious stakeholders such as machine learning engineers, data scientists, and DevOps engineers.

However, these projects will also require collaboration and cooperation from IT, management, and data engineers.

Step 2 – Invest in infrastructure

There are a raft of infrastructure products on the market, and not all are born equal.

In deciding with product to adopt, a business should consider:

Reproducibility

The product must make data science knowledge retention easier.

Indeed, ease of reproducibility is governed by data version control and experiment tracking.

Efficiency

Does the product result in time or cost savings? For example, can machine learning remove manual work to increase pipeline capability?

Integrability

Will the product integrate nicely with existing processes or systems?

Step 3 – Automation

Before moving into production, machine learning projects must be split into smaller, more manageable components.

These components must be related but able to be developed separately.

The process of separating a problem into various components forces the product team to follow a joined process.

This encourages the formation of a well-defined language between engineers and data scientists, who work collaboratively to create a product capable of updating itself automatically.

This ability is akin to the DevOps practice of continuous integration (CI).

MLOps and AIaaS

AI Platform diagram — Source: cloud.google.com

MLOps consists of various phases built on top of an AI platform, where models will need to be prepared (via data labeling, Big Query datasets, and Cloud Storage), built, validated, and deployed.

And MLOps is a vast world, made of many moving parts.

Indeed, before the ML code can be operated, as highlighted on Google Cloud, a lot is spent on “configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring.”

The ML Process

ML models follow several steps, an example is: Data extraction > Data analysis > Data preparation > Model training > Model evaluation > Model validation > Model serving > Model monitoring.

Machine learning ops examples

Below are a couple of examples of how machine learning ops are being applied at companies such as Uber and Booking.com.

Uber

Uber Michelangelo is the name given to Uber’s machine learning platform that standardizes the workflow across teams and improves coordination.

Before Michelangelo was developed, Uber faced difficulties implementing machine learning models because of the vast size of the company and its operations.

While data scientists were developing predictive models, engineers were also creating bespoke, one-off systems that used these models in production.

Ultimately, the impact of machine learning at Uber was limited to whatever scientists and engineers could build in a short timeframe with predominantly open-source tools.

Michelangelo was conceived to provide a system where reliable, uniform and reproducible pipelines could be built for the creation and management of prediction and training data at scale.

Today, the MLOps platform standardizes workflows and processes via an end-to-end system where users can easily build and operate ML systems.

While Michelangelo manages dozens of models across the company for countless use cases, its application to UberEATS is worth a quick mention.

uber-eats-business-model — Uber Eats is a three-sided marketplace connecting a driver, a restaurant owner, and a customer with the Uber Eats platform at the center. The three-sided marketplace moves around three players: Restaurants pay commission on the orders to Uber Eats; Customers pay the small delivery charges, and at times, cancellation fees; Drivers earn through making reliable deliveries on time.

Here, machine learning was incorporated into meal delivery time predictions, restaurant rankings, search rankings, and search autocomplete.

Calculating meal delivery time is seen as particularly complex and involves many moving parts, with Michelangelo using tree regression models to make end-to-end delivery estimates based on multiple current and historical metrics.

Booking.com

Booking.com is the largest online travel agent website in the world with users able to search for millions of different accommodation options.

Like Uber, Booking.com needed a complex machine learning solution that could be deployed at scale.

To understand the company’s predicament, consider a user searching for accommodation in Paris.

At the time of writing, there are over 4,700 establishments – but it would be unrealistic to expect the user to look at all of them.

So how does Booking.com know which options to show?

At a somewhat basic level, machine learning algorithms list hotels based on inputs such as location, review rating, price, and amenities.

The algorithms also consider available data about the user, such as their propensity to book certain types of accommodation and whether or not the trip is for business or pleasure.

More complex machine learning is used to avoid the platform serving up results that consist of similar hotels.

It would be unwise for Booking.com to list 10 3-star Parisian hotels at the same price point on the first page of the results.

To counter this, machine learning incorporates aspects of behavioral economics such as the exploration-exploitation trade-off.

The algorithm will also collect data on the user as they search for a place to stay.

Perhaps they spend more time looking at family-friendly hotels with a swimming pool, or maybe they are showing a preference for a bed and breakfast near the Eiffel Tower.

An important but sometimes overlooked aspect of the Booking.com website are the accommodation owners and hosts.

This user group has its own set of interests that sometimes conflict with holidaymakers and the company itself.

In the case of the latter, machine learning will play an increasingly important role in Booking.com’s relationship with its vendors and by extension, its long-term viability.

Booking.com today is the culmination of 150 successful customer-centric machine learning applications developed by dozens of teams across the company.

These were exposed to hundreds of millions of users and validated via randomized but controlled trials.

The company concluded that the iterative, hypothesis-driven process that looked to other disciplines for inspiration was key to the initiative’s success.

Key takeaways

Machine Learning Ops encompasses a set of best practices that help organizations successfully incorporate artificial intelligence.
Machine Learning Ops seeks to address a disconnect between carefully written code and unpredictable real-world data. In so doing, MLOps can improve the efficiency of machine learning release cycles.
Machine Learning Ops implementation can be complex and as a result, relies on input from many different stakeholders. Investing in the right infrastructure and focusing on automation are also crucial.

Machine Learning Ops (MLOps) Highlights:

Definition: Machine Learning Ops (MLOps) encompasses best practices for effectively running artificial intelligence in business operations, including skills, workflows, and processes to create, manage, and maintain machine learning models.
Emergence: MLOps is relatively new due to the commercial adoption of artificial intelligence, which emerged less than a decade ago with applications like image recognition.
Applications: AI is now widely used in various fields including language translation, credit risk assessment, customer service, healthcare, logistics, and more.
Challenges: MLOps emerged to address challenges like slow deployment, low reproducibility, and inconsistent performance caused by the disconnect between code and real-world data.
Guiding Principles:
1. Reproducibility: Ability to audit, verify, and reproduce production models by versioning data, parameters, and metadata.
2. Collaboration: Promotes visibility and collaboration in machine learning model production by transforming tacit knowledge into code.
3. Testing and Monitoring: Emphasizes testing, monitoring, and setting model adherence standards for both predictive importance and technical performance.
4. Continuous Improvement: Treats machine learning as a continuous process, accommodating retraining as needed.
Implementation Steps:
1. Recognize Stakeholders: Involve various stakeholders including machine learning engineers, data scientists, DevOps engineers, IT, management, and data engineers.
2. Invest in Infrastructure: Choose infrastructure products based on reproducibility, efficiency, and integrability.
3. Automation: Divide machine learning projects into manageable components, fostering collaboration and enabling automatic updates.
AI as a Service (AIaaS): AI as a Service offers AI functionalities to organizations without requiring in-house AI expertise. Utilizes cloud-based platforms (e.g., Amazon AWS, Google Cloud, Microsoft Azure) and offers various services for different use cases.
MLOps Phases: MLOps involves phases like model preparation, building, validation, and deployment on top of an AI platform.
Machine Learning in Uber and Booking.com:
- Uber: Michelangelo is Uber’s machine learning platform that standardizes workflows, coordinates efforts, and manages models across various use cases.
- Booking.com: Uses machine learning to recommend accommodations based on factors like location, price, and user behavior, improving user experience and host relationships.
Key Takeaway: MLOps is a set of practices that ensures efficient integration of artificial intelligence into business operations. It addresses challenges related to deploying, managing, and improving machine learning models, enabling organizations to benefit from AI effectively.

Related Concepts	Description	When to Apply
MLOps (Machine Learning Operations)	MLOps (Machine Learning Operations) is a set of practices and principles that aim to streamline and automate the end-to-end lifecycle of machine learning (ML) models, from development and training to deployment and monitoring in production environments. MLOps integrates software engineering, data engineering, and DevOps methodologies to enable scalable, reliable, and reproducible ML workflows, addressing challenges related to model versioning, reproducibility, scalability, and performance monitoring. MLOps emphasizes collaboration, automation, and continuous improvement across cross-functional teams involved in ML projects, facilitating efficient model development, deployment, and maintenance at scale.	– When operationalizing machine learning models or implementing AI solutions in production environments. – Particularly in situations where there is a need to streamline ML workflows, ensure model scalability and reliability, or improve collaboration between data scientists, engineers, and operations teams. Implementing MLOps practices enables organizations to accelerate ML model deployment, automate model monitoring, and optimize model performance in real-world applications, driving business value and innovation in AI-driven initiatives.
DevOps	DevOps is a software development methodology and cultural approach that emphasizes collaboration, automation, and integration between development (Dev) and operations (Ops) teams to improve the speed, quality, and efficiency of software delivery and deployment. DevOps practices include continuous integration (CI), continuous delivery (CD), infrastructure as code (IaC), and automated testing, enabling organizations to automate software development pipelines, deploy changes more frequently, and enhance feedback loops between development and operations teams. DevOps fosters a culture of collaboration, transparency, and continuous improvement, driving agility, innovation, and resilience in software development and deployment processes.	– When streamlining software development and deployment processes to improve agility and reliability. – Particularly in situations where there is a need to accelerate software delivery, enhance collaboration between development and operations teams, or automate infrastructure provisioning and deployment. Adopting DevOps practices enables organizations to optimize software development lifecycles, reduce time-to-market, and improve software quality and reliability in application development, IT operations, and digital transformation initiatives.
Continuous Integration (CI)	Continuous Integration (CI) is a software development practice that involves frequently integrating code changes from multiple developers into a shared repository, followed by automated build and testing processes to detect integration errors early and ensure code quality and stability. CI aims to improve collaboration, reduce integration risks, and accelerate software development cycles by automating the process of code integration, compilation, and testing in a controlled environment. CI systems automatically trigger build and test processes whenever code changes are committed to the version control repository, providing rapid feedback to developers and facilitating early detection and resolution of integration issues.	– When automating code integration and testing processes to improve software quality and development velocity. – Particularly in situations where there are multiple developers working on the same codebase or where there is a need to detect integration errors early in the development lifecycle. Implementing Continuous Integration enables organizations to streamline development workflows, enhance collaboration between development teams, and deliver high-quality software with greater speed and efficiency in software development, DevOps, and agile methodologies.
Continuous Deployment (CD)	Continuous Deployment (CD) is an extension of continuous integration (CI) that automates the process of deploying code changes to production environments once they pass automated tests and quality checks. CD aims to further accelerate software delivery cycles, reduce manual intervention, and improve deployment reliability and repeatability by automating the deployment pipeline from development to production. CD systems automatically release software changes to production environments after successful testing, enabling organizations to deliver new features, updates, and fixes to users rapidly and reliably. CD emphasizes automation, monitoring, and rollback mechanisms to ensure safe and seamless deployment of changes in production environments.	– When automating software deployment processes to streamline release cycles and improve deployment reliability. – Particularly in situations where there is a need to accelerate time-to-market, reduce deployment errors, or ensure consistent delivery of software changes. Implementing Continuous Deployment enables organizations to automate deployment pipelines, reduce manual intervention, and deliver software updates to production environments quickly and reliably in software development, DevOps, and cloud computing initiatives.
Model Deployment	Model Deployment refers to the process of deploying machine learning (ML) models into production environments where they can make predictions or perform tasks based on new data inputs. Model deployment involves packaging trained ML models along with associated dependencies and deploying them to scalable and reliable production infrastructure, such as cloud platforms or containerized environments. Model deployment ensures that ML models are available for inference or usage by end-users or downstream applications, enabling organizations to derive insights, make decisions, or automate tasks based on predictive analytics or machine learning algorithms. Model deployment encompasses considerations such as scalability, reliability, latency, security, and monitoring to ensure that deployed models perform effectively and meet business requirements in real-world applications.	– When operationalizing machine learning models for real-world usage in production environments. – Particularly in situations where there is a need to deploy trained ML models to make predictions, automate tasks, or support decision-making processes. Model deployment enables organizations to leverage ML capabilities, derive actionable insights, and drive business value through predictive analytics, recommendation systems, or intelligent automation in AI-driven initiatives and data-driven applications.
Model Monitoring	Model Monitoring is the process of continuously tracking and evaluating the performance, behavior, and quality of deployed machine learning (ML) models in production environments to ensure that they operate as intended and deliver reliable predictions or outputs. Model monitoring involves collecting and analyzing data on model inputs, outputs, predictions, and feedback over time, detecting deviations, drifts, or anomalies in model behavior, and triggering alerts or actions to address issues or maintain model performance. Model monitoring enables organizations to identify and mitigate issues such as concept drift, data quality issues, or model degradation that may affect the accuracy, fairness, or effectiveness of deployed ML models in real-world applications.	– When monitoring the performance and behavior of deployed ML models in production environments. – Particularly in situations where there is a need to ensure that deployed models operate reliably, perform effectively, and deliver accurate predictions or outputs over time. Implementing model monitoring enables organizations to detect and address issues or anomalies in model behavior, maintain model performance, and ensure compliance with business requirements and regulatory standards in AI-driven initiatives and data-driven applications.
Model Versioning	Model Versioning is the practice of systematically managing and tracking different versions or iterations of machine learning (ML) models throughout their lifecycle, from development and training to deployment and maintenance. Model versioning involves capturing metadata, code, configurations, and dependencies associated with each model version, enabling reproducibility, traceability, and accountability in ML workflows. Model versioning facilitates collaboration, experimentation, and governance in ML projects by providing a structured framework for managing model changes, comparisons, and deployments across different environments and stakeholders. Model versioning systems integrate with version control tools and platforms to ensure consistency, integrity, and visibility of ML model artifacts and artifacts and artifacts and streamline model development, deployment, and maintenance processes.	– When managing changes and iterations of machine learning models throughout their lifecycle. – Particularly in situations where there are multiple iterations or versions of ML models, or where there is a need to track model changes, comparisons, or deployments across different environments or stakeholders. Implementing model versioning enables organizations to maintain model integrity, facilitate collaboration, and ensure reproducibility and traceability in ML workflows, enhancing transparency and governance in AI-driven initiatives and data science projects.
Model Governance	Model Governance refers to the policies, processes, and controls established by organizations to ensure the responsible development, deployment, and management of machine learning (ML) models throughout their lifecycle. Model governance encompasses considerations such as ethics, fairness, transparency, accountability, and regulatory compliance in ML projects, addressing risks related to bias, privacy, security, and legal implications. Model governance frameworks define roles and responsibilities, establish guidelines and standards, and implement controls and oversight mechanisms to mitigate risks, foster trust, and ensure the ethical and responsible use of ML technologies in real-world applications. Model governance aims to align ML initiatives with organizational values, regulatory requirements, and societal expectations, promoting transparency, fairness, and accountability in AI-driven decision-making processes.	– When establishing policies and controls to ensure the responsible use and management of machine learning models. – Particularly in situations where there are ethical, legal, or regulatory considerations in ML projects, or where there is a need to mitigate risks related to bias, fairness, or privacy. Implementing model governance enables organizations to establish trust, ensure compliance, and mitigate risks associated with ML technologies, fostering responsible AI adoption and promoting ethical and accountable practices in AI-driven initiatives and data science projects.
Model Explainability	Model Explainability is the ability to interpret and understand the decisions and predictions made by machine learning (ML) models, providing insights into the factors, features, or patterns influencing model outputs. Model explainability techniques aim to uncover the internal mechanisms and decision-making processes of ML models, enabling users to interpret model predictions, assess model behavior, and identify factors contributing to model performance. Model explainability enhances transparency, trust, and accountability in ML models by making them more interpretable and understandable to stakeholders, such as domain experts, regulators, or end-users. Explainable AI (XAI) methods include feature importance analysis, model interpretation techniques, and post-hoc explanation approaches, which provide insights into model predictions and enable users to validate, debug, or improve model performance and reliability in real-world applications.	– When interpreting and understanding the decisions and predictions made by machine learning models. – Particularly in situations where there is a need to explain model behavior, assess model reliability, or gain insights into factors influencing model outputs. Implementing model explainability techniques enables organizations to enhance transparency, trust, and accountability in ML models, empowering stakeholders to understand and validate model predictions and decisions in AI-driven initiatives and data science projects.
Model Bias and Fairness	Model Bias and Fairness refers to the assessment and mitigation of biases and unfairness in machine learning (ML) models, ensuring that models make predictions or decisions that are equitable, unbiased, and non-discriminatory across different demographic groups or protected characteristics. Model bias and fairness considerations address issues such as algorithmic bias, data bias, and fairness disparities that may result in discriminatory outcomes or perpetuate societal inequalities in ML applications. Techniques for assessing and mitigating model bias and fairness include fairness-aware algorithms, bias detection methods, and fairness interventions, which aim to identify, measure, and mitigate biases in training data, model outputs, and decision-making processes. Model bias and fairness assessments help organizations identify and address ethical and social risks associated with ML models, promote fairness and inclusivity, and ensure equitable outcomes in AI-driven decision-making and automated systems.	– When assessing and mitigating biases and fairness disparities in machine learning models. – Particularly in situations where there are ethical, legal, or social implications of biased or unfair model predictions or decisions. Addressing model bias and fairness enables organizations to promote ethical AI practices, mitigate risks of discrimination, and ensure equitable outcomes in AI-driven initiatives and decision-making processes, fostering trust and inclusivity in AI applications and data-driven systems.