MLOps Support Team Lead at CloudFactory May, 2026

- MLOps Support Team Lead at CloudFactory
- View Jobs in ICT / Telecommunication / View Jobs at CloudFactory
Posted: May 18, 2026

Deadline: Not specified
- Save
- Email
- @gmail.com
- @yahoo.com
- @outlook.com
Never pay for any notarisation, certificate or assessment as part of any recruitment process. When in doubt, contact us
CloudFactory is changing the way the world works by providing an on-demand, digital workforce for scaling critical business processes in the cloud. We’re also on a mission to create meaningful work for as many people as possible.
Read more about this company

MLOps Support Team Lead
- Job Type Full Time
- Qualification BA/BSc/HND
- Experience
- Location Nairobi
- Job Field Data, Business Analysis and AI , ICT / Computer
Role Summary
- As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory's MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.
- You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.
Responsibilities: Service Ownership & Reliability
- Own the operational performance of all production ML systems and pipelines
- Ensure reliability, availability, and supportability across client and internal MLOps workloads
- Establish and enforce SLAs, SLOs, and operational standards
- Act as the escalation point for major incidents and service degradation
Team Leadership & Delivery
- Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
- Define shift patterns, on-call rotations, and coverage models
- Set clear expectations, performance metrics, and development plans
- Foster a strong operational culture focused on accountability and continuous improvement
Incident Management & RCA
- Own incident response processes, including triage, communication, and resolution
- Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
- Drive reduction in repeat incidents through structured problem management
- Improve time to detect (TTD) and time to resolve (TTR) metrics
Monitoring, Observability & MLOps Maturity

Drive implementation and evolution of monitoring across:
- pipelines and data flows
- infrastructure and compute
- model performance and drift
- Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
- Partner with Engineering to improve instrumentation, logging, and alerting
Support Model & Process Design
- Define and evolve the MLOps support operating model
- Clearly establish boundaries between Support, Engineering, and external partners
- Build and maintain runbooks, playbooks, and escalation paths
- Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
Stakeholder & Partner Management

Act as the primary operational interface for:
- Engineering teams
- Platform Operations
- External partners
- Reduce reliance on individuals by formalizing ownership and knowledge sharing
- Provide clear communication during incidents and service updates
Continuous Improvement & Scaling
- Identify trends in incidents and operational inefficiencies
Drive improvements in:
- automation
- alert quality
- self-healing capabilities
- Support onboarding of new MLOps projects into a standardized support model
- Contribute to building MLOps as a scalable, repeatable service offering
Reporting & Service Health

Define and track key operational metrics:
- incident volume and severity
- SLA adherence
- system uptime and reliability
- Support regular service reviews and model health reporting
- Provide leadership visibility into risks, trends, and improvement areas
Requirements Must Have skills (required)
- Proven experience in operations leadership, SRE, DevOps, or platform support environments
- Strong understanding of production support models, incident management, and escalation frameworks
- Experience leading or mentoring technical support or operations teams
Working knowledge of ML systems in production, including:
- pipelines and batch processing
- model lifecycle and deployment
- common failure modes
- Strong analytical and troubleshooting skills in complex environments
- Experience with monitoring and observability tools
Proficiency in:
- SQL
- Python or scripting (Bash)
- Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
- Strong stakeholder management and communication skills
Nice To Have Skills (Preferred)
- Experience supporting AI/ML platforms at scale
Familiarity with tools such as:
- Databricks
- MLflow
- Grafana
- Power BI
- New Relic
- Exposure to model monitoring (drift, bias, performance validation)
- Experience working with external partners or vendors in delivery models
- Understanding of cloud platforms (AWS, GCP, Azure)
- Experience with containerized environments (Docker / Kubernetes)
- Background in building or scaling support functions from early-stage to maturity
General Requirements
- Strong service ownership mindset — takes accountability for outcomes, not just activity
- Calm, structured, and decisive during incidents
- Ability to balance operational delivery with strategic improvement
- Passion for building reliable, trustworthy AI/ML systems
- Highly collaborative across Engineering, Platform, and Delivery teams
- Focus on reducing risk related to:
modeil performance
- bias
- data integrity
- Commitment to documentation, knowledge sharing, and eliminating single points of failure
Check how your CV aligns with this job

Method of Application

Interested and qualified? Go to CloudFactory on www.linkedin.com to apply

Build your CV for free. Download in different templates.
Share
- Save
- Email
- Report
Send your application

Your Name Your Email Your Phone Number Your Current Location Subject of your Application Your cover letter
Attach your CV/Doc

View All Vacancies at CloudFactory Back To Home

Jobs You Might Be Interested in

Related Companies Hiring Now

Career Advice

Intelligence-Led Recruitment in Kenya: A Smarter Way for Companies to Hire and Retain Talent MyJobMag Kenya launches its intelligence-led recruitment service to help companies hire smarter using data, insights, and proven success patterns, improving retention and overall hiring outcomes.
How to Network Professionally at Career Events (Plus Templates) Networking at career events can open doors to new opportunities. Discover everything you need to network professionally and make meaningful connections.
60 Behavioural Interview Questions That Expose a Candidate If you’re trying to figure out whether someone is a good fit for your team, emotionally aware, or a strong leader, these questions can help you see who they really are before you hire them.
25 Signs Your Job Interview Went Really Well In this article, we discuss 25 clear signs that your interview probably went really well. These are simple hints that recruiters and employers often show.

View All Career Advice

Send this job to a friend

Did you notice an error or suspect this job is scam? Tell us.

MLOps Support Team Lead at CloudFactory

MLOps Support Team Lead

Method of Application

Send your application

Related Companies Hiring Now

Career Advice

Send this job to a friend

Did you notice an error or suspect this job is scam? Tell us.

MLOps Support Team Lead at CloudFactory

MLOps Support Team Lead

Method of Application

Send your application

Related Companies Hiring Now

Career Advice

Subscribe to Job Alert