Provide oversight and strategic coordination of end-to-end service delivery across critical platforms and systems.
Proactively identify service trends, recurring issues, and systemic failures, and lead efforts to drive permanent resolutions.
Lead root cause analysis (RCA) and post-incident reviews with stakeholders, identifying patterns and continuous improvement opportunities.
Mentor and guide junior team members in incident and problem resolution techniques, ensuring knowledge transfer and skills development.
Act as the primary escalation point for complex incidents, owning resolution and customer communication at the senior level.
Drive continuous improvement across monitoring, automation, and system reliability to reduce operational noise and increase system resiliency.
Lead incident bridges and engage with engineering teams and senior stakeholders to ensure timely resolution and high-quality communications.
Champion best practices in service management including SLAs/OLAs, change management, and problem management processes.
Contribute to tooling strategy and capability enhancements for observability, incident management, and analytics.
Own key relationships with cross-functional partners including DevOps, Cloud Engineering, and Product teams to ensure operational readiness and service alignment.
Represent the team in technical leadership forums and contribute to operational strategy and planning.
Ensure consistent shift readiness by reviewing and refining runbooks, escalation paths, and shift documentation.
Promote a culture of quality by embedding service excellence in operational procedures, ensuring processes are optimized for consistency, performance, and reliability.
Measure and track key quality indicators and ensure feedback loops are in place for ongoing improvement.
Required Knowledge, Skills, and Qualities
Bachelor s degree or equivalent experience with 6 to 9 years in IT operations, site reliability, or service delivery within enterprise or SaaS environments.
Deep understanding of Cloud architectures (Microsoft Azure, AWS, or GCP), infrastructure monitoring, and incident response.
Demonstrated experience managing incidents in high-availability, high-throughput, mission-critical environments .
Strong technical background with ability to lead troubleshooting across infrastructure, networking, application, and platform services.
Advanced knowledge of monitoring, alerting, and observability tools (e.g., Grafana, Opsgenie, Datadog, Prometheus, etc.).
Expert-level understanding of ITIL processes, particularly Incident, Problem, and Change Management.
Experience conducting technical postmortems , producing RCA reports, and implementing service improvement plans.
Proven ability to influence and collaborate with cross-functional technical teams and senior management.
Strong leadership presence during high-impact events; comfortable leading conversations with engineering leaders and executive stakeholders.
Demonstrated mentoring and coaching experience; ability to develop junior engineers and promote operational excellence culture.
Strong focus on quality assurance within service delivery, with a commitment to maintaining high standards in documentation, execution, and outcomes.
Excellent verbal and written communication skills with the ability to tailor messages to technical and non-technical audiences.
Adaptability to evolving technologies and a strong drive to automate and improve existing processes.
Willingness to participate in on-call rotation and provide senior-level support during critical incidents.
.
EQUAL OPPORTUNITY EMPLOYER
.
All prospective and current Employees need to remain vigilant when it comes to executing security policies in the workplace. This includes:
- Following workplace security protocols and training programs to familiarize with the ways to maintain a safe workplace. - Following security procedures to report any suspicious activity. - Having respect for corporate security procedures to allow those procedures to be effective. - Adhering to companys compliance and regulations. - Encouraging to follow a zero tolerance for workplace violence.
- Basic knowledge of information security and data privacy requirements (e.g., how to protect data & how to be handling this data).
- Demonstrative knowledge of information security through internal training programs.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: Software DevelopmentRole: Practice Manager / HeadEmployement Type: Full time