Role Overview
We are looking for a Lead Cloud Operations Engineer to join our growing team supporting key
supply-side technology platforms, including Atlas Integration, GMX, Hotel APIs, and related
microservices in Azure. This is a high-impact technical leadership role focused on Azure cloud
operations, monitoring, performance, security, and incident resolution.
You will be responsible for ensuring the availability, scalability, and reliability of cloud-hosted
systems, mentoring a small operations team, and collaborating with developers, architects, and
business stakeholders to drive continuous improvement.
Key Responsibilities Own day-to-day operations and health of production and pre-prod environments hosted
in Azure.
Monitor infrastructure and applications using Azure Monitor, Application Insights, and
Grafana.
Lead the team in proactive incident detection, triage, resolution, and post-incident
reviews (RCA, documentation).
Implement and enhance automation for common operational tasks using PowerShell,
Python, Azure CLI, and Terraform/Ansible.
Act as escalation point for complex issues and high-severity incidents.
Create, improve, and maintain runbooks, dashboards, alerts, and performance tuning
metrics.
Collaborate with development and DevOps teams to ensure operational readiness,
deployment hygiene, and system resilience.
Maintain strong governance around Azure resources, RBAC, policy enforcement, and
tagging strategy.
Lead disaster recovery planning, testing, and execution across critical systems.
Drive cost optimization initiatives using Azure Cost Management and FinOps
principles.
Ensure compliance with security policies (ISO 27001, GDPR, SOC2) and assist in audits
or security reviews.
Support team mentoring, training, and promoting a strong culture of ownership and
accountability.
Required Skills & Experience
Azure IaaS: Virtual Machines, Scale Sets, Load Balancer, Disks, Networking (VNETs,
NSGs, UDRs, Private Links, Service Endpoints)
Azure PaaS: App Services, Azure Functions, Logic Apps, Key Vault, Event Grid, Azure
SQL, Application Gateway, Azure Front Door, Traffic Manager
Azure Kubernetes Service (AKS) deployment, scaling, security & troubleshooting
Azure Site Recovery (ASR), Azure Backup, and Disaster Recovery architecture
Deep understanding of Azure Monitor, Application Insights, Log Analytics
Ability to write and optimize KQL queries for diagnostics and dashboards
Experience with Grafana, Prometheus, and alerting pipelines
Hands-on experience with Terraform, Ansible, ARM templates
Proficiency in scripting with PowerShell, Bash, and/or Python
Experience with Azure DevOps Pipelines or similar CI/CD tooling is a plus
RBAC, Managed Identities, Conditional Access, Key Vault integration
Awareness of ISO 27001, SOC2, GDPR requirements in cloud environments
Proven experience leading 24 engineers (including juniors/mid-levels)
Strong verbal and written communication skills; able to interact with technical and non-
technical stakeholders
Experience participating in on-call rotations, owning major incidents, and delivering
RCA reports
Ability to train, mentor, and guide junior engineers
Collaborative mindset with a strong sense of accountability and urgency
Nice to Have Experience with multi-cloud (AWS or GCP) environments and hybrid cloud networking
Experience working with microservices-based systems and APIs
Exposure to FinOps practices and cloud cost management tools
Certifications: AZ-305, AZ-104, AZ-500, AZ-700, AZ-400 preferred
Keyskills: Cloud Operations Terraform Azure Cloud Grafana Python Powershell Shell Scripting Ansible Prometheus Bash Kubernetes