Drive the Flow level observability strategy including instrumentation operations to enhance the detection mitigation capabilities
Drive initiatives independently to fix root-causes identified from the repeat issues observed across monitoring platforms - challenge the status quo and follow through to completion
Build proactive alerting and real-time monitoring tools to help identify issues early and in-collaboration with Product engineering teams, resolve the issues in a timely manner
Develop observability standards/ framework for new product readiness to ensure service reliability in SOA and distributed systems
Build Domain Expertise to achieve Scalability - by understanding the nuances of Payments - across processing, compliance and infra
Drive large scale migration and adoption projects on Observability Reliability by cross-collaborating with various Payments teams
Collaborate with large set of stakeholders across engineering, infrastructure and operations teams to align and implement foundational Operational programs
Automate our alerts configuration across various observability tools (eg. Watchpoint, Kibana, Datadog etc.) that work across signals - metrics, logs and traces
Bring ideas to life (i.e. production) to help make the lives of engineers better
Partner with the broader Airbnb organization to learn from incidents through a blameless post mortem process
Automate as much as humanly possible and always configure as code
About You:
8+ years of technical experience, with 5+ years of relevant industry experience in a fast paced tech environment
Experience in building and implementing Observability/ SRE along with expertise in building availability/Reliability tools in a similar environment
Experience in driving E2E SRE initiatives (L2/L3) and improving observability reliability, preferably in the payments space
You have strong working knowledge across observability tools (eg. DataDog, Open Telemetry etc.) SRE Practices
Experience in Application and Tool development (Java, .Net, Python) in Microservices environments. Previous experience in AI/ML will be a plus.
Experience with initiatives across Auto scaling, Self-healing mechanism, Chaos Engineering, Performance optimization techniques will be a plus
You have excellent communication skills and the ability to work well within a team and with teams across timezones
You are a strong problem solver and have worked in a team that is on-call for production systems before
Technical leadership: hands on experience leading project teams and setting technical direction and strategy
You are passionate about efficiency, availability, technical quality and system quality
Job Classification
Industry: Travel & TourismFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Site Reliability EngineerEmployement Type: Full time