Evaluate and ensure availability of components within their teams and identify how to bring all services within SLO (99.XX) Monitor systems for implemented automation and set SLI/SLOs along with respective stakeholders. Implementation of observability platform Review all ownership data and ensure it is current and complete. Review volume and accuracy of bugs assigned to the team and identify opportunities to improve automated triage. Identify CFBT (Customer Flow Based Testing) eligible flows, develop CFBT tests and train the team on how to write and maintain them. Lead post postmortems for any P1 or greater incidents during the rotation. Train the team on distributed problem management process. Operations and Design Consultation for driving high reliability. Emergency Incident Response with action-oriented postmortem/RCA/Incident debriefs. Driving continuous improvement through toil reduction and automation. Application Performance and availability analysis Technology/Programming Language React Node.js Java JS Python, Angular, Typescript, HTML 5 Event Streaming (Kafka) Shell scripting / PowerShell Hosting/Technical Environment AWS Technologies Kubernetes Docker Containers CI/CD, Jenkins pipelines Basics of Content delivery networks (CDN and caching concepts) Artifactory / Container Registry Web Server Gateways REST API / API Endpoints
Employement Category:
Employement Type: Full time Industry: IT - Software Role Category: General / Other SoftwareSite Engineering / Project Management Functional Area: Not Applicable Role/Responsibilies: we have Site Reliability Engineer