We are looking for a highly skilled Site Reliability Engineer (SRE) with strong Python development expertise to optimize, automate, and scale our infrastructure. This role requires deep experience in Python coding, DevOps practices, cloud infrastructure, and observability tools. You will work closely with engineering teams to build highly available, scalable, and reliable systems.
Key Responsibilities:
Develop and maintain automation tools using Python for deployment, monitoring, and scaling infrastructure.
Build and manage CI/CD pipelines for faster and more efficient releases.
Optimize system performance, reliability, and availability through proactive monitoring and observability tools.
Troubleshoot and resolve production incidents with a focus on root cause analysis and automation to prevent recurrence.
Implement Infrastructure-as-Code (IaC) using Terraform, Ansible, or similar tools.
Manage cloud infrastructure (AWS, GCP, or Azure) with a strong focus on automation and security.
Enhance monitoring and alerting using Prometheus, Grafana, Datadog, or similar tools.
Collaborate with developers to implement best practices for performance, security, and scalability.
With its global headquarters in California; TRUGlobal is a top IT services firm servicing clients across Fortune 500 Companies to Startups. Our talented team of Business Consultants and Technologists average over 18 years of industry-honed experience and have conducted many engagements worldwide...