1.1 Responsibilities ( Principal SRE )
- Be responsible for both on-premises and cloud-based infrastructure. - Function as an extension of the existing staff and teams.
- Possess deep expertise in:
- Troubleshooting
- Infrastructure (physical and cloud)
- Automation
- Enterprise-level system administration - Agile workflows
- Enterprise change management
Technical Proficiencies
Cloud Technologies:
- AWS native load balancers
- AWS EC2, ECS, EKS, Containers - Terraform
Monitoring & Observability:
- Splunk Cloud Observability - CloudWatch
DevOps & Automation:
- CI/CD
- Jenkins
- Automation with Python
AWS Infrastructure Management
- Deploy and maintain shared platform team assets (e.g., ECS clusters, ALBs) - Deploy and maintain unique or non-standard infrastructure assets
- Assist developer teams in standardizing deployments
3. Cost Containment
- Minimize service operational costs
- Perform periodic cost analysis to identify cost-saving opportunities
4. Capacity Planning
- Conduct capacity analysis for production and non-production environments - Right-size domain assets for performance and availability
- Leverage automation (e.g., autoscaling)
5. Monitoring
- Collaborate on defining SLOs for service availability
- Coordinate deployment quality objectives
- Develop pattern-based service monitors
- Implement uniform service measurement and monitoring
6. Performance Optimization
- Develop performance measurement techniques
- Assist in refining and improving service efficiency over time
7. Incident Management
- Provide on-call support and drive service restoration
8. Security
- Implement InfoSec-recommended patterns - Monitor anomalies using internal tools
9. NOC Services
- Establish targeted alerting and predefined NOC response procedures
1.2 Key Responsibilities
- Develop and maintain monitoring and alerting systems
- Manage the incident response lifecycle (runbooks, dashboards, automation) - Automate operational tasks for efficiency
- Participate in on-call rotations
- Design performance testing and capacity planning strategies
- Collaborate across teams to troubleshoot and resolve issues
1.3 Required Qualifications
- Strong problem-solving skills
Hands-on experience with:
- Cloud Platforms: AWS, Azure, GCP
- IaC Tools: Terraform or CloudFormation
- Programming Languages: Python, Java, C/C++, Go, JavaScript, or Ruby - Log Aggregation: Splunk, ELK, or SumoLogic
- Monitoring Tools: SignalFx, Datadog, Dynatrace, AppDynamics
- Prior roles in SRE, Software Engineering, or Production Engineering - Passion for learning and improving systems
- Interest in SLIs, SLOs, resilience, scaling, system Design and performance
1.4 Desired Qualifications
- Experience with large-scale distributed systems
Familiarity with configuration and automation tools: - Terraform, Puppet, Ansible
Experience with CI/CD and DevOps toolchain:
- Git, Jenkins, Docker, Nexus, Artifactory, Selenium
Knowledge of cloud security practices, including: - Intrusion detection
- Penetration testing
- Vulnerability scanning

Keyskills: GIT Aws Devops Ansible Cicd Pipeline Python Terraform Cloudwatch Agile Splunk