Job Description
Key Responsibilities
Own the reliability, scalability, and performance of our core backend systems in a high-growth, fast-moving startup environment.
Architect and implement backend services that can handle rapid scaling and unpredictable traffic patterns.
Proactively monitor production systems, detect anomalies, and resolve issues before they impact users.
Lead live debugging during critical incidents, providing quick resolutions and implementing preventive measures.
Optimize APIs, database queries, and service workflows for maximum efficiency and minimal latency.
Design and enforce best practices for deployment, monitoring, and disaster recovery.
Collaborate with product and engineering teams to build resilient, fault-tolerant systems from the ground up.
Automate operational tasks to reduce manual intervention and speed up delivery cycles.
Mentor and guide junior engineers in SRE and backend development best practices.
Take postmortems, root cause analysis, and long-term reliability improvements.
Qualifications
8+ years experience in backend development (Node.js, Python, Go, or similar).
Solid experience with cloud platforms (AWS, GCP, or Azure).
Strong knowledge of containerization (Docker, Kubernetes) and CI/CD pipelines.
Familiarity with distributed systems and microservices architecture.
Proficiency in database technologies (SQL & PostgreSQL).
Experience with observability tools (Prometheus, Grafana, ELK stack, etc.).
Strong debugging skills in live production environments.
Knowledge of performance tuning and system optimization techniques.
Nice to Have
Experience with infrastructure-as-code tools (Terraform, Pulumi).
Background in load testing and capacity planning.
Understanding of networking concepts and security best practices.
Job Classification
Industry: IT Services & Consulting
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Site Reliability Engineer
Employement Type: Full time
Contact Details:
Company: Valuecoders
Location(s): Noida, Gurugram
Keyskills:
Performance tuning
Networking
Load testing
Postgresql
Debugging
Disaster recovery
Distribution system
SQL
Python
Capacity planning