Role & responsibilities
Job Title: Mainframe Site Reliability Engineer (SRE)
Location: Pune/Hyd
Employment Type: Full-Time
---
About the Role
We are seeking a visionary Mainframe Site Reliability Engineer (SRE) to redefine the reliability, automation, and efficiency of our mission-critical z/OS systems. This role combines deep mainframe expertise with cutting-edge SRE practices, focusing on innovations in observability, AI-driven operations, and DevOps integration to transform legacy workflows into modern, self-healing systems. You will drive initiatives to eliminate manual toil, optimize performance, and ensure the platforms resilience aligns with business-critical service level objectives (SLOs).
---
Key Responsibilities
1. SRE-Centric Innovation & Automation
- Automation Engineering:
- Design and deploy Infrastructure-as-Code (IaC) solutions using Ansible, Zowe CLI, and z/OSMF workflows to automate system provisioning, configuration management, and recovery processes.
- Develop self-healing workflows for critical subsystems (CICS, Db2, IMS) to auto-resolve incidents like JVM failures or transaction bottlenecks.
- Convert legacy operational scripts (REXX, NCL) into modern, version-controlled pipelines integrated with Git and CI/CD tools like Jenkins.
- AI-Driven Observability:
- Implement predictive analytics tools (e.g., IBM Watson AIOps, Splunk ITSI) to detect anomalies in system metrics, logs, and message queues.
- Build dashboards using Grafana or Prometheus to visualize the Four Golden Signals (latency, traffic, errors, saturation) across mainframe workloads.
- Centralize alert management to reduce noise and prioritize actionable alerts using AI-driven correlation.
2. DevOps Integration & Modernization
- CI/CD for Mainframe:
- Streamline software delivery pipelines for COBOL/PL/I applications using IBM Dependency-Based Build (DBB) and UrbanCode Deploy (UCD).
- Integrate mainframe SDLC processes with enterprise Git repositories (GitHub, GitLab) to enable collaborative development and audit trails.
- Enable automated testing and phased rollouts for z/OS middleware updates.
- Performance & Capacity Engineering:
- Optimize CPU/MIPS utilization through runtime tuning (e.g., CICS Threadsafe, AT-TLS offloading) to reduce software licensing costs.
- Forecast capacity demands using historical SMF/RMF data and propose dynamic hardware scaling strategies.
- Conduct load testing for batch and OLTP workloads to validate system limits and error budgets.
3. Incident Management & Reliability
- Lead blameless postmortems for critical incidents, focusing on root cause analysis (RCA) and preventive actions (e.g., monitoring gaps, automation fixes).
- Reduce MTTR by implementing automated incident response playbooks (e.g., auto-restart failed subsystems, reroute traffic).
- Maintain 24/7 operational readiness through on-call rotations and cross-training in z/OS, CICS, Db2, and storage management.
4. Platform Hardening & Knowledge Sharing
- Enforce security best practices (RACF, TLS) and vulnerability remediation for z/OS and middleware.
- Develop reusable workbooks and runbooks to document system configurations, troubleshooting steps, and automation workflows.
- Mentor teams on SRE principles, fostering a T-shaped skill model (deep mainframe + DevOps/Agile practices).
5. Batch Optimization & Resource Management
- Design dynamic resource allocation strategies (e.g., WLM policies, enclaves) to prioritize critical batch jobs and minimize contention for CPU, memory, and I/O resources.
- Implement parallel processing (e.g., multi-task JCL, SYSAFF routing) to reduce runtime and avoid bottlenecks in long-running batch cycles.
- Streamline job dependencies using graph-based scheduling tools (e.g., IWS, CA7, Control-M) to eliminate idle wait times between interdependent jobs.
6. Proactive Batch Health Monitoring :
- Develop automated checks for batch job SLAs, including real-time alerts for delays, resource starvation, or dataset contention.
- Integrate predictive analytics (e.g., historical SMF data analysis) to forecast and mitigate delays caused by seasonal peaks or data volume spikes.
---
Required Skills
- Technical Expertise:
- xx+ years in z/OS system programming, performance tuning, or infrastructure support.
- Proficiency in JCL, REXX, Python, and mainframe automation tools (IBM Z System Automation, Broadcom OPS/MVS).
- Hands-on experience with Zowe, Ansible, Git, and CI/CD pipelines.
- Mastery of SRE tenets: SLOs/SLIs, error budgets, and Infrastructure-as-Code (IaC).
- Innovation Focus:
- Proven track record in implementing AI/ML-driven monitoring or auto-remediation for mainframe environments.
- Experience modernizing legacy workflows (e.g., replacing CA Endevor with Git-based SDLC).
- Soft Skills:
- Ability to lead cross-functional teams during high-severity incidents.
- Strong communication to align technical execution with business objectives.
- Education:
- Bachelors degree in Computer Science, Engineering, or related field.
---
Preferred Qualifications
- Experience with AI-Driven Automation platforms (e.g. AMELIA AIOps) to standardize and migrate legacy workflows, integrate with event management systems (e.g., BigPanda), and orchestrate ITIL processes (Incident, changes) via ServiceNow
- Certifications: IBM z/OS System Programming, Broadcom Mainframe SRE, or Hashicorp Terraform.
- Familiarity with Zowe Desktop for modern IDE-driven development or Dynatrace APM for CICS/Db2 monitoring.
- Knowledge of mainframe open-source ecosystems (Zowe, Feilong) or hybrid-cloud integrations.
Keyskills: Site Reliability Engineering Mainframes Zos