Specialist, AI Site Reliability Engineer - Core Enterprise Services
Your opportunity
At Charles Schwab, our purpose is simple: we championclients’goals with passion and integrity. Guided by honesty, mutualrespectand a commitment to doingwhat’sright, we bring innovation, education, and service together to help shape financial futures. Our people are the foundation of our success – they approach their work with curiosity and collaboration, coming together to create solutions that make a meaningful impact for clients and communities. As we expand into India, we are bringing this same culture of inclusion, learning, and opportunity to new talent. Joining us means becoming part of a global team where your work matters and your future can take shape.
Our Hyderabad location is central to Schwab’s growth, bringing together talented people and technology to drive innovation,scale,and efficiency. Here, you will work alongside teams who create solutions that support millions of clients every day. The work you do is more than daily operations –it’sa chance to experiment, learn, and build within avalue–driven, supportive environment. This is a unique opportunity to be part of our early growth phase and shape something new, backed by the stability and strength of a Fortune 500 company. Your impact begins on day one, and your contributions will help define our future in the region
We are seeking an AI Site Reliability Engineer to join a forward-thinking engineering team that builds intelligent observability, monitoring, and deployment automation solutions using AI-augmented development practices while also providing production support for Pega Infinity applications. This role combines hands-on software engineering with operational ownership, requiring the ability to build software solutions for operational challenges, leverage Gen AI to accelerate workflows, automate incident response, and support the stability, availability, and performance of business-critical Pega platforms. You will contribute to architecture, code, AI-driven tooling, and production readiness while partnering closely with development, platform, and business teams to resolve incidents, drive root cause analysis, and implement preventive improvements. Ideal for engineers who are curious, adaptable, and excited about working at the intersection of software engineering, AI, and enterprise application support.
KeyResponsibilities:
High Availability, Resilience & Pega Platform Stability
Design and implement architectures that achieve and sustain 99.999% uptime across critical systems and support the reliability of Pega Infinity production applications
Define, measure, and track SLOs, SLIs, error budgets, and operational health metrics for both platform services and Pega applications
Build AI-powered self-healing systems with automated failover, redundancy, and graceful degradation
Perform AI-assisted capacity planning, demand forecasting, and load testing, including performance monitoring for Pega workloads
Monitor and support Pega platform components such as queues, job schedulers, agents, node types, search indexing, and background processing to maintain application stability
Production Support & Incident Management
Provide production support for Pega Infinity applications, including incident triage, troubleshooting, service restoration, and coordination during major incidents
Monitor application health, background processing, integrations, APIs, and batch performance to proactively identify and resolve issues before they impact users
Analyze failures across Pega case processing, data pages, connectors, services, decisioning flows, and asynchronous processing
Partner with development, infrastructure, database, and business teams to diagnose production defects, implement fixes, and drive preventive actions
Support application releases, hotfix deployments, patch validation, rollback planning, and operational readiness for Pega Infinity platforms
Observability & Monitoring
Design, build, and maintain AI-enhanced observability platforms covering metrics, logs, traces, and intelligent alerting
Implement AI-powered anomaly detection, predictive alerting, and proactive system health management
Leverage Gen AI to auto-generate and refine dashboards, alert rules, runbooks, and support playbooks
Build real-time availability dashboards with AI-driven trend analysis tracking reliability and support targets
Use Pega diagnostic tools, alerts, and performance data to identify bottlenecks, failed processing, and platform health issues
Root Cause Analysis & Continuous Improvement
Build AI-accelerated root cause analysis with thorough postmortems and actionable remediation
Build AI-powered diagnostic tools that automatically correlate logs, metrics, and traces
Use Gen AI to analyze incident patterns, predict recurring failures, and recommend preventive actions
Continuously reduce MTTD and MTTR through AI-assisted workflows, operational automation, and platform improvements
Drive corrective actions related to guardrail compliance, rule efficiency, query performance, clipboard usage, and overall Pega application stability
Deployment Automation
Design and maintain AI-enhanced CI/CD pipelines with AI-driven deployment risk scoring
Implement progressive delivery with canary releases, blue-green deployments, and automated rollbacks triggered by AI anomaly detection
Leverage Gen AI to generate deployment scripts, IaC templates, and pipeline configurations
Eliminate manual toil by building AI agents for repetitive operational and support tasks
Support Pega deployment pipelines, product rule packaging, branch merge processes, and environment promotion activities across lower and production environments
What you have
Required Qualifications:
- 5+ years of software engineering experience with a strong focus on SRE, DevOps, platform engineering, or enterprise application support
- Bachelor's degree in Computer Engineering, Computer Science, Information Technology, or equivalent
- Hands-on technical ability to architect, review code, and contribute to critical engineering and operational decisions
- Proven experience operating systems at 99.99% or higher availability
Hands-on experience with GenAI coding tools anddemonstratedability to apply them effectively
- Proficiencyin Java or .NET, with scripting skills in Python, Bash, or PowerShell
- Strong experience providing production support for enterprise applications, including incident triage, root cause analysis, and service restoration
- Hands-on experience with Pega Infinity applications, including troubleshooting case processing, data pages, connectors, services, background jobs, queues, agents, node types, and environment issues in production or pre-production environments
- Experience working with Pega diagnostic and support tooling, including tracer, clipboard, logs, alerts, performance analysis, and monitoring dashboards
- Experience supporting integrations such as REST, SOAP, messaging, and database interactions within Pega applications
- Experience supporting application releases, hotfixes, and patch deployments with strong change management discipline
- Hands-on experience with observability platforms (Datadog, Splunk, Grafana, ELK, or similar)
- Strong experience with CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions)
- Proventrack recordleading RCA efforts and driving reliability improvements
- Proven experience diagnosing production issues end-to-end,identifyingroot cause, and partnering with development teams on corrective and preventative improvements
- Strong understanding of Pega guardrails, performance tuning concepts, and operational considerations forhighly availableenterprise applications
- Strong CS fundamentals including system design, networking, concurrency, and algorithms
PreferredQualifications:
- Experience with Mongo, Oracle, SQL
- Experience with cloud platforms (PCF, GCP, AWS, or Azure)
- Experience with containerization and orchestration (Docker, Kubernetes)
- Pega Infinity certification or prior experience supporting Pega-based applications in production environments
- Experience with ITSM, incident/problem management, and operational support processes
- Familiarity with Pega deployment pipelines, product rule packaging, branch strategies, and environment promotion processes
- Familiarity with Pega guardrails, PAL, tracer, clipboard, and alert log analysis
- Experience supporting Pega integrations, including APIs, messaging, and external system connectivity
- Knowledge of Pega background processing constructs such as job schedulers, queues, and asynchronous processing
- Experience with performance tuning, search indexing, node classification, and operational support forhighly availablePega platforms
What’s in it for you
At Schwab India, you’re empowered to shape your future. We support your growth through meaningful work, continuous learning, and a culture rooted in trust and collaboration – so you can build the skills to make a lasting impact. Our benefits are designed to care for your wellbeing, your family, and your long-term financial security.
Our base benefits, wellbeing, and total rewards include:
- Competitive compensation and retirement programs including Employee Provident Fund (EPF), Gratuity, and optional National Pension System (NPS) contributions
- Robust Paid Time Off, including annual/privilege leave, sick and casual leave, public holidays, maternity/paternity leave, and more
- Education assistance for continued learning to help you grow
- Comprehensive medical insurance with Outpatient Department (OPD) services, including vaccination, pharmacy, dental, and vision coverage
- Annual reimbursement for health check-ups and mental health support through our Employee Assistance Program (EAP)
- Childcare (creche) reimbursement for eligible employees
- Transportation and meal benefits that support your day-to-day work
- Group life, personal accident, and critical illness insurance