Lead SRE Engineer - Remote

Lead SRE Engineer - Remote

EPAM

Information

RESPONSIBILITIES - Lead development teams through architectural reviews and recommendations - Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs - Define, track, and enforce error budgets - Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs - Establish, test, and tune alerting for varying tiers of applications Participation in on-call rotation - Document and maintain runbooks and procedures, automate as much as possible - Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection) - Perform periodic load and scalability testing to establish baselines, drift, and capacity planning - Design and implement peak readiness reviews for anticipated high-volume times - Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams - Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc - Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning
Descripcion de la oferta
REQUIREMENTS - 5+ years of SRE or Systems Engineering experience - 2+ years as team lead or SRE champion - Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience - Proven experience troubleshooting, mitigating, and resolving issues in a distributed system - Strong communication and collaboration skills for varying groups of stakeholders - Be self-motivated and can prioritize effectively between competing priorities - Experience with implementing SRE practices for services and applications deployed in production in the cloud - Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems
Conoce mas en este link!

Log in

See all the content and easy-to-use features by logging in or registering!