Lead SRE Engineer - Remote

EPAM

Information

RESPONSIBILITIES - Lead development teams through architectural reviews and recommendations - Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs - Define, track, and enforce error budgets - Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs - Establish, test, and tune alerting for varying tiers of applications Participation in on-call rotation - Document and maintain runbooks and procedures, automate as much as possible - Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection) - Perform periodic load and scalability testing to establish baselines, drift, and capacity planning - Design and implement peak readiness reviews for anticipated high-volume times - Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams - Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc - Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning

Descripcion de la oferta

REQUIREMENTS - 5+ years of SRE or Systems Engineering experience - 2+ years as team lead or SRE champion - Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience - Proven experience troubleshooting, mitigating, and resolving issues in a distributed system - Strong communication and collaboration skills for varying groups of stakeholders - Be self-motivated and can prioritize effectively between competing priorities - Experience with implementing SRE practices for services and applications deployed in production in the cloud - Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems

Conoce mas en este link!

https://epa.ms/lead-sre-engineer

EPAMMesa 6 - EPAMSince 1993, we have helped our customers navigate the waves of digital transformation, building solutions that help them level the playing field and stay competitive through constant market disruption. As consultants, designers, architects, engineers and trainers, at EPAM we focus on building long-term partnerships with our customers, enabling them to reimagine their businesses through a digital lens. We help our customers become faster, more agile and more adaptive enterprises, by delivering solutions through best-in-class engineering, strategy, design, consulting, education and innovation services.

Lead SRE Engineer - Remote

Information

Log in

Log in