Senior Site Reliability Engineer, Series B AI Startup

Remote

Information

Your Impact Automation! Capability! Performance! Scale! Have a strong influence in defining our engineering best practices and deployment process Help automate the continuous integration and testing processes to enable and scale Manage and maintain infrastructure Own, design and implement monitoring systems such as Prometheus and Grafana Optimize Linux systems for performance, reliability, and security Own configuration management process(es) and build product features as appropriate Investigate and dig into data to find the root of a problem and strategize with our engineers on solutions Participate in on-call rotation
Company
Recruiting From Scratch
Requirements
We're looking for someone who You have 5+ years experience in SRE, Production Engineering and/or DevOps Expert level experience architecting, developing, and troubleshooting large scale systems Advanced level proficiency with one or more programming languages (i.e. Python, Golang) Deep experience with data structures and Linux systems internals (e.g., filesystems, system calls) and administration Extensive experience with CI/CD pipelines and infrastructure as code (i.e. Terraform, Ansible) You have a strong familiarity with AWS services (i.e. ECS, S3, ALB, VPC) You have knowledge in containers and orchestration using Kubernetes Experience building production quality cloud infrastructure that enables reliable and rapid deployment of large-scale systems with effective monitoring and resilient operations You thrive working in a fast paced, startup environment You have a proven track record taking on projects from inception to launch Bachelor's in Computer Science or Electrical Engineering (MS preferred)

Log in

See all the content and easy-to-use features by logging in or registering!