Senior Site Reliability Engineer, Series B AI Startup
Remote
Information
Your Impact
Automation! Capability! Performance! Scale!
Have a strong influence in defining our engineering best practices and deployment process
Help automate the continuous integration and testing processes to enable and scale
Manage and maintain infrastructure
Own, design and implement monitoring systems such as Prometheus and Grafana
Optimize Linux systems for performance, reliability, and security
Own configuration management process(es) and build product features as appropriate
Investigate and dig into data to find the root of a problem and strategize with our engineers on solutions
Participate in on-call rotation
Company
Recruiting From Scratch
Requirements
We're looking for someone who
You have 5+ years experience in SRE, Production Engineering and/or DevOps
Expert level experience architecting, developing, and troubleshooting large scale systems
Advanced level proficiency with one or more programming languages (i.e. Python, Golang)
Deep experience with data structures and Linux systems internals (e.g., filesystems, system calls) and administration
Extensive experience with CI/CD pipelines and infrastructure as code (i.e. Terraform, Ansible)
You have a strong familiarity with AWS services (i.e. ECS, S3, ALB, VPC)
You have knowledge in containers and orchestration using Kubernetes
Experience building production quality cloud infrastructure that enables reliable and rapid deployment of large-scale systems with effective monitoring and resilient operations
You thrive working in a fast paced, startup environment
You have a proven track record taking on projects from inception to launch
Bachelor's in Computer Science or Electrical Engineering (MS preferred)