Site Reliability Engineer (Remote)

Engineering | Full-time, Remote

Astronomer is the commercial developer of Apache Airflow, a community-driven open-source tool that’s leading the market in data orchestration. We’re a globally-distributed and rapidly growing venture-backed team of learners, innovators and collaborators. Our mission is to build an Enterprise-grade product that makes it easy for data teams at Fortune 500’s and startups alike to adopt Apache Airflow. As a member of our team, you will be at the forefront of the industry as we strive to make Apache Airflow the de-facto standard in data orchestration.

We're looking for SREs to join our Infrastructure team which is responsible for operating and maintaining the Astronomer products that are deployed on our hosted cloud as well as supporting the same on our customer's environments. The ideal candidate will be passionate about an operations role that involves deep knowledge of distributed computing, and they will also believe that automation is a key component of operating large-scale systems. SREs at Astronomer own the full infrastructure stack; from host IO performance debugging to application deployment pipelines, up through to application performance and cluster operations. Our responsibilities are both broad and deep.

Our team is collaborative; we work closely with the development teams we support to deliver the best results. We think critically and strive to balance the best solution with the need to get things done for each engineering challenge we face.

Responsibilities

  • Serve as a primary point who is responsible for the overall health, performance, and capacity of our platform.
  • Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and growth.
  • Develop tools to improve our ability to rapidly deploy and effectively monitor applications in a large-scale environment.
  • Work closely with development teams to ensure the platform is designed with operability in mind.
  • Identify and lead efforts to improve automation.
  • Perform root cause analysis and document results in the form of post-mortems.
  • Write and maintain documentation around key systems and processes.
  • Participate in an on-call rotation.
  • Function well in a fast-paced, rapidly-changing environment.

Key Qualifications

  • 3+ years Hands-on experience operating Kubernetes clusters in a production environment.
  • Experience in managing and scaling distributed systems in one of the three major cloud providers (AWS, Azure, GCP).
  • Strong experience with at least one Continuous Integration system such as CircleCI or Jenkins.
  • Understanding of the Linux Operating System, standard networking protocols, and components.
  • Experience with deploying, supporting and monitoring new and existing services, platforms, and application stacks.
  • Automation/Scripting experience with Shell, Python or something similar.
  • Familiarity with Infrastructure as Code (IaC) tools (terraform, Cloudformation, etc.).
  • Excellent troubleshooting and problem solving skills.

Nice to Have

  • Experience with scale testing, disaster recovery, and capacity planning.
  • Experience with at least one of the following languages; NodeJS, Go, Python.
  • Familiarity with Apache Airflow.
  • Experience with Openshift and the Red Hat marketplace.
  • Experience with the Prometheus/Grafana and ELK stacks.


At Astronomer, we value diversity. We are an equal opportunity employer: we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.  Astronomer is a remote-first company.
Apply Now