Site Reliability Engineer

  • Starling Bank
  • London, United Kingdom
  • Oct 17, 2017
Full time Developer JAVA

Job Description


Our SRE team proactively ensures the stability, resilience and scale of our services by automation, testing and engineering. We build on expertise from systems / operations (OS & DB), cloud infrastructure (AWS), pipeline / release engineering (TeamCity), software development and stress / load testing to make sure our services are available 24 hours a day, seven days a week.

We're looking for engineers to join the team with a passion for infrastructure and delivery who are equally happy:

  • working with developers to ensure a principled approach to delivering change in a safe and secure way
  • working with third parties to ensure our comms are reliable
  • working with other SREs to hit our service level objectives and prove our systems and environments

The ideal candidate will strive for continual improvement by contributing and assessing new ideas and innovations to meet short term and longer term goals whilst at the same time accepting responsibility for day-to-day health of our environments.



You will work in our SRE team, or embedded in our engineering teams, to deliver our SRE mission:

  • Change management and delivery pipeline into production
    • Ensure safety, predictability, repeatability and auditability of all build and deploy processes
    • Enabling ownership by platform and application engineers of tech-specific build plans
    • Enabling maximum velocity without violating service level objectives
  • Monitoring, alerting, SLO tracking
    • To proactively manage delivery of service level objectives
    • Detection / early warning / self-heal
    • On-call management
  • Facilitate emergency / incident response
    • Create, maintain and test for recovery (backup & restore, infra automation etc.)
  • Provisioning / automating deployment infrastructure
    • Demand forecasting and capacity management
    • Efficiency and cost management
    • Performance and scalability of the services
    • Ownership of some cross-cutting implementation like logs / metrics infrastructure
  • Automation of security checks, break-glass procedures, etc.
    • Provide level of audit and control to security personnel


The ideal candidate will some or all of:

  • Software development experience: ideally Java / JVM but not essentially; javascript, python, bash all beneficial
  • AWS expertise; familiarity with core services (S3, EC2, ELB, ASG) and CloudFormation
  • Good understanding of traditional ops areas of expertise: Linux, Disk I/O, Networking, VPNs
  • Good familiarity with docker and container ecosystem
  • Continuous delivery - principles and pragmatics of dealing with build pipelines, artefact repositories, zero-downtime deployment and so on
  • Proving resilience via failure injection (chaos monkey), scalability via load and stress testing
  • Experience with any of the following: CoreOS, ELK, Prometheus, ElasticSearch, PostgreSQL, PagerDuty, Gatling, JMeter, Kubernetes
  • Some understanding of iOS or Android also beneficial
  • Sensitivity to (but also boldness to influence) culture and behaviour across an organisation


Above market-rate salaries, c. £50-100k for experienced developers

Potential for equity incentives

25 days holiday