Fulltime, IT & Engineering, Permanent
Site Reliability Engineer
Bristol
Job Responsibilities:
- Collaborate with Software Engineers to improve reliability and performance in their subsystems.
- Partner with System Administrators in automating toil and eliminating alerts.
- Evolve observability and monitoring capabilities to identify and solve problems before they impact the business.
- Support development environments to help us achieve our delivery and quality goals.
- Research and evaluate technologies, tools, and services to influence buy-vs-build decisions.
- Develop expertise in diverse technical and business domains.
- Expand your knowledge of the technical stacks used.
Requirements:
- Experience using modern configuration management tools (such as Ansible, Chef or similar).
- Experience working with Terraform.
- Experience working with docker containers & container orchestration tools (such as Kubernetes, OpenShift or Docker Swarm).
- Experience both using and maintaining CI / CD tools (such as Jenkins or similar).
- Experience with monitoring tools such as InfluxDB, Prometheus or Grafana.
- Experience of event-driven integration with MQ messaging (RabbitMQ or similar AMQP solution).
- Good understanding of relational databases and SQL.
- Linux command line, administration, and shell scripting.
- Working knowledge of network security protocols.
- Experience using, developing, and maintaining cloud hosting services (ideally AWS EC2, RDS, S3, Lambda).
- Industry experience of writing well-tested code in one of our platform languages (Java, Go, Python or similar).
- Knowledge of cross domain principles & technologies.
- Experience of working in a service management environment.
- Practical applications of using observability patterns in previous systems.
- Creating and monitoring system availability metrics and using those to drive work that reduces downtime.