Site Reliability Engineer
Site Reliability Engineer
About the role
We are looking for a Site Reliability Engineer to join our rapidly growing team.
The Site Reliability Engineer will help to deploy, manage, fix and reinvent the tools, services and components that the software engineers rely on to automate our services and keep them operational. Your internal customers are your engineering colleagues, and through close collaboration, support and exchange of ideas, we share a common goal to serve our external customers and grow through learning and innovation.
About the Team
Reporting to the Software Development Manager, you will be part of our Site Reliability and DevOps team. You will have the opportunity to utilise a wide range of technologies and tools. Training and support will be provided where relevant, therefore having exhaustive experience across all our technologies is not a necessity. At Brightpearl we pride ourselves in providing a collaborative environment that ensures we produce leading products across web and native applications.
You will work with our product delivery teams around the business to provide them with the support, tooling and knowledge to achieve great results. Ultimately, you will be passionate about the quality of software developed at Brightpearl. Your aim will be to ensure the systems we develop are highly available, low latency, robust to unexpected failures, scalable to high levels of load, cost effective and secure.
- Working with software engineering teams to help plan and deliver solutions, ensuring they are highly performant, reliable and secure.
- Developing tooling and libraries and investigating new approaches and technologies to support our development teams in gaining improved observability, performance, reliability and security.
- Proactively monitoring performance and reliability of systems at Brightpearl. Helping the Site Reliability team to define acceptable standards for key metrics. Identifying necessary improvements and working with developers to deliver them.
- Gathering data and presenting it to the wider business to help share understanding of the reliability and performance of our systems.
- Documenting our tooling and best practices for both technical and non-technical audiences.
- Supporting the technical response to outages and incidents, and designing and implementing improvements to our systems to prevent recurrence.
- Working closely with the DevOps team to ensure the infrastructure to run our systems is in place, that it is secure, and that we have the means to ship code in a safe, reliable and continuous manner.
- Linux services
- Configuration tools (Ansible, Chef or similar)
- Previous experience of collaborating with a global team to roll out new features, oversee frequent releases to production and improve the infrastructure within a Product Delivery environment
- Experience designing, implementing and maintaining site reliability processes and systems that increase efficiency, eliminate downtime and maintain performance at scale across platforms
- Proven experience of diagnosing, resolving and escalating service-impacting issues
- Experience using the CI/CD pipeline to support automated testing and deployments
- A good team player capable of delivering to deadlines.
- Ability to work calmly under pressure to help diagnose performance issues affecting customers in production.
- Comfortable proactively communicating with colleagues and stakeholders.
- Quality focused and value driven.
Ideally you’ll have some of the following:
- Knowledge of AWS and its various services
- Familiar maintaining & supporting varied production environments
- A customer-centric approach to creating and maintaining services
- Familiarity with Kubernetes/Docker a plus
- Experience working with Terraform a plus
- Comfortable with scripting languages (such as bash, ruby, python, golang etc)