SRE Architect - R01548544
About Brillio:
Brillio is one of the fastest growing digital technology service providers and a partner of choice for many Fortune 1000 companies seeking to turn disruption into a competitive advantage through innovative digital adoption. Brillio, renowned for its world-class professionals, referred to as "Brillians", distinguishes itself through their capacity to seamlessly integrate cutting-edge digital and design thinking skills with an unwavering dedication to client satisfaction.
Brillio takes pride in its status as an employer of choice, consistently attracting the most exceptional and talented individuals due to its unwavering emphasis on contemporary, groundbreaking technologies, and exclusive digital projects. Brillio's relentless commitment to providing an exceptional experience to its Brillians and nurturing their full potential consistently garners them the Great Place to Work® certification year after year.
SRE Architect
Dallas TX – 3 days Hybrid
Required Skills & Experience
As a Senior SRE Lead, you will lead the implementation, optimization, and maintenance of production systems at the customer site. You will work closely with cross-functional teams, including development, operations, and business stakeholders, to ensure high availability, performance, and resilience of applications and infrastructure. Your expertise in automation, monitoring, incident management, and cloud cost optimization will be critical to driving operational excellence and financial efficiency.
Key Responsibilities:
1. System Reliability and Performance
• Design, implement, and maintain highly available, scalable, and resilient systems.
• Monitor system health and performance using tools like Splunk, Dynatrace, Prometheus, Grafana, or similar platforms.
• Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to measure system reliability.
• Perform root cause analysis (RCA) for incidents and implement preventive measures to avoid recurrence.
2. Automation and Tooling
• Automate repetitive tasks such as deployments, scaling, and monitoring using scripting languages (e.g., Python, Bash, PowerShell).
• Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or CloudFormation.
• Build and optimize CI/CD pipelines to streamline application delivery processes.
3. Incident Management and On-Call Support
• Lead incident response efforts, coordinating with internal and customer teams to resolve issues quickly.
• Participate in an on-call rotation to provide 24x7 support for critical systems.
• Reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through proactive monitoring and automation.
4. FinOps and Cost Optimization
• Implement FinOps practices to manage and optimize cloud and infrastructure costs effectively.
• Analyze and monitor cloud spending using tools like AWS Cost Explorer, Azure Cost Management, or third-party solutions (e.g., CloudHealth, Spot.io).
• Identify opportunities to reduce costs through resource optimization, reserved instances, spot instances, and auto-scaling policies.
• Collaborate with finance and engineering teams to establish budgets, forecasts, and cost allocation strategies.
• Educate and train teams on cost-aware development and operational practices.
5. Collaboration and Leadership
• Act as the primary technical point of contact at the customer site, fostering strong relationships with stakeholders.
• Mentor junior engineers and guide them in adopting SRE best practices, including cost optimization.
• Collaborate with development teams to embed observability, scalability, reliability, and cost efficiency into the software development lifecycle (SDLC).
6. Compliance and Security
• Ensure compliance with security standards and regulatory requirements (e.g., GDPR, HIPAA, SOC 2).
• Implement and enforce security best practices across systems and processes.
• Conduct regular audits and vulnerability assessments to maintain a secure environment.
Required Qualifications:
Experience:
• 9+ years of experience in IT operations, DevOps, or Site Reliability Engineering roles.
• Proven experience leading SRE initiatives in customer-facing or on-site roles.
• Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes).
• Strong understanding of distributed systems, microservices architecture, and serverless computing.
• Experience with cloud cost optimization and FinOps practices
Technical Skills:
• Proficiency in automation tools
• Expertise in monitoring and observability tools (e.g., Splunk, Dynatrace, Prometheus, Grafana).
• Experience with configuration management tools (e.g., Ansible, Puppet, Chef).
• Knowledge of scripting and programming languages (e.g., Python, Bash, Go).
• Familiarity with database technologies (e.g., MySQL, PostgreSQL, MongoDB).
• Hands-on experience with cloud cost management tools (Azure Cost Management, or CloudHealth).
Why should you apply for this role?
As Brillio continues to gain momentum as a trusted partner for our clients in their digital transformation journey, we strive to set new benchmarks for speed and value creation. The DI team at Brillio is at the forefront of leading this charge by reimagining and executing how we structure, sell and deliver our services to better serve our clients.
Equal Employment Opportunity Declaration
Brillio is an equal opportunity employer to all, regardless of age, ancestry, colour, disability (mental and physical), exercising the right to family care and medical leave, gender, gender expression, gender identity, genetic information, marital status, medical condition, military or veteran status, national origin, political affiliation, race, religious creed, sex (includes pregnancy, childbirth, breastfeeding, and related medical conditions), and sexual orientation.
#LI-CH1
Know what it’s like to work and grow at Brillio: Click here Salary: 80-85 USD per-hour-wage