Lead Site Reliability Engineer

Basic Information

Country

United States

State

NA

City

Remote

Date Published

21-Jul-2021

Job ID

30823

Travel Amount

up to 10%

Description and Requirements

Thanks to our ongoing expansion we have the opportunity to grow our Site Reliability Engineering team. We’re looking for people who are just as passionate about solving issues with distributed systems as they are to automate, code and collaborate to tackle problems.

Primary Roles and Responsibilities:


In this role you will:

·         Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.

·         Manage Cloud provider infrastructure, system deployments and product release operations.

·         Monitor the Elasticsearch platform, responding to incidents, correcting and improving systems to prevent incidents and planning capacity.

·         Be involved in resolving Elasticsearch customer issues.

·         Participate in 24x365 on-call schedules.

Qualifications

  • You are either a Software Engineer with real interest, and ideally some experience in Linux systems, networking, monitoring and automation; or an experienced sysadmin or systems engineer with professional skills in Linux, preferably on distributed systems at scale, and a proven interest and experience in using software engineering to solve operational problems.
  • You are comfortable writing software to automate API-driven tasks at scale. Python preferred.
  • Define and enhance the SRE Observability and Monitoring leveraging AIOps Service.
  • Develop SRE runbooks automation, self-service playbooks for the Helix SaaS platform 
  • Working understanding system monitoring tools and technology
  • Experience with monitoring and observability to proactively predict failure logging and monitoring tools ELK, Prometheus, Kibana, and Grafana a plus.
  • experience with CI/CD pipelines and DevOps tools such as Jenkins, Docker, Git, Kubernetes, Terraform, and Ansible.
  • You have a passion for collaborating cross-functionally & cross-product to identify and own the RCA and mitigation plan and reduce MTTR
  • Experience optimizing existing deployment workflows.

#LI-Remote
BMC helps customers run and reinvent their businesses in the digital age by tackling their IT management challenges, championing their innovation, and celebrating their success.
Every BMC employee has the potential to have a tremendous impact on customer success—and when customers thrive, we all do.

BMC offers bold and fearless career-seekers like you the opportunity to expand your skills, your network, and your horizons as you work to enable customer growth and innovation every day. You will be surrounded by peers who inspire you, drive you, support you, and make you laugh out loud, in an environment that fosters individuality, respect, and personal ambition.

It is the policy of BMC Software to afford equal opportunity for employment to all individuals regardless of race, color, sex, age, national origin, physical or mental disability, history of disability, ancestry, citizenship status, political affiliation, religion, gender, transgender, gender identity, gender expression, marital status, status as a parent, sexual orientation, protected veteran status, genetic information or other factors prohibited by law, and to prohibit harassment or retaliation based on any of these factors.

If you need a reasonable accommodation for any part of the application and hiring process, visit the accommodation request page.