Site Reliability Engineer II, SRE
![]() | |
![]() United States, Washington, Redmond | |
![]() | |
OverviewAre you interested in working for one of the most exciting teams at Microsoft? Then look no further than Microsoft Teams SRE team. You will be building solutions that leverage state-of-the-art technologies to deliver the next evolution in collaboration and teamwork.What is a Software Reliability Engineer (SRE)? SRE is what you get when you treat operations as if it is a software engineering problem. Our mission is to improve the availability, latency, performance, and security of the Microsoft Teams services. Like traditional operations, we keep important revenue-critical systems up and running, even when natural disasters, bandwidth outages and configuration problems occur. Unlike traditional operations groups, we identify and address these software problems directly through software improvements, innovative technologies, and systems automation.As a Site Reliability Engineer II in Teams, you will provide leadership, direction and accountability for networking, infrastructure design, end to end implementation and security for Teams services. Proficient collaboration skills will be required working closely with other engineering teams to ensure services/systems are highly stable and performant and meet the expectations of internal stakeholders and external customers and users. This opportunity will allow you to learn what it takes to deploy and run software as a 24x7 enterprise grade cloud service, hone your security expertise and become an expert in webservices optimization. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesDesign, write and deliver software to improve the availability, scalability, latency, and efficiency of Microsoft's Identity services.Help define the next generation of Teams services infrastructure and routing design and drive its implementation.Troubleshoot complex infrastructure and network issues and proactively implement methods to reduce reoccurrence and impact of future incidents.Develop code, scripts, systems, or platforms that automate complex operations processes (e.g., monitoring, alerting, routing, debugging) at scale.Identify security issues and recommends potential mitigation strategies to address underlying causes. Develops security guidance and models to address issues and to contribute to the definition of best practices. Suggest and drives appropriate guidance, models, response, and remediation for issues.Participate in regular on-call rotations and share details related to incidents and their resolution through post-mortem reports and regular review meetings. |