The goal of the education and outreach function of the SRE team is to improve reliability and resilience of Netflix services by focusing on the people within the company, since it's the normal, everyday work of Netflix employees that creates our availability. To cultivate operational excellence, we reveal risks, identify opportunities, and facilitate the transfer of skills and expertise of our staff by sharing experiences.
The nature of our work is interdisciplinary so we recognize that a successful candidate can come from a wide variety of backgrounds (e.g., software engineering, SRE, human factors, safety science, systems engineering, technical product/program management, UX research, organizational psychology, cultural anthropology). We encourage you to apply even if you feel uncertain that you have the "right" background.
You may also be interested in the Senior Site Reliability Engineer opening on our team.
We think about:
Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is constantly undergoing change. Unforseen interactions are common and operational surprises arise from perfect storms of events.
_Surprises over incidents and recovery more than prevention. _We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.
Discovering new information and actionable outcomes over tracking stagnant action items. We aspire to pursue the ways that help us learn; not chase after numbers. Building a learning organization is a real way that we are able to proactively and continually improve.
- Increase Netflix's capacity to adapt to changes and surprises
- Enhance operational expertise at Netflix
- Advance Netflix as a learning organization
- Change the ways internal tool builders think about how people and tools interact
- Improve team health by empowering teams to balance operational responsibilities with development
- Exploring contributors versus constructing causes
- "I see how that action was reasonable" versus "you shouldn't have done that"
- ‘Human error’ as symptom versus ‘human error’ as cause
- Automation as a team player versus automation as a replacement for humans
- How things went right versus why things went wrong
- Adapting to new surprises over remediating prior incidents
- Narrative descriptions of surprising events versus out-of-context quantitative data
- Deep conversations versus shallow collections of timestamps
- Identifying weak signals versus broadly categorizing incidents
- Decisions driven by expert judgment versus decisions driven by superficial metrics
- Influence through developing relationships over exercising authority
- Investigate operational surprises
- Facilitate reviews and conversations to surface risks and opportunities
- Share context and develop holistic techniques that change how people work
- Design and execute on programs to socialize findings and drive operational change
- Education and training on identified risks and operational gaps
- Inform product and tooling roadmaps based on findings
- Experimentation to try new approaches for reaching an audience
- Use qualitative and quantitative data to inform recommendations and decisions
- Familiarity with resilience engineering concepts
- Software and systems engineering
- Technical product/program management in this specific domain
- Experience within systems that encounter complex failure modes
- Proficient with qualitative research methods
Here are some resources that explain more about what we do and how we think:
- How Did Things Go Right? Learning More from Incidents
- Antics, Drift, and Chaos
10 days ago - source