Sr. Site Reliability Engineer (SRE)

As a Principal Site Reliability Engineer (SRE) overseeing our digital application portfolio, you will lead efforts to ensure the reliability, scalability, and performance of the platforms behind our web, mobile, and OTT experiences. You’ll work across a diverse ecosystem of products and technologies—helping with architectural decisions, shaping reliability standards, and championing operational excellence at scale. You will serve as a strategic partner to engineering, product, security, and infrastructure teams—guiding system design for high availability, leading incident response across critical services, and embedding SRE best practices across the software development lifecycle. Your role will include evolving observability frameworks, advancing infrastructure-as-code maturity, and automating tool to accelerate delivery while maintaining stability. Success in this role is defined by your ability to influence engineering culture, mentor teams, and drive systemic improvements that raise the bar for operational resilience. You’ll take a proactive, data-driven approach to identifying and addressing risks before they impact users. Collaboration across teams—including video engineering, content delivery, data, and customer experience—is key to delivering digital products that are not only innovative but consistently reliable. What We Value Site Reliability Engineers are the champions of reliability and customer trust in production. We value engineers who are driven by a desire to deliver the best possible customer experience—ensuring that every interaction across our web, mobile, CTV, and video platforms is fast, seamless, and dependable. We look for systems thinkers who act with urgency, collaborate deeply, and apply a data-driven mindset to everything they do. Curiosity, clear communication, and continuous improvement are at the heart of our culture. As a Principal SRE, you’ll lead by example—mentoring others, shaping best practices, and helping us build resilient systems that scale. Responsibilities: Design and implement tools, processes, and frameworks to proactively monitor, measure, and improve the performance, availability, and reliability of production applications. Define and maintain key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to uphold system reliability and user experience targets. Evaluate applications and services for production readiness—ensuring they meet operational, security, and customer experience requirements before launch. Establish comprehensive observability practices—including real-time monitoring, alerting, and telemetry—to ensure deep visibility into system health and user impact. Serve as a feedback loop to engineering teams—analyzing production behavior, identifying reliability gaps, and driving architectural and operational improvements. Collaborate with security and infrastructure teams to proactively address vulnerabilities and maintain compliance across production systems. Partner with product and platform teams to ensure operational insights inform development priorities and release strategies. Lead post-incident reviews and foster a culture of continuous learning, improvement, and resilience. Participate in a 24/7 on-call rotation to support critical services and ensure rapid incident response.

Job ID

744000072470976

Business

News Group HQ

News Group Digital Tech, Prod & Growth

DetailURL

https://jobs.smartrecruiters.com/NBCUniversal3/744000072470976

Job Level

Mid-Senior Level

Job Location

United States

New Jersey