۱۵ ساعت پیش
استخدام Production Reliability & Operations Lead در اسنپ فود
حضوری
مقطع تحصیلی اعلام نشده
سابقه دارد (۲ سال)
حقوق توافقی
آقا و خانم
تمام وقت
مشاهده اطلاعات تماس
اطلاعات بیشتر
امروز
مجموعه اسنپ فود در تهران جهت تکمیل کادر خود از واجدین شرایط زیر دعوت به همکاری می نماید.
| Production Reliability & Operations Lead | |
| Mission: Own the reliability and operational health of all production systems. Lead the team responsible for real-time incident response, release safety, and continuous improvement of operational standards across all engineering teams. | Job Description |
| The Production Reliability & Operations Lead is a senior technical leadership role responsible for establishing and scaling reliability practices across the engineering organisation. This person owns the full operational lifecycle — from change management and safe deployment to major incident response, postmortem governance, and long-term service reliability improvement. The role requires equal strength in hands-on technical depth and cross-functional stakeholder management. | Role Overview |
| Incident Management Own the end-to-end Major Incident Management process, including severity classification, escalation paths, and war-room facilitation for high severity events. Lead incident communications to technical and business stakeholders during active events, ensuring clarity, accuracy, and timeliness. Define and continuously refine on-call rotation structure, escalation policies, and paging thresholds across all engineering teams. Track and report on MTTD, MTTR, incident frequency, and recurrence trends, using data to drive systemic improvements. Postmortem & RCA Programme Establish and champion a blameless postmortem culture across all 20 engineering teams. Own the Root Cause Analysis (RCA) process — ensuring depth, consistency, and actionability of every postmortem output. Track corrective actions to closure and surface recurring failure patterns to leadership as part of monthly reliability reviews. Publish a monthly reliability digest covering incident trends, SLO performance, and top systemic risks. Release & Change Management Define CI/CD governance standards: deployment gates, health checks, automated rollback triggers, and promotion criteria. Own the release orchestration process, including deployment freeze windows during peak business hours (lunch and dinner rush periods). Lead change risk assessment for high-impact deployments and enforce production readiness gates before promotion to production. Partner with Release Engineering to drive progressive delivery adoption — canary releases, blue/green deployments, and feature flags. Operational Monitoring & Visibility Ensure comprehensive production monitoring coverage across all critical services, with particular focus on the order-to-delivery critical path. Drive alert quality improvement initiatives — reducing alert fatigue, eliminating false positives, and improving signal-to-noise ratio. Own the operational health review process, including weekly service health reports and executive-level reliability dashboards. Partner with the Reliability Platform Engineering team to align monitoring standards with platform capabilities. Reliability Engineering Partnership Act as the primary reliability advisor to engineering team leads — identifying gaps, recommending improvements, and co-prioritising reliability work alongside feature delivery. Define and enforce service reliability standards, including the Service Maturity Model (SMM) tiers and associated production readiness requirements. Drive error budget policy enforcement — escalating to leadership when teams breach budget thresholds and recommending feature freeze periods where appropriate. Load Testing & Capacity Strategy Define load testing strategies for critical services ahead of peak traffic events (promotional campaigns, seasonal peaks). Review scalability risks and performance bottlenecks identified through capacity planning reviews. Coordinate with engineering teams to validate readiness for expected traffic growth. Service Governance & Standards Maintain service ownership standards and drive Service Catalogue adoption and accuracy across all teams. Define and enforce production readiness requirements — no service goes to production without meeting baseline observability, alerting, and runbook standards. Chair or co-chair the Architecture Review Board (ARB) for operational and reliability concerns. Reporting & Stakeholder Communication Produce weekly, monthly, and quarterly reliability reports covering SLA compliance, error budget status, incident trends, and operational risk. Communicate reliability risks and platform health status to the CTO, VP Engineering, and business stakeholders. Represent the SRE function in engineering leadership forums and strategic planning cycles. |
Key Responsibilities |
| 5+ years in Site Reliability Engineering, Production Engineering, Platform Engineering, or a directly equivalent discipline. 2+ years leading technical operations or reliability teams in a high-growth or high-traffic environment. Proven experience managing large-scale distributed systems in cloud-native environments (microservices, event-driven architectures). Deep understanding of incident management frameworks, blameless postmortem practices, and SLO/error budget models. Demonstrated experience defining and enforcing CI/CD governance, deployment strategies, and change management processes. Strong track record of cross-functional stakeholder management, including communication with C-suite and business leadership during incidents. Experience in high-throughput transactional domains (e-commerce, ride-sharing, food delivery, or fintech) strongly preferred. |
Required Experience |
متقاضیان واجد شرایط می توانند با کلیک روی لینک تکمیل فرم استخدام، رزومه خود را ارسال نمایند.
اطلاعات تماس
گزارش مشکل آگهی
- ثبتنام برای تکمیل فرم استخدام اینجا کلیک نمایید
- مهلت ۱۴۰۵/۰۶/۰۷
آگهیهای مشابه
جستجوهای مشابه
- استخدام مهندس IT در استان تهران
- استخدام مدیر سرور در شهر تهران
- استخدام مدیر سرور در استان تهران
- استخدام برنامه نویس در استان تهران
- استخدام کارشناس شبکه در شهر تهران
- استخدام کارشناس شبکه در استان تهران
- استخدام رشته کامپیوتر در استان تهران
- استخدام کارشناس انفورماتیک در شهر تهران
- استخدام برنامه نویس پایتون (Python) در شهر تهران
- استخدام برنامه نویس پایتون (Python) در استان تهران
دستهبندی آگهیهای استخدام