۱۵ ساعت پیش

استخدام Production Reliability & Operations Lead در اسنپ فود
اسنپ فود

استخدام Production Reliability & Operations Lead در اسنپ فود

اسنپ فود
تهران
اطلاعات تماس

حضوری
مقطع تحصیلی اعلام نشده
سابقه دارد (۲ سال)
حقوق توافقی
آقا و خانم
تمام وقت

مشاهده اطلاعات تماس
اطلاعات بیشتر
امروز

مجموعه اسنپ فود در تهران جهت تکمیل کادر خود از واجدین شرایط زیر دعوت به همکاری می نماید.

Production Reliability & Operations Lead
Mission:  Own the reliability and operational health of all production systems. Lead the team responsible for real-time incident response, release safety, and continuous improvement of operational standards across all engineering teams. Job Description
The Production Reliability & Operations Lead is a senior technical leadership role responsible for establishing and scaling reliability practices across the engineering organisation. This person owns the full operational lifecycle — from change management and safe deployment to major incident response, postmortem governance, and long-term service reliability improvement. The role requires equal strength in hands-on technical depth and cross-functional stakeholder management. Role Overview
Incident Management
Own the end-to-end Major Incident Management process, including severity classification, escalation paths, and war-room facilitation for high severity events.
Lead incident communications to technical and business stakeholders during active events, ensuring clarity, accuracy, and timeliness.
Define and continuously refine on-call rotation structure, escalation policies, and paging thresholds across all engineering teams.
Track and report on MTTD, MTTR, incident frequency, and recurrence trends, using data to drive systemic improvements.
Postmortem & RCA Programme
Establish and champion a blameless postmortem culture across all 20 engineering teams.
Own the Root Cause Analysis (RCA) process — ensuring depth, consistency, and actionability of every postmortem output.
Track corrective actions to closure and surface recurring failure patterns to leadership as part of monthly reliability reviews.
Publish a monthly reliability digest covering incident trends, SLO performance, and top systemic risks.
Release & Change Management
Define CI/CD governance standards: deployment gates, health checks, automated rollback triggers, and promotion criteria.
Own the release orchestration process, including deployment freeze windows during peak business hours (lunch and dinner rush periods).
Lead change risk assessment for high-impact deployments and enforce production readiness gates before promotion to production.
Partner with Release Engineering to drive progressive delivery adoption — canary releases, blue/green deployments, and feature flags.
Operational Monitoring & Visibility
Ensure comprehensive production monitoring coverage across all critical services, with particular focus on the order-to-delivery critical path.
Drive alert quality improvement initiatives — reducing alert fatigue, eliminating false positives, and improving signal-to-noise ratio.
Own the operational health review process, including weekly service health reports and executive-level reliability dashboards.
Partner with the Reliability Platform Engineering team to align monitoring standards with platform capabilities.
Reliability Engineering Partnership
Act as the primary reliability advisor to engineering team leads — identifying gaps, recommending improvements, and co-prioritising reliability work alongside feature delivery.
Define and enforce service reliability standards, including the Service Maturity Model (SMM) tiers and associated production readiness requirements.
Drive error budget policy enforcement — escalating to leadership when teams breach budget thresholds and recommending feature freeze periods where appropriate.
Load Testing & Capacity Strategy
Define load testing strategies for critical services ahead of peak traffic events (promotional campaigns, seasonal peaks).
Review scalability risks and performance bottlenecks identified through capacity planning reviews.
Coordinate with engineering teams to validate readiness for expected traffic growth.
Service Governance & Standards
Maintain service ownership standards and drive Service Catalogue adoption and accuracy across all teams.
Define and enforce production readiness requirements — no service goes to production without meeting baseline observability, alerting, and runbook standards.
Chair or co-chair the Architecture Review Board (ARB) for operational and reliability concerns.
Reporting & Stakeholder Communication
Produce weekly, monthly, and quarterly reliability reports covering SLA compliance, error budget status, incident trends, and operational risk.
Communicate reliability risks and platform health status to the CTO, VP Engineering, and business stakeholders.
Represent the SRE function in engineering leadership forums and strategic planning cycles.
Key Responsibilities
5+ years in Site Reliability Engineering, Production Engineering, Platform Engineering, or a directly equivalent discipline.
2+ years leading technical operations or reliability teams in a high-growth or high-traffic environment.
Proven experience managing large-scale distributed systems in cloud-native environments (microservices, event-driven architectures).
Deep understanding of incident management frameworks, blameless postmortem practices, and SLO/error budget models.
Demonstrated experience defining and enforcing CI/CD governance, deployment strategies, and change management processes.
Strong track record of cross-functional stakeholder management, including communication with C-suite and business leadership during incidents.
Experience in high-throughput transactional domains (e-commerce, ride-sharing, food delivery, or fintech) strongly preferred.
Required Experience

متقاضیان واجد شرایط می توانند با کلیک روی لینک تکمیل فرم استخدام، رزومه خود را ارسال نمایند.

اطلاعات تماس
گزارش مشکل آگهی
https://iranestekhdam.ir/?p=3097181
ابتدای صفحه
مختصری درباره ایران استخدام

سایت ایران استخدام در تاریخ ۱۳۹۱/۱/۱۰ راه اندازی شد و با تلاش گروهی و روزانه مدیران و نویسندگان خود در جهت تبدیل شدن به مرجع بروز آگهی های استخدامی گام برداشت. سعی همیشگی همکاران ما ارائه مطلوب و با کیفیت آگهی های استخدامی خدمت بازدیدکنندگان محترم این سایت بوده است. ایران استخدام به صورت مستقل و خصوصی اداره می شود و وابسته به هیچ نهاد و یا سازمان دولتی نمی باشد، این سایت تنها منتشر کننده ی آگهی های استخدامی بوده و بنابراین لازم است که بازدید کنندگان محترم سایت خود نسبت به صحت و سقم اخبار منتشر شده در آن هوشیار باشند.

نماد اعتماد الکترونیکی
ارسال رزومه