۱ روز پیش

استخدام Observability Engineer برای اسنپ فود در تهران
اسنپ فود

استخدام Observability Engineer برای اسنپ فود در تهران

اسنپ فود
تهران
اطلاعات تماس

حضوری
مقطع تحصیلی اعلام نشده
سابقه دارد (۳ سال)
حقوق توافقی
آقا و خانم
تمام وقت (شنبه تا چهارشنبه ساعت 10 تا 19)

مشاهده اطلاعات تماس
اطلاعات بیشتر

مجموعه اسنپ فود در تهران جهت تکمیل کادر خود از واجدین شرایط زیر دعوت به همکاری می نماید.

Observability Engineer
At Snappfood, we believe in creating value that goes beyond the ordinary. We embrace innovation and continuously challenge ourselves to build reliable and scalable technology that serves millions of users every day.
We are looking for an experienced Observability Engineer to join our Production Reliability & Operations team and help us improve the reliability, visibility, and operational excellence of our production platforms. If you enjoy solving complex operational problems, building monitoring solutions, and enabling engineering teams with better observability, we would love to have you continue this story with us.
Job Description
As an Observability Engineer, you will be responsible for designing, implementing, and continuously improving monitoring, alerting, and observability practices across our production systems. You will work closely with engineering teams to ensure that services are measurable, actionable, and operationally mature.
You will play a key role in improving incident detection, reducing Mean Time to Detect (MTTD), and enabling faster and more effective incident response.

Role Summary

Monitoring & Observability
Design, implement, and maintain monitoring solutions for applications, infrastructure, and business-critical services.
Build and maintain dashboards, service health indicators, and operational reports.
Define and promote observability standards, including metrics, logs, traces, and service instrumentation.
Ensure critical systems have adequate monitoring coverage and operational visibility.
Continuously improve telemetry quality and monitoring effectiveness.
Alert Engineering
Design and maintain actionable alerts and escalation policies.
Reduce alert fatigue by improving signal-to-noise ratio and eliminating duplicate or low-value alerts.
Define alert standards and thresholds based on service reliability objectives.
Develop proactive monitoring mechanisms to identify issues before they impact customers.
Incident Detection & Response
Continuously monitor production environments and respond to operational incidents.
Participate in incident response activities and support major incident investigations.
Analyze monitoring data during incidents to assist troubleshooting and root cause identification.
Collaborate with engineering teams to implement preventive actions and improve service resilience.
Reliability Improvement
Identify monitoring gaps and recommend improvements to system reliability and operational readiness.
Partner with engineering teams to improve instrumentation, observability, and service maturity.
Support the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and reliability reporting.
Documentation & Reporting
Maintain monitoring documentation, runbooks, dashboards, and operational procedures.
Produce reports on service health, incidents, alert trends, and monitoring coverage.
Ensure incident records and operational documentation remain accurate and up to date.
Operational Support
Participate in a 24/7 shift rotation to ensure continuous operational visibility and timely incident response.
Participate in on-call rotations and emergency response activities when required.
Responsibilities
3+ years of experience in Observability Engineering, Site Reliability Engineering (SRE), Production Operations, NOC, Systems Engineering, or related fields.
Experience operating and supporting production systems in a 24/7 environment.
Hands-on experience with monitoring, troubleshooting, and incident response processes.
Strong experience with monitoring and observability platforms such as: Prometheus, Grafana, Zabbix
Experience with centralized logging solutions such as: ELK, Loki, Splunk
Familiarity with distributed tracing and observability concepts, including: OpenTelemetry, Tempo, 
Experience configuring: Dashboards, Alerts, Service health reports, Monitoring automation
Solid understanding of Linux/Unix systems and troubleshooting methodologies.
Good understanding of networking fundamentals and distributed systems concepts.
Familiarity with cloud-native environments and container platforms is a plus.
Requirements
Experience with Kubernetes and containerized environments.
Understanding of SLI/SLO concepts and reliability engineering practices.
Experience with automation and scripting using Python, Bash, or Go.
Experience working in high-traffic, mission-critical production environments.

Preferred Qualifications

 

متقاضیان واجد شرایط می توانند با کلیک روی لینک تکمیل فرم استخدام، رزومه خود را ارسال نمایند.

اطلاعات تماس
گزارش مشکل آگهی
https://iranestekhdam.ir/?p=3097184
ابتدای صفحه
مختصری درباره ایران استخدام

سایت ایران استخدام در تاریخ ۱۳۹۱/۱/۱۰ راه اندازی شد و با تلاش گروهی و روزانه مدیران و نویسندگان خود در جهت تبدیل شدن به مرجع بروز آگهی های استخدامی گام برداشت. سعی همیشگی همکاران ما ارائه مطلوب و با کیفیت آگهی های استخدامی خدمت بازدیدکنندگان محترم این سایت بوده است. ایران استخدام به صورت مستقل و خصوصی اداره می شود و وابسته به هیچ نهاد و یا سازمان دولتی نمی باشد، این سایت تنها منتشر کننده ی آگهی های استخدامی بوده و بنابراین لازم است که بازدید کنندگان محترم سایت خود نسبت به صحت و سقم اخبار منتشر شده در آن هوشیار باشند.

نماد اعتماد الکترونیکی
ارسال رزومه