۳ هفته پیش

استخدام Observability Engineer برای اسنپ فود در تهران

اسنپ فود

تهران

۳ هفته پیش اطلاعات تماس

حضوری

مقطع تحصیلی اعلام نشده

سابقه دارد (۳ سال)

حقوق توافقی

آقا و خانم

تمام وقت (شنبه تا چهارشنبه ساعت 10 تا 19)

مشاهده اطلاعات تماس

اطلاعات بیشتر

مجموعه اسنپ فود در تهران جهت تکمیل کادر خود از واجدین شرایط زیر دعوت به همکاری می نماید.

Observability Engineer
At Snappfood, we believe in creating value that goes beyond the ordinary. We embrace innovation and continuously challenge ourselves to build reliable and scalable technology that serves millions of users every day. We are looking for an experienced Observability Engineer to join our Production Reliability & Operations team and help us improve the reliability, visibility, and operational excellence of our production platforms. If you enjoy solving complex operational problems, building monitoring solutions, and enabling engineering teams with better observability, we would love to have you continue this story with us.	Job Description
As an Observability Engineer, you will be responsible for designing, implementing, and continuously improving monitoring, alerting, and observability practices across our production systems. You will work closely with engineering teams to ensure that services are measurable, actionable, and operationally mature. You will play a key role in improving incident detection, reducing Mean Time to Detect (MTTD), and enabling faster and more effective incident response.	Role Summary
Monitoring & Observability Design, implement, and maintain monitoring solutions for applications, infrastructure, and business-critical services. Build and maintain dashboards, service health indicators, and operational reports. Define and promote observability standards, including metrics, logs, traces, and service instrumentation. Ensure critical systems have adequate monitoring coverage and operational visibility. Continuously improve telemetry quality and monitoring effectiveness. Alert Engineering Design and maintain actionable alerts and escalation policies. Reduce alert fatigue by improving signal-to-noise ratio and eliminating duplicate or low-value alerts. Define alert standards and thresholds based on service reliability objectives. Develop proactive monitoring mechanisms to identify issues before they impact customers. Incident Detection & Response Continuously monitor production environments and respond to operational incidents. Participate in incident response activities and support major incident investigations. Analyze monitoring data during incidents to assist troubleshooting and root cause identification. Collaborate with engineering teams to implement preventive actions and improve service resilience. Reliability Improvement Identify monitoring gaps and recommend improvements to system reliability and operational readiness. Partner with engineering teams to improve instrumentation, observability, and service maturity. Support the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and reliability reporting. Documentation & Reporting Maintain monitoring documentation, runbooks, dashboards, and operational procedures. Produce reports on service health, incidents, alert trends, and monitoring coverage. Ensure incident records and operational documentation remain accurate and up to date. Operational Support Participate in a 24/7 shift rotation to ensure continuous operational visibility and timely incident response. Participate in on-call rotations and emergency response activities when required.	Responsibilities
3+ years of experience in Observability Engineering, Site Reliability Engineering (SRE), Production Operations, NOC, Systems Engineering, or related fields. Experience operating and supporting production systems in a 24/7 environment. Hands-on experience with monitoring, troubleshooting, and incident response processes. Strong experience with monitoring and observability platforms such as: Prometheus, Grafana, Zabbix Experience with centralized logging solutions such as: ELK, Loki, Splunk Familiarity with distributed tracing and observability concepts, including: OpenTelemetry, Tempo, Experience configuring: Dashboards, Alerts, Service health reports, Monitoring automation Solid understanding of Linux/Unix systems and troubleshooting methodologies. Good understanding of networking fundamentals and distributed systems concepts. Familiarity with cloud-native environments and container platforms is a plus.	Requirements
Experience with Kubernetes and containerized environments. Understanding of SLI/SLO concepts and reliability engineering practices. Experience with automation and scripting using Python, Bash, or Go. Experience working in high-traffic, mission-critical production environments.	Preferred Qualifications