We are seeking a Senior Observability Engineer to ensure the performance and efficiency of our Monitoring platform. You'll work closely with engineers and data scientists to develop systems for near real-time data collection and analysis. As an expert in high-scale monitoring and observability, you will utilize Elastic Stack, OpenTelemetry (OTEL), Grafana, Prometheus, and other telemetry frameworks.
Your role includes designing, enriching, and maintaining these tools to manage customer AI workloads, automate engineering processes, and enhance overall efficiency.
Responsibilities
- Analyze metrics, logs, and traces to identify performance bottlenecks and errors.
- Instrument metrics when desired telemetry is unavailable.
- Drive actions to improve the system’s health, performance and key metrics.
- Collaborate with cross-functional teams to improve service reliability and performance.
- Develop and refine metrics to assess the performance and effectiveness of runtime inferencing.
- Lead efforts in driving down latency and throughput improvements and drive efficiency and utilization of the system.
- Use competitive industry trends and technologies related to observability and performance engineering to keep us state of art.
Required/Minimum Qualifications:
- Bachelor's Degree in Computer Science, or related technical discipline
- 4+ years experience in observability, in high scale monitoring and observability systems
- Experience with Elastic Stack, Grafana, Prometheus, or other telemetry frameworks.