Skip to content

Observability

Observability and monitoring

We set up observability with Sentry, Datadog, Grafana or Honeycomb — not as a graph museum, but so you can concretely answer what's happening, why it's happening and who's affected.

3D illustration of a transparent glass dome lens hovering above a field of glowing data nodes connected by orange waveform lines — symbolising instrumentation that answers the questions you actually ask.

Observability without a chart museum

What makes a difference at two in the morning

Exceptions, releases and user context
Sentry
OpenTelemetry across services
Traces
Alerts on user experience, not on thresholds
SLO
Every alert has a concrete action
Runbook

How we think about observability

Answers to the questions you actually have — not a chart museum.

Most systems aren't under-monitored — they're over-instrumented and under-thought. Dashboards in ten levels, alerts that fire twice a day that no one reacts to, and when something actually breaks, you're stuck in the logs searching for a stack trace anyway. We build observability that answers the questions you actually have at two in the morning: what broke, how many are affected, and what should we do now.

Our typical setup is Sentry for exceptions and frontend errors, a metrics system (Datadog, Grafana Cloud or Honeycomb) for operations data and latency, structured logging via OpenTelemetry for traces across services, and a runbook per service with concrete action steps per alert. We aim for SLOs instead of thresholds — we alert when user experience is at risk, not when a random metric crosses a random number.

We also set up the things that make a difference for the day-to-day team: per-deploy markers on all charts (so you can see which deploy broke something), per-tenant views (when a large customer complains, you can immediately see their specific metrics), and an on-call rotation with clear handovers so no one is stuck with a burning system without help.

What we deliver

Sentry, traces, SLO-based alerts and runbooks.

Built so every alert demands concrete action, and every incident leaves learning in the runbook.

  • Sentry for exceptions and frontend errors

    Sentry set up with source maps, release tracking, performance monitoring and user context. Releases linked to Git commits so you can immediately see which code change introduced an error.

  • Metrics, dashboards and business KPIs

    Datadog, Grafana Cloud or Honeycomb depending on your stack and budget. Dashboards designed per service and per business-critical flow — not as a museum of every conceivable metric.

  • Distributed tracing via OpenTelemetry

    OpenTelemetry instrumentation so you can follow a request across frontend, API, database and third-party services. When something is slow, you know exactly where the time is spent — not just that it's slow.

  • Structured logging and log aggregation

    JSON-structured logs with trace-id, request-id, tenant-id and user-id on every line. Aggregated in Datadog, Grafana Loki or your own ELK stack — so search actually finds something.

  • SLO-based alerts that don't burn out the team

    Alerts based on Service Level Objectives (e.g., '99.5% of checkout requests under 500ms over 30 days'), not on arbitrary thresholds. We configure error budgets, burn-rate alerts, and remove alerts that don't lead to action.

  • Runbooks and on-call

    Every alert has a runbook: what does it mean, what steps to take, who to escalate to. On-call rotation set up in Opsgenie or PagerDuty with clear handovers. Post-mortems after serious incidents — without blame, with learning.

Before you commit

What you should consider first.

  • How many tools do you actually need?

    It's tempting to have separate tools for logs, metrics, traces, exceptions and uptime. That gives best-in-class per layer, but also five dashboards to juggle during an incident. We often recommend consolidating on fewer tools (Datadog covers most; Grafana Cloud too) — unless you have specific requirements at one layer.

  • Sampling and cost

    Full instrumentation on everything can get unmanageable expensive. We design a sampling strategy: 100% of exceptions, 100% of all 5xx responses, 5–10% of successful requests — with smart sampling that always keeps slow requests and failing traces. Budget alarms on telemetry spend itself so you don't get a surprise bill.

  • GDPR and user data in logs

    Structured logs must not contain PII (email, names, addresses, national IDs). We set up automatic redaction in the logging layer and audit at setup. User context in error reports is kept to user-id and tenant-id — anything else requires explicit decision.

  • Alert fatigue is a real risk

    An alert that fires three times a day that no one reacts to has lost its meaning. We review all alerts quarterly: those no one acted on are removed or adjusted. The goal is that every pager alert requires concrete action — otherwise it's noise.

FAQ

What people usually ask.

  • How long does an observability setup take?

    A basic setup with Sentry, structured logging, three core dashboards and an on-call rotation is typically 3–5 weeks. A full setup with distributed tracing, SLOs, runbooks per service and post-mortem process takes 6–10 weeks. We recommend starting with the basics and building out — full instrumentation on day one is often noise without learning.

  • Which tool should we choose?

    Sentry is almost always part of it — it covers exceptions and frontend errors well and is reasonable for most stacks. For metrics and traces it depends on your team: Datadog is the most comprehensive but requires discipline on usage; Grafana Cloud gives good value for small teams; Honeycomb is excellent for complex distributed systems. We choose together in discovery based on your stack and operations team.

  • Can we use OpenTelemetry and avoid vendor lock-in?

    Yes — and we recommend it as the default. OpenTelemetry instrumentation exports to whichever backend you want. You can start with, say, Datadog and later switch to Grafana Cloud without re-instrumenting the code. It's a small upfront investment that gives flexibility later.

  • How do you handle on-call without burning out the team?

    With strict discipline on alerts — only what requires action pages. With a rotation that isn't heavier than the team can handle (typically one week per person, not necessarily 24/7 if the business doesn't require it). With follow-the-sun if you have a distributed team. And with post-mortems after every serious incident that actually touch the process — not just find blame.

  • Can you take on-call for us?

    Yes, on a monthly operations agreement. We typically take a hybrid model where your team is primary on-call (they know the product best), and we're secondary to back up outside business hours or when an incident escalates. We've also taken primary on-call for customers without an internal operations team.

Ready to get started?

Let's have a no-pressure conversation.

We'll get back within one business day with concrete input — not a stock proposal.