Skip to content

Open Telemetry

  • OpenTelemetry is an open source framework for creating and managing telemetry data. It provides APIs, SDKs, and integrations to export telemetry data (metrics, logs, and traces) from applications to backends.p[[]]
  • The three pillars of observability are: Logs, Metrics, and Traces.

Logs

  • A log is a textual record of an event that happened in the system at a specific time.
  • The trigger to generate a log is within the application code.
type Log = {
    timestamp: Date;
    payload: {
        message: string;
        context: Record<string, any>;
        metadata: Record<string, any>;
    };
};
  • Logs categories:
    • Structured logs - logs that are formatted in a specific way, usually JSON, that is easy to query and analyze.
    • Unstructured logs - logs that are generated for human consumption, usually random text with no specific format.

Metrics

  • A metric is a series of data values and timestamps, thus it is a time-series of data.
  • E.g. CPU usage, memory usage, number of requests per second, etc.
  • Data values are usually numeric, but can also be boolean or string.
  • Metrics are aggregated over time, to reduce resources used to store them: we only store avg, min, max, sum, of the metric over a specific time window rather than the raw data.

Traces

  • A trace is a series of events that are related to each other, that is, a trace describes the entire journey of a request across a distributed system.
  • The request passes through the system components, and each component generates a span.
  • All spans related to the same request point to the same trace, and each points to the parent span(predecessor or previous span, aka. the span of the caller).
type Span = {
    id: string;
    traceId: string; // shared between all spans in the same trace
    parentId: string; // id of the parent span
    name: string; // usually function or component name
    startTime: Date;
    endTime: Date;
    status: string;
    metadata: Record<string, any>;
    events: (Log | Exception | Error)[];
    links: Record<string, any>;
};
  • Aggregating spans and traces generates good metrics by Site Reliability Engineers (SREs) to understand the health of the system.

References