Node Monitoring and Observability

Last modified on October 4, 2023

You can enable metrics on StrongDM gateways and relays in order to assist with monitoring and observability. When visualized on monitoring dashboards and mapped to alerts, metrics provide valuable insights into the status of gateways and relays, including connection failures, disconnects, availability, and so forth. Monitoring gateways and relays can help you to preemptively address and understand problems as soon as they arise.

This guide defines gateway and relay metrics, describes common terminology related to such metrics, and provides a configuration example for enabling Prometheus-formatted metrics on a gateway or relay.

After configuration is complete, you can request metrics from the gateway or relay on the specified port. The /metrics endpoint can be reached at:

http://127.0.0.1:9999/metrics

Terminology

Common terminology related to gateway and relay metrics is described in the following table.

TermDescription
ChunkA data blob representing a portion of a long-running SSH, RDP, or Kubernetes interactive session recording.
EgressThe act of a gateway or relay making an outbound network connection (called an egress connection) directly to a target resource outside the StrongDM relay network. Of the many relay hops that may make up a route from client to resource, only the last hop creates the egress connection.
LinkA secure network connection between a gateway and a client, relay, or other gateway. There is generally only one link between any two entities. A link serves as a tunnel through which streams can flow.
QueryA single client request to a resource, such as a SQL query. Long-running SSH, RDP, or Kubernetes interactive sessions count as queries.
StreamA single logical network connection between a client and a resource. One stream can be tunneled through multiple links across multiple gateways and relays. One link can contain multiple streams. There can be multiple simultaneous streams between a client and a resource.

Metrics

Gateway and relay metrics are described in the following table.

Metric nameMetric typeDescriptionLabel(s)
go_gc_duration_secondsSummarySummary of the pause duration of garbage collection cycles
go_goroutinesGaugeNumber of goroutines that currently exist
go_infoGaugeInformation about the Go environment
go_memstats_alloc_bytesGaugeNumber of bytes allocated and still in use
go_memstats_alloc_bytes_totalCounterTotal number of bytes allocated even if freed
go_memstats_buck_hash_sys_bytesGaugeNumber of bytes used by the profiling bucket hash table
go_memstats_frees_totalCounterTotal number of frees
go_memstats_gc_sys_bytesGaugeNumber of bytes used for garbage collection system metadata
go_memstats_heap_alloc_bytesGaugeNumber of heap bytes allocated and still in use
go_memstats_heap_idle_bytesGaugeNumber of heap bytes waiting to be used
go_memstats_heap_inuse_bytesGaugeNumber of heap bytes that are in use
go_memstats_heap_objectsGaugeNumber of allocated objects
go_memstats_heap_released_bytesGaugeNumber of heap bytes released to OS
go_memstats_heap_sys_bytesGaugeNumber of heap bytes obtained from the system
go_memstats_last_gc_time_secondsGaugeNumber of seconds since 00:00:00 UTC on January 1, 1970 of the last garbage collection
go_memstats_lookups_totalCounterTotal number of pointer lookups
go_memstats_mallocs_totalCounterTotal number of mallocs
go_memstats_mcache_inuse_bytesGaugeNumber of bytes in use by mcache structures
go_memstats_mcache_sys_bytesGaugeNumber of bytes used for mcache structures obtained from the system
go_memstats_mspan_inuse_bytesGaugeNumber of bytes in use by mspan structures
go_memstats_mspan_sys_bytesGaugeNumber of bytes used for mspan structures obtained from the system
go_memstats_next_gc_bytesGaugeNumber of heap bytes when next garbage collection will take place
go_memstats_other_sys_bytesGaugeNumber of bytes used for other system allocations
go_memstats_stack_inuse_bytesGaugeNumber of bytes in use by the stack allocator
go_memstats_stack_sys_bytesGaugeNumber of bytes obtained from the system for the stack allocator
go_memstats_sys_bytesGaugeNumber of bytes obtained from the system
go_threadsGaugeNumber of OS threads created
promhttp_metric_handler_requests_in_flightGaugeCurrent number of scrapes being served
promhttp_metric_handler_requests_totalCounterTotal number of scrapes by HTTP status code
sdmcli_chunk_completed_countCounterNumber of chunks processed by the gateway or relaytype=<RESOURCE_TYPE> (example: type=postgres)
sdmcli_credential_load_countCounterTotal number of times the gateway or relay has attempted to load credentials for a resourcetype=<RESOURCE_TYPE> (example: type=postgres),
source=store|cache|api,
success=true|false
sdmcli_egress_countGaugeCurrent number of active egress connectionstype=<RESOURCE_TYPE> (example: type=postgres)
sdmcli_egress_attemptCounterTotal number of times the gateway or relay has attempted to establish an egress connection to a resourcetype=<RESOURCE_TYPE> (example: type=postgres),
successful=true|false
sdmcli_link_attempt_countCounterTotal number of attempts to establish links with other gateways, relays, and listenersdirection=inbound|outbound,
success=true|false
sdmcli_link_countGaugeCurrent number of active links
sdmcli_link_latencyGaugeRound-trip network latency (in seconds) to a certain gatewaypeer_id=<UUID_OF_GATEWAY>,
peer_addr=<HOST:PORT_OF_GATEWAY>
sdmcli_node_heartbeat_durationHistogramCount and duration of each time the gateway or relay attempts to send a heartbeat to the StrongDM backend
sdmcli_node_heartbeat_error_countCounterTotal number of times a heartbeat attempt has failederror=invalid operation|permission denied|item already exists|item does not exist|internal error|canceled|deadline exceeded|unauthenticated|failed precondition|aborted|out of range|unimplemented|unavailable|resource exhausted
sdmcli_node_lifecycle_state_change_countCounterTotal number of times the gateway or relay has changed its lifecycle statestate=verifying_restart|awaiting_restart|restarting|started|stopped
sdmcli_query_completed_countCounterNumber of queries processed by the gateway or relaytype=<RESOURCE_TYPE> (example: type=postgres)
sdmcli_stream_countGaugeCurrent number of active streams
sdmcli_upload_backlog_bytesGaugeCurrent size of the gateway or relay’s upload backlog in bytestype=query_batch|chunk
sdmcli_upload_bytesCounterNumber of bytes the gateway or relay has attempted to uploadtype=query_batch|chunk,
successful=true|false
sdmcli_upload_countCounterNumber of query batches and chunks the gateway or relay has attempted to uploadtype=query_batch|chunk,
successful=true|false
sdmcli_upload_dropped_countCounterNumber of uploads the gateway or relay has given up retryingtype=query_batch|chunk
sdmcli_upload_retried_countCounterNumber of uploads the gateway or relay has retriedtype=query_batch|chunk

Prerequisites

Before you begin configuration, ensure that you have the following:

  • StrongDM client version 34.96.0 or higher
  • A StrongDM account with the Administrator permission level
  • A StrongDM gateway or relay up and running
  • Existing accounts and familiarity with the following:
    • A monitoring system and time series database, such as Prometheus
    • A monitoring dashboard, such as Grafana
    • An alerting tool, such as Prometheus Alertmanager or Rapid7

Configuration Example

You can use the /metrics endpoint to request metrics for any monitoring solution. This particular example shows how to enable Prometheus-formatted metrics on a gateway or relay. Note that the following example steps may differ from yours, and these steps are provided as an example only.

Configuration involves these general steps:

1. Enable Prometheus-formatted metrics on your gateway or relay

This section explains the various ways to enable Prometheus-formatted metrics on your gateway or relay. You need to specify the port and/or IP address for the gateway or relay to listen on. To do so, set an environment variable with or without IP, or pass a setting in your command-line interface.

Once metrics are enabled, the gateway or relay starts listening on the specified port.

Enable metrics using environment variable with port

Set the SDM_METRICS_LISTEN_ADDRESS environment variable in the gateway or relay’s environment on port 9999:

SDM_METRICS_LISTEN_ADDRESS=:9999

Enable metrics using environment variable with IP and port

To specify an IP address to listen on, set the variable with the IP address and port 9999, as in the following example:

SDM_METRICS_LISTEN_ADDRESS=127.0.0.1:9999

Enable metrics using CLI setting

The following example shows how to pass the metrics setting as a command-line argument:

sdm relay --prometheus-metrics=:9999

2. Configure Prometheus

  1. Open your config YAML file for editing.

  2. In the scrape_configs section, add jobs for each gateway or relay, as in the following example:

    scrape_configs:
      - job_name: "StrongDM Relay 01"
        static_configs:
          - targets: ["<RELAY_BOX_URL>:9999"]
    

3. Set up your monitoring dashboard

Configure a monitoring dashboard such as Grafana to visualize your Prometheus metrics. For information on creating a Prometheus data source in Grafana, please see the Prometheus documentation.

4. Set up alerts

Configure your desired alerts on a tool such as Prometheus Alertmanager or Rapid7 in order to ensure reliability and be aware of gateway and relay performance issues.

You may, for example, wish to set alerts for gateway health, resource health and reachability, when a new gateway fails to connect, and when a connected gateway disconnects.

How to Request Metrics

After configuration is complete, you can request metrics from the gateway or relay on the specified port by accessing the /metrics endpoint.

For example:

curl http://127.0.0.1:9999/metrics