Metrics

The iota-metrics crate defines a Metrics struct with various IntGaugeVec metrics to monitor running tasks, pending futures, channel sizes, and scope activities. A gauge is a type of metric that represents a single numerical value that can go up or down.

#[derive(Debug)]
pub struct Metrics {
    pub tasks: IntGaugeVec,
    pub futures: IntGaugeVec,
    pub channels: IntGaugeVec,
    pub scope_iterations: IntGaugeVec,
    pub scope_duration_ns: IntGaugeVec,
    pub scope_entrance: IntGaugeVec,
}

These gauges are initialized as IntGaugeVec metrics in the Metrics::new function and added a given prometheus::Registry.

impl Metrics {
    fn new(registry: &Registry) -> Self {
        Self {
            tasks: register_int_gauge_vec_with_registry!(
                "monitored_tasks",
                "Number of running tasks per callsite.",
                &["callsite"],
                registry,
            )
            .unwrap(),
            futures: register_int_gauge_vec_with_registry!(
                "monitored_futures",
                "Number of pending futures per callsite.",
                &["callsite"],
                registry,
            )
            .unwrap(),
            channels: register_int_gauge_vec_with_registry!(
                "monitored_channels",
                "Size of channels.",
                &["name"],
                registry,
            )
            .unwrap(),
            // More metrics initialized here..
        }
    }
}

These metrics can only be initialized and accessed using OnceCell, ensuring they are only initialized once and are thread-safe. Typically, the init_metrics function is called at the beginning of a main function or in a test setup.

static METRICS: OnceCell<Metrics> = OnceCell::new();

pub fn init_metrics(registry: &Registry) {
    let _ = METRICS
        .set(Metrics::new(registry))
        // this happens many times during tests
        .tap_err(|_| warn!("init_metrics registry overwritten"));
}

pub fn get_metrics() -> Option<&'static Metrics> {
    METRICS.get()
}

To monitor futures and tasks in a consistent and simple way, the crate defines multiple macros like monitored_future! and spawn_monitored_task!, which wrap a given future or task while updating metrics. These macros call the global get_metrics function to retrieve the Metrics struct and update the relevant gauges.

To monitor code scopes, the monitored_scope function can create a named scope that keeps track of:

The total iterations where the scope is called in the monitored_scope_iterations metric.
The total duration of the scope in the monitored_scope_duration_ns metric. The total duration of the scope is updated when the scope is dropped, as shown below:

impl Drop for MonitoredScopeGuard {
    fn drop(&mut self) {
        self.metrics
            .scope_duration_ns
            .with_label_values(&[self.name])
            .add(self.timer.elapsed().as_nanos() as i64);
        self.metrics
            .scope_entrance
            .with_label_values(&[self.name])
            .dec();
    }
}

Monitored scopes are used in multiple parts of the node. For example, the consensus_handler module uses monitored scopes to track the duration and number of iterations of the handle_consensus_output function:

    async fn handle_consensus_output(&mut self, consensus_output: ConsensusOutput) {
    let _scope = monitored_scope("HandleConsensusOutput");
    self.handle_consensus_output_internal(consensus_output)
        .await;
}

`RegistryService`

To manage Prometheus registries with their metrics more easily, the crate provides a RegistryService struct with a default registry and a collection of additional registries identified by unique UUIDs:

/// A service to manage the prometheus registries. This service allow us to
/// create a new Registry on demand and keep it accessible for
/// processing/polling.
#[derive(Clone)]
pub struct RegistryService {
    // Holds a Registry that is supposed to be used
    default_registry: Registry,
    registries_by_id: Arc<DashMap<Uuid, Registry>>,
}

This RegistryService allows for the creation, addition, and removal of Prometheus registries and also provides a function to gather all metrics from all registries. It works as follows:

        // Create a default registry
        let default_registry = Registry::new_custom(Some("default".to_string()), None).unwrap();

        // Create a registry service with the default registry
        let registry_service = RegistryService::new(default_registry.clone());
        let default_counter = IntCounter::new("counter", "counter_desc").unwrap();
        default_counter.inc();
        default_registry
            .register(Box::new(default_counter))
            .unwrap();

        // Create another registry and add a metric to it
        let registry_1 = Registry::new_custom(Some("consensus".to_string()), None).unwrap();
        registry_1
            .register(Box::new(
                IntCounter::new("counter_1", "counter_1_desc").unwrap(),
            ))
            .unwrap();

        // Add the new registry to the registry service
        let registry_1_id = registry_service.add(registry_1);

        // Gather all metrics from all registries
        let mut metrics = registry_service.gather_all();
        metrics.sort_by(|m1, m2| Ord::cmp(m1.get_name(), m2.get_name()));

        // There should be two metrics
        assert_eq!(metrics.len(), 2);

        // Check the first metric
        let metric_default = metrics.remove(0);
        assert_eq!(metric_default.get_name(), "default_counter");
        assert_eq!(metric_default.get_help(), "counter_desc");

        // Check the second metric
        let metric_2 = metrics.remove(0);
        assert_eq!(metric_2.get_name(), "consensus_counter_1");
        assert_eq!(metric_2.get_help(), "counter_1_desc");

Exposing the Prometheus Metrics

In order to expose the Prometheus metrics, the RegistryService gets exposed by the start_prometheus_server function, which starts an axum HTTP server and serves the metrics by the /metrics endpoint from the registries.

pub fn start_prometheus_server(addr: SocketAddr) -> RegistryService {
    let registry = Registry::new();

    let registry_service = RegistryService::new(registry);

    if cfg!(msim) {
        // prometheus uses difficult-to-support features such as
        // TcpSocket::from_raw_fd(), so we can't yet run it in the simulator.
        warn!("not starting prometheus server in simulator");
        return registry_service;
    }

    let app = Router::new()
        .route(METRICS_ROUTE, get(metrics))
        .layer(Extension(registry_service.clone()));

    tokio::spawn(async move {
        axum::Server::bind(&addr)
            .serve(app.into_make_service())
            .await
            .unwrap();
    });

    registry_service
}

Typically, the Prometheus server is started in the main function of the node as follows:

#[tokio::main]
async fn main() -> Result<()> {
    let _guard = telemetry_subscribers::TelemetryConfig::new()
        .with_env()
        .init();

    ...

    let registry_service = iota_metrics::start_prometheus_server(
        format!(
            "{}:{}",
            config.client_metric_host, config.client_metric_port
        )
        .parse()
        .unwrap(),
    );
    let registry: Registry = registry_service.default_registry();
    iota_metrics::init_metrics(&registry);

    ...
}

Additionally, instead of exposing the metrics via HTTP only, the metrics can also be pushed, for example, to a given push_url. The iota-node crate, for example, starts a push task start_metrics_push_task that pushes all metrics regularly to a given endpoint defined in the NodeConfig. This push task assigns the current timestamp to each metric, encodes the metrics into the Protobuf format, adds compression and pushes the compressed metrics data via an HTTP POST.

What type of metric is used in the `iota-metrics` crate to monitor running tasks, pending futures, and channel sizes?

Feedback Form

RegistryService​

Exposing the Prometheus Metrics​

Question 1/3

`RegistryService`

Exposing the Prometheus Metrics