Prometheus Metrics and API structure

26.10.2021 — monitoring, salesforce — 5 min read

This is a part of series of blog posts about advanced monitoring techniques for your Salesforce.com application so your business can rely on the tech team.

So far, we've covered:

Alright, so we've got Prometheus up and running. So let's get it pulling data from a Salesforce instance. That's right, Prometheus calls an API to collect metrics. It uses a pull mechanism, rather than asking us to push data to it.

In a way, it's easier to do that because the only major thing that'd impact is the API limits. If we were to push data to it by invoking it's API, it's tricky to do that in Salesforce because we are talking about a job that collects metrics and calls an API frequently. Even if we separate the concerns and collect metrics and place it as records of an object, we'd still have to call an API frequently. Best case we'd do that every few minutes but have no guarantee that it'll actually run (the queue might be busy!) and worst case we have a perennial job just re-scheduling itself back occupying one job all the time.

Prometheus also has an FAQ about why it thinks pulling is better than pushing, if you are interested.

Metrics

Let's begin with understanding what we want to send to Prometheus - metrics.

Straight from Prometheus's documentation, here's a very good definition:

In layperson terms, metrics are numeric measurements, time series mean that changes are recorded over time

It can be the total storage space, number of processes running, % of cpu used etc.

In Salesforce, some of the org specific metrics are specified by the governor limits themselves - like how many API calls were made in the last 24 hrs, total data usage, number of batch jobs that ran etc.

In Prometheus's data model, each metric is stored in a time-series database. You can read more about here, but in short, it means that it stores data in time order. It then gives us more control over querying it to aggregate results in a nice way.

A metric follows this notation:

<metric name>{<label name>=<label value>, ...} <optional_value>

Ex:
# => total num of api requests of the type "POST" = 2308
api_http_requests_total{method="POST"} 2308

The example above gives the current value of the total number of POST API requests as 2308. The total means it's the total since Prometheus started up.

Each metric can have labels attached to it, which greatly helps us in understanding more about the metric. In our example, we can see that the number of POST APIs are captured. We could even add a status=404 or status=200 label if we wish to capture the responses too.

All of this allows to build detailed monitoring rules - like asking us to be alerted when the number of 404's cross 30% of total API requests etc.

Sample journey

Let's take a look at a sample sequence of calls to Prometheus and see how it all works.

We know that every time the API is called, the application gives out the total number of API requests. Let's say at the beginning this is 0. Let's assume that the scrape_interval is 1 minute, that is Prometheus calls this API every minute to collect metrics. Here's how some sample responses look like:

First time - T1

api_http_requests_total{method="POST"} 0

Assuming there are 21 API calls made in the minute. Second time - T61

api_http_requests_total{method="POST"} 21

Assuming there are 25 API calls made in the last minute. Second time - T121

api_http_requests_total{method="POST"} 46

Assuming there are 30 API calls made in the last minute. Third time - T181

api_http_requests_total{method="POST"} 76

As you can see, the number keeps growing. This type of metric is called a Counter.

Types of Metrics

There are different types of metrics and I've summarized them here.

🔶
Counter
It represents a single monotonically increasing counter. It can only increase in value or be reset to zero on restart. Use this to represent the number of API requests served, number of error logs generated etc.
🔶
Gauge
It represents a numerical value that can go up and down. For example, you can use it to count remaining storage space, rolling 24hr limits etc.
🔶
Histogram
Both Histogram and Summary counts items in configurable buckets tracking the number of observations and sum of observed values allowing us to track averages. Please take a look at the documentation linked above for examples and differences between them.
🔶
Summary
See above

For our examples, we'll use Counter and Gauge.

API structure

Metrics are exposed to Prometheus using a text-based format in the API:

each metric in a separate line (\n)
empty lines are ignored
last line must
Content-Type should be text/plain

Sample API response

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

That's it for now! In the next chapter, we'll take a stab at building an API in Salesforce that'll send metrics to Prometheus.