Prometheus Metrics and API structure
— monitoring, salesforce — 5 min read
This is a part of series of blog posts about advanced monitoring techniques for your Salesforce.com application so your business can rely on the tech team.
So far, we've covered:
- A case for better monitoring
- Slack alerts for Salesforce.com
- Taking it to the next level for monitoring the Salesforce.com Enterprise
- Installing Prometheus on Heroku
Alright, so we've got Prometheus up and running. So let's get it pulling data from a Salesforce instance. That's right, Prometheus calls an API to collect metrics. It uses a pull mechanism, rather than asking us to push data to it.
In a way, it's easier to do that because the only major thing that'd impact is the API limits. If we were to push data to it by invoking it's API, it's tricky to do that in Salesforce because we are talking about a job that collects metrics and calls an API frequently. Even if we separate the concerns and collect metrics and place it as records of an object, we'd still have to call an API frequently. Best case we'd do that every few minutes but have no guarantee that it'll actually run (the queue might be busy!) and worst case we have a perennial job just re-scheduling itself back occupying one job all the time.
Prometheus also has an FAQ about why it thinks pulling is better than pushing, if you are interested.
Metrics
Let's begin with understanding what we want to send to Prometheus - metrics
.
Straight from Prometheus's documentation, here's a very good definition:
In layperson terms, metrics are numeric measurements, time series mean that changes are recorded over time
It can be the total storage space, number of processes running, % of cpu used etc.
In Salesforce, some of the org specific metrics are specified by the governor limits themselves - like how many API calls were made in the last 24 hrs, total data usage, number of batch jobs that ran etc.
In Prometheus's data model, each metric is stored in a time-series database. You can read more about here, but in short, it means that it stores data in time order. It then gives us more control over querying it to aggregate results in a nice way.
A metric follows this notation:
<metric name>{<label name>=<label value>, ...} <optional_value>
Ex:# => total num of api requests of the type "POST" = 2308api_http_requests_total{method="POST"} 2308
The example above gives the current value of the total number of POST API requests as 2308. The total
means it's the total since Prometheus started up.
Each metric can have labels attached to it, which greatly helps us in understanding more about the metric. In our example, we can see that the number of POST
APIs are captured. We could even add a status=404
or status=200
label if we wish to capture the responses too.
All of this allows to build detailed monitoring rules - like asking us to be alerted when the number of 404
's cross 30% of total API requests etc.
Sample journey
Let's take a look at a sample sequence of calls to Prometheus and see how it all works.
We know that every time the API is called, the application gives out the total number of API requests. Let's say at the beginning this is 0. Let's assume that the scrape_interval is 1 minute, that is Prometheus calls this API every minute to collect metrics. Here's how some sample responses look like:
First time - T1
api_http_requests_total{method="POST"} 0
Assuming there are 21 API calls made in the minute. Second time - T61
api_http_requests_total{method="POST"} 21
Assuming there are 25 API calls made in the last minute. Second time - T121
api_http_requests_total{method="POST"} 46
Assuming there are 30 API calls made in the last minute. Third time - T181
api_http_requests_total{method="POST"} 76
As you can see, the number keeps growing. This type of metric is called a Counter
.
Types of Metrics
There are different types of metrics and I've summarized them here.
- 🔶CounterIt represents a single monotonically increasing counter. It can only increase in value or be reset to zero on restart. Use this to represent the number of API requests served, number of error logs generated etc.
- 🔶GaugeIt represents a numerical value that can go up and down. For example, you can use it to count remaining storage space, rolling 24hr limits etc.
- 🔶HistogramBoth Histogram and Summary counts items in configurable buckets tracking the number of observations and sum of observed values allowing us to track averages. Please take a look at the documentation linked above for examples and differences between them.
- 🔶SummarySee above
For our examples, we'll use Counter
and Gauge
.
API structure
Metrics are exposed to Prometheus using a text-based format in the API:
- each metric in a separate line (
\n
) - empty lines are ignored
- last line must
Content-Type
should betext/plain
Sample API response
# HELP http_requests_total The total number of HTTP requests.# TYPE http_requests_total counterhttp_requests_total{method="post",code="200"} 1027 1395066363000http_requests_total{method="post",code="400"} 3 1395066363000
That's it for now! In the next chapter, we'll take a stab at building an API in Salesforce that'll send metrics to Prometheus.
Next: