Are your team Prometheus experts?
What do you mean by monitoring? Why do you need it? What are the real needs and are you monitoring them? Ask yourself these questions. Can you answer them? If not, you’re probably doing monitoring wrong.
This post asks the basic question. What is monitoring? How does it compare to logging and tracing? Let’s find out.
We use logging to represent state transformations within an application. When things go wrong, we need logs to establish what change in state caused the error.
But the problem is that obtaining, transferring, storing and parsing logs is expensive. Because of this it is crucial to only log what is necessary; only logs that can be acted upon should be stored. Log only actionable information.
This usually results in two types of data; panic-level information for humans and structured data for machines. I would also question whether you really need the structured data. But there are use cases, e.g. security.
A trace represents a single user’s journey through an entire stack of an application. It is often used for optimisation purposes. For example you would use it to establish little used part of a stack or bottlenecks within specific parts of the stack.
But it adds significant complexity. There are often significant amounts of implementation code and is often designed as a push model, which means that applications could be affected by loading in the monitoring system.
The libraries intended to simplify tracing are often more complicated than the code they are serving.
Therefore tracing tends to be quite expensive. Think long and hard whether the added complexity is warranted. Are you falling into the trap of premature optimisation? Is optimisation that important when you could just scale horizontally?
Instrumenting an application and monitoring the results represents the use of a system. It is most often used for diagnostic purposes. For example we would use monitoring systems to alert developers when the system is not operating “normally”.
Instrumentation tends to be very cheap to compute. Metrics take nanoseconds to update and some monitoring systems operate on a “pull” model, which means that the service is not affected by monitoring load.
Generally the more data you have, the more useful monitoring becomes.
So typically you would want to instrument all of your services. But make sure you pick a simple, scalable monitoring system like Prometheus!
In short, Monitoring != Logging != Tracing.