Logging is one of the most important parts of the distributed systems. Many things can break, but when the logging breaks, then we are completely lost. In this blog post, we will understand log levels and how to log efficiently in distributed systems.
Logging Levels
Whenever we log a message, we need to specify the log level or log severity. It’s an indicator of how important the message is and who should be concerned.
Log levels have been with us for a long time. Since the very early days of computer systems, we had to provide log files and define the levels. Typically, we can use the following levels:
- Error
- Warning
- Information
- Debug
- Trace
Sometimes a longer hierarchy is used and we may have additional categories like:
- Emergency
- Alert
- Critical
- Notice
We use the log levels in two ways:
- We define the level for each message that we log. In other words, when logging a message in the application source code, we need to specify the level of this particular message
- We specify which messages are stored in the log storage. In other words, when adding a new log storage (like a log file or log table), we specify the maximum log level that we want to store
No matter what levels we can use (which depends on the libraries and monitoring systems we use), we have some best practices that we should follow. Let’s go through them one by one.
Trace Level
Logs at this level typically show “everything” that happened. They are supposed to explain everything to debug any issue. This includes showing raw data packets that are sent over the network or hexadecimal representation of memory bytes.
Logs on this level may consume a lot of space due to the raw data that we log. For instance, if your application needs to send 1GB of data over the network, it would store the content of the data in the log file which would be at least 1GB.
This level may also pose a security risk as we emit everything, including passwords or sensitive information. Therefore, this level shouldn’t be used in production systems.
Messages on this level are typically useful for the developers as they help debug bugs. They are rarely helpful for database administrators and nearly never interesting for regular users.
Debug Level
This level is similar to the trace level but focuses on the curated messages helpful for programmers. On this level, we rarely store the raw content of everything. We rather focus on very detailed pieces of information showing crucial data that may help the programmers. For instance, we wouldn’t store the actual password sent over the network, but we would store the metadata of the connection and which encryption schemes are used.
Logs on this level are still very big and shouldn’t be enabled in production. They may also contain sensitive data, so they may pose a security risk. Messages are useful for programmers and are very rarely useful for others.
Information Level
Logs on this level typically show “what happens”. They just inform that something happened, typically presented from the business perspective. This is the level we typically use initially and only change it if we generate too many logs.
Messages on this level focus on business context and business signals. They very rarely show actual data sent by the users. More often they focus on the facts that happened.
Messages on this level are very helpful for the typical users of the system who want to monitor its behavior. This includes database administrators and programmers. Messages on this level are rarely helpful for regular users.
This is the level that we should use by default unless we get too many logs (which may happen if our system handles thousands of transactions per second).
Warning Level
On this level, we log messages related to potentially erroneous situations that the system handles automatically. For instance, we may inform that the user logged in with an insecure password. The system still worked correctly and it’s not an error in theory, but still, this situation requires some attention.
Logs on this level are useful for database administrators and users. For instance, they may indicate issues that can be prevented if taken care of early enough or things that may break after updating the system to the next edition.
We should always store these logs in the log file.
Error Level
This level shows situations that are errors and the system couldn’t handle them automatically. Depending on the context, one situation may be an “error” from the user's perspective and not an “error” from the system's perspective or the other way around. For instance, if the user provides an incorrect password when logging in, the user should get an error message (and so the database client should log an error), whereas the system should only log a warning message (as this is potentially a brute-force attack).
Logs on this level should always be stored in the log file. Depending on the error severity, we may want to build alerts around these logs and notify the users accordingly. Error logs are useful for users and administrators.
What to Log
When logging information, we should always include everything that “is needed” and not more. We need to find a balance between logging too much (which consumes the disk space) and logging not enough (so we can’t investigate the problem later).
Having distributed systems in mind, we should log the following:
- Timestamp - human-friendly timestamp of the message. For instance, “2024.06.21 13:08:23.125422”. It’s better to start the timestamp with the year to make sorting easier. Depending on the system we’re dealing with, we should log at least hours and minutes, but we can go as far down as nanoseconds
- Application name - the name of the application that we understand. It shouldn’t be generic like “database” but rather something with business meaning like “tickets database”
- Server identifier - when we scale to multiple servers, we should always log the server identifier to understand where the issue occurred
- Process identifier - the identifier of the process as reported by the operating system. This is helpful when we scale out our application by forking it into multiple processes
- Thread identifier - the identifier of the thread that logged the message. We may have multiple identifiers of this kind (the operating system level identifier and the runtime thread identifier) and we should log all of them. The name of the thread can be helpful as well. Thread identifiers are helpful when analyzing memory dumps.
- Log level - obviously, the level of the message
- Correlation identifier - the unique identifier of the operation we’re executing. This can be Trace ID and Span ID from the OpenTelemetry or something equivalent
- Activity - the human-friendly name of the workflow we’re executing. For instance, “ticket purchase”
- Logical time - any other ordering identifier that we have, for instance, Lamport’s happened-before time or vector time in the distributed system
- Logger ID - the identifier of the logger that we used (especially if we have multiple subsystems within the process)
There are more things that we can log depending on our needs. Consider this a good starting point.
We should typically store the logs up to the Information level in the log files. If we face issues with logs becoming too big too early or with too much disk activity due to the log writing, we should consider limiting the log files to store messages up to the Warning level.
How to Effectively Use Logging With Observability
There are some best practices that we should follow when configuring the logging. Let’s see them.
Rolling Old Files
We typically store the log messages in the log files. They may easily become very big (in tens or hundreds of gigabytes), so we need to roll the files often. Typically, each day or each hour.
After we roll the file, we should compress the old file and send it over to a centralized place. Rolling files each hour gives us some more reliability (as in the worst case we will lose one hour of logs) but also makes the number of files higher.
Asynchronous Logging and Centralization
Logging is expensive. When your system handles thousands of transactions each second, we simply can’t log all the data synchronously. Therefore, always offload logging to a background thread and log asynchronously.
However, logging asynchronously may break the order of logs. To fix that, always include logical ordering ID (like Lamport’s happened-before) to be able to reconstruct the proper order. Remember that the timestamp is good enough only when compared within one thread (and even then it may be incorrect) and so should be avoided in distributed systems.
It’s better to log to local log files (to minimize the writing time), and then periodically extract the logs and send them to some external storage. This is typically achieved with a background daemon that reads the files, compresses them, and sends them over to some central monitoring place.
Anomaly Detection
Once we have the log files in a centralized store, we can look for anomalies. For instance, we can simply search the logs and look for words like “exception”, “error’, “issue”, or “fatal”. We can then correlate these logs with other signals like metrics (especially business dimensions) or traces and data distribution.
We can make anomaly detection better with the help of artificial intelligence. We can analyze the patterns and look for suspicious trends in the data. This is typically provided out-of-the-box in the cloud infrastructures.
Filtering by the Correlation ID
The correlation ID can help us debug scenarios much more easily. Whenever there is an issue, we can show the correlation ID to the user and ask them to mention the ID when reaching back to use.
Once we have the correlation ID, we simply filter the logs based on the correlation ID, order the log entries by the logical clock (like vector clock), then order them by the timestamp, and finally visualize. This way, we can easily see all the interactions within our system.
Tracing on Demand
It’s good to have the ability to enable tracing on demand, for instance, based on the cookie, query string parameter, or some well-known session ID. This helps us live debug issues when they happen.
With such a feature, we can dynamically reconfigure the log file to store all the log messages (up to the trace level) only for some of the transactions. This way, we don’t overload the logging system but we still get the crucial logs for the erroneous workflows.
Logging In Database
Logging in databases doesn’t differ much from the regular distributed systems. We need to configure the following:
- What log collector to use - this specifies where we log messages. This can be for instance the output stream or the file
- Where to store logs - in case of logging to a file, we need to specify the directory, names, and file formats (like CSV or JSON)
- How to handle log files - whether they are rotated daily, hourly, or even more often, or if the files are trimmed on the system restart
- Message formats - what information to include, for instance, the transaction identifier or some sequence numbering, what timezone to use, etc.
- Log level - which messages to store in the log file
- Sampling - for instance if log only longer queries or slower transactions
- Categories - what to log, for instance, whether to log connection attempts or only transactions
As mentioned before, we need to find a balance between logging too much and not logging enough. For instance, take a look at PostgreSQL documentation to understand which things can be configured and how they may affect the logging system.
Typically, you can check your log sizes with SQL queries. For instance for PostgreSQL, this is how you can check the size of the current log file:
1SELECT size FROM pg_stat_file(pg_current_logfile());
size
94504316
This query shows you the total size of all log files:
You can verify the same with the command line. Go to your database installation directory (that could be /var/lib/postgresql/data/log or similar and you can check it with SHOW data_directory;) and run this command;
1du -cb
54893969026 .
54893969026 total
Depending on how you host the database, you may get your logs delivered to external systems out of the box. For instance, you can configure PostgreSQL in RDS to send logs to Amazon CloudWatch and then you can see the logs within the log group /aws/rds/instance/instance_name/postgresql
There are many types of logs you can see, for instance:
- Startup and shutdown logs - these describe your database startup and shutdown procedures.
- Query logs - these show the SQL queries that are executed
- Query duration logs - they show how long queries took to complete
- Error logs - they show errors like invalid SQL queries, typos, constraint violations, etc.
- Connection and disconnection logs - they show how users connect to the database
- Write-Ahead Logs (WAL) - they show checkpoints and when the data is saved to the drive
Depending on your configuration, your log files may easily grow in size. For instance, if you decide to log every query and you have thousands of transactions per second, then the log file will store all the history of the queries which may be enormous after a couple of hours.
Summary
Logs are the most important part of the distributed system. Anything can break, but if logs break, then we are blind. Therefore, it’s crucial to understand how to log, what to log, and how to deal with logs to debug issues efficiently.