How to keep track of what’s happening in your cloud
Observability allows us to measure the current state of a system using data such as transactions, metrics, and logs. Its role has become increasingly important lately, as the cloud environment gets more complex and finding the causes of failures occurring within also becomes more challenging. Trask builds the observability of integration platforms for its clients on three fundamental pillars:
- Central log management (CLM) - Collection of application and system logs to a central repository, their processing, and continuous evaluation.
- Metrics: application monitoring - Collection and assessment of metrics from applications and systems, often focusing on performance characteristics.
- Tracing - Tracking requests both externally and internally within the application.
Our colleagues Marcel Hnilka and Adam Morávek focused on the first two pillars in their presentations.
Why and how to use CLM
There are several reasons to opt for central log management:
- Effective and quick root cause analysis: When an error occurs within applications, you are able to quickly trace and correlate individual logged events.
- Correlation and tracing of logged events from multi-layered architectures and distributed systems (especially with the proliferation of microservice architecture): “Without a central repository, it is a problem today to find the connection (correlation) between the events that have occurred,” pointed out Marcel Hnilka, our CLM expert.
- Legislative requirements imposed on banks (NÚKIB): Next year will bring the NIS2 directive, which expands the scope of sectors not previously covered by the cybersecurity law.
- Archiving and recovery of logs for criminal proceedings: This is particularly relevant for banks, which are required to provide logs to the police and other authorities if necessary.
[.infobox][.infobox-heading]What are logs?[.infobox-heading]Logs are streams of aggregated and time-sequenced events collected from the output streams of all running processes and support services. They are typically in text format with one event per line. These logs flow continuously throughout the application runtime. They are not just records of applications in files; often, they include various black boxes or network elements.[.infobox]
Various products falling into two categories can be used to implement CML: closed commercial solutions (such as Splunk, Loggly, Sumo Logic, Sematext Logs, or Better Stack) and open-source solutions (Elastic Stack, OpenSearch, Grafana Loki, VictoriaLogs, or Graylog).
All these solutions are adequate. The differences lie in the price, and you need to decide whether they are worth the investment. At Trask, we use Elastic Stack / OpenSearch in combination with Kafka. We view Kafka as a form of transaction log known from databases. It also expands the spectrum of supported integrations.
Marcel Hnilka, Integration Services Engineer
Anticipate problems with metrics
A different set of information is obtained from metrics. Metrics are crucial for modern infrastructure. They provide concise information about the current state of the application, always offered by the monitored application, and there are usually many of them. Metric values are always tied to the time of collection.
“If we want to use metrics effectively, we need to monitor them constantly. They are used in combination with an alert-management system, which ensures the propagation of changes of certain typical values to the responsible users, who then address the problem,” points out Adam Morávek, who deals with metrics at Trask.
In his view, the number of metrics is unlimited (there can be thousands) and applications usually determine them on their own.
We need to consider whether we need all the metrics. This imposes demands on our infrastructure. Let’s define the key ones among them.
Adam Morávek, DevOps Engineer
[.infobox][.infobox-heading]Why do we collect metrics?[.infobox-heading]- They enable us to anticipate problems.
- Thanks to metrics, we can monitor queue depths, automatically scale pods in Kubernetes, perform performance monitoring, etc.
- Metrics can be monitored and generate an alert when threshold values are exceeded. Knowing them gives us the ability to detect trends, prepare for potential outages, and most importantly, prevent them.[.infobox]
“Trask’s” own solution
Today, the standard tool for metric collection is Prometheus, which actively acquires metrics from applications. The advantages of Prometheus include simple configuration and easy integration with applications, enhancing scalability. However, it does have drawbacks: it lacks support for storing historical data in separate storage and limited support for high availability.
Therefore, an extension of Prometheus called Thanos was developed. At Trask, we have developed and deployed a monitoring solution based on Thanos for our clients in AWS. What benefits can it offer? Improved scalability, theoretically unlimited storage for historical data, easier management for vendors, and readiness for hybrid infrastructure. Moreover, the entire solution is fully automated.
There are many more problems with observability and approaches to them. If you are interested in how to improve operations in cloud infrastructure at your company, do not hesitate to contact us. We are happy to help with everything.
Author
Marcel Hnilka
Integration Services Engineer
mhnilka@thetrask.com
Adam Moravek
DevOps Engineer
amoravek@thetrask.com