How to Keep Track of What’s Happening in Your Cloud
One key aspect of integration is achieving observability in a hybrid cloud environment. Effective monitoring strategies can provide a comprehensive overview, helping to manage these complex setups easily. How can one gain an overview and monitor a hybrid cloud environment? Discover more information in our article.
Observability allows us to monitor and measure the current state of a system using data such as transactions, metrics, and logs. Its role has become increasingly important lately, as the cloud environment gets more complex, and finding the causes of failures occurring within also becomes more challenging. For example, Trask builds the observability of integration platforms for its clients on three fundamental pillars:
- Central log management (CLM) – Collection of application and system logs to a central repository, their processing, and continuous evaluation.
- Metrics: application monitoring – Collection and assessment of metrics from applications and systems, often focusing on performance characteristics.
- Tracing – Tracking communication both externally and internally within the application.
Let´s explore the first two pillars more deeply.
Why And How to Use CLM
There are several reasons to opt for central log management:
- Effective and quick root cause analysis: When an error occurs within applications, you can trace and correlate individual logged events quickly.
- Correlation and tracing of logged events from multi-layered architectures and distributed systems (especially with the increasing complexity of a microservice architecture): Without a central repository, it is a problem today to find the connection (correlation) between the events that have occurred in different parts of an environment.
- Legislative requirements imposed on banks (NIS2): Pointing to the NIS2 directive, which expands the scope of areas not previously covered by cybersecurity law.
- Archiving and recovery of logs for criminal proceedings: This is particularly relevant for banks, which are required to provide logs to the police and other authorities if necessary.
What are logs?
Logs are streams of aggregated and time-sequenced events collected from the output of all running processes and support services. They are typically in text format with one event per line. These logs flow continuously throughout the application runtime. The logs are produced by applications, systems and infrastructure components such as network devices.
Various products falling into two categories can be used to implement CML: commercial solutions (such as Splunk, Loggly, Sumo Logic, Sematext Logs, or Better Stack) and open-source solutions (Elastic Stack, OpenSearch, Grafana Loki, VictoriaLogs, or Graylog).
All these solutions are adequate. The differences lie in TCO, and you need to decide whether they are worth the investment. At Trask, we use Elastic Stack / OpenSearch in combination with Kafka. We view Kafka as a form of “transaction log” known from databases. By using Kafka Connect it also expands the spectrum of supported system integrations.
Marcel Hnilka, Integration Services Engineer
Anticipate Problems With Metrics
A different set of information is obtained from metrics. Metrics are crucial for modern infrastructure. They provide concise information about the current state of the application, always provided by the monitored application or its underlying runtime, and there are usually many of them covering various facets of the system. Metric values are always tied to the time of collection.
If we want to use metrics effectively, we need to monitor them constantly. They are combined with an alert-management system, which ensures the propagation of changes of certain typical values to the responsible users, who then address the problem.
In our view, the number of metrics is unlimited (there can be thousands), and applications usually determine them on their own.
We need to consider whether we need all the metrics. This imposes demands on our infrastructure. Let’s define the key ones among them.
Adam Morávek, DevOps Engineer
Why do We Collect Metrics?
- They enable us to anticipate problems.
- Thanks to metrics, we can monitor queue depths, automatically scale pods in Kubernetes, perform performance monitoring, etc.
- Metrics can be monitored and generate an alert when threshold values are exceeded. Knowing them allows us to detect trends, prepare for potential outages and, most importantly, prevent them.
Meet “Trask’s” Own Solution
Today, the standard tool for metric collection is Prometheus, which actively acquires metrics from applications. The advantages of Prometheus include simple configuration and easy integration with applications, enhancing scalability. However, it does have drawbacks: it lacks support for storing historical data in separate storage and limited support for high availability.
Therefore, an extension of Prometheus called Thanos was developed. At Trask, we have developed and deployed a monitoring solution based on Thanos for our clients in AWS. What benefits can it offer? Improved scalability, theoretically unlimited storage for historical data, easier management for vendors, and readiness for hybrid infrastructure. Moreover, the entire solution is fully automated.
There are many more problems with observability and approaches to them. If you are interested in how to improve operations in cloud infrastructure at your company, do not hesitate to contact us.
Authors
Marcel Hnilka
Integration Services Engineer
mhnilka@thetrask.com
Adam Moravek
DevOps Engineer
amoravek@thetrask.com