What is Stack driver in Google Cloud Platform
Stackdriver works for multiple clouds and on-premises infrastructures i.e. it won’t lock developers into using a particular cloud provider
Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information.
Monitoring. This service is used to help understand the performance and utilization of applications and resources.
Logging. This service is used to collect service-specific details about the operations of services.
Alerting. This service is used to notify responsible parties about issues with applications or infrastructure that need attention.
A reliable system continuously provides its service. Reliability is closely related to availability. Reliability is a probability, specifically, the probability that a system will be able to process some specified workload for some period of time
Improving Reliability with Stackdriver
It is difficult, if not impossible, to provide reliable software services without insights into how that software is functioning. The state of software systems is constantly changing, especially in the cloud, where infrastructure, as well as code, can change frequently. Also, the demands on applications change.
■ A new service might be more popular than anticipated, so additional compute infrastructure is needed to meet demand.
■ Seasonal variations, such as holiday shopping, can lead to expected high workloads.
■ An error in service may be disrupting a workflow, resulting in a backlog of unprocessed tasks.
■ A database runs out of persistent storage and can no longer execute critical transactions.
■ The cache hit ratio is dropping for an application because the memory size is no longer sufficient to meet the needs of the service.
Monitoring with Stackdriver
Monitoring is the practice of collecting measurements of key aspects of infrastructure and applications. Examples include average CPU utilization over the last minute, the number of bytes written to a network interface, and the maximum memory utilization over the past hour. These measurements, which are known as metrics, are made repeatedly over time and constitute a time series of measurements.
Metrics have a particular pattern that includes some kind of property of an entity, a time range, and a numeric value. GCP has defined metrics for a wide range of entities, including the following:
■ GCP services, such as BigQuery, Cloud Storage, and Compute Engine.
■ Operating system and application metrics which are collected by Stackdriver agents that run on VMs.
■ Anthos, which includes metrics include Kubernetes and Istio metrics
■ AWS metrics measure the performance of Amazon Web Service resources, such as EC2 instances.
■ External metrics including metrics defined in Prometheus, a popular open-source monitoring tool.
It performs different operations under different features:
■Dynamic configure and intelligent defaults
■Alerts and Dashboard
■ Platform, System and Application logs
■ Monitoring Alerts and Data can be exported to BigQuery
3. Error Reporting
■ Error Notifications
■ Error Dashboard
■ Display data in near real-time
■Inspect an application without stopping it
Alerting with Stackdriver
Alerting is the process of monitoring metrics and sending notifications when some custom-defined conditions are met. The goal of alerting is to notify someone when there is an incident or condition that cannot be automatically remediated and that puts service-level objectives at risk. If you are concerned about having enough CPU capacity for intermittent spikes in workload, you may want to run your application servers at an average of 70 percent utilization or less. If utilization is greater than 80 percent, then you may want to be notified.