Mastering Node.js App Monitoring: Tips & Best Practices

General Information

Session Description: Ever struggled with monitoring in your apps? Not anymore! By sharing the good, the bad, and the hair-pulling from our own experiences, I want to help you steer clear of monitoring chaos. We’ll see how truly knowing how your apps work help you have more focused monitoring. This allows you to dodge black holes checkbox monitoring can have as you can make sure that important metrics and alerts are not swallowed. Additionally, we’ll see how strategic and focused logging, monitoring, and alerting with tools like Graylog, Grafana and Prometheus can supercharge your app’s resilience. Join to uncover how reliability and monitoring patterns and anti-patterns can help improve app quality. You will return armed with invaluable insights that can skyrocket your monitoring game!

Conference: DevOps.JS (ONLINE, February 15-16, 2024)

IMPORTANT ANNOUNCEMENT

Inspired by my experiences at Infobip in logging, monitoring, troubleshooting, alerting, and observability, I'm planning to use the experience I have inside the Web Infrastructure team. I will write a series of blog posts about good and bad practices in monitoring that will be informative and example-based. These posts will provide an in-depth and relatable look at the topic. Look out for these posts in March, April, and May. I'll be working on them together with our own Infobip's DevRel team and posting on the Infobip Developers blog. So, don't expect me to give away everything in anti-patterns and patterns, I have to save some things as a surprise. :)

Suggested Resources

I would like to share some resources that inspired this talk and can provide more information about this topic. The talk is based on all the good and bad practices I've come across and experienced while troubleshooting our web infrastructure. However, it's important to note that many of you may have already encountered these practices, but were not aware of what they were.

Michael T. Nygard - Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) -> Amazon
- While I didn't have time to talk about anti-patterns and patterns while developing applications, it is highly recommended to check this book as it is really good.
Mike Julian - Practical Monitoring: Effective Strategies for the Real World -> Amazon
- This book highly influenced this talk and it has some amazing insights you should check!
Stephen Townshend - Bad Observability -> SquaredUp Blog
Useful Microsoft articles about anti-patterns and patterns:
Rob Ewaschuk - My Philosophy on Alerting -> Article Link
Martin Kleppmann - Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems -> Amazon

Monitoring Anti-Patterns

Tool Obsession

Tool Obsession occurs when teams concentrate heavily on using software tools and technologies to solve issues. This reliance can cause teams to forget about their primary objectives and goals. Believing that simply purchasing specific tools or platforms will guarantee success is a mistake. Instead, teams should use these tools as a means to accomplish their set goals, rather than letting the tools dictate their actions.

Monitoring-as-a-Job

The mistaken belief that only specific individuals should handle monitoring leads to only one person or a small group taking care of all monitoring tasks. Assigning monitoring responsibilities to just a few people can reduce its effectiveness. Instead, the entire team should collaborate in creating and maintaining monitoring for the services they are in charge of.

Checkbox Monitoring

Checkbox monitoring refers to having a monitoring system just to say you have one. It often leads to unreliable, untrustworthy, and ineffective monitoring that may be worse than not monitoring at all. Some of the standard signs of Checkbox Monitoring are:

Tracking basic metrics like system load, CPU usage, and memory use, but still having service outages without knowing why
Ignoring alerts frequently because there are many false alarms
Checking system metrics less often than every 5 minutes
Not saving historical metric data to spot trends

Using Monitoring as a Crutch

Sometimes, teams depend on monitoring tools to handle problems in their systems or processes, rather than working on improvements. This is known as "using monitoring as a crutch." It means the team responds to issues after they happen, rather than preventing them in the first place. The issue here is that the focus is on finding problems, not fixing them. For instance, a team might continue to monitor a weak application rather than improving the code to make it stronger.

Manual Configuration

Manual configuration means setting up monitoring systems by hand instead of using automated processes. This can waste time, lead to mistakes, and make monitoring less effective overall. Doing everything manually makes it harder to monitor systems well. Teams spend too much time on basic setup and maintenance. This makes it less likely they will enhance or update monitoring when needed.

Unnecessary Alerts

This one is pretty much self-explanatory. Getting too many alerts about things that don't require quick action can cause some problems:

Engineers start ignoring alerts, even critical ones
Lots of pointless alerts distract from high-priority issues
Harder to spot real emergencies needing immediate response

The Big Dumb Metric

Some metrics oversimplify complex systems. They combine many measurements into one number. But a single, simplified number cannot show how different parts of the system work. These too-simple metrics do not help people understand what is really happening inside a system.

Examples include:

Reporting one response time for software with hundreds of services
Showing only total % CPU usage without details

For example:

Imagine your monitoring system says the 95th percentile response time is 780 milliseconds. That single number does not provide enough detail to know:

If certain services are slow
How to improve response times
Which parts need help

A Plague of Dashboards

Using too many ready-made dashboards for monitoring can cause several issues. These issues include not being able to look at data quickly, difficulty answering unique questions, and slower development of data analysis skills. The main problem is that relying too much on dashboards can make it hard to find and fix problems fast. This is because engineers might only focus on the existing dashboards and struggle to answer questions that aren't already included.

Reactive Monitoring

Reactive monitoring mainly involves waiting for issues to show up in production environments before taking action. This can cause unneeded downtime, unhappy customers, and missed chances to make the system more stable. Reactive monitoring does not actively find, predict, or stop potential problems, making the system less dependable and strong.

Monitoring should be proactive, using techniques such as ongoing testing, efficient alerts, and automatic problem detection. This helps find issues early on during development and deployment.

Ignoring Customers

This means that you mainly pay attention to how well your technology systems and services are working, but you forget about how they affect your customers and their experiences. This can cause a mismatch between how well you think your systems are working and how happy your customers actually are. To make sure your monitoring is focused on customers, you should keep an eye on how they experience and interact with your services, while still checking your technology systems to identify and fix problems.

Monitoring Patterns

Easy-to-Combine Monitoring

Easy-to-Combine Monitoring is an approach that lets you mix various monitoring tools, creating a flexible and adjustable monitoring system. This method uses a blend of tools that offer benefits and focus on specific tasks. This way, you get a wider view of your system's health and quickly detect and solve problems. As a result, you can build a flexible, strong, and expandable monitoring system better suited to handle your platform's growing complexity.

Monitor from the User Perspective

Users just want the app to work well for them, and they aren't worried about the behind-the-scenes details that make it happen. If you look at things from their point of view, you'll get useful information on how good the user experience is. This can help you make the app better and keep users happy. Pay attention to how the app works for the user when you collect data, instead of only focusing on how the system is performing internally.

Buy, Not Build

Select ready-to-use, often SaaS-based solutions for your monitoring needs, rather than creating custom platforms in-house. This way, you can save time, resources, and make things less complicated, allowing your company to concentrate on its products and goals. The benefits of this approach include:

Cost-effective: It's usually less expensive than in-house solutions, as the costs of development and maintenance are much lower, and you can avoid lost opportunities.
Expertise: SaaS providers tend to be better at creating and maintaining scalable, reliable, and high-performance systems than individual companies.
Easy to deploy: SaaS solutions can be set up quickly and include high-availability, automation, and documentation from the beginning.

Continuous Improvement

Continuous improvement involves regularly updating and enhancing monitoring methods, tools, and systems. This is done to keep up with changing needs and advancements in the industry. Achieving top-quality monitoring takes time and effort, and it's important to understand that it cannot be accomplished quickly. The process may take several months or even years of steady dedication and growth.

Choosing Important Metrics

Choosing Important Metrics is an approach to monitoring that highlights picking, following, and focusing on the most significant, high-impact metrics. These are based on how the system is built, how its parts interact, and any concerns about stability. The goal is to find, track, and pay attention to the most important metricsthat signal potential problems. This helps make monitoring more effective, solve issues quickly, and improve the system's performance. By focusing on metrics that directly affect the system's stability, performance, and overall health, you reduce the chances of overlooking important details and missing problems.

Health Endpoint Monitoring

Health Endpoint Monitoring is a way to constantly check if an application is working properly. It does this by sending requests to specific parts of the application called endpoints. These endpoints then run tests on the application's internal and external services and give a report on their status. This helps make sure the application is available, reliable, and working at its best.

To use health endpoint monitoring, first choose points in the application for health checks, like '/health' or '/status'. Then, add code to the application that runs these tests and gives a detailed report on how the important parts are working. This helps find and fix problems quickly, and it gives useful information to keep the system running smoothly.

Symptom-based Monitoring

This method of monitoring focuses on finding and giving a heads-up about problems that directly impact users, instead of just looking at the root causes. The aim is to create a better and more detailed system for detecting issues, which helps avoid false alarms and makes sure important problems affecting users are dealt with quickly and given the right priority. This approach looks into issues users face like:

Availability: things like error messages or a website being down
Latency: slow loading pages, for example
Data issues: missing or outdated information
Feature functionality: broken parts of a website or services not working properly

Overall look at observability and monitoring

The key message here is that we must work together as developers and team members. By learning from one another and sharing best practices within the team, we can grow and improve. Sharing knowledge helps us enhance our infrastructure day by day. So, let's learn, experiment, and grow together, as it brings many benefits!

Ante Tomić