Troubleshoot#2: Core Dumps for Network Engineers
Got a core dump file, please contact your TAC engineers for further support !
This is the common statement that we all see on vendors' websites and in their recommendations and it definetly is true. The reason is that a core dump file contains information that requires deep knowledge of the code of the cored application to make use of it.
However, this doesn't mean that core dump files are totally "this is not your business" thing. Network operations teams can use core dumps strategically to help their vendor's support find and fix their problems quicker.
I will try to bridge the gap here between netwokers and programmers so let me first take a step back and define a core dump for the ones who are not very familiar with it and then I will give examples on how you can still use them to your advantage as a network operations engineer.
Core Dumps for non-programmers:
A core dump is simply a snapshot of a process state at the moment it is aborted by the kernel, it's an image of the memory space of that process and it contains the contents of the data, stack, and certain memory-mapped segments, and also CPU registers at the time of creation.
Core dump files are then loaded in a debugger like GDB to inspect the state of the process at the time of the termination, usually to troubleshoot a bug.
Core dumps are typically caused by the delivery of specific signals that are not handled in the code. But they also can be caused programmatically or based on user request and that's what we are interested in here for our purpose.
When core dumps are generated by the system, there is not much actually you can do about them without access to the code. But intentionally generated core dumps can be used strategically to help your vendor's engineers resolve your problems faster.
From a high level there are three types of core dumps seen on networking platforms, based on the types of the operating system and hardware architecture. These are:
- Kernel core dumps.
- Daemons or Processes core dumps.
- Forwarding Plane core dumps if the modules run a micro-kernel.
These dump files are typically automatically generated by an un-handeled error in the program flow but they also can be generated on purpose without interrupting the process, which is known as live core dumps.
Generating Live Core dump files:
Core dumps are useful in a variety of situations, so it's a good practice to generate them when you think they can support other logs. But I will consider 2 different problems seen on network equipment where core dumps might not be the best and the only way to approach, but giving the examples will clarify how core dump files can be used to your advantage and not just when daemons crash.
1- High CPU utilization problems
High CPU utilization, is a nasty common problem that can be seen on network boxes and cause a lot of side effects in the network. Usually you can find the cause of the high utilization by examining the logs on your system, but it's not always clear why or what is causing the daemon/kernel to consume high CPU cycles.
In these cases generating a live core dump is one useful tool to help your vendor's engineers identify what the daemon is exactly doing that is consuming the CPU.
In order for this to really be effective, multiple live core dumps are needed while the process is in that bad state to examine what the daemon is doing and if it is stuck at a certain function or instruction.
You can generate multiple live core dumps few minutes apart of each other. For example, you generate a core dump, wait until it's completely written to the disk, wait a few minutes and repeat the process a couple of times.
Having multiple lives cores in this manner will help the engineers find out at least which instructions or functions might be causing the high CPU and examine them deeper.
2- Memory Leaks
Memory leaks is another type of a problem, where looking at patterns over time is useful. A memory leak happens when a program has a bug related to memory allocation or de-allocation which results in exhaustion of the system's available memory.
In order for a software engineer to be able to identify the source of the leak, beside other logs they will need to look at the memory utilization as it grows to look for patterns in the process memory consumption.
Although the core dump files are not the best way to find memory leaks, they can still help as part of the overall debugging strategy. Depending on the speed of which the memory is growing, taking multiple live core dump files of the process might help in identifying what is not being released. Then restarting the process to release the memory and taking a core dump in the clean state will also help for cross check.
So, How to generate core dumps?
That depends on the network operating system the boxes are running and probably it's underlying kernel. You should always be able to obtain this information from the vendor's support website.
The main point here is that you generate a LIVE core dump that doesn't result in restarting the process. Otherwise if the process is restarted, it is very likely that the problem might be "temporary" corrected with the restart.
P.S: In this series called troubleshoot, I will be exploring network problems from a system's point of view. Hopefully it will be interesting to some.