1. Overview

Although software and hardware have improved over the years, there are still cases where a program becomes stuck and unresponsive:

  • system resources are insufficient for the program to run
  • the program performs a never-ending computation
  • the program is waiting indefinitely for a network resource

In this tutorial, we’ll learn tools and techniques to find out why a program isn’t running as expected.

2. Use Case Program

Firstly, we’ll create a C program that reads a file and copies its contents to a character array:

$ cat -n filecopy.c
     1  #include <stdio.h>
     2  #include <stdlib.h>
     3  #include <string.h>
     4  #include <unistd.h>
     5
     6  int main(int argc, char* argv[]){
     7      FILE *fd = fopen(argv[1], "r");
     8      int SIZE = 1024*1000;
     9      char *content = (char *)malloc(1);
    10      char *buffer = (char *)malloc(SIZE*sizeof(char));
    11      long i = 0;
    12      while(!feof(fd)){
    13          fgets(buffer, SIZE, fd);
    14          char* tmp = (char *)realloc(content, (i+1)*SIZE*sizeof(char));
    15          if(tmp != NULL){
    16              content = tmp;
    17              strcat(content, buffer);
    18          }
    19          i++;
    20      }
    21      fclose(fd);
    22  }

Notably, we’ve annotated the code snippet with line numbers for easier reference with the -n option of the cat command.

As we can see, the program opens a file in read-only mode. The path to the file is within the first command-line argument. Then, inside the while loop, we copy SIZE number of characters to the buffer variable. Finally, the content of the buffer variable is appended to the content variable.

In essence, our objective is to have the program read the /dev/random special file. In fact, this file is a random number generator, causing feof to always return true. So, the while loop runs indefinitely. Furthermore, since we don’t free the allocated memory within the program, the program’s memory continuously grows.

Next, let’s save the code to filecopy.c, compile with gcc, and run the program in the background:

$ gcc -o filecopy filecopy.c
$ ./filecopy /dev/random &

Indeed, our program is now running. At this point, we created a non-responsive program that runs in the background indefinitely.

Furthermore, let’s find the PID value of the process running the filecopy program using the ps and grep commands:

$ ps | grep filecopy
  11809 pts/1    00:00:08 filecopy

In this case, we can see that the PID value of the process that runs filecopy is 11809.

3. GNU Debugger

The first tool that we’ll examine is the GNU Debugger or GDB. We can use GDB to obtain a stack trace of the program. Actually, a stack trace is a snapshot of the function call sequence. Thus, we can get an insight into what the program is doing while it isn’t responding.

So, let’s attach GDB to a running process that we can identify via its PID:

$ sudo gdb
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
(gdb) attach 11809
Attaching to process 11809
Reading symbols from .../filecopy...
...
(gdb) backtrace
#0  __strcat_avx2 () at ../sysdeps/x86_64/multiarch/strcat-avx2.S:192
#1  0x0000557d038502cb in main ()
(gdb)

In the above example, we first start gdb with superuser privileges. As a result, we enter the interactive mode of gdb. Next, we use the attach command with the PID of the process we want to link to and debug. Finally, we enter the backtrace command to get a stack trace of the running program.

In the stack trace, we can see that the filecopy program currently executes the __strcat_avx2 function which refers to the strcat function. Next, the second line shows us that strcat() is called in the main() function of our program.

4. GDB Program Symbols

An interesting point is that we can have extra information in the stack trace, like the line number of function calls, as well as the values of variables inside our code.

We can add such details by compiling our program with the -ggdb option:

$ gcc -ggdb -o filecopy filecopy.c

Now, if we load the filecopy program with GDB’s file command, the output of the backtrace command will be enriched:

$ sudo gdb
...
(gdb) file filecopy
Reading symbols from filecopy...
(gdb) attach 11809
Attaching to program: .../filecopy, process 11809
...
(gdb) backtrace
#0  __strcat_avx2 () at ../sysdeps/x86_64/multiarch/strcat-avx2.S:194
#1  0x0000557d038502cb in main (argc=2, argv=0x7ffc3ca846c8) at filecopy.c:17

Indeed, we can see that strcat() is called in line 17 of the main() function. Moreover, the stack trace contains the values of the argc and argv variables.

Notably, getting one stack trace isn’t sufficient to understand what is going wrong with a program. Most of the time, we’ll have to examine multiple stack traces of the program’s execution to troubleshoot the problem. A function call that keeps appearing in the stack trace could be indicative of an infinite loop, a time-consuming operation, or the program’s execution being blocked.

5. The strace Command

The strace command is a useful tool that reports system calls and the signals received by a running program. The output of strace can help us understand what a program is doing during its execution.

We can attach to a running process via the strace command and monitor the system calls performed:

$ sudo strace -p 11809
strace: Process 11809 attached
mremap(0x7dc4fc216000, 1258930180096, 1258931204096, MREMAP_MAYMOVE) = 0x7dc4fc216000
mremap(0x7dc4fc216000, 1258931204096, 1258932228096, MREMAP_MAYMOVE) = 0x7dc4fc216000
mremap(0x7dc4fc216000, 1258932228096, 1258933252096, MREMAP_MAYMOVE) = 0x7dc4fc216000
mremap(0x7dc4fc216000, 1258933252096, 1258934276096, MREMAP_MAYMOVE) = 0x7dc4fc216000
mremap(0x7dc4fc216000, 1258934276096, 1258935300096, MREMAP_MAYMOVE) = 0x7dc4fc216000
mremap(0x7dc4fc216000, 1258935300096, 1258936324096, MREMAP_MAYMOVE) = 0x7dc4fc216000
read(3, "\35\202\2\216i\7\357\30\3160\253s\332xL\317\250\f\203\23n\362]/\264\251\3466f\255\1\31"..., 4096) = 4096
...

As we can see, there are two system calls recorded by strace:

  • mremap: expanding or shrinking an existing memory mapping
  • read: reading a number of bytes from a file

In our case, strace keeps printing these two system calls to stdout. The mremap system call corresponds to the re-allocation of memory for the content variable, while the read system call performs the reading of the file.

Thus, through the recorded system calls, we can have an understanding of what actions a program is performing while it’s not responding.

6. Getting the Status of a Process

There are commands that we can use to find information about a process, like how much memory it consumes and what state it is in. Sometimes, this hints at the reasons behind a given program getting stuck.

6.1. The ps Command

We can find the status and the consumed memory of a process via its PID using the ps command with the v option:

$ ps v -11809
    PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
  11809 pts/1    R      1:46      0     0 180298779 23820  0.2 ./filecopy /dev/random

Let’s explore the meaning of each column:

  • PID: the PID of the process
  • TTY: the associated terminal
  • STAT: the process’s status
  • TIME: CPU time
  • MAJFL: number of major page faults
  • TRS: physical memory occupied by the program’s code
  • DRS: physical memory occupied by the program’s data
  • RSS: non-swapped memory used
  • %MEM: the proportion of the memory occupied by the process in relation to the total available memory
  • COMMAND: the command that initiated the process

In our case, the STAT column has the value of R which denotes that our process is running. Values D and S mean that the process is sleeping. Furthermore, the process consumes about 180M of memory.

Consequently, we can assess this information to troubleshoot the problem.

6.2. The top Command

Similarly to ps, we can use the top command to obtain data about a process:

$ top -p 11809
top - 15:49:10 up 15:46,  2 users,  load average: 1.00, 1.00, 0.89
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.7 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  10607.8 total,   9840.9 free,    484.4 used,    282.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9860.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  11809 ubuntu    20   0  581.6g  77468   1320 R  99.0   0.7  28:43.99 filecopy

In the above example, we ran top with the -p option to display information about a specific process. Moreover, the first part of the command’s output contains information about the system, like the load average, total memory consumption, and others.

Concerning the information displayed per process, we can find the process’s consumed memory under the RES column and the status under the S column.

7. Examining the Logs

Most applications use logging to record their activity and their possible failures. In addition, it’s quite common that we can set the logging detail level of an application.

Many applications create a sub-directory in the /var/log directory to store their log files:

$ ls -l /var/log
total 3332
drwxr-xr-x  2 root      root               4096 Sep 18 10:34 apt
drwxr-xr-x  2 root      root               4096 Feb 10  2023 dist-upgrade
drwxr-xr-x  2 root      adm                4096 Sep 18 00:00 nginx
-rw-r-----  1 syslog    adm              257328 Sep 19 16:04 syslog
...

In this example, the NGINX server saves its log files to the nginx sub-directory.

So, reviewing a program’s logs may provide us an insight into why the program is stuck. In general, logs serve as an important source of information when troubleshooting a malfunctioning program.

8. Conclusion

In this article, we learned ways to troubleshoot an unresponsive program. In summary, we reviewed tools that provide information in several categories:

  1. the program’s stack trace
  2. memory consumption
  3. process information
  4. application errors

Ideally, data from each of these categories can be combined to provide an overall view of what is happening inside a program.