packagecloud.io: How does strace work?

packagecloud.io: How does strace work?

This blog post explains how strace works, internally. We’ll examine the ptrace system call, which strace relies on, at the API layer and internally to understand how exactly strace can get information about the system calls being made in a running process.

ptrace

ptrace is a system call which a program can use to:

trace system calls
read and write memory and registers
manipulate signal delivery to the traced process
As you can see, ptrace is a really useful system call for tracing and manipulating other programs.

It is used by strace, GDB, and other tools as well.

You should take a look at the man page for ptrace for a lot of useful documentation.

Store Deb, RPM, RubyGem, and Python packages, free. Sign up!
ptrace and syscalls

For simplicity, let’s call the program which will trace the system calls being made by another program, the tracer. The program which is being traced, the tracee.

The tracer can use the PTRACE_ATTACH flag when calling ptrace and supply the process ID of the tracee. This is followed by another call to ptrace with the PTRACE_SYSCALL flag and process ID.

The tracee will run until it enters a system call, at which point it will be stopped by the Linux kernel. For the tracer, it will appear as if the tracee has been stopped because it received a SIGTRAP signal. The tracer can then inspect the arguments to the system call, print any relevant information.

Next, the tracer can call ptrace with PTRACE_SYSCALL again which will resume the tracee, but cause it to be stopped by the kernel when it exits the system call.

This pattern continues to trace the entry and exit from system calls allowing the tracer to inspect the tracee and print arguments, return values, timing information, and more.

Now that the order of operations for working with ptrace has been outline, let’s take a look at how this actually works under the hood in the kernel.

PTRACE_ATTACH

A good place to start looking is the code in the kernel for the ptrace system call. The code samples below will refer to the Linux Kernel 3.13 and there will be links to the source on GitHub.

We’ll start by understanding what PTRACE_ATTACH does.

The generic ptrace system call code can be found in kernel/ptrace.c. Take a look at the source on GitHub.

A few lines in, the source for ptrace checks the request parameter for PTRACE_ATTACH:

if (request == PTRACE_ATTACH || request == PTRACE_SEIZE) {
ret = ptrace_attach(child, request, addr, data);
/*
* Some architectures need to do book-keeping after
* a ptrace attach.
*/
if (!ret)
arch_ptrace_attach(child);
goto out_put_task_struct;
}
ptrace_attach

If, it matches, ptrace_attach is called which you can find toward the top of the file.

ptrace_attach does a few things early on:

sets flags which will eventually be stored on a structure in the kernel representing the process that is being attached to
ensures that the task it will attach to is not a kernel thread
ensures that the task it will attach to is not a thread of the current proccess
__ptrace_may_access does some security checks
After that, flags are set and the process is stopped.

In our case, that flag is PT_PTRACED.

The ptrace_attach function completes and execution resumes in ptrace.

Finishing the PTRACE_ATTACH from ptrace

Next, ptrace checks if the process is ready for ptrace operations by calling ptrace_check_attach here.

Finally, ptrace calls arch_ptrace which is a function supplied by the CPU architecture specific code. For our purposes, this is the x86 ptrace code, which can be found in arch/x86/kernel/ptrace.c.

You can examine the arch_ptrace function here. If you follow that huge switch statement to the bottom, you’ll see that there is no case for PTRACE_ATTACH, so the default case will execute, which simply hands execution back up to the generic ptrace code in a function called ptrace_request.

ptrace_request also has nothing to do for PTRACE_ATTACH, so it simply breaks, returns, and the whole call stack is unwound back to the ptrace system call code which finishes up by returning.

Now that we’ve seen how PTRACE_ATTACH works, we can start examining PTRACE_SYSCALL.

PTRACE_SYSCALL

The ptrace system call begins as it did in the PTRACE_ATTACH case. Since this is not an attach request, the first thing the source does is call ptrace_check_attach to ensure that the proces is ready for ptrace operations.

Next, much like PTRACE_ATTACH, the function arch_ptrace is called which is supplied by the CPU architecture specific code. In similar fashion, arch_ptrace has nothing to do for PTRACE_SYSCALL and calls over to ptrace_request.

So far, this is very similar to PTRACE_ATTACH, but now the code diverges a bit.

In ptrace_request for the case of PTRACE_SYSCALL, ptrace_resume is called.

ptrace_resume

This function starts by setting the TIF_SYSCALL_TRACE flag on the thread info structure for the tracee.

A few possible states are checked (as other functions might call ptrace_resume) and finally the tracee is woken up and execution resumes until the tracee enters a system call.

Entering system calls

So, we’ve seen how the kernel tracks when a process’ system calls should be traced by setting a flag (TIF_SYSCALL_TRACE) on a thread info structure associated with a process.

The question then follows: when is that flag checked and acted upon?

Whenever a system call is made by a program, there is CPU architecture specific code that is executed on the kernel side prior to the execution of the system call itself.

The code that is executed on x86 CPUs when a system call is made is written in assmebly and can be found in arch/x86/kernel/entry_64.S.

_TIF_WORK_SYSCALL_ENTRY

If take a look at the code for the assembly function system_call here you will that about 20 lines into the function a flag called _TIF_WORK_SYSCALL_ENTRY is checked:

testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
jnz tracesys
If this flag is set, execution moves to tracesys. Looking at the definition of this flag:

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
_TIF_NOHZ)
You can see that the flag is actually a combination of several flags, including the one we saw getting set earlier: _TIF_SYSCALL_TRACE.

And so: on every system call made, the thread info structure for a process has its flags checked for _TIF_SYSCALL_TRACE. If a flag is set, execution moves to tracesys

tracesys

If you read the source for tracesys here you can see that about 10 lines in the function syscall_trace_enter is called.

This function is actually defined in the CPU specific ptrace code found in arch/x86/kernel/ptrace.c, which you can find here.

This function checks to see if the _TIF_SYSCALL_TRACE flag is set and if so, tracehook_report_syscall_entry is called:

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;
The tracehook_report_syscall_entry is a static-inline wrapper function from include/linux/tracehook.h and it has some great documentation.

It calls ptrace_report_syscall which is defined in the same file, but near the top.

ptrace_report_syscall

The ptrace_report_syscall function matches what was described earlier: a SIGTRAP is generated when a traced process enters a system call:

ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
ptrace_notify is implemented in kernel/signal.c here.

As you can see, ptrace_notify calls ptrace_do_notify which prepares a signal info structure for delivery to the process by ptrace_stop, which is also found in kernel/signal.c

SIGTRAP

Once the tracee receives a SIGTRAP, the tracee is stopped and the tracer is notified that a signal is pending for the process. The tracer can then examine the state of the tracee and print register values, timestamps, or other information.

This is how strace prints its information to the terminal when you trace a process.

syscall_trace_leave for the exit path

A similar code path is executed for the exit path of the system call:

syscall_trace_leave is called by the assembly code
this function calls tracehook_report_syscall_exit
which also calls ptrace_report_syscall, just like the entry path.
And this is how tracing processes are notified when a system call completes so they can harvest the return value, timing information, or anything else needed to print useful output for the user.

Conclusion

ptrace is an incredibly useful system call for debuggers, tracers, and other system programs that need to extract useful information from programs. strace is implemented primarily by relying on ptrace.

ptrace internals are a bit tricky, as execution is transferred between a set of files, but the implementation itself is relatively straight forward.

I also encourage you to check out the source code for your favorite debugger and see how it uses ptrace to examine program state, modify registers and memory, etc.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s