APUE2e - Chapter 1 - UNIX System Overview

I'm taking the plunge and starting a detailed reading of Advanced Programming in the UNIX Environment Second Edition by Stevens and Rago also known as just APUE2e. I've had the book for a couple years and done little more than flip through it. The spine still cracks. It's a massive book. This could take a while...

There are some big programs that I'd like to write. Implementing a language interpreter, a text editor, and an HTTP server always distract my thoughts and they are intimately involved with the operating system's processes, threads and networking features. I have a general understanding of most of the principles but many holes that need filling. I'm not stating that I'm going to write these programs but at least I'd like to know how an implementation might be written. So I'm approaching my reading of APUE2e with an application programmer perspective.

User Accounts

Section 1.3 discusses the login process and user accounts.

The /etc/passwd file contains a line for each user account with the user's id, group id, home directory and shell program, etc. One of the features about UNIX that I like is it's easy to poke around and read these sorts of system files because they are in plain text. On my Mac OS X 10.5 system, I took a quick look at the /etc/passwd file.

$ cat /etc/passwd 
##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# This file will not be consulted for authentication unless the BSD local node
# is enabled via /Applications/Utilities/Directory Utility.app
# 
# See the DirectoryService(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
...

I was disappointed to see my own "peter" account was not listed here. It seems to commonly the case that OS X is set up almost like most UNIX systems but not quite. There are three files in /etc with "pass" in the name: kcpassword, master.password, password. None of these files contain my "peter" account. After some more poking around, the following command seemed to show my user's data.

$ sudo cat /var/db/dslocal/nodes/Default/users/peter.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
...
  <key>gid</key>
  <array>
    <string>20</string>
  </array>
  <key>home</key>
  <array>
    <string>/Users/peter</string>
  </array>
...
  <key>name</key>
  <array>
    <string>peter</string>
  </array>
  <key>passwd</key>
  <array>
    <string>********</string>
  </array>
...
  <key>realname</key>
  <array>
    <string>Peter Michaux</string>
  </array>
  <key>shell</key>
  <array>
    <string>/bin/bash</string>
  </array>
  <key>uid</key>
  <array>
    <string>501</string>
  </array>
</dict>
</plist>

Looking on my Debian Etch server, things look more reasonable. The /etc/password file is just as advertised in APUE2e and it contains a line for each user account.

$ cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
...
peter:x:1011:1011:,,,:/home/peter:/bin/bash

I suspect my appreciation for Debian (and perhaps FreeBSD) will grow thought my APUE2e reading.

Files and Directories

There wasn't much surprising information in section 1.4 for UNIX command line users: directories are also files, . is the current directory, .. is the parent directory, files in the paths are separated by slashes, etc. There is a discussion of "current working directory" and how it relates to relative file paths.

The bonus of section 1.4 is the book's first code listing for a small ls-type program to list the files in a directory. I can remember when I was learning UNIX as a user, I found the programs in /bin and similar directories as mysterious wonders. The 20 line example here takes away almost all of that mystery. It shows how a program like ls can be just a little C program that interact with the system. In this case, the interaction is through the dirent.h header. The fact that the standard /bin/ls program has bloated over time is just the inevitable result of many programmers using ls and wanting it to do many things.

Below is the example how I implemented the book's example. This is a self-contained version (it doesn't use the apue.h header the book uses and supplies in Appendix 2.)

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
  
  DIR *dp;
  struct dirent *dirp;
  
  if (argc != 2) {
    fprintf(stderr, "usage: ls directory_name\n");
    exit(1);
  }
  
  if ((dp = opendir(argv[1])) == NULL) {
    fprintf(stderr, "can't open %s: %s\n", argv[1], strerror(errno));
    exit(1);
  }
  
  while ((dirp = readdir(dp)) != NULL) {
    printf("%s\n", dirp->d_name);
  }
  
  closedir(dp);
  exit(0);  
}

It is simple to compile and run.

$ gcc -o myls myls.c
$ ./myls /etc 
.
..
adduser.conf
adjtime
aliases
aliases.db
alternatives
apache2
apt
bash.bashrc
...

So there you have it. If you install this compiled program in /usr/local/bin, you would have a system-wide, command-line tool that you wrote. No mystery.

An Operating System Provides Services to Application Programs

The main point of APUE2e chapter 1 seems to be that an operating system is a collection of software that runs the hardware and provides a set of services to application programs. The services are accessed by calling functions provided in the system's C header files.

The functions provided by the system, can be broken into "system calls" and "library routines". The system calls are the lowest-level, fundamental functions provided by the kernel of the operating system. It shouldn't be possible to rewrite a system call in terms of other system calls. (Maybe in practice it is possible in some cases). The library routines, on the other hand, are divisible and are written in terms of system calls and other library routines.

From an application programmer's perspective, it shouldn't be important which functions are system calls and which are library routines. This can be the case with all bottom-up implementations. It is enough to know that the operating system provides the functions, which header file contains which function, and what each function does.

One example in the chapter uses memory allocation. The sbrk system call increases the memory allocated to a program. The C standard library malloc library routine is a wrapper function which uses sbrk to make memory management easier for the application programmer than it would be to use sbrk directly.

stdin, stdout, and stderr

stdin, stdout, and stderr are file pointers defined in C's standard stdio.h header. These are likely familiar to anyone who has programmed in C.

Taking the opportunity to play around with little bits of code while the book topics are still simple, here is a program which reads from stdin and writes to both stdout and stderr.

#include <stdio.h>
#include <string.h>

#define MAXIN 1000

int main(void) {
  char input[MAXIN];
  fgets(input, MAXIN, stdin);
  fprintf(stdout, "This message goes to stdout. (The input length was %d)\n", strlen(input));
  fprintf(stderr, "This is an error message\n");
}

Running this from the command line:

$ ./a.out
some input
This message goes to stdout. (The input length was 11)
This is an error message

It is possible to redirect all three of stdin, stdout, stderr from the command line so that stdin is taken from the input.txt file, stdout is written to the out.txt file, and stderr is written to err.txt. The bash shell makes this redirection easy as discussed in UNIX in a Nutshell fourth edition by Robbins (page 355)

$ ./a.out < input.txt  > out.txt 2> err.txt
$ cat input.txt 
some input file
$ cat out.txt
This message goes to stdout. (The input length was 16)
$ cat err.txt 
This is an error message

It seems this can be done slightly more verbosely with the three file descriptors all included.

$ ./a.out 0< input.txt  1> out.txt 2> err.txt

What surprised me about the first example in APUE2e's section 1.5 was the use of the concepts of standard input and output without the use of stdio.h's stdin and stdout. Instead this section uses unistd.h's STDIN_FILENO and STDOUT_FILENO. Why not just use stdin and stdout? I think the reason for using STDIN_FILENO and STDOUT_FILENO is because the function calls in the example are to read and write which are system calls provided by unistd.h. So for consistency the example doesn't use anything from stdio.h.

Looking in /usr/include/unistd.h the three macro definitions appear:

#define	 STDIN_FILENO	0	/* standard input file descriptor */
#define	STDOUT_FILENO	1	/* standard output file descriptor */
#define	STDERR_FILENO	2	/* standard error file descriptor */

Processes and Threads

Parallel processing is a hot topic these days and is set to grow in the future of multi-core hardware processors. I'm interested how the UNIX process and thread facilities are manipulated from inside a C program and it looks like APUE2e will cover these topics extensively.

Section 1.6 introduces the idea of a process and how one process can start another process. It is amazing to see the section's 30-line, second example which is a rudimentary shell. Programs in C are supposed to be verbose, aren't they? The example uses the fork, execlp, and waitpid functions from the unistd.h and sys/wait.h headers. The way one process starts another process is pretty weird in UNIX.

Calling the fork function creates a copy of the current process. The call to fork returns in each of the processes and they continue executing right where the fork call was in the source code. fork returns 0 in the new (child) process and returns the child process id in the original (parent) process. Inspecting the returned value allows the child and parent processes to continue but doing different things.

If the parent process is using a lot of memory, making a complete copy of the parent process seems very expensive. I vaguely remember reading that some implementations make the copy of the parent's memory in a lazy fashion. That is, the memory is copied when the child process wants to modify some of it. Even with this implementation, it seems to make sense to keep the parent process as light as possible.

Since the objective isn't to have two copies of the same program executing in two processes, the execlp function is called in the child process. This replaces the program in the child process with a new program.

These two steps are a strange way to spawn a new process but it's the UNIX way.

The mention of threads in this chapter is short and scary. Synchronization of threads in a single processes accessing shared memory is known to be difficult problem. (It is interesting to note a new program like Google's Chrome browser has each browser window/tab in a separate process to avoid some of the problems with threads: especially one thread crashing and taking down a whole process.)

Error Handling

C's errno global variable makes me squirm like nothing else in code can. Using a global variable as a way to "return" information to the caller not only makes using the code in a threaded environment impossible, it is confusing for the programmer to follow the logic. Better to come up with some struct with multiple fields, pass a pointer to the struct to the function and read the fields after the function has returned.

Section 1.7 discusses how to access errno under Linux a thread safe way with errno defined as a macro.

extern int * __error(void);
#define errno (*__error())

On OS X 10.5 the sys/errno.h header file does something very similar.

extern int * __error(void);
#define errno (*__error())

So it looks like the implementations have made errno safe to set and get without any need to program with the threading in mind. All this errno business makes me think of thread locals in Java.

Signals

When you press control-c to stop a terminal process you are sending an interupt signal to the process. (If you've never needed this try find / -name "foo.c" -print. You're either patient or you'll be reaching for control-c quickly.) A program's default behavior when receiving a interrupt signal is to terminate. If, inside the program, you define an interrupt signal handler function then you can have the program do anything you want when that signal arrives. Remember the first time you tried to quit Vi or Emacs? I do: control-c, Control-C, CONTROL-C, CONTROL-C!!!!!!!!!

Terminating main by Calling exit

One small implementation detail of the chapter 1 examples really stuck out like a sore thumb to me: the APUE2e authors like to terminate the main function by explicitly calling the exit function from the stdlib.h header file of the C standard library.

For example, the simple "hello, world" program written APUE2e style would be.

#include "apue.h"

int
main(void)
{
  printf("hello, world\n");
  exit(0);
}

Note that the apue.h header (in APUE2e appendix 2) includes both stdio.h and stdlib.h for the printf and exit functions respectively.

The above program appears to terminate quite differently than the "hello, world" of The C Programming Language second edition by Kernighan and Ritchie (a.k.a. K&R or K&R2e) which terminates by simply falling off the end of main.

#include <stdio.h>

main()
{
  printf("hello, world\n");
}

Section 5.1.2.2.3 of the C99 Standard [pdf] discusses terminating main with a call to exit:

If the return type of the main function is a type compatible with int, a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0. If the return type is not compatible with int, the termination status returned to the host environment is unspecified.

So the C99 spec doesn't really provide any reason to choose to call exit explicitly to end main.

K&R2e section 7.6 also discusses terminating main with a call to exit:

Within main, return expr is equivalent to exit(expr). exit has the advantage that it can be called from other functions, and that calls to it can be found with a pattern-searching program....

Well, I suppose that is at least a small reason: if using exit for all terminations in main then grep exit filename will find all terminations in all functions of the program including main.

If I find myself liking the explicit call to exit, then I suppose I'd find myself writing "hello, world" as:

#include <stdlib.h>
#include <stdio.h>

int main(void) {
  printf("hello, world\n");
  exit(0);
}

Summary

APUE2e is a highly regarded book and I can see why. It helps that I've been picking away at some of these ideas over the past few years but the book is clear and easy to read. I'm looking forward to the nitty-gritty details to come in the following 2 kg of book.

Notes

APUE2e site

Comments

Have something to write? Comment on this article.

Stefan Weiss June 9, 2009

Welcome to Unix programming. I haven't gotten around to reading APUE2e yet, but it looks very interesting. I'm looking forward to the summaries of the next sections, if you decide to continue posting them.

Regarding return vs exit at the end of main() in a C program, IMO it's more a question of personal preference. C++ is another matter - the important difference here is that exit will never return, which means that the destructors for any objects in the local scope will not be executed. _exit and abort are two more ways to end the program, each with their own specialities. There is a good summary of the differences in the second reply here and a code example in the link given there.

Peter Michaux June 9, 2009

Stefan,

Thanks for the information about exiting a program. I was hoping my post would gain a comment just like yours.

I do plan on posting more summaries as it is a good way for me to engage the text more actively.

Flooey June 11, 2009

As a slight correction/expansion, when you type Ctrl-C, it's actually your TTY that sees the keystroke and turns it into SIGINT. You can change that if you wish (via a command like `stty intr ^E`, which would switch it to Ctrl-E). Because of this, Emacs probably doesn't install a signal handler for SIGINT; instead, it probably changes the TTY settings to deliver the Ctrl-C keystroke directly instead of transforming it into a signal.

Peter Michaux June 9, 2009

Flooey, Thanks for the comment. I'm sure I'll be learning a lot more about this sort of thing as APUE2e unfolds.

Greg A. Woods June 26, 2011

First off let me thank you for sharing your thoughts and perspectives on your first encounters with lower-level systems programming. Those of us who've "always" known these things are often confused when someone doesn't immediately "get it". Your story here sheds light on some of the things that could perhaps be better explained.

Your discussion about separate fork() and exec() is a perfect example of this. The elegance of this situation is lost unless you know a great deal more about the underlying issues. I.e. I think it's very difficult to explain the "obvious" reasons for why they are separate calls without going into a huge amount of background material and looking much deeper under the hood. This is one of the issues I remember having to learn about myself, and in retrospect I'm not sure when the lightbulb came on, but at some point it just seemed obvious and I couldn't imagine why I didn't think so before.

You've chosen a very good guide though to help you along! It's one of the best books there is on the topic.

Now, on to the issue of terminating a program by calling exit(), even from main(). Indeed the state of the literature and common folklore is far from ideal on this issue.

Technically a "return X;" statement in "main()" is not precisely equivalent to a call to "exit(X);", since the automatic storage of "main()" vanishes when "main()" returns, but it it does not vanish if a call to "exit()" is made instead. It is entirely possible and legal for a program to reference automatic storage in the context of "main()" during its last breath, so to speak. The classic example is a function registered through "atexit()" to be called during process shutdown which might reference storage for an automatic variable declared in "main()", usually via some other global setting, e.g. via "setbuf()" et al.

Furthermore, in C or any C-like language a "return" statement strongly hints to the reader that execution will continue in the calling function, and while this continuation of execution is usually technically true if you count the C startup routine which called your "main()" function, it's not exactly what _you_ mean when you mean to end the process.

After all, if you want to end your program from within any other function except "main()" you _must_ call "exit()" (and not try "return"). Using "exit()" consistently in "main()" as well makes your code much more readable, and it also makes it much easier for anyone to re-factor your code; i.e. code copied from "main()" to some other function won't misbehave because of accidental "return" statements that _should_ have been "exit()" calls.

So, combining all of these points together the conclusion is that it's a _bad habit_, at least for C, to use a "return" statement to end the program in "main()".

Have something to write? Comment on this article.