System-level routines for the UNIX operating system and their applications

We discuss here some system-level implementations of functions in C.

Context

Kernighan and Ritchie’s book on the C Programming Language is widely recognized as a standard reference for learning C. The book is culminated in Chapter 8, where it presents implementations of several system-level routines for the UNIX operating system and the applications of them. Providing a concise summary of how they’re used can be useful for quick reference.

1) Fopen

First of all, fopen utilizes a lower level, finer control routine called ‘open.’ They are slightly different:

int open(char* name, int flags, int perms)
FILE* fopen(char *name, char* mode)

As we can see, open returns file descriptor, while fopen returns a pointer to a higher-level file stream object. Further, fopen uses a string mode such as “r”, “w”, “a” for read, write, append, etc. In the end, fopen is part of the C standard library and is used by a user directly, while open is specific to systems that support POSIX. With this in mind, let us dive into the implementation of fopen.

#include <fcntl.h>
#include "syscalls.h"
#define PERMS 0666 /* RW for owner, group, others */

FILE *fopen(char *name, char *mode)
{
    int fd;
    FILE *fp;

    if (*mode != 'r' && *mode != 'w' && *mode != 'a')
        return NULL;
    for (fp = _iob; fp < _iob + OPEN_MAX; fp++)
        if ((fp->flag & (_READ | _WRITE)) == 0)
            break; /* found free slot */
    if (fp >= _iob + OPEN_MAX) /* no free slots */
        return NULL;

    if (*mode == 'w')
        fd = creat(name, PERMS);
    else if (*mode == 'a') {
        if ((fd = open(name, O_WRONLY, 0)) == -1)
            fd = creat(name, PERMS);
        lseek(fd, 0L, 2);
    } else
        fd = open(name, O_RDONLY, 0);
    if (fd == -1) /* couldn't access name */
        return NULL;
    fp->fd = fd;
    fp->cnt = 0;
    fp->base = NULL;
    fp->flag = (*mode == 'r') ? _READ : _WRITE;
    return fp;
}

In words, it can be described as follows.

In order to manage file opening, the function initially ensures that the mode parameter corresponds to one of the accepted modes (validated in the first if statement). It then scans for an unused entry within an array of file structures (_iob array) to use this file (executed in the for loop). If all entries are in use, the function yields null return (checked in the second if statement).

Depending on the specified mode, we generate file descriptor (the third if statement), and in append mode (‘a’), the file’s offset for write position is adjusted (lseek).

Subsequently, the function populates the relevant data to the file and return its pointer (last part of the code).

Inquiry points

1) what is (fp->flag & (_READ _WRITE)) == 0?

If this evaluates to true, it means fp->flag is not _READ nor _WRITE, which says that FILE structure is available.

2) what is PERMS?

There is an array of flags available for file operations. For example, O_RDONLY is for read-only access, O_WRONLY is for write-only access, and O_RDWR is for both reading and writing. The flag O_CREAT says “create a file if it does not exist.” When the O_CREAT flag is employed, the PERMS parameter is used to set the file’s access permissions. This permission parameter is composed of nine bits, grouped into three sets of three bits each, which define the access rights for the file’s owner, the group the owner belongs to, and all other users, respectively.

2) fillbuf

This routine is utilized within the routine getchar() function, which retrieves a single character from the standard input (usually keyboard input). It is treated as if it were a file stream by the C standard I/O.

#define getchar() getc(stdin)
#define getc(p) (--(p)->cnt >= 0 \
? (unsigned char) *(p)->ptr++ : _fillbuf(p))

getc macro says that ‘if there are characters left in the buffer ($cnt>=0$), increment buffer pointer; if the buffer is empty, refill the buffer with new data from the file.’ With this motivation, let us look at the implementation of _fillbuf.

#include "syscalls.h"
/* _fillbuf: allocate and fill input buffer */
int _fillbuf(FILE *fp)
{
    int bufsize;

    if ((fp->flag & (_READ | _EOF | _ERR)) != _READ)
        return EOF;
    bufsize = (fp->flag & _UNBUF) ? 1 : BUFSIZ;
    if (fp->base == NULL) /* no buffer yet */
        if ((fp->base = (char *) malloc(bufsize)) == NULL)
            return EOF; /* can't get buffer */
    fp->ptr = fp->base;
    fp->cnt = read(fp->fd, fp->ptr, bufsize);
    if (--fp->cnt < 0) {
        if (fp->cnt == -1)
            fp->flag |= _EOF;
        else
            fp->flag |= _ERR;
        fp->cnt = 0;
        return EOF;
    }
    return (unsigned char) *fp->ptr++;
}

In words, the _fillbuf function performs a series of checks before returning a character from the current position of the pointer (*fp->ptr) and then advancing the pointer.

Initially, the function verifies if the file is open for reading (READ flag) and whether it has not reached the end of the file (EOF) or encountered an error (ERR); if the condition is not met, the function returns EOF immediately (the first if statement).

Next, the function determines the size of the buffer to use. It sets bufsize to 1 if the file is marked as unbuffered (UNBUF flag is set), and to BUFSIZ (1024) otherwise.

The second if statement checks whether a buffer has already been allocated for this file stream (fp->base). If not, the function attempts to allocate it using malloc. Should memory allocation fail (indicating no available memory), it then returns EOF.

Subsequently, the function updates the file stream structure (fp) with the necessary information, including setting up the buffer pointers.

The last if statement examines whether the count of characters read (cnt) indicates an error or end-of-file condition. If cnt is less than zero after decrementing, it checks whether this is because the end of the file has been reached or because of an error, updating the file stream’s status flags accordingly.

Finally, if no errors are present, it returns the character currently pointed to by the buffer pointer (*fp->ptr).

Inquiry points

1) why do we have – in ‘if (–fp->cnt <0)’?

*fp->ptr++ is responsible for actually reading the character from the buffer, and a priori, we decrement fp->cnt for efficiency.

2) how does read routine work?

int read(int fd, char *buf, int n)

It reads from file descriptor fd to a buffer. n is a maximum bytes to be read, and the returned integer is a number of bytes that were actually read.

3) fsize

In the context of file system operations, directories are essentially treated as a special type of file. Each directory has an associated inode, which functions as a unique identifier within the file system.

#define NAME_MAX 14 /* longest filename component; */
                /* system-dependent */

typedef struct { /* portable directory entry */
    long ino;                /* inode number */
    char name[NAME_MAX+1];   /* name + '\0' terminator */
} Dirent;

typedef struct { /* minimal DIR: no buffering, etc. */
    int fd;                 /* file descriptor for the directory */
    Dirent d;               /* the directory entry */
} DIR;

As we can see DIR contains all the information on Dirent plus file descriptor fd.

DIR *opendir(char *dirname);
Dirent *readdir(DIR *dfd);
void closedir(DIR *dfd);
char *name;
struct stat stbuf;
int stat(char *, struct stat *);

stat(name, &stbuf);

Analogous to the way we open and close files, there are distinct functions designed for working with directories: opendir, readdir, and closedir.

It might seem confusing why we would need readdir, where we could simply return Dirent within DIR. The reason we nontrivially need readdir is because accessing Dirent via DIR is protected, and so we need other ways to get around it.

The stat structure serves as a repository for a file’s metadata, and the stat function is tasked with populating this structure with the file’s details.

struct stat { /* inode information returned by stat */
    dev_t st_dev;     /* device of inode */
    ino_t st_ino;     /* inode number */
    short st_mode;    /* mode bits */
    short st_nlink;   /* number of links to file */
    short st_uid;     /* owners user id */
    short st_gid;     /* owners group id */
    dev_t st_rdev;    /* for special files */
    off_t st_size;    /* file size in characters */
    time_t st_atime;  /* time last accessed */
    time_t st_mtime;  /* time last modified */
    time_t st_ctime;  /* time originally created */
};

#define S_IFMT 0160000 /* type of file: */
#define S_IFDIR 0040000 /* directory */
#define S_IFCHR 0020000 /* character special */
#define S_IFBLK 0060000 /* block special */
#define S_IFREG 0100000 /* regular */
/* ... */

The last five macros are entries for st_mode.

With this motivation, let us investigate the implementation of fsize.

int stat(char *, struct stat *);
void dirwalk(char *, void (*)(char *));

/* fsize: print the name of file "name" */
void fsize(char *name)
{
    struct stat stbuf;

    if (stat(name, &stbuf) == -1) {
        fprintf(stderr, "fsize: can't access %s\n", name);
        return;
    }
    if ((stbuf.st_mode & S_IFMT) == S_IFDIR)
        dirwalk(name, fsize);
    
    printf("%8ld %s\n", stbuf.st_size, name);
}

In words, if we could not populate the information on stbuf into name, there was an error, so we use stderr to print out the error message (the first if statement). If the file itself is directory, we use recursion to traverse through underlying files (the second if statement). Finally, stbuf has a correct information of the size of file.

Inquiry points

In the format $\pm *, *$, the first $\pm$ determines left/right adjustment, the first * determines maximum width, and the second * determines precision. So in our case, it means the maximum width of 81 characters.

Let us see how dirwalk, which was used in fsize, is defined:

#define MAX_PATH 1024

/* dirwalk: apply fcn to all files in dir */
void dirwalk(char *dir, void (*fcn)(char *))
{
    char name[MAX_PATH];
    Dirent *dp;
    DIR *dfd;

    if ((dfd = opendir(dir)) == NULL) {
        fprintf(stderr, "dirwalk: can't open %s\n", dir);
        return;
    }
    while ((dp = readdir(dfd)) != NULL) {
        if (strcmp(dp->name, ".") == 0
            || strcmp(dp->name, "..") )
            continue; /* skip self and parent */
        if (strlen(dir)+strlen(dp->name)+2 > sizeof(name))
            fprintf(stderr, "dirwalk: name %s/%s too long\n",
                dir, dp->name);
        else {
            sprintf(name, "%s/%s", dir, dp->name);
            (*fcn)(name);
        }
    }
    closedir(dfd);
}

Just like before, this code conducts a series of checks to ensure that the operation is error-free before applying the function.

Initially, it verifies whether the directory specified by dir can be opened successfully (the first if statement). Inside the while loop, it determines if the current directory (denoted by ‘.’) or the parent directory (‘..’) is being referenced, and skips them (the first if statement within while loop). Subsequently, it assesses if the concatenation of the directory name and the file name exceeds the maximum allowable length (the second if statement within while loop). If these conditions are met without errors, the function is executed within the else block by calling (*fcn)(name).

Inquiry points

1) Why is there +2 when checking whether string literal is too long? well, 1 is for directory separation(/) and 1 is for null terminator \0.

2) The function readdir(dfd) progresses through the directory entries within the while loop. This occurs because readdir internally calls the read system call to fetch each entry from the directory stream.

Finally, let us check how opendir, closedir, and readdir, used in dirwalk, are implemented. opendir and closedir are relatively straightforward.

int fstat(int fd, struct stat *);

/* opendir: open a directory for readdir calls */
DIR *opendir(char *dirname)
{
    int fd;
    struct stat stbuf;
    DIR *dp;

    if ((fd = open(dirname, O_RDONLY, 0)) == -1
        || fstat(fd, &stbuf) == -1
        || (stbuf.st_mode & S_IFMT) != S_IFDIR
        || (dp = (DIR *)malloc(sizeof(DIR))) == NULL)
        return NULL;

    dp->fd = fd;
    return dp;
}

/* closedir: close directory opened by opendir */
void closedir(DIR *dp)
{
    if (dp) {
        close(dp->fd);
        free(dp);
    }
}

In the opendir, the if statement verifies several conditions: first, it checks if the directory can be successfully opened; second, it assesses if the stbuf structure can be correctly filled with information; third, it ensures that the item being opened is indeed a directory; and lastly, it confirms that sufficient memory can be allocated for the operation.

In the closedir, we use close function with a parameter of file descriptor accessed via dp->fd. We also need to free memory.

#include <sys/dir.h> /* local directory structure */

/* readdir: read directory entries in sequence */
Dirent *readdir(DIR *dp)
{
    struct direct dirbuf; /* local directory structure */
    static Dirent d;      /* return: portable structure */

    while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf))
           == sizeof(dirbuf)) {

        if (dirbuf.d_ino == 0) /* slot not in use */
            continue;

        d.ino = dirbuf.d_ino;
        strncpy(d.name, dirbuf.d_name, DIRSIZ);
        d.name[DIRSIZ] = '\0'; /* ensure termination */
        return &d;
    }
    return NULL;
}

First let us look at the definition of direct.

#ifndef DIRSIZ
#define DIRSIZ 14
#endif

struct direct { /* directory entry */
    ino_t d_ino;                /* inode number */
    char d_name[DIRSIZ];        /* long name does not have '\0' */
};

As we can see, this is very similar to the definition of Dirent; both store inode number and a character string. direct is typically a lower-level, system-specific structure that may be used internally by the operating system’s filesystem routines, while Dirent is provided by the standard C library. All in all, in our case, direct is used as an intermediary to be copied its information into Dirent.

readdir takes DIR * and outputs Dirent *. This means that given directory, the function essentially traverses through its content to output all the entries of the directory.

Inquiry points

4) A storage Allocator

The malloc function in C allows for dynamic memory allocation, which is advantageous because it lets us release memory with free, optimizing memory utilization. The operating system typically orchestrates this management scheme. Each block of memory allocated by malloc is prefaced with a header containing metadata about the block, including its size.

typedef long Align; /* for alignment to long boundary */

union header { /* block header */
    struct {
        union header *ptr; /* next block if on free list */
        unsigned size;     /* size of this block */
    } s;
    Align x; /* force alignment of blocks */
};

typedef union header Header;

To comply with the system’s alignment requirements, which is crucial for ensuring efficient and error-free memory access, the header is structured using a union. This approach aligns the memory to the most restrictive type within the union, often a long, to accommodate any data type that might be stored in the allocated block.

Let us now see how malloc is implemented.

static Header base;                     /* empty list to get started */
static Header *freep = NULL;            /* start of free list */

/* malloc: general-purpose storage allocator */
void *malloc(unsigned nbytes)
{
    Header *p, *prevp;
    Header *morecore(unsigned);
    unsigned nunits;

    nunits = (nbytes+sizeof(Header)-1)/sizeof(Header) + 1;
    if ((prevp = freep) == NULL) {      /* no free list yet */
        base.s.ptr = freep = prevp = &base;
        base.s.size = 0;
    }

    for (p = prevp->s.ptr; ; prevp = p, p = p->s.ptr) {
        if (p->s.size >= nunits) {      /* big enough */
            if (p->s.size == nunits) {  /* exactly */
                prevp->s.ptr = p->s.ptr;
            } else {                    /* allocate tail end */
                p->s.size -= nunits;
                p += p->s.size;
                p->s.size = nunits;
            }
            freep = prevp;
            return (void *)(p+1);
        }
        if (p == freep)                 /* wrapped around free list */
            if ((p = morecore(nunits)) == NULL)
                return NULL;            /* none left */
    }
}

Note that freep is a global pointer. The first if statement initializes the free list if it’s empty. In the for loop, the function iterates over the free list looking for a block of memory that’s large enough to satisfy the request. The if statement within for loop checks exactly this and it makes necessary adjustment accordingly. If no block is found after checking all entries (second if statement in the for loop), p will eventually be equal to freep and so we use morecore function to request for more memory (from pagaging space or disk, but details are abstract away).

Inquiry points

1) Why do we use nunits instead of pure bytes?

This is because we are comparing nunits with p->s.size.

2) What do the -1 and +1 mean here?

nunits = (nbytes+sizeof(Header)-1)/sizeof(Header) + 1;

They ensure that we round up to the next whole unit.

Now, how is this procedure morecore implemented?

#define NALLOC 1024 /* minimum #units to request */

/* morecore: ask system for more memory */
static Header *morecore(unsigned nu)
{
    char *cp, *sbrk(int);
    Header *up;

    if (nu < NALLOC)
        nu = NALLOC;
    cp = sbrk(nu * sizeof(Header));
    if (cp == (char *) -1) /* no space at all */
        return NULL;
    up = (Header *) cp;
    up->s.size = nu;
    free((void *)(up+1));
    return freep;
}

Here, sbrk stands for space break, which is the system call to increase the program’s data space. We do nu * size(Header) since nu is in the unit of the size of Header. the last four lines represent putting block in a free list. In particular, free inserts the block just after the header up+1 into the free list, making it available for future allocations.

Finally, let’s see how free is implemented.

/* free: put block ap in free list */
void free(void *ap)
{
    Header *bp, *p;

    bp = (Header *)ap - 1; /* point to block header */
    for (p = freep; !(bp > p && bp < p->s.ptr); p = p->s.ptr)
        if (p >= p->s.ptr && (bp > p || bp < p->s.ptr))
            break; /* freed block at start or end of arena */

    if (bp + bp->s.size == p->s.ptr) { /* join to upper nbr */
        bp->s.size += p->s.ptr->s.size;
        bp->s.ptr = p->s.ptr->s.ptr;
    } else {
        bp->s.ptr = p->s.ptr;
    }

    if (p + p->s.size == bp) { /* join to lower nbr */
        p->s.size += bp->s.size;
        p->s.ptr = bp->s.ptr;
    } else {
        p->s.ptr = bp;
    }
    freep = p;
}

The for loop starts with the pointer p set to freep, the beginning of the free list, and checks whether the block pointer bp falls between the current block p and the next block p->s.ptr. If bp is not in this range, the loop continues to the next block in the free list. Subsequent two if statements are executed to merge adjacent free blocks in the memory arena; this process is known as coalescing.

Inquiry points

1) What is meant by the following line of code within a for loop?

if (p >= p->s.ptr && (bp > p || bp < p->s.ptr))

This means that the iteration reached a full cycle. It suggests that bp does not fit between any two existing blocks and should be inserted at the edges of the memory arena managed by the free list.

2) Why do we need to subtract by 1 when assigning bp?

Because we need to move the pointer back to the start of the header, instead of start of the memory block itself.

References

M.Ritchie, B. W. K. & D. (1988). C Programming Language: Vol. Second Edition. AT&T Bell Laboratories.