Knowledge is power: UNIX System Calls

A system call is just what its name implies -- a request for the

operating system to do something on behalf of the user's program. The

system calls are functions used in the kernel itself. To the

programmer, the system call appears as a normal C function call.

However since a system call executes code in the kernel, there must be a

mechanism to change the mode of a process from user mode to kernel mode.

The C compiler uses a predefined library of functions (the C library)

that have the names of the system calls. The library functions

typically invoke an instruction that changes the process execution mode

to kernel mode and causes the kernel to start executing code for system

calls. The instruction that causes the mode change is often referred to

as an "operating system trap" which is a software generated interrupt.

The library routines execute in user mode, but the system call interface

is a special case of an interrupt handler. The library functions pass

the kernel a unique number per system call in a machine dependent way --

either as a parameter to the operating system trap, in a particular

system call the user is invoking. In handling the operating system

trap, the kernel looks up the system call number in a table to find the

address of the appropriate kernel routine that is the entry point for

the system call and to find the number of parameters the system call

expects. The kernel calculates the (user) address of the first

parameter to the system call by adding (or subtracting, depending on the

direction of stack growth) an offset to the user stack pointer,

corresponding to the number of the parameters to the system call.

Finally, it copies the user parameters to the "u area" and call the

appropriate system call routine. After executing the code for the

system call, the kernel determines whether there was an error. If so,

it adjusts register locations in the saved user register context,

typically setting the "carry" bit for the PS (processor status) register

and copying the error number into register 0 location. If there were no

errors in the execution of the system call, the kernel clears the

"carry" bit in the PS register and copies the appropriate return values

from the system call into the locations for registers 0 and 1 in the

saved user register context. When the kernel returns from the operating

system trap to user mode, it returns to the library instruction after

the trap instruction. The library interprets the return values from the

kernel and returns a value to the user program.

UNIX system calls are used to manage the file system, control processes,

and to provide interprocess communication. The UNIX system interface

consists of about 80 system calls (as UNIX evolves this number will

increase). The following table lists about 40 of the more important

system call:

GENERAL CLASS SPECIFIC CLASS SYSTEM CALL

------------------------------------------------------------------------------------

File Structure Creating a Channel creat()

Related Calls open()

close()

Input/Output read()

write()

Random Access lseek()

Channel Duplication dup()

Aliasing and Removing link()

Files unlink()

File Status stat()

fstat()

Access Control access()

chmod()

chown()

umask()

Device Control ioctl()

---------------------------------------------------------------------

Process Related Process Creation and exec()

Calls Termination fork()

wait()

exit()

Process Owner and Group getuid()

geteuid()

getgid()

getegid()

Process Identity getpid()

getppid()

Process Control signal()

kill()

alarm()

Change Working Directory chdir()

----------------------------------------------------------------------

Interprocess Pipelines pipe()

Communication Messages msgget()

msgsnd()

msgrcv()

msgctl()

Semaphores semget()

semop()

Shared Memory shmget()

shmat()

shmdt()

----------------------------------------------------------------------

[NOTE: The system call interface is that aspect of UNIX that has

changed the most since the inception of the UNIX system. Therefore,

when you write a software tool, you should protect that tool by putting

system calls in other subroutines within your program and then calling

only those subroutines. Should the next version of the UNIX system

change the syntax and semantics of the system calls you've used, you

need only change your interface routines.]

When a system call discovers and error, it returns -1 and stores the

reason the called failed in an external variable named "errno". The

"/usr/include/errno.h" file maps these error numbers to manifest

constants, and it these constants that you should use in your programs.

When a system call returns successfully, it returns something other than

-1, but it does not clear "errno". "errno" only has meaning directly

after a system call that returns an error.

When you use system calls in your programs, you should check the value

returned by those system calls. Furthermore, when a system call

discovers an error, you should use the "perror()" subroutine to print a

diagnostic message on the standard error file that describes why the

system call failed. The syntax for "perror()" is:

void perror(string)

char string;

"perror()" displays the argument string, a colon, and then the error

message, as directed by "errno", followed by a newline. The output of

"perror()" is displayed on "standard error". Typically, the argument

give to "perror()" is the name of the program that incurred the error,

argv[0]. However, when using subroutines and system calls on files, the

related file name might be passed to "perror()".

There are occasions where you the programmer might wish to maintain more

control over the printing of error messages than "perror()" provides --

such as with a formatted screen where the newline printed by "perror()"

would destroy the formatting. In this case, you can directly access the

same system external (global) variables that "perror()" uses. They are:

extern int errno;

extern char *sys_errlist[];

extern int sys_nerr;

"errno" has been described above. "sys_errlist" is an array (table) of

pointers to the error message strings. Each message string is null

terminated and does not contain a newline. "sys_nerr" is the number of

messages in the error message table and is the maximum value "errno" can

assume. "errno" is used as the index into the table of error messages.

Following are two sample programs that display all of the system error

messages on standard error.

/* errmsg1.c

print all system error messages using "perror()"

#include

int main()

{

int i;

extern int errno, sys_nerr;

for (i = 0; i < sys_nerr; ++i)

{

fprintf(stderr, "%3d",i);

errno = i;

perror(" ");

}

exit (0);

}

/* errmsg2.c

print all system error messages using the global error message table.

#include

int main()

{

int i;

extern int sys_nerr;

extern char *sys_errlist[];

fprintf(stderr,"Here are the current %d error messages:\n\n",sys_nerr);

for (i = 0; i < sys_nerr; ++i)

fprintf(stderr,"%3d: %s\n", i, sys_errlist[i]);

}

Following are some examples in the use of the most often used system

calls.

File Structure Related System Calls

The file structure related system calls available in the UNIX system let

you create, open, and close files, read and write files, randomly access

files, alias and remove files, get information about files, check the

accessibility of files, change protections, owner, and group of files,

and control devices. These operations either use a character string

that defines the absolute or relative path name of a file, or a small

integer called a file descriptor that identifies the I/O channel. A

channel is a connection between a process and a file that appears to the

process as an unformatted stream of bytes. The kernel presents and

accepts data from the channel as a process reads and writes that

channel. To a process then, all input and output operations are

synchronous and unbuffered.

When doing I/O, a process specifies the file descriptor for an I/O

channel, a buffer to be filled or emptied, and the maximum size of data

to be transferred. An I/O channel may allow input, output, or both.

Furthermore, each channel has a read/write pointer. Each I/O operation

starts where the last operation finished and advances the pointer by the

number of bytes transferred. A process can access a channel's data

randomly by changing the read/write pointer.

All input and output operations start by opening a file using either the

"creat()" or "open()" system calls. These calls return a file

descriptor that identifies the I/O channel. Recall that file

descriptors 0, 1, and 2 refer to standard input, standard output, and

standard error files respectively, and that file descriptor 0 is a

channel to your terminal's keyboard and file descriptors 1 and 2 are

channels to your terminal's display screen.

creat()

The prototype for the creat() system call is:

int creat(file_name, mode)

char *file_name;

int mode;

where file_name is pointer to a null terminated character string that

names the file and mode defines the file's access permissions. The mode

is usually specified as an octal number such as 0666 that would mean

read/write permission for owner, group, and others or the mode may also

be entered using manifest constants defined in the "/usr/include/sys/stat.h"

file. If the file named by file_name does not exist, the UNIX system creates

it with the specified mode permissions. However, if the file does exist, its

contents are discarded and the mode value is ignored. The permissions of the

existing file are retained.

Following is an example of how to use creat():

/* creat.c */

#include

#include /* defines types used by sys/stat.h */

#include /* defines S_IREAD & S_IWRITE */

int main()

{

int fd;

fd = creat("datafile.dat", S_IREAD | S_IWRITE);

if (fd == -1)

printf("Error in opening datafile.dat\n");

else

{

printf("datafile.dat opened for read/write access\n");

printf("datafile.dat is currently empty\n");

}

close(fd);

exit (0);

}

The following is a sample of the manifest constants for the mode

argument as defined in /usr/include/sys/stat.h:

#define S_IRWXU 0000700 /* -rwx------ */

#define S_IREAD 0000400 /* read permission, owner */

#define S_IRUSR S_IREAD

#define S_IWRITE 0000200 /* write permission, owner */

#define S_IWUSR S_IWRITE

#define S_IEXEC 0000100 /* execute/search permission, owner */

#define S_IXUSR S_IEXEC

#define S_IRWXG 0000070 /* ----rwx--- */

#define S_IRGRP 0000040 /* read permission, group */

#define S_IWGRP 0000020 /* write " " */

#define S_IXGRP 0000010 /* execute/search " " */

#define S_IRWXO 0000007 /* -------rwx */

#define S_IROTH 0000004 /* read permission, other */

#define S_IWOTH 0000002 /* write " " */

#define S_IXOTH 0000001 /* execute/search " " */

Multiple mode values may be combined by or'ing (using the | operator)

the values together as demonstrated in the above sample program.

open()

Next is the open() system call. open() lets you open a file for

reading, writing, or reading and writing.

The prototype for the open() system call is:

#include

int open(file_name, option_flags [, mode])

char *file_name;

int option_flags, mode;

where file_name is a pointer to the character string that names the

file, option_flags represent the type of channel, and mode defines the

file's access permissions if the file is being created.

The allowable option_flags as defined in "/usr/include/fcntl.h" are:

#define O_RDONLY 0 /* Open the file for reading only */

#define O_WRONLY 1 /* Open the file for writing only */

#define O_RDWR 2 /* Open the file for both reading and writing*/

#define O_NDELAY 04 /* Non-blocking I/O */

#define O_APPEND 010 /* append (writes guaranteed at the end) */

#define O_CREAT 00400 /*open with file create (uses third open arg) */

#define O_TRUNC 01000 /* open with truncation */

#define O_EXCL 02000 /* exclusive open */

Multiple values are combined using the | operator (i.e. bitwise OR).

Note: some combinations are mutually exclusive such as: O_RDONLY |

O_WRONLY and will cause open() to fail. If the O_CREAT flag is used,

then a mode argument is required. The mode argument may be specified in

the same manner as in the creat() system call.

Following is an example of how to use open():

/* open.c */

#include /* defines options flags */

#include /* defines types used by sys/stat.h */

#include /* defines S_IREAD & S_IWRITE */

static char message[] = "Hello, world";

int main()

{

int fd;

char buffer[80];

/* open datafile.dat for read/write access (O_RDWR)

create datafile.dat if it does not exist (O_CREAT)

return error if datafile already exists (O_EXCL)

permit read/write access to file (S_IWRITE | S_IREAD)

fd = open("datafile.dat",O_RDWR | O_CREAT | O_EXCL, S_IREAD | S_IWRITE);

if (fd != -1)

{

printf("datafile.dat opened for read/write access\n");

write(fd, message, sizeof(message));

lseek(fd, 0L, 0); /* go back to the beginning of the file */

if (read(fd, buffer, sizeof(message)) == sizeof(message))

printf("\"%s\" was written to datafile.dat\n", buffer);

else

printf("*** error reading datafile.dat ***\n");

close (fd);

}

else

printf("*** datafile.dat already exists ***\n");

exit (0);

}

close()

To close a channel, use the close() system call. The prototype for the

close() system call is:

int close(file_descriptor)

int file_descriptor;

where file_descriptor identifies a currently open channel. close()

fails if file_descriptor does not identify a currently open channel.

read() write()

The read() system call does all input and the write() system call does

all output. When used together, they provide all the tools necessary to

do input and output sequentially. When used with the lseek() system

call, they provide all the tools necessary to do input and output

randomly.

Both read() and write() take three arguments. Their prototypes are:

int read(file_descriptor, buffer_pointer, transfer_size)

int file_descriptor;

char *buffer_pointer;

unsigned transfer_size;

int write(file_descriptor, buffer_pointer, transfer_size)

int file_descriptor;

char *buffer_pointer;

unsigned transfer_size;

where file_descriptor identifies the I/O channel, buffer_pointer points

to the area in memory where the data is stored for a read() or where

the data is taken for a write(), and transfer_size defines the maximum

number of characters transferred between the file and the buffer.

read() and write() return the number of bytes transferred.

There is no limit on transfer_size, but you must make sure it's safe to

copy transfer_size bytes to or from the memory pointed to by

buffer_pointer. A transfer_size of 1 is used to transfer a byte at a

time for so-called "unbuffered" input/output. The most efficient value

for transfer_size is the size of the largest physical record the I/O

channel is likely to have to handle. Therefore, 1K bytes -- the disk

block size -- is the most efficient general-purpose buffer size for a

standard file. However, if you are writing to a terminal, the transfer

is best handled in lines ending with a newline.

For an example using read() and write(), see the above example of

open().

lseek()

The UNIX system file system treats an ordinary file as a sequence of

bytes. No internal structure is imposed on a file by the operating

system. Generally, a file is read or written sequentially -- that is,

from beginning to the end of the file. Sometimes sequential reading and

writing is not appropriate. It may be inefficient, for instance, to

read an entire file just to move to the end of the file to add

characters. Fortunately, the UNIX system lets you read and write

anywhere in the file. Known as "random access", this capability is made

possible with the lseek() system call. During file I/O, the UNIX system

uses a long integer, also called a File Pointer, to keep track of the

next byte to read or write. This long integer represents the number of

bytes from the beginning of the file to that next character. Random

access I/O is achieved by changing the value of this file pointer using

the lseek() system call.

The prototype for lseek() is:

long lseek(file_descriptor, offset, whence)

int file_descriptor;

long offset;

int whence;

where file_descriptor identifies the I/O channel and offset and whence

work together to describe how to change the file pointer according to

the following table:

whence new position

------------------------------

0 offset bytes into the file

1 current position in the file plus offset

2 current end-of-file position plus offset

If successful, lseek() returns a long integer that defines the new file

pointer value measured in bytes from the beginning of the file. If

unsuccessful, the file position does not change.

Certain devices are incapable of seeking, namely terminals and the

character interface to a tape drive. lseek() does not change the file

pointer to these devices.

Following is an example using lseek():

/* lseek.c */

#include

int main()

{

int fd;

long position;

fd = open("datafile.dat", O_RDONLY);

if ( fd != -1)

{

position = lseek(fd, 0L, 2); /* seek 0 bytes from end-of-file */

if (position != -1)

printf("The length of datafile.dat is %ld bytes.\n", position);

else

perror("lseek error");

}

else

printf("can't open datafile.dat\n");

close(fd);

}

Many UNIX systems have defined manifest constants for use as the

"whence" argument of lseek(). The definitions can be found in the

"file.h" and/or "unistd.h" include files. For example, the University

of Maryland's HP-9000 UNIX system has the following definitions:

from file.h we have:

#define L_SET 0 /* absolute offset */

#define L_INCR 1 /* relative to current offset */

#define L_XTND 2 /* relative to end of file */

and from unistd.h we have:

#define SEEK_SET 0 /* Set file pointer to "offset" */

#define SEEK_CUR 1 /* Set file pointer to current plus "offset" */

#define SEEK_END 2 /* Set file pointer to EOF plus "offset" */

The definitions from unistd.h are the most "portable" across UNIX and

MS-DOS C compilers.

link()

The UNIX system file structure allows more than one named reference to a

given file, a feature called "aliasing". Making an alias to a file

means that the file has more than one name, but all names of the file

refer to the same data. Since all names refer to the same data,

changing the contents of one file changes the contents of all aliases to

that file. Aliasing a file in the UNIX system amounts to the system

creating a new directory entry that contains the alias file name and

then copying the i-number of a existing file to the i-number position of

this new directory entry. This action is accomplished by the link()

system call. The link() system call links an existing file to a new

file.

The prototype for link() is:

int link(original_name, alias_name)

char *original_name, *alias_name;

where both original_name and alias_name are character strings that name

the existing and new files respectively. link() will fail and no link

will be created if any of the following conditions holds:

a path name component is not a directory.

a path name component does not exist.

a path name component is off-limits.

original_name does not exist.

alias_name does exist.

original_name is a directory and you are not the superuser.

a link is attempted across file systems.

the destination directory for alias_name is not writable.

the destination directory is on a mounted read-only file system.

Following is a short example:

/* link.c

#include

int main()

{

if ((link("foo.old", "foo.new")) == -1)

{

perror(" ");

exit (1); /* return a non-zero exit code on error */

}

exit(0);

}

unlink()

The opposite of the link() system call is the unlink() system call.

unlink() removes a file by zeroing the i-number part of the file's

directory entry, reducing the link count field in the file's inode by 1,

and releasing the data blocks and the inode if the link count field

becomes zero. unlink() is the only system call for removing a file in

the UNIX system.

The prototype for unlink() is:

int unlink(file_name)

char *file_name;

where file_name names the file to be unlinked. unlink() fails if any of

the following conditions holds:

a path name component is not a directory.

a path name component does not exist.

a path name component is off-limits.

file_name does not exist.

file_name is a directory and you are not the superuser.

the directory for the file named by file_name is not writable.

the directory is contained in a file system mounted read-only.

It is important to understand that a file's contents and its inode are

not discarded until all processes close the unlinked file.

Following is a short example:

/* unlink.c

#include

int main()

{

if ((unlink("foo.bar")) == -1)

{

perror(" ");

exit (1); /* return a non-zero exit code on error */

}

exit (0);

}

File Status

stat() - fstat()

The i-node data structure holds all the information about a file except the

file's name and its contents. Sometimes your programs need to use the

information in the i-node structure to do some job. You can access this

information with the stat() and fstat() system calls. stat() and fstat()

return the information in the i-node for the file named by a string and by a

file descriptor, respectively. The format for the i-node struct returned by

these system calls is defined in /usr/include/sys/stat.h. stat.h uses types

built with the C language typedef construct and defined in the file

/usr/include/sys/types.h, so it too must be included and must be included

before the inclusion of the stat.h file.

The prototypes for stat() and fstat() are:

#include

int stat(file_name, stat_buf)

char *file_name;

struct stat *stat_buf;

int fstat(file_descriptor, stat_buf)

int file_descriptor;

struct stat *stat_buf;

where file_name names the file as an ASCII string and file_descriptor names

the I/O channel and therefore the file. Both calls returns the file's

specifics in stat_buf. stat() and fstat() fail if any of the following

conditions hold:

a path name component is not a directory (stat() only).

file_name does not exit (stat() only).

a path name component is off-limits (stat() only).

file_descriptor does not identify an open I/O channel (fstat() only).

stat_buf points to an invalid address.

Following is an extract of the stat.h file from the University's HP-9000. It

shows the definition of the stat structure and some manifest constants used

to access the st_mode field of the structure.

/* stat.h */

struct stat

{

dev_t st_dev; /* The device number containing the i-node */

ino_t st_ino; /* The i-number */

unsigned short st_mode; /* The 16 bit mode */

short st_nlink; /* The link count; 0 for pipes */

ushort st_uid; /* The owner user-ID */

ushort st_gid; /* The group-ID */

dev_t st_rdev; /* For a special file, the device number */

off_t st_size; /* The size of the file; 0 for special files */

time_t st_atime; /* The access time. */

int st_spare1;

time_t st_mtime; /* The modification time. */

int st_spare2;

time_t st_ctime; /* The status-change time. */

int st_spare3;

long st_blksize;

long st_blocks;

uint st_remote:1; /* Set if file is remote */

dev_t st_netdev; /* ID of device containing */

/* network special file */

ino_t st_netino; /* Inode number of network special file */

long st_spare4[9];

};

#define S_IFMT 0170000 /* type of file */

#define S_IFDIR 0040000 /* directory */

#define S_IFCHR 0020000 /* character special */

#define S_IFBLK 0060000 /* block special */

#define S_IFREG 0100000 /* regular (ordinary) */

#define S_IFIFO 0010000 /* fifo */

#define S_IFNWK 0110000 /* network special */

#define S_IFLNK 0120000 /* symbolic link */

#define S_IFSOCK 0140000 /* socket */

#define S_ISUID 0004000 /* set user id on execution */

#define S_ISGID 0002000 /* set group id on execution */

#define S_ENFMT 0002000 /* enforced file locking (shared with S_ISGID)*/

#define S_ISVTX 0001000 /* save swapped text even after use */

Following is an example program demonstrating the use of the stat() system

call to determine the status of a file:

/* status.c */

/* demonstrates the use of the stat() system call to determine the

status of a file.

#include

#define ERR (-1)

#define TRUE 1

#define FALSE 0

int main();

int main(argc, argv)

int argc;

char *argv[];

{

int isdevice = FALSE;

struct stat stat_buf;

if (argc != 2)

{

printf("Usage: %s filename\n", argv[0]);

exit (1);

}

if ( stat( argv[1], &stat_buf) == ERR)

{

perror("stat");

exit (1);

}

printf("\nFile: %s status:\n\n",argv[1]);

if ((stat_buf.st_mode & S_IFMT) == S_IFDIR)

printf("Directory\n");

else if ((stat_buf.st_mode & S_IFMT) == S_IFBLK)

{

printf("Block special file\n");

isdevice = TRUE;

}

else if ((stat_buf.st_mode & S_IFMT) == S_IFCHR)

{

printf("Character special file\n");

isdevice = TRUE;

}

else if ((stat_buf.st_mode & S_IFMT) == S_IFREG)

printf("Ordinary file\n");

else if ((stat_buf.st_mode & S_IFMT) == S_IFIFO)

printf("FIFO\n");

-32-

if (isdevice)

printf("Device number:%d, %d\n", (stat_buf.st_rdev > 8) & 0377,

stat_buf.st_rdev & 0377);

printf("Resides on device:%d, %d\n", (stat_buf.st_dev > 8) & 0377,

stat_buf.st_dev & 0377);

printf("I-node: %d; Links: %d; Size: %ld\n", stat_buf.st_ino,

stat_buf.st_nlink, stat_buf.st_size);

if ((stat_buf.st_mode & S_ISUID) == S_ISUID)

printf("Set-user-ID\n");

if ((stat_buf.st_mode & S_ISGID) == S_ISGID)

printf("Set-group-ID\n");

if ((stat_buf.st_mode & S_ISVTX) == S_ISVTX)

printf("Sticky-bit set -- save swapped text after use\n");

printf("Permissions: %o\n", stat_buf.st_mode & 0777);

exit (0);

}

access()

To determine if a file is accessible to a program, the access() system call

may be used. Unlike any other system call that deals with permissions,

access() checks the real user-ID or group-ID, not the effective ones.

The prototype for the access() system call is:

int access(file_name, access_mode)

char *file_name;

int access_mode;

where file_name is the name of the file to which access permissions given in

access_mode are to be applied. Access modes are often defined as manifest

constants in /usr/include/sys/file.h. The available modes are:

Value Meaning file.h constant

----- ------ ------

00 existence F_OK

01 execute X_OK

02 write W_OK

04 read R_OK

These values may be ORed together to check for mone than one access

permission. The call to access() returns 0 if the program has the given

access permissions, otherwise -1 is returned and errno is set to the reason

for failure. This call is somewhat useful in that it makes checking for a

specific permission easy. However, it only answers the question "do I have

this permission?" It cannot answer the question "what permissions do I

have?"

The following example program demonstrates the use of the access() system

call to remove a file. Before removing the file, a check is made to make

sure that the file exits and that it is writable (it will not remove a

read-only file).

/* remove.c */

#include

#define ERR (-1)

int main();

int main(argc, argv)

int argc;

char *argv[];

{

if (argc != 2)

{

printf("Usage: %s filename\n", argv[0]);

exit (1);

}

if (access (argv[1], F_OK) == ERR) /* check that file exists */

{

perror(argv[1]);

exit (1);

}

if (access (argv[1], W_OK) == ERR) /* check for write permission */

{

fprintf(stderr,"File: %s is write protected!\n", argv[1]);

exit (1);

}

if (unlink (argv[1]) == ERR)

{

perror(argv[1]);

exit (1);

}

exit (0);

}

Print this post

Pages

Friday, November 26, 2010

UNIX System Calls

No comments: