Document OMJWP02950926 ---------------------------------------------------------------------------- Document Id: OMJWP02950926 Date Loaded: 01-26-96 Description: HP-UX 10.0 Process Management White Paper ---------------------------------------------------------------------------- Description: HP-UX 10.0 Process Management White Paper HP-UX 10.0 Process Management White Paper HP 9000 Series 700/800 Computers Printed in U.S.A. January 1995 First Edition E0195 LEGAL NOTICES The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material. Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office. Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies. HEWLETT-PACKARD COMPANY 3000 Hanover Street Palo Alto, California 94304 U.S.A. Copyright Notices. (C)copyright 1995 Hewlett-Packard Company, all rights reserved. Reproduction, adaptation, or translation of this document without prior written permission is prohibited, except as allowed under the copyright laws. Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited. First Edition: January 1995 (HP-UX Release 10.0) ============================================================================= HP-UX 10.0 Process Management White Paper ========================================= Process Management ================== Understanding how HP-UX manages processes will help you better interpret how your system carries out its computations. This paper discusses: * What a process is. * How processes are created. * How processes are killed. * Commands for managing processes. * How the kernel manages processes. * HP-UX multiprocessing. What Is a Process? ================== A process is a running program, managed by such system components as the scheduler and the memory management subsystem. Although processes appear to the user to run simultaneously, in fact a single processor is executing only one process at any given moment. Described most simply, a process consists of text (the code that the process runs), data (used by the code), and stack (a "place" in the kernel where the parts of a program are stored as the process is running). Two stacks are associated with a process -- kernel stack and user stack. The process uses the user stack when in user space and the kernel stack when in kernel space. In addition, a process includes: * The program's data structures (variables, arrays, records). * A process ID, parent process ID, and process group ID. * The process's user and group IDs (both real and effective IDs). * A group access list. * Information on the process's open files. * The process's current working directory. * An audit ID (on trusted systems only). Process Relationships ===================== Processes maintain hierarchical, parent-child relationships. Every process has one parent, but a parent process can have many child processes. Processes can create processes, which in turn, can create more processes. A child process inherits its parent's environment (including environment variables, current working directory, open files). Also, all processes except system processes (such as init, pagedaemon, and swapper) belong to process groups. Process groups are explained later in this discussion. Process and Parent Process IDs ============================== When a process is created, HP-UX assigns the process a unique integer number known as a process ID (PID). The HP-UX kernel identifies each process by its process ID when executing commands and system calls. A process also has a parent process ID(PPID), which is the PID of the parent process. Using the ps command, you can see the process and parent process IDs of processes currently running on your system (see ps(1) in the HP-UX Reference. For example, if your username were terry, you might see something like this when executing ps: $ ps -f UID PID PPID C STIME TTY TIME COMMAND terry 3865 3699 2 13:35:43 ttyp3 0:00 ps -f terry 3699 3698 0 12:58:21 ttyp3 0:00 ksh This indicates that the user, terry, has two processes running, the ps -f command and ksh (the Korn shell). Notice that the PPID of the ps -f command is the same as the PID of ksh. This is because ps -f was spawned from the ksh command line; thus, ksh is the parent of ps -f, and ps -f is the child of ksh. (For details on the other columns shown, see ps(1) in the HP-UX Reference. User and Group IDs (Real and Effective) ======================================= In addition to the process ID, a process has other identification numbers: * a real user ID * a real group ID * effective user ID * effective group ID. A real user ID is an integer value that identifies the owner of the process -- that is, the user ID of the user who invoked the process. A real group ID is an integer value that identifies the group to which the user belongs. The real group ID is shared by all users who belong to the group. It allows members of the same group to share files, and to disallow access to users who do not belong to the group. The id(1) command displays both integers and names associated with real user ID and real group ID. The /etc/passwd file assigns the real user ID and real group ID to the user at login. %id uid=513(terry) gid=20(users) % grep 513 /etc/passwd terry:EqqHevH:513:20:Terry Ho,[44MY],495-0601,:/home/terry:/usr/bin/csh The effective user ID and effective group ID allow a process running a program to act as the program's owner while the program executes. The effective IDs are usually identical to the user's real IDs. However, the effective user ID and group ID can be set individually to protect a program, by making the effective IDs of the program's processes equal to the real IDs of the program's owner. The effective IDs are used to allow a user to access or modify a data file or to execute a program in a limited manner. When the effective user ID is zero, the user is allowed to execute system calls as the superuser (described in the following section). For example, suppose the dean of a university keeps all student records in a file on the system. He wishes to allow an English professor to modify student records, but only for the professor's own class. The dean first sets read and write permissions on the file containing the records. He then writes a program that takes as input the professor's modifications, verifies that the professor, as user, may change the record, then modifies the record if allowed. Finally, the dean protects the program by assigning the effective IDs of the user to the dean's real IDs when the program is executed. Thus, when the program accesses the student record file, the system allows the program to read from or write to the file because it believes that the dean is accessing the file (the effective user and group IDs are those of the dean). However, the real user and group IDs of the process remain the same as those of the user invoking the program. The program can use these IDs to verify access permission to the data (that is, ensuring that an English professor is accessing English class records). The effective user and group IDs remain set until: * The process terminates. * The effective IDs are reset by an overlaying process, if the setuid or setgid bit is set. (See exec(2) in the HP-UX Reference. * The effective, real, and saved IDs are reset by the system calls setuid, setgid, setresuid, or setresgid. (See section 2 in the HP-UX Reference). Audit ID (Trusted Systems Only) =============================== If you have converted to a trusted system, each user has an audit ID. This audit ID does not change, even when the user executes programs that use a different effective user ID. Jobs ==== A job is any command or command pipeline invoked from an HP-UX shell. When you execute a command, the operating system interprets the entire task required by that command as a job. Job Control =========== HP-UX supports job control for both the Korn shell and C shell. Job control provides users with greater flexibility in managing and controlling jobs. For example, you can: * Temporarily stop (suspend) a foreground job, by pressing CTRL-Z. This can be customized using the stty command. * Bring a background job into the foreground, using the fg built-in shell command. * Move a foreground job into the background, using the bg built-in shell command. For more information, see stty(1), csh(1), glossary(9), and ksh(1) in the HP-UX Reference. Process Groups ============== Every process (except system processes, such as init and swapper), belongs to a process group. When you create a job, the shell assigns all the processes in the job to the same process group. Signals can propagate to all processes in a process group; this is a principal advantage of job control. (Signals are discussed later in this paper.) Each process group is uniquely identified by an integer called a process group ID. Each process group also has a process group leader. The process group's ID is the same as the process ID of the process group leader. Every process in a process group has the same group ID. A process group ID cannot be re-used by the system until its process group lifetime ends. The process group lifetime is a period of time beginning when a process group is created and ending when the last remaining process in the group leaves. A process leaves either: * When another process executes a wait() or waitpid() function for an inactive process, or * When calling the setsid or setpgid system calls (see setsid(2) and setpgid(2) in the HP-UX Reference). Group Access Lists ================== Each process has a group access list of up to NGROUPS_MAX groups to which the process belongs. NGROUPS_MAX is defined in /usr/include/limits.h, and is typically 20. Programmatically, you can use sysconf() to obtain the value. A process is permitted to access the files of any group in this list as though that group were the process's effective group ID. The access list is assigned at login based on the group memberships specified in the file /etc/group (Series 700) or /etc/logingroup (Series 800). You can control group access by using the chgrp command. (See chown(1) in the HP-UX Reference.) Sessions ======== Each process group is a member of a session. All processes started during a single login session belong to a session. Or, think of a session as a login shell with all the jobs the login shell spawns. On a system running windows or shell layers, each window or layer is a session. A process belongs to the same session as its creator (parent). A process can alter its session affiliation using the setsid system call. The shell uses the setpgid system call to create jobs and ensure that all processes in a job belong to the same process group. (See setsid(2) and setpgid(2) in the HP-UX Reference.) A session leader is the process that created the session via the setsid system call (normally the login shell). Session lifetime is the period between when a session is created and the end of the lifetime of all process groups that belong to the session. A special instance of process groups is the orphaned process group. This is a process group in which the parent process of every member is either itself a member of the group or is not a member of the group's session. Terminal signals (such as ^Z) cannot stop an orphaned process group. Processes and Terminal Affiliation ================================== Every session has one controlling terminal. The session leader connected to the controlling terminal is called the controlling process. The exceptions to this are daemon processes (such as cron and inetd); they have no controlling terminal. All processes belonging to the session use the controlling terminal for standard input, standard output, and standard error. At any one time, one process group in a session is the foreground process group. Typically, the foreground process group is the job that the login shell runs in the foreground, or the shell itself if no foreground job has been spawned. The foreground process group has primary access to the controlling terminal, beyond that of all background process groups. The foreground process group can read from and write to the controlling terminal without restrictions. Attempted Read by Background Process Group ------------------------------------------ If a process in a background process group attempts to read from its controlling terminal, its process group is signaled with SIGTTIN, which suspends the process to which it is sent, except in two instances: * When the reading process ignores or blocks SIGTTIN. * When the reading process belongs to an orphaned process group. In either case, the read returns -1 and no signal is sent. Attempted Write by Background Process Group ------------------------------------------- If a process in a background process group attempts to write to the controlling terminal, its process group is signaled with SIGTTOU which, by default, stops (suspends) the process to which it is sent. There are three cases where SIGTTOU is not sent: * If the TOSTOP terminal characteristic is not set, the process is allowed to write. * If the TOSTOP terminal characteristic is set, and if the writing process ignores or blocks SIGTTOU, the process is allowed to write. * If the TOSTOP characteristic is set, and the writing process belongs to an orphaned process group, and the writing process does not ignore or block SIGTTOU, the call to write returns -1. Process Creation ================ A process can create another process to: * concurrently execute another program. * execute another program and wait for its completion. At a shell prompt, you create a new process by typing the command's name and pressing RETURN. The shell is the parent process; the process you create is the child process. The child process can then create its own child processes, of which it is the parent. At the system programming level, a new process is created when a program calls either the fork or vfork system call. The parent process is the calling process; the child process is the created process. The system subroutine can also spawn a child process; system's calling process waits for the child process to terminate before continuing. The following section discusses briefly key system calls initiated when you run a program. For detailed information on the intricacies of processes and programs, consult the manual Programming on HP-UX, and system(3S), fork(2), vfork(2), and exec(2) in the HP-UX Reference. The fork System Call ==================== The fork system call creates ("forks") a new process. In earlier implementations of HP-UX, the system copied the entire data segment of a process every time fork was issued. fork time increased as the size of the data and stack segments increased. Because processes can be larger than available memory, copy time could increase dramatically, degrading system performance. To avoid the penalty and potential waste of the copy, HP-UX implements what is known as copy-on-write, which allows the parent and child processes to share a common data page until either one writes to that page. This results in only those pages modified being copied. When the parent process is written to the child address space, it differs from the parent process in its process ID and parent process ID); it has exact copies of the parent's code, data and current variable values. (See fork(2) in the HP-UX Reference for a detailed list of similarities and differences.) When the fork system call is executed, the system must have enough free swap space to duplicate the parent process or fork fails. Once the child process is created, both processes begin to execute from the program statement immediately following the call to fork. The fork system call returns the child's process ID (a non-zero value) to the parent process, while the identical call in the child's copy of the code always returns zero. Since the process IDs returned by fork are distinguishable, each process can determine whether it is the parent process or the child process. For example, suppose a process consists of a program that tests the life of car batteries. The program has read 1000 data values from a voltmeter and is ready to print and plot the data. The program could have been written to do one task completely (such as printing the data) and then perform the second task (plotting the data). However, the programmer has included a fork system call in his program at a location after the data has been read. When the program completes the statement containing the fork system call, two nearly identical processes exist. Each process examines the value returned by its fork system call to determine whether it is the child process or the parent process. Following the fork statement is a conditional branch statement that states, "If the process is the child process, print the data. If the process is the parent process, plot the data." The fork statement and the conditional branch statement enable both printing and plotting to be done simultaneously. And since each process has its own copy of the test data, each can modify the data without affecting the other process. copy-on-write and fork ---------------------- Current restrictions in HP-UX allow only one virtual translation to a physical page. Because of this, the copy-on-write implementation used by fork is not a true copy-on-write, but what HP refers to as copy-on-access. The parent retains the translation as read-only, but the child is given only a page number. If the child accesses any copy-on-write data (read or write), the page is copied, hence the term, copy-on-access. For more information, see the Memory Management White Paper. The vfork System Call --------------------- Applications that need to create a small independent process might do so faster and more efficiently using vfork rather than fork. The use of vfork is appropriate only when the child process performs an immediate exec() or _exit() call. With fork, the child process gets a copy of the parent's virtual address space. As a result, the child process and the parent process can run at the same time. With vfork, however, the child process borrows the parent process's virtual address space; the child process gets no copy of its own. As a result, there is no copy time or swap allocation. Since the child process uses the parent process's virtual address space, the two processes cannot run concurrently. The parent process sleeps while the child process runs. For example, vi uses vfork to create a child process which then immediately uses exec() to execute sh. The child process first uses the parent vi virtual address space, and then switches to the virtual address space of sh. If the child modifies data before it does an exec() or _exit(), these changes will be visible to the parent when it resumes execution. The exec System Call ==================== Often following the fork system call is an exec to another program. exec is a system call that overlays separate code and data on top of already existing process code and data. In this manner a parent process is able to create a new process using fork, and subsequently execute an entirely different program via exec. For example, suppose you are editing a file and need to verify the name of another file. You might use a shell escape to fork a shell and exec the program ls. Open Files ========== For a process to access files, it must first open them. A process inherits all open files from the parent. Three files that are usually open are standard input (stdin), standard output (stdout), and standard error (stderr). When a process terminates, the system closes any files opened by the process. Two configurable operating-system parameters govern the number of open files permitted per process, maxfiles and maxfiles_lim. The parameter maxfiles defines the soft limit, representing how many open files a process can have. The soft limit is inherited across forks and execs. The parameter maxfiles_lim defines the hard limit of open files permitted per process. Both maxfiles and maxfiles_lim are set when the kernel is generated to a default range of between 30 and 2048, inclusive. maxfiles must be less than or equal to maxfiles_lim. An absolute limit of 2048 is imposed on both limits. The soft limit is per-process and derives from the process's parent (unless the process was process 1, in which case the soft limit was originally maxfiles). Consumption of system resources can be limited by using the setrlimit system call. For example, a calling process's soft limit can be increased to the hard limit or decreased to any value greater than number of the highest file-descriptor allocated. When the system boots, the hard limit begins at maxfiles_lim. Only root processes can increase the hard limit; however, the maximum hard limit is 2048. (Programmatically, you can use sysconf() or getrlimit() to get the current value.) Note that soft and hard limits apply to a process unless or until a process execs a setrlimit() call. If that occurs, the process's child process inherits the new limitation. If none of a process's ancestors have changed the soft or hard limits, the soft limit is maxfiles and the hard limit is maxfiles_lim. When considering the appropriate setting of soft and hard limits, consider too the operating-system parameters, nfile and ninode, because they are affected as well. Operating-system parameters are defined in /usr/conf/master.d/core-hpux. Process Termination =================== At the system programming level, a process terminates (dies) when: * The process successfully finishes running. * The process intentionally terminates itself by calling the exit system call (see exit(2) in the HP-UX Reference). * The process receives a signal whose default action is fatal. When a process dies, all its open files are closed and most of the resources the process holds are deallocated. The process then enters an inactive state in which it still holds some system resources. The remaining resources are returned to the system, the last being process's entry in the proc table. The process dies. At the shell level, processes can be terminated via the kill command, described in more detail in the next section. Process Management Commands =========================== From both command level (from a shell) or from SAM (Process Management area), you can manage and monitor processes. The most useful tools are described in this section. HP-UX can also provide system accounting information for terminated processes. Understanding Process Status -- ps ================================== The ps command, run from a shell or from SAM, displays the following information about processes currently running on the system: * User ID of the user who spawned the process * Process ID * Parent process ID * Command line used to spawn the process * tty (terminal) line from which command was invoked * Length of time the process has been running (real CPU time). The ps command also allows you to query the system about processes selectively. It is a very versatile command which is quite useful to system administrators and general users alike. When invoked without options, ps displays the process ID, terminal ID (tty), real CPU time usage, and name of all commands a user is running. If the -f (full) option is specified, ps also displays login name, parent process ID, and time the process was forked. $ ps -f UID PID PPID C STIME TTY TIME COMMAND mickey 3286 2016 9 16:19:03 ttyp1 0:00 ps -f mickey 25705 25649 0 08:47:58 ttyp1 0:02 -ksh /home/michael [ksh] mickey 2016 25705 0 15:13:02 ttyp1 0:24 vi processes.tag Including the -e option causes ps to display information for all processes in the system, not just those of the user who invoked it: $ ps -ef UID PID PPID C STIME TTY TIME COMMAND mickey 25737 25715 14 08:48:13 ttyp3 0:00 xcal /usr/bin/X11/xcal root 13322 1 0 Jun 6 ? 0:00 cron mickey 24357 1 0 19:45:46 console 0:01 -ksh [ksh] root 4 0 0 Jun 6 ? 7:15 netisr . . . If the -l (long) option is specified, ps displays user ID, state (S), nice value (NI), memory or disk address of the process (ADDR), priority (PRI), and size (SZ) in blocks of the process core image. (State and priority are discussed in some detail later in this paper.) $ ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 R 513 11009 7793 5 179 20 d6e200 16 ttyu4 0:00 ps 1 S 0 7792 133 15 154 20 e06100 13 214fb0 ttyu4 0:00 rlogind 1 S 513 7793 7792 16 168 20 df5a80 52 7ffe6000 ttyu4 0:00 csh The -l option is useful for isolating problems on a sluggish system. For example, a runaway process might show up with an excessively large entry in the TIME column. You can trace the process back to its owner via the UID column. Similarly, if a process is not responding or getting time to run, the S (status) column might indicate whether the process is sleeping. For details on using ps, see ps(1) in the HP-UX Reference. Understanding Process Termination -- kill ========================================= The kill command, run from a shell or from SAM, terminates a running process when you specify the process's PID. By default, kill terminates the process gracefully via the SIGTERM termination signal. "Graceful" means that the process first handles any additional signals pending. By specifying a signal number, you can kill processes more abruptly if necessary (for example, kill -9, where 9 specifies the SIGKILL signal). See signal(5) in the HP-UX Reference for complete description of the signals handled by the multiple signal interfaces supported by HP-UX. For example, suppose a user is running a program that has hung the terminal; that is, the program does not respond to keyboard input from the user and the user cannot exit the program. As system administrator, you can kill the user's program from another terminal by doing the following: * Determine the process ID of the program, by running ps. * Kill the process via kill. If the program was named list_data and the user's user ID was 265, you might see the following: $ ps -f -u 265 UID PID PPID C STIME TTY TIME COMMAND terry 3677 3638 0 12:27:29 ttyp1 0:53 list_data terry 3638 3632 0 12:25:02 ttyp1 0:00 ksh To kill the process gracefully, type: $ kill 3677 This would, in most cases, kill the process and the user would again be able to use the terminal. If this does not work, you might try: $ kill -9 3677 This form of kill should be used cautiously because it bears a higher risk of data loss. Understanding Relative Process Priority -- nice and renice ========================================================== All processes have a priority, set when the process is invoked and based on factors such as who is running the process (user, system) and whether the process is created in a time-share or real-time environment. The nice command can be used to set a process to run at a lower priority than would be set by default. nice does not lower the priority of an already running process. nice is useful for running programs whose execution time is not critical. For example, suppose you have a program, named numcrunch, that manipulates large arrays of data, but the data is not critical to your work at the moment. How long it takes for the program to manipulate the data is unimportant; more critical programs should have greater access to CPU resources. To run numcrunch as a low priority background process, type: $ nice numcrunch & Note that both Korn and C shells handle nice slightly differently: ksh automatically lowers priority of background processes by four; this behavior can be modified using the bgnice argument. If you specify nice from ksh, it executes /usr/bin/nice and lowers priority by ten. If you specify nice from csh, it executes its built-in command and lowers priority by four; however, if you specify /usr/bin/nice, csh lowers priority by ten. The renice command (/usr/sbin/renice) allows you to alter the priority of running processes. Running processes can also be altered from the Process Management area of SAM. For details, see nice(1), ksh(1), csh(1), renice(1M), and nice(2) in the HP-UX Reference. Tools for Monitoring Process Management Performance =================================================== The following tools may provide additional information to help you examine your system's operation. Although they are commands, some (top and sar) are accessible from the Process Management area of SAM: top(1) Displays and updates information about the top processes on the system. Summarizes the general state of the system (load average), quantifies amount of memory in use and free, and reports on individual processes active on the system. Whereas ps gives a single "snapshot" of the system, top samples the system and updates its display at intervals. On multi-processing systems, top reports on the state of each each CPU. sar(1M) Reports on cumulative system activity, including CPU utilization, buffer activity, transfer of data to and from devices, terminal activity, number of specific system calls used, amount of swapping and switching activity, queue lengths, and other kernel tables. vmstat(1) Quantifies the use of virtual memory by processes on the system; also reports on traps and CPU activity. iostat(1) Reports I/O statistics for active disks, terminal, and processor. Process Management and the Kernel ================================= You can think of the process control system as containing the kernel's scheduling subsystem, memory management subsystem, and interprocess communication (IPC) subsystem. The process control system interacts with the file system when reading files into memory before executing them. Several processes may be instances of the same program; for example, more than one person might be using vi at same time; each invocation of vi is an instance of the same process. A process reads and writes its own data and stack, but cannot write/read any other processes'. (Note, however, shared memory can be read and written by several processes.) Processes communicate with other processes via shared memory or system calls. Communication between processes (IPC) includes asynchronous signaling of events and synchronous transmission of messages between processes. System calls are requests by a process for some service from the kernel, such as I/O, process coordination, system status, and data exchange. All HP-UX system calls are documented in section 2 of HP-UX Reference. Much valuable information about CPU usage can be obtained using the Process Management area of SAM. Process Modes ============= An HP-UX process can execute in two modes -- kernel or user mode -- and through its process lifetime, switches between them. Information about the process (such as variables, process addresses, buffer counts) accumulates in a "stack" for each mode, and it is through these stacks that the process executes instructions and switches modes. Certain kinds of instructions trigger mode changes; for example, when a program invokes a system call it goes through system call stub code, passing the system call number through a gateway page that adjusts privilege bits to switch to kernel mode. When a process switches mode to the kernel, it executes kernel code, and uses the kernel stack. Every process has an entry in a kernel process table and a u_area structure, which contains private data such as control and status information. The context of a process is defined by all the unique elements identifying it -- the contents of its user and kernel stacks, values of its registers, data structures, and variables. Although HP-UX gives the illusion of executing multiple processes simultaneously, each CPU actually executes one process at a time, according to its priority. (In multiprocessing systems, several SPUs can be executing a process simultaneously.) The principle behind the distribution of CPU time is called time-slice -- the amount of time a process can run before the kernel checks to see if there is an equal-priority process ready to run. Process Priorities ================== Processes are generally chosen by the scheduler to execute a time-slice based on their priority. Priorities range from highest priority to lowest priority and are classified by need. Priorities are set in two separate ranges: a range of POSIX standard priorities and a range of other HP-UX priorities. The POSIX standard priorities are always higher than all other HP-UX priorities. Within a given range of priority numbers, the process assigned the lowest number has the highest priority. For example, a process assigned a priority of 1 takes precedence over a process assigned a priority of 6. You can see at what priority a process is set to run at by looking in the PRI column when you invoke ps or top. (The lowest number within the range represents the highest priority.) The following lists the four categories of priority, from highest to lowest: 1. POSIX standard priority (tunable parameter) POSIX standard priorities, known as RTSCHED priorities, are the highest priorities. RTSCHED processes have a range of priorities separate from other HP-UX priorities. The number of RTSCHED priorities is a user tunable parameter (rtsched_numpri), set between 32 and 512 (default 32). 2. Real-time priority (0-127) Reserved for SCHED_RTPRIO processes started with rtprio() system calls. 3. System priority (128-177) Used by system processes. 4. User priority (178-255) Assigned to user processes. The kernel can alter the priority of time-share priorities (128-255) but not real-time priorities (0-127). Run Queues ========== A process must be on a queue of runnable processes before the scheduler can choose it to run. Run queues are link-listed in decreasing priority. Each process is represented by its header on the list of run queue headers; each entry in the list of run queue headers points to the process table entry for its respective process. Processes get linked into the run queue based on the process's priority, set in the process table. The kernel maintains separate queues for system-mode and user-mode execution. When a timeshare process is not running, the kernel improves the process's priority (lowers its number). When a process is running, its priority worsens. The kernel does not alter priorities on real-time processes. Timeshared processes (both system and user) lose priority as they execute and regain priority when they do not execute. The scheduler chooses the process with the highest priority to run for a given time-slice. System-mode priorities take precedence for CPU time. User-mode priorities can be preempted -- stopped and swapped out to secondary storage; kernel-mode priorities cannot. Processes run until they have to wait for a resource (such as data from a disk, for example), until the kernel preempts them when their run time exceeds a time-slice limit, until an interrupt occurs, or until they exit. The scheduler then chooses a new eligible highest-priority process to run; eventually, the original process will run again when it has the highest priority of any runnable process. Process Context =============== When the kernel switches from executing one process to executing another process, it switches context. That is, the kernel switches from executing in the context of one process to the context of another process. When this happens, the kernel saves information about the interrupted process to be able to later resume execution. The kernel also restores the context of the process it is about to start. Note that context switching can only take place when the process is executing in system mode. Process States and Transitions ============================== Throughout the course of its lifetime, a process transits through several states. Queues in main memory keep track of the process by its process ID. A process resides on a queue according to its state; process states are defined in the /usr/include/sys/proc.h file. Events, such as receipt of a signal, cause the process to transit from one state to another. Processes can be described as being in one of the following states: * idle - Process has either just been forked; an idle process can be scheduled to run. * run - Process is on a run queue, available to execute in either kernel or user mode. * stopped - Executing process is stopped by a signal or parent process. * sleep - Process is not executing, while on a sleep queue (for example, awaiting I/O to complete). * zombie - Having exited, the process no longer exists, but leaves behind for the parent process some record of its execution. When a program starts up a process, the kernel allocates a structure for it from the process table; the process is now in idle state, waiting for system resources. Once it acquires the resource, the process is linked onto a run queue and made runnable. When the process acquires a time-slice, it runs, switching as necessary between kernel mode and user mode. If a running process receives a SIGSTOP signal (as with control-Z in vi) or is being traced, it enters a stop state. On receiving a SIGCONT signal, the process returns to a run queue (in-core, runnable). If a running process must wait for a resource (such as a semaphore or completion of I/O), the process goes on a sleep queue (sleep state) until getting the resource, at which time the process wakes up and is put on a run queue (in-core, runnable). A sleeping process might also be swapped out, in which case, when it receives its resource (or wakeup signal) the process might be made runnable, but remain swapped out. The process is swapped in and is put on a run queue. Once a process ends, it exits into a zombie state. Interrupts ---------- Here is an example of how interrupts work: Process A might make a system call requiring disk I/O. While waiting for the disk I/O to complete, Process A goes to sleep. Meanwhile Process B starts up. The disk I/O completion generates an interrupt. In the course of completing the disk I/O, the driver controlling it also puts the sleeping Process A onto a run queue, making it runnable at a high priority, because Process A now holds a resource. Interrupt activity "belongs" to Process A, but happened while Process B was running. The interrupt is asynchronous to the thread of execution. Signals ======= Signals are transmissions between processes and are used for interprocess communication (IPC). Signals are also viewed as events to which processes respond. An example of signal use is the kill command, which sends a SIGTERM signal to terminate the specified processes. When a process receives a signal, the process might ignore the signal, terminate, defer the signal, or execute a signal handler. HP-UX defines several signal interfaces that allow a process to specify the action taken upon receipt of a signal. (See sigaction(2) signal(2), sigvector(2), bsdproc(2), and sigset(2V) in the HP-UX Reference for the various HP-UX signal interfaces. See signal(5) for description of HP-UX signals. Acceptable signal values are defined in /usr/include/sys/signal.h.) Multiprocessing =============== Standard HP-UX executes in a uniprocessing system architecture consisting of one central processing unit (CPU), memory, and peripherals. Some PA-RISC systems support symmetrical multiprocessing (SMP) -- two, three, or four system processing units (SPUs), memory shared among the processors, and peripherals. Symmetrical multiprocessing is a limited form of parallel processing. Header files that define SMP include /usr/include/machine/mp.h, /usr/include/sys/spinlock.h, and several semaphore-related header files. SMP header files can be read but they define operating-system behavior that is not configurable. How Multiprocessing Compares to Uniprocessing ============================================= Despite architectural differences, multiprocessing systems execute code identically to standard HP-UX. Although multiple processes can run concurrently on the several processors, all processors use the same kernel code. Symmetry -------- Symmetry in SMP has to do with use of the SPU. With some asymmetrical MP systems, only a primary SPU can execute I/O and system code; other SPUs execute only user code. With symmetrical MP, every SPU can execute I/O and system code. One copy of the kernel resides in memory and controls all processors. Since any kernel task can execute on any processor in the SMP system, HP-UX MP is termed symmetrical. Multiprocessor Boot-Up ---------------------- Boot-up procedure in an SMP system is almost identical to that of a uniprocessing system. At boot-up, the processor-dependent code (PDC) arbitrates among the multiple processors to determine the "monarch processor." The serf (non-monarch, or remaining) processors wait in "rendezvous code," while the monarch processor alone executes the bootup code. Upon reaching the SMP portion of the configuration code (near the end of booting process), the monarch processor signals each serf processor in turn, to bring it into the multiprocessing environment, at which time they go into idle, ready to run processes. Once all serfs are brought in, the monarch processor can go into idle also. From idle, processors do their work: they look in run queues for processes that need CPU time, grab locks, execute processes, release locks, and thus perform their normal operations on behalf of the kernel. Scheduling Processes Using Semaphores and Spinlocks =================================================== When processes execute simultaneously, it becomes harder for the operating system to maintain the integrity of data. To solve this problem, HP-UX controls access to data by using semaphores. Semaphores (and spinlocks, a kind of semaphore) are a locking mechanism for ensuring that only one processor accesses critical kernel resources at a time. Critical kernel resources include run queues, file and inode tables, linked lists, any kind of table that processes might want to access or change. A spinlock is held very briefly; other semaphores lock resources for longer durations and allow the processor to perform another task (such as switch processes) while the lock is held. The SMP system kernel uses semaphores to protect global data structures while multiple processors are executing concurrently. Locks are also used in a uniprocessing system to ensure data integrity. Processes run concurrently, but each lock may only be held by one process at a time. If another process attempts to acquire an already-held lock, the process must wait for the lock to be released. The example described following demonstrates how the kernel allocates critical regions to concurrently running processes. A process (P1), executing in user mode on SPU 1, might take a trap, which interrupts the process's execution and sends it into kernel mode, where it starts executing system code. P1 puts itself on a run queue, which pulls another process (P2) off the run queue. P2 starts to run. Meanwhile, a third process (P3), executing in user mode on SPU 2, takes a trap into kernel mode and wants to go on the run queue. Since P2 is executing in the critical region and controls the spinlock, P3 must wait, spinlocked, until P2 releases the lock; then P3 can continue to run. When multiple processes want to access critical resources at the same time, there is potential for a race condition or failure. Race conditions might occur any time a program takes an action dependent on data. Spinlocks enable the kernel to arbitrate between competing processes so that each process gets to access resources in a prioritized order. For example, suppose there are two processes, executing on two SPUs, each process accessing the same location in kernel memory simultaneously (in lockstep). Each process is counting objects, with the statement nobjects=nobjects+1; The value of nobjects is the number of objects counted, and equals 10. If the same code is running simultaneously without spinlocks, the two processors independently compute and change the total twice to 11. With spinlocks, the first process must complete and release its lock before the second process can compute. The total is thus incremented twice, first to 11, then 12. Each processor uses spinlocks to control access to data structures on behalf of its executing process. Likewise, competing processors must wait to obtain a spinlock held by another processor, until the lock is available. Processor Affinity ================== Processor affinity is the tendency of a process to execute from beginning to end on the same processor. The kernel is inclined to allocate processing time so that this happens, because a process running on the same processor executes faster. The kernel, however, must also balance the workloads among the SPUs, and will move a process from one SPU to another if necessary to achieve the best overall system throughput. Uniprocessor Emulation Issues ============================= Underlying the design of HP-UX multiprocessing is the concept of uniprocessor emulation. In all user interactions, HP-UX on SMP behaves exactly as if the operating system were running on a single processor. As noted earlier, widespread use of spinlocks require that device drivers, in I/O operations and in virtually all areas of program design, share kernel data structures sequentially to ensure their integrity. Because of SMP's greater complexity, programmers must take care to write code more carefully: timing hazards not seen in uniprocessing can occur in an SMP environment. Careful regression testing on SMP computers is essential to screen for timing hazards. A problem resulting from this sequential sharing of resources is characterized by excessive idle time. This situation can be detected with tools, such as top(1), sar(1M), and the optional HP products LaserRX and Glance/UX. These tools, however, will not reveal why the problem occurs; thorough review of your application design will be necessary.