Purdue's Adapter for Parallel Execution and Rapid Synchronization: The TTL_PAPERS DesignPurdue's Adapter for Parallel Execution andRapid Synchronization: The TTL_PAPERS Design H. G. Dietz, T. M. Chung, T. Mattox, and T. MuhammadSchool of Electrical EngineeringPurdue UniversityWest Lafayette, IN 47907-1285hankd@ecn.purdue.eduJanuary 1995 Abstract Tightly-coupled parallel machines are being replaced withclusters of workstations connected by communication networksyielding relatively long latencies, but ever-higherbandwidth (e.g., Ethernet, FDDI, HiPPI, ATM). However, itis very difficult to make parallel programs based on fine-grain aggregate operations execute efficiently using anetwork that is optimized for point-to-point blocktransfers.TTL_PAPERS augments a cluster of PCs or workstations withthe minimum hardware needed to provide very low latencybarrier synchronization and aggregate communication. Forexample, UNIX user processes within a TTL_PAPERS cluster canperform a barrier synchronization in 2.5 microseconds,including all software overhead. This is four orders ofmagnitude faster than using UNIX socket connections over anEthernet.This paper presents the TTL_PAPERS principles of operation,implementation issues, low-level software interface, andmeasured performance of the basic operations.This work has been supported in part by ONR Grant No.N0001-91-J-4013 and NSF Grant No. CDA-9015696. Keywords Parallel architecture, Fine-grain parallelism, Workstationclustering, Barrier synchronization, Aggregatecommunication. Notice for HTML Users The HTML version of this paper differs from the standardpostscript version (click here for the 157K compressed postscript version) in severalways, including a variety of minor formatting changes. Thestandard postscript version was formatted by groff-mgs as a single file. In contrast, this HTMLdocument was generated using a script that replaces figureswith in-line gif images that are hypertextlinked to full-resolution postscript versions; the(nK .ps.Z) at the upperright of each in-line image provides both the size of thecompressed postscript file and the hypertext link. Thereare also a variety of hypertext links inserted fornavigation within the document, including a Hypertext Indexat the end of this document (click here to jump to the HypertextIndex). 1. Introduction TTL_PAPERS is the TTL implementation of Purdue's Adapter forParallel Execution and Rapid Synchronization. Although itprovides sufficient support for barrier synchronization andaggregate communication to allow a cluster to efficientlyexecute fine-grain MIMD, SIMD, or even VLIW code, a four-processor TTL_PAPERS unit uses just eight standard TTLchips. At first, it is very difficult to accept that suchinexpensive and simple hardware can implement very effectivesynchronization and communication operations for parallelprocessing. This is because the traditional views ofparallel and distributed processing rest on a set of basicassumptions that are incompatible with the concept:Conventional wisdom suggests that the operating systemshould manage synchronization and communication, but even asimple context switch to an interrupt handler takes moretime than TTL_PAPERS takes to complete a typicalsynchronization or communication. All interactions with theTTL_PAPERS hardware are I/O port accesses made directly fromthe user program; there are no OS modifications required andno OS call overhead is incurred.Communication operations are characterized primarily bylatency (total time to transmit one object) and bandwidth(the maximum number of objects transmitted per unit time).The hardware and software complexity of most interactionmethods results in high latency; high latency makes highbandwidth and large parallel grain size necessary. Incontrast, TTL_PAPERS is very simple and yields acorrespondingly low latency. Providing low latency allowsTTL_PAPERS to work well with relatively fine-grainparallelism, but it also means that relatively low bandwidthcan suffice.A typical parallel computer is constructed by giving eachprocessor a method of independently performingsynchronization and communication operations with otherprocessors; in contrast, TTL_PAPERS interactions betweenprocessors are performed as aggregate operations based onthe global state of the parallel computation, much as in aSIMD machine. This SIMD-like model for interactions resultsin much simpler hardware and a substantial reduction insoftware overhead for most parallel programs (as wasobserved in the PASM prototype [SiN87] ). For example,message headers and dynamically-allocated message buffersare not needed for typical TTL_PAPERS communications. It isalso remarkably inexpensive for the TTL_PAPERS hardware totest global state conditions such as any or all.Thus, TTL_PAPERS does not perform any magic; it simply isbased on a parallel computation model that naturally yieldssimpler hardware and lower latency. TTL_PAPERS is notreally a network at all, but a special-purpose parallelmachine designed specifically to provide low-latencysynchronization and communication, taking full advantage ofthe "loosely synchronous" execution models associated withfine-grain to moderate-grain parallelism. 1.1. Synchronization The only synchronization mechanism implemented by TTL_PAPERSis a special type of fine-grain barrier synchronization thatfacilitates compile-time scheduling of parallel operations [DiO92] [DiC94] .Hardware barrier synchronization was first proposed in apaper by Harry Jordon [Jor78] , and has since become apopular mechanism for coordination of MIMD ( MIMD refers toMultiple Instruction stream, Multiple Data stream; i.e.,each processor independently executes it own program,synchronizing and communicating with other processorswhenever the parallel algorithm requires. ) parallelprocesses. A barrier synchronization is accomplished byprocessors executing a wait operation that does notterminate until sometime after all PEs have signaled thatthey are waiting. However, while building the 16 processorPASM (PArtitionable Simd Mimd) prototype in 1987 [SiN87] , werealized that the hardware enabling a collection ofconventional processors to execute both MIMD andinstruction-level SIMD ( SIMD refers to Single Instructionstream, Multiple Data stream; i.e., a single processor withmultiple function units organized so that the same operationcan be performed simultaneously on multiple data values. )programs was actually an extended type of barriersynchronization mechanism. Generalizing this barriersynchronization mechanism resulted in several new classes ofbarrier synchronization architectures, as reported in [OKD90] [OKD90a] . Unlike other PAPERS implementations, theTTL_PAPERS barrier mechanism provides only a subset of thesefeatures; however, these features are sufficient to enableconventional processors to efficiently implement fine-grainMIMD, SIMD, and VLIW ( VLIW refers to Very Long InstructionWord; i.e., a generalization of SIMD that allows eachfunction unit to perform a potentially different operationon its own data. ) execution [CoD94a] .The TTL_PAPERS barrier mechanism also provides an effectivetarget for compile-time instruction-level code scheduling [DiO92] . 1.2. Communication In addition to high-performance barrier synchronization,TTL_PAPERS provides low-latency communication. Thismechanism is not equivalent to a shared memory nor aconventional message-passing system, but has severaladvantages for loosely synchronous communication.Basically, it trades asynchrony for lower latency and morepowerful static routing abilities.As a side-effect of a barrier synchronization, each PE canplace information into the TTL_PAPERS unit and getinformation from whichever processor, or group ofprocessors, is selected. Thus, the sender only outputsdata; it is up to the receiver to know where to look for thedata that the sender has made available to it. InTTL_PAPERS, we further simplify the hardware by extendingthis concept so that all participating processors must agreeon and contribute to the choice of what data each processorwill have available. Compared to conventional networks,this method allows less autonomy for each processor, butyields simpler hardware, better performance, and morepowerful access to global state.The most basic TTL_PAPERS communication operation is amulti-broadcast that sends to each processor a bit maskcontaining one bit from each processor. This is a verypowerful mechanism for examining the global state of theparallel machine. An obvious application is for processorsto "vote" on what they would like to do next. For example,it can be used to determine which processors are ready forthe next phase of a computation or which processors wouldlike to access a shared resource (e.g., to implement abalanced, conflict-free, schedule for access to anEthernet). However, because any communication pattern is asubset of multi-broadcast, it can also be used to implementgeneral communication "routing."In addition to one-bit multi-broadcast, some versions ofPAPERS provide the ability to get four bits of data from anyprocessor in a single operation. Using a unidirectionalCentronics printer port, four bits is the theoreticalmaximum number of data bits that can be obtained by one portinput operation. TTL_PAPERS implements four bit sends, andeven more functionality, using simple NAND logic to combinesignals from the processors. Operations directly supportedinclude:Since TTL_PAPERS data is literally a four-bit wide NAND ofdata sent across all processors, computing a four-bit globalNAND takes only one operation. Likewise, NAND can implementAND and OR functions by simply complementing the output orby complementing the inputs.To accomplish a four-bit broadcast from a single processor,all the other processors simply output 0xf -- four 1 bits.Again, the result will be the complement of the data sent.To accomplish a one-bit multi-broadcast, each processorsends four bits of data such that processor i's bit i is thedata it wishes to send, and the other bits are all 1 values.For example, bit 2 will be 1 for processors 0, 1, and 3;thus, NANDing these signals together will result in a signalthat is the complement of bit 2 from processor 2.In summary, TTL_PAPERS provides almost no facility forautonomous communications, but does provide a very richcollection of aggregate communication operations based onglobal state. 1.3. Interrupts To facilitate some level of asynchronous operation,TTL_PAPERS provides a separate interrupt broadcast facilityso that any processor can signal the others. Such aninterrupt does not really generate a hardware interrupt oneach processor, rather, it sets a flag that each processorcan read at an appropriate time. Although such a check canbe made at any time, the two most obvious times are:When a barrier wait has taken an unexpectedly long time.This would indicate that one or more processors might havegenerated an interrupt, freezing the barrier state so thatother processors would know to check for an interrupt.When the OS is invoked to perform a scheduling operation.Thus, gang scheduling and other OS functions can beimplemented by having the OS on each processor use theinterrupt facility to coordinate across processors.The TTL_PAPERS interrupt facility provides all the necessarylogic for generating an interrupt and providing a special"interrupt acknowledge barrier." 2. PC Hardware Although TTL_PAPERS provides very low latencysynchronization and communication, it is interfaced to PCsusing only a standard parallel printer port and isimplemented with a minimal amount of external hardware.This section details the PC hardware involved in use ofTTL_PAPERS.Throughout the following description, we will distinguishbetween stand-alone PCs and PCs used as processors within aparallel machine by referring to the later as "PEs" --processing elements. The design presented here directlysupports 4 PEs (PE0, PE1, PE2, and PE3), and we also detailhow the design can be scaled to 8, 16, or 32 PEs. In fact,no significant changes are needed to scale the design tothousands of processors (the hardware structure that needsto scale is significantly simpler than the barrier AND treeimplemented in the Cray T3D [Cra93] ). 2.1. PE Hardware Interface No changes are required to make standard PC hardware into aTTL_PAPERS PE. All that is needed is a standard parallelprinter port and an appropriate cable. Although some of thePCs on the market provide extended-functionality parallelports that allow 8-bit bidirectional data connections, manyPCs provide only an 8-bit data output connection. Like theoriginal PAPERS, TTL_PAPERS can be used with any PC; it usesonly the functions supported by a standard unidirectionalparallel port.But if there is no parallel input port, how does TTL_PAPERSget data into the PC? The answer lies in the fact that the8-bit data output port is accompanied by 5 bits of statusinput and 4 bits of open-collector input/output (which aresometimes implemented as output only) on two other portsassociated with the 8-bit data output port. The way we usethese lines, there are actually 12 bits of data output and 5bits of data input.All versions of both TTL_PAPERS and PAPERS use all 5available input lines. However, the various versions differin how many and which of output signals are used. Becausethe open-collector lines are generally not as well driven asthe 8-bit data output lines and require access to adifferent port address, we generally use the open-collectorlines only for signals that are modal, i.e., that changerelatively infrequently and can have large settling times.The version of TTL_PAPERS discussed here uses 11 of the 12available output lines, but only the 8-bit data output portis written in the process of normal interactions with theTTL_PAPERS hardware; the 3 other bits are used only forinterrupt handling. Table 1 summarizes the signalassignments for the PC parallel port as it is used for theTTL_PAPERS interface. ( 5K .ps.Z ) 2.2. PE Port Bit Assignments Although the parallel port hardware is not altered to workwith TTL_PAPERS, the parallel port lines are not used asthey would be for driving a Centronics-compatible printer.Thus, it is necessary to replace the standard parallel portdriver software with a driver designed to interact withTTL_PAPERS. Toward this end, it is critical to understandwhich port addresses, and bits within the port registers,correspond to each TTL_PAPERS signal.There are three port registers associated with a PC parallelport. These registers have I/O addresses corresponding tothe port base address (henceforth, called P_PORTBASE) plus0, 1, or 2. Typically, P_PORTBASE will be one of 0x378,0x278, or 0x3bc, corresponding to MS-DOS printer namesLPT1:, LPT2:, and LPT3:. As a general rule, most PCs use0x378 for the built-in port, however, IBM PCs generally use0x3bc. Workstations based on processors other than the 386,486, or Pentium, e.g., DEC Alphas, also generally use 0x3bc;however, most of these processors map port I/O registersinto memory addresses, so the port operations must bereplaced with accesses to the memory locations thatcorrespond to the specified port register I/O addresses.For example, the IBM PowerPC specification places I/Oaddress 0x3bc at physical memory location 0x800003bc.The bits within the register P_PORTBASE+0 are used to sendTTL_PAPERS both strobe and data values, as well ascontrolling a bi-color LED (red/green Light Emitting Diode)on the PAPERS front panel. If a PC wants to mark itself asnot participating in a group of barrier synchronizations, itshould simply output 0xcf; this corresponds to setting P_S1,P_S0, P_D3, P_D2, P_D1, and P_D0 all equal to 1. Noticethat, if a PC is not connected to one of the TTL_PAPERScables, the TTL inputs will all float high, causing themissing PC to be harmlessly ignored in operations performedby the PCs still connected. In contrast, setting both P_S1and P_S0 to 0 will ensure that all barrier operations halt.In normal operation, each TTL_PAPERS operation is triggeredby toggling the P_S1 and P_S0 lines between (P_S1=1, P_S0=0)and (P_S1=0, P_S0=1); this can be done by simply exclusive-oring the previous output byte with (P_S1 | P_S0).A PC sends data to TTL_PAPERS by setting the output nybblebits appropriately and toggling the strobe lines asdescribed above. Performing this operation as two steps,first changing data bits then changing the strobe bits, canbe done to increase the reliability of transmission. Datalines should be given time to settle before a new readysignal is derived from the new strobe signals. Changingstrobes and data simultaneously results in a race conditionin which the data bits have only about 20 nanoseconds "headstart" -- a small enough margin for many systems to performunreliably.The P_LR and P_LG lines are simply used to control a bi-color LED to indicate the status of the PC relative to thecurrently executing parallel program. When TTL_PAPERS isnot in use, both bits should be set to 0, yielding a darkLED. When a parallel program is running, the LED should belighted green, which is accomplished by making P_LG=1 andP_LR=0. When a PC is waiting for a barrier, it should makeits LED red by setting P_LG=0 and P_LR=1. It is alsopossible to generate an orange status light by setting bothP_LG and P_LR to 1, however, this setting is used onlyrarely (as a "special" status indication).The second port register, P_PORTBASE + 1, is used to receivedata from TTL_PAPERS. To enhance the portability ofTTL_PAPERS to somewhat non-standard parallel printer ports,only these five bits are used as input: four bits of dataand one bit to act as a ready line. Because 0x40 is theonly bit that can be enabled to generate a true interrupt tothe PC, earlier versions of PAPERS made P_RDY use 0x40 sothat the PAPERS unit could generate a true hardwareinterrupt when a barrier synchronization had completed.However, this led to an inconvenient order for the bits ofthe input nybble, and we never found a good use for the truehardware interrupt (because interrupt handlers have too muchlatency), so the current arrangement makes P_RDY use 0x80.The new arrangement is superior not only because it keepsthe bits of the input nybble contiguous, but also becausethe inversion of P_RDY is harmless, whereas the inversion ofan input data line would require extra hardware or softwareto flip the inverted data bit (e.g., exclusive or with0x80).Note also that P_RDY is a toggling ready signal. Theoriginal PAPERS unit used software to "reset" the readysignal after each barrier synchronization had been achieved,thus requiring four port operations for each PAPERSsynchronization. By simply toggling P_RDY when each newbarrier is achieved, TTL_PAPERS can perform barriersynchronizations using just two port operations.The third port register, P_PORTBASE + 2, serves only onepurpose for TTL_PAPERS: parallel interrupt support.Actually, for the reasons described earlier, TTL_PAPERSnever generates a "real" interrupt to a PC. However,parallel interrupts provide a mechanism for managing the useof the TTL_PAPERS unit in a more sophisticated way, forexample: providing a better TTL_PAPERS "check-in" procedure,facilitating abnormal termination of parallel programs,implementing a user-level parallel file system, and evengang scheduling and parallel timesharing of the TTL_PAPERSunit.To cause a parallel interrupt, a PC simply sets P_IRQ to 1.However, other processors will not notice that a parallelinterrupt is pending unless they explicitly check. This isdone by changing P_SEL to 1, which causes the normal P_RDY(described above) to be replaced by an interrupt readyflag... until P_SEL is again set to 0. Thus, any PC cancheck for an interrupt at any time without interfering withthe operation of other PCs; for example, while delayedwaiting for a barrier, it is essentially harmless to checkfor an interrupt. To encourage PCs to check for aninterrupt, the interrupting PC can set its P_S1 and P_S0bits to 0 (see above), forcing barriers to be delayed. Whenall PCs set their P_NAK to 0, this simply acts to perform aspecial interrupt barrier. The extended interruptfunctionality is implemented by sending an interrupt code asa side-effect of this special barrier. 3. TTL_PAPERS Hardware Thus far, this document has focused on the way in which PChardware interacts with TTL_PAPERS. In this section, webriefly describe the hardware that implements TTL_PAPERSitself. The TTL_PAPERS design has been carefully minimizedto use just 8 standard TTL 74LS-series parts and to fit on asingle-layer circuit board (shown in Figure 1). The resultis a remarkably simple design that is inexpensive to build,yet fast to operate. The logic design for TTL_PAPERS islogically (and physically) divided into three subsystems:the barrier and interrupt mechanism, the aggregatecommunication logic, and the LED display control. Thecomplete circuit diagram is given in Figure 2. This sectionbriefly explains how the required functionality isimplemented by the board's logic. ( 21K .ps.Z ) ( 85K .ps.Z ) 3.1. Barrier/Interrupt Hardware Although the DBM (Dynamic Barrier MIMD) architecturepresented in [CoD94] [CoD94b] is far superior to theoriginal DBM design as presented in [OKD90a] , neither one issimple enough to be efficiently implemented without usingprogrammable logic devices. Thus, TTL_PAPERS uses avariation on the SBM (Static Barrier MIMD) design of [OKD90] . The primary difference between the previouslypublished SBM and the TTL_PAPERS mechanism is that there aretwo barrier trees rather than one. The reason is simplythat the published SBM silently assumed that the barrierhardware would be reset between barriers, essentially by an"anti barrier." In contrast, the use of two trees allows thehardware for one tree to be reset as a side-effect of theother tree being used, halving the number of operationsneeded per barrier. Both of these two trees are triviallyimplemented using a 74LS20, a dual 4-input NAND. The resultis latched by setting or resetting a 1-bit registerimplemented using 1/2 of a 74LS74.The interrupt logic is remarkably similar to the barrierlogic, also using a 74LS20 and 1/2 of a 74LS74. However,there is a difference in the connection between these chips.Interrupt requests are generated by any PE, and acknowledgedby all PEs, while barrier synchronizations are alwaysimplemented by all PEs. Thus, to invert the sense of theinterrupt request output from the NAND, the interrupt logicuses the simple trick of clocking the 1-bit register ratherthan setting it asynchronously. This trick saves a chip andactually yields a slightly more robust interrupt mechanism,because the interrupt is triggered by an edge rather than bya level.Finally, because there are not enough input bits for eachPE, the above two latched values must be independentlyselectable for each PE. The obvious way to select betweenvalues is using a multiplexor; however, there is no standardTTL 74LS-series part that implements selection between twoindividual bits. Instead, we construct the multiplexorusing two quad driver chips which have outputs that can beindependently set to a high-impedance state. The 74LS125and 74LS126 differ only in the sense of their enable lines;thus, using the same select line to control one output oneach chip makes it possible to simply wire both outputstogether. Only the selected signal will be passed; theother line's driver will be in the high impedance state.Notice also that this logic trivially can be scaled tolarger machines. Every 4 PEs will require a 74LS125 and a74LS126. The 74LS74 remains unchanged, although a driverchip may be needed to increase fan-out of the outputs formore than 8 processors. The 74LS20 NAND gates simply getreplaced by larger NAND trees. For example, for 8processors, each 1/2 of a 74LS20 is replaced by a 74LS30 (8input NAND). For 16 processors, chip count is minimized byimplementing each NAND tree using a 74LS134 (12-input NAND)and 1/2 of a 74LS20, the outputs of which are combined using1/4 of a 74LS32 (quad 2-input OR). A somewhat neater 16processor board layout results from using two 74LS30 and 1/4of a 74LS32 for each of the NAND trees, and a 32 processorsystem can easily be constructed using four 74LS30 and 3/4of a 74LS32 for each NAND tree.Although the circuitry scales to very large configurations,for a cluster containing more than 32 machines, the circuitcomplexity is high enough to warrant using a moresophisticated design based on programmable logic components. 3.2. Aggregate Communication Hardware Although other versions of PAPERS provide internal datalatching and fancy communication primitives, TTL_PAPERSsimply offers NANDing of the 4 data bits across theprocessors.In general, there is nothing tricky about building theseNAND trees; they look exactly as described above. However,each NAND tree has an output that must drive one line toeach of the processors, and the use of relatively longcables (up to about 10 feet each) requires a pretty good TTLdriver. For the case of 4 processors, we haveexperimentally confirmed that the 74LS40 (dual 4-input NANDbuffer) provides sufficient drive to directly power all 4lines, although the signal transitions become somewhat slow(e.g., about 300 nanoseconds) as the cables get long. In alarger system, we recommend constructing the NAND tree asdescribed above, and then using 74LS244 or 74LS541 octaldrivers to increase the drive ability of the NAND outputs.With a typical cable, each driver within a 74LS244 or74LS541 can drive up to about 4 lines. 3.3. LED Display Hardware In the PAPERS prototypes, we have experimented with avariety of different status displays. However, by far theeasiest display to understand was one using just a singlebi-color LED for each processor. The color code is veryintuitive: green means running, red means waiting (for morethan two "clock" cycles), and black means not in use. Theproblem is that there is no trivial way to derive thesecolor choices directly from the barrier logic, thus, theLEDs are explicitly set under software control. There isalso a power light on the TTL_PAPERS unit, which is a blueLED to avoid confusion with the rest of the status display. 4. PAPERS Software Although TTL_PAPERS will be supported by a variety ofsoftware tools including public domain compilers forparallel dialects of both C and Fortran [DiO92] [CoD94a] , inthis document we restrict our discussion to the most basichardware-level interface. The code given is written in C(the ANSI C-based dialect accepted by GCC) and is intendedto be run under a UNIX-derived operating system. However,this interface software can be adapted to most existing(sequential) language compilers and interpreters undernearly any operating system.The following sections discuss the operating systeminterface, TTL_PAPERS port access, the basic barrierinterface, and how NANDing is used to implement datacommunication. 4.1. Operating System Interface Although it would certainly be possible to implement theTTL_PAPERS software interface as part of an operatingsystem's kernel, typical latency for a minimal system callis between 5 and 50 times the typical latency for theTTL_PAPERS hardware port operations -- layering overhead ofa system call for each TTL_PAPERS operation would destroythe low latency performance. Thus, the primary purpose ofthe OS interface is to obtain direct user-level access tothe ports that connect to TTL_PAPERS.On older architectures, such as the 8080 and Z80, directuser access to I/O was accomplished by simply using the portI/O instructions or by accessing the memory locations thatcorresponded to the desired memory-mapped I/O deviceregisters. Things are now a bit more complex. Using a PCbased on the 386, 486, or Pentium processor, port I/Oinstructions are now protected operations. Similarly, theDEC Alpha, IBM PowerPC, and Sun SPARC use memory mapped I/O,but physical addresses corresponding to I/O ports generallyare not mapped into the virtual address space of a userprocess. None of these problems is fatal, but it can takequite a while to figure out how to get things working right.Thus, although the parallel printer port can be directlyaccessed under most operating systems, here we focus on howto gain direct user process access to I/O ports on a 386,486, or Pentium-based personal computer running eithergeneric UNIX or Linux. 4.1.1. Generic UNIX In general, UNIX allows user processes to have direct accessto all I/O devices. However, only processes that have asufficiently high I/O priority level can make such accesses.Further, only a privileged process can increase its I/Opriority level -- by calling iopl(). The following C codesuffices:if (iopl(3)) { /* iopl failed, implying we were not priv */ exit(1);}However, this call grants the user program access to allI/O, including a multitude of unrelated ports.In fact, this call allows the process to executeinstructions enabling and disabling interrupts. Bydisabling interrupts, it is possible to ensure that allprocessors involved in a barrier synchronization actprecisely in unison; thus, the average number of portoperations (barrier synchronizations) needed to accomplishsome TTL_PAPERS operations can be reduced. However,background scheduling of DMA devices (e.g., disks) and otherinterference makes it hard to be sure that a UNIX systemwill provide precise timing constraints even when interruptsare disabled, so we do not advocate disabling interrupts.Even so, performance of the barrier hardware can be safelyimproved by causing UNIX to give priority to a process thatis waiting for a barrier synchronization. This improvesperformance because if any one PE is off running a processthat has nothing to do with the synchronization, then allPEs trying to synchronize with that PE will be delayed. Thepriority of a privileged UNIX process can increased by acall to nice() with a negative argument between -20 and -1. 4.1.2. Linux Although Linux supports the UNIX interface described in theprevious section, it also provides a more secure way toobtain access to the I/O devices. The ioperm() functionallows a privileged process to obtain access to only thespecified port or ports. The C code:if (ioperm(P_PORTBASE, 3, 1)) { /* like iopl, failure implies we were not priv */ exit(1);}Obtains access for 3 ports starting at a base port addressof P_PORTBASE. Better still, if the operating system ismanaging all parallel program interrupts, only the first twoports need to be accessible:if (ioperm(P_PORTBASE, 2, 1)) { /* like iopl, failure implies we were not priv */ exit(1);}Because the 386/486/Pentium hardware checks portpermissions, this security does not destroy port I/Operformance; however, checking the permission bits does addsome overhead. For a typical PC parallel printer port, theadditional overhead is just a few percent, and is probablyworthwhile for user programs. 4.2. Port Access Although Linux and most versions of UNIX provide routinesfor port access, these routines often provide a built-indelay loop to ensure that port states do not change fasterthan the external device can examine the state.Consequently, the TTL_PAPERS support code uses its owndirect assembly language I/O calls. The code is:inline unsigned intinb(unsigned short port){ unsigned char _v;__asm__ __volatile__ ("inb %w1,%b0" :"=a" (_v) :"d" (port), "0" (0)); return(_v);}inline voidoutb(unsigned char value,unsigned short port){__asm__ __volatile__ ("outb %b0,%w1" : /* no outputs */ :"a" (value), "d" (port));}The basic TTL_PAPERS interface is thus defined in terms ofthe above port operations on any of three parallel printerport interface registers. The following definitions assumethat P_PORTBASE has already been set to the base I/O addressfor the port, which is generally 0x378 on PC clones and0x3bc for IBM ValuePoint PCs./* To output to P_PORTBASE */#define P_OUT(x) \ outb(((unsigned char)(x)), \ ((unsigned short) P_PORTBASE))/* To input from P_PORTBASE+1 */#define P_IN() \ inb((unsigned short) (P_PORTBASE + 1))/* To output to P_PORTBASE+2 */#define P_MODE(x) \ outb(((unsigned char)(x)), \ ((unsigned short) (P_PORTBASE + 2)))It is important to note that, even though the TTL_PAPERShardware interrupt mechanism does not have to be used, it isnecessary that any processor connected to TTL_PAPERS set themode port so that the P_RDY bit, and not P_INT, is visible.This can be done by executing:P_MODE(P_NAK);As part of the TTL_PAPERS initialization code. 4.3. Barrier Interface Logically, each barrier synchronization consists of twooperations:signaling that we are at a barrier andwaiting to be signaled that the barrier synchronization hascompleted.Although the TTL_PAPERS library generally combines theseoperations, here we discuss them as two separate chunks ofcode. The code for signaling that we are at a barrier issimply:P_OUT(last_out ^= (P_S0 | P_S1));This code just flips the sense of both strobe bits. Becauselast_out is initialized to have only one of the strobe bitshigh, this has the effect of alternating between P_S0 andP_S1. Nothing else is changed, including the output databits.The code that actually waits for the barrier synchronizationto complete is larger than one might expect, even thoughtypically only a few instructions are executed. The reasonthat there is so much code has to do with three features ofthe TTL_PAPERS interface. The first complication is thatthe ready signal toggles, rather than always going to thesame value. The second complication is that the status LEDsare software controlled. The third complication is that wewant to check for an interrupt whenever a barrier icessively delayed. The resulting code is something like:/* Which condition am I waiting for? */if (last_out & P_S0) { /* Waiting for P_RDY */ if ((!(P_IN() & P_RDY)) && (!(P_IN() & P_RDY))) { /* Polled twice, make LED red */ P_OUT(last_out ^= (P_LG | P_LR)); /* Continue waiting */ while (!(P_IN() & P_RDY)) CHECKINT; /* Ok, LED green again */ P_OUT(last_out ^= (P_LG | P_LR)); } } else { /* Waiting for not P_RDY */ if ((P_IN() & P_RDY) && (P_IN() & P_RDY)) { /* Polled twice, make LED red */ P_OUT(last_out ^= (P_LG | P_LR)); /* Continue waiting */ while (P_IN() & P_RDY) CHECKINT; /* Ok, LED green again */ P_OUT(last_out ^= (P_LG | P_LR)); }}In the initial version of the support library, CHECKINT isnothing -- TTL_PAPERS interrupts are not used. However, wecan easily check for an interrupt by defining CHECKINT as:{ /* Make P_INT status visible */ P_MODE(P_SEL | P_NAK); /* Check for interrupt */ if (P_IN() & P_INT) { /* Process the interrupt.... */ } else { /* Restore P_RDY */ P_MODE(P_NAK); }} 4.4. NAND Data Communication Although the TTL_PAPERS library provides a rich array ofaggregate communication operations, all that the hardwarereally does is a simple 4-bit NAND as a side-effect of abarrier synchronization. However, this operation typicallyrequires 5 port accesses rather than just the 2 portaccesses used to implement a barrier synchronization withoutdata communication. The extra port operations are requiredbecause:When a PE changes one or more of its output data bits, theTTL_PAPERS hardware NANDs in the new data, immediatelychanging the input data bits for all PEs. Thus, if one PEgets far enough ahead of another PE, it could change thedata before the slower PE has been able to read the previousNAND result. If interrupts are disabled and an initialordinary barrier synchronization is performed, then a blockof data can be transmitted safely using static timinganalysis alone to ensure that this race condition does notoccur. Unfortunately, it isn't practical to disableinterrupts under UNIX, so the safe solution is to followeach data transmitting barrier with an ordinary barrier thatensures all PEs have read the data before any PE can changethe data. This adds 2 port operations.The TTL_PAPERS hardware is fed both data and strobe signalson the same port. Thus, data and strobe signals changesimultaneously. As they race through the cables to theTTL_PAPERS logic and back out the cables, the TTL_PAPERSlogic should give the NAND data at least a 20 nanosecondlead over the toggling of the P_RDY bit, but is 20nanoseconds enough to ensure that the NAND data doesn't losethe race? Well, it depends on the implementation of the PCport hardware and the electrical properties of the cables...which is another way of saying that the race isn'tacceptable. An additional port operation is used to ensurethat the data wins the race. There are two possible ways to insert the extraoperation. One way is to double the output operation, sothat we first change the data and then toggle the strobe.This has the advantage that only PEs that are actuallychanging their data would need to insert the extraoperation. If only early PEs change their data, we don'tsee any delay; however, the asymmetry of the PE operationscan actually result in more delay than if the extraoperation was always inserted. The other alternative is toresample the NAND data value after P_RDY has signaled it ispresent. This is somewhat more reliable than the doublingof the output operation, but the delay is always present forany PE that reads the data.In any case, the result is that a barrier synchronizationaccompanied by an aggregate communication will take 5 portoperations. For example, to perform a reliable NAND withour contribution being the 4-bit value x, we could first:last_out = ((last_out & 0xf0) | x);This sets the new data bits so that a barriersynchronization will transmit them along with the strobe.The next step is to perform a barrier synchronization,precisely as described in section 4.3. Having completed thebarrier, we know that the NAND data should now be valid, sowe resample it:nand_result = P_IN();Finally, a second barrier synchronization is performed toensure that no PE changes its output data until aftereveryone has read the current NAND data.Notice that nand_result is actually an 8-bit value with theNAND data embedded within it; thus, to extract the 4-bitresult we simply use:((nand_result >> 3) & 0x0f)All the TTL_PAPERS library communication operations arebuilt upon this communication mechanism. For example, ANY,ALL, and voting operations require just one such operation,or 5 port accesses. Larger operations, such as an 8-bitglobal OR, 8-bit global AND, or broadcast require 2 of theabove transmission sequences, or 10 port accesses. 5. Performance The performance of the basic TTL_PAPERS operations issummarized in the table below. These numbers were obtainedusing 486DX33-based IBM ValuePoint PCs running Linux version1.1.75. Machines with faster printer ports obtaincorrespondingly higher performance; in fact, performance canbe increased by nearly an order of magnitude without anyother changes. ( 4K .ps.Z )The primary observation is that the total UNIX-process toUNIX-process latency is remarkably low. For example, webenchmarked a 64-processor nCUBE2 at 206 milliseconds for asimple barrier synchronization -- 83,000 times slower thanthe TTL_PAPERS cluster! In fact, the minimum latency forsending a single message on the nCUBE2 was reported to bebetween 32 and 110 microseconds [BrG94] , which is between 13and 44 times as long as it takes for a TTL_PAPERS barriersynchronization. The same paper [BrG94] quotes that theIntel Paragon XP/S has a minimum message latency of 240microseconds, which is 97 times the TTL_PAPERS barrierlatency. Of course, any conventional workstation networkwill have a minimum message latency of at least hundreds ofmicroseconds, simply due to the overhead of UNIX contextswitching and the socket protocol.The second observation is that the bandwidth is not veryimpressive. However, bandwidth for TTL_PAPERS isn't reallymeasuring the same thing as bandwidth for a conventionalnetwork. If you want to perform asynchronous blocktransfers, TTL_PAPERS cannot compete with a conventionalnetwork. However, the bandwidth for the aggregateoperations (listed above) is much higher for TTL_PAPERS thanfor nearly any other network attempting to synthesize theseoperations. In summary, if a program needs to perform anaggregate operation, use TTL_PAPERS. If it needs to do someother type of communication, use the conventional network --having TTL_PAPERS on a cluster does not interfere with anyother network's operation. 6. Conclusion In this paper, we have detailed the design of the TTL_PAPERShardware. This public domain design represents the simplestpossible mechanism to efficiently support barriersynchronization, aggregate communication, and groupinterrupt capabilities -- using unmodified conventionalworkstations or personal computers as the processingelements of a fine-grain parallel machine. We do not viewTTL_PAPERS as the ultimate mechanism, but rather as anintroductory step toward the more general and higherperformance barrier and aggregate communication engines thathave been at the core of our research since 1987. The keything to remember about TTL_PAPERS is not what it is, butrather why it is.Why is TTL_PAPERS so much lower latency than other networks?Because it doesn't have a layered hardware and softwareinterface. Why is the hardware so simple? Because it isn'ta network; the fact that TTL_PAPERS communications are aside-effect of barrier synchronization eliminates the needfor buffering, routing, arbitration, etc. Why is it useful?Because, although shared memory and message passing hardwareis very common, the most popular high-level language andcompiler models for parallelism are all based on aggregateoperations -- exactly what TTL_PAPERS provides. Further,barrier synchronization is the key to efficientlyimplementing MIMD, SIMD, and VLIW mixed-mode execution. Whydidn't somebody do it earlier? We and others did. Theproblem is that tightly coupled design of hardware andcompiler is not the standard way to build systems, soearlier designs (e.g., PASM, TMC CM-5) tended to use toomuch hardware and interface software, cost too much, andperform too poorly.Higher-performance versions of PAPERS are on the way. Infact, this TTL_PAPERS design is the sixth PAPERS design wehave built and tested, and three of the other designs easilyoutperform it... but they use much more hardware. We arecurrently working on a smart parallel port card for the ISAbus that will roughly quadruple the performance ofTTL_PAPERS without any change to its hardware. We are alsopursuing versions of the higher-performance designs based onTexas Instruments FPGAs, and anticipate a PCI interface tofuture PAPERS units. Also watch for releases of variouscompilers targeting PAPERS. The WWW URLhttp://garage.ecn.purdue.edu/~papers/ will be the primaryplace for announcing and distributing future releases ofboth hardware and software. References [BrG94] U. Bruening, W. K. Giloi, and W. Schroeder-Preikschat,"Latency Hiding in Message-Passing Architectures," 8thInternational Parallel Processing Symposium, Cancun, Mexico,April, 1994, pp. 704-709. [CoD94] W. E. Cohen, H. G. Dietz, and J. B. Sponaugle, DynamicBarrier Architecture For Multi-Mode Fine-Grain ParallelismUsing Conventional Processors; Part I: BarrierArchitecture, Purdue University School of ElectricalEngineering, Technical Report TR-EE 94-9, March 1994. [CoD94a] W. E. Cohen, H. G. Dietz, and J. B. Sponaugle, DynamicBarrier Architecture For Multi-Mode Fine-Grain ParallelismUsing Conventional Processors; Part II: Mode Emulation,Purdue University School of Electrical Engineering,Technical Report TR-EE 94-10, March 1994. [CoD94b] W. E. Cohen, H. G. Dietz, and J. B. Sponaugle, "DynamicBarrier Architecture For Multi-Mode Fine-Grain ParallelismUsing Conventional Processors," Proc. of 1994 Int'l Conf. onParallel Processing, St. Charles, IL, pp. I 93-96, August1994. [Cra93] Cray T3D System Architecture Overview, Publication HR-04033,Cray Research, Inc., 2360 Pilot Knob Road, Mendota Heights,MN 55120, 1993. [DiC94] H. G. Dietz, W. E. Cohen, T. Muhammad, and T. I. Mattox,"Compiler Techniques For Fine-Grain Execution On WorkstationClusters Using PAPERS," 7th Annual Workshop on Languages andCompilers for Parallel Computing (also to appear as a bookchapter from Springer Verlag), Cornell University, August1994. [DiM94] H. G. Dietz, T. Muhammad, J. B. Sponaugle, and T. Mattox,PAPERS: Purdue's Adapter for Parallel Execution and RapidSynchronization, Purdue University School of ElectricalEngineering, Technical Report TR-EE 94-11, March 1994. [DiO92] H. G. Dietz, M.T. O'Keefe, and A. Zaafrani, "StaticScheduling for Barrier MIMD Architectures," The Journal ofSupercomputing, vol. 5, pp. 263-289, 1992. [Jor78] H. F. Jordon, "A Special Purpose Architecture for FiniteElement Analysis," Proc. Int'l Conf. on Parallel Processing,pp. 263-266, 1978. [OKD90] M. T. O'Keefe and H. G. Dietz, "Hardware barriersynchronization: static barrier MIMD (SBM)," Proc. of 1990Int'l Conf. on Parallel Processing, St. Charles, IL, pp. I35-42, August 1990. [OKD90a] M. T. O'Keefe and H. G. Dietz, "Hardware barriersynchronization: dynamic barrier MIMD (DBM)," Proc. of 1990Int'l Conf. on Parallel Processing, St. Charles, IL, pp. I43-46, August 1990. [SiN87] T. Schwederski, W. G. Nation, H. J. Siegel, and D. G. Meyer,"The Implementation of the PASM Prototype ControlHierarchy," Proc. of Second Int'l Conf. on Supercomputing,pp. I 418-427, 1987. Hypertext Index Abstract Keywords Notice for HTML Users 1. Introduction 1.1. Synchronization 1.2. Communication 1.3. Interrupts 2. PC Hardware 2.1. PE Hardware Interface 2.2. PE Port Bit Assignments 3. TTL_PAPERS Hardware 3.1. Barrier/Interrupt Hardware 3.2. Aggregate Communication Hardware 3.3. LED Display Hardware 4. PAPERS Software 4.1. Operating System Interface 4.1.1. Generic UNIX 4.1.2. Linux 4.2. Port Access 4.3. Barrier Interface 4.4. NAND Data Communication 5. Performance 6. Conclusion References Hypertext Index The only thing set in stone is our name. |
|