The only incremental hardware requirements for a true symmetric multiprocessing environment are the additional CPUs, as shown in Figure 3.
Because the SMP hardware transparently maintains a coherent view of the data distributed among the processors, software program executions do not inherit any additional overhead related to this.
SMP is more than just multiple processors and memory systems design because the NOS can schedule pending code for execution on the first available CPU. This is accomplished by using a self-correcting feedback loop that ensures all CPUs an equal opportunity to execute an specified amount of code; hence, the term symmetric multiprocessing.
Only one queue runs in an SMP operating system, with the dumping of all work into a common funnel. Commands are distributed to multiple CPUs on the basis of availability in a symmetric fashion. Although SMP can potentially use CPU resources more efficiently than asymmetric multiprocessing architectures, poor programming can nullify this potential.
Therefore, process scheduling for SMP requires a fairly complex set of algorithms, along with synchronization devices as spin locks, semaphores, and handles. In using all of the available CPU resources, SMP pays a price for the scheduling overhead and complexity, resulting in less than perfect scaling.
A system with two CPUs cannot double the overall performance of a single processor. With increased overhead for each additional processor, the efficiency goes down, moreso when multiple nodes are involved, as shown in Figure 3.
Although the overall goal of both SMP derivatives is speed, each format solves a number of additional problems. Recall how clustering links individual computers, or nodes, to provide parallel processing within a robust, scalable, and fault-tolerant system. These clustered nodes communicate using high-speed connections to provide access to all CPUs, disks, and shared memory to a single application.
The clustering applications are distributed among multiple nodes, so that when one machine in the cluster fails, its applications are moved to one of the remaining nodes. The downside of pure clustering involves the overhead required to coordinate CPU and disk drive access from one machine to the next. This overhead prevents clusters from scaling as efficiently as pure SMP, which maintains closer ties between the processors, memory, and disks.
System management is greatly simplified with clustering, however, due to the flexibility of merely adding another node to the cluster for additional processing power. Massively parallel processing, where tightly coupled processors reside within one machine, provides scalability far beyond that of basic SMP. Basic SMP suffers from decreasing performance as more processors are added to the mix, whereas MPP performance increases almost uniformly as more processors are introduced.
In the MPP arrangement, the processors do not share a single logical memory image and, therefore, do not have the overhead required to maintain memory coherency or move processes between processors. Special applications are needed to tap the power of MPP systems, though, capable of making heavy use of threads, and able to partition the workload among many independent processors.
The application must then reassemble the results from each processor. Consider Figure 3. Early MPP systems were built as single computers, with many processors. A recent trend in computer research, however, is to tap the power of individual machines to build large distributed MPP networks.
These projects centrally manage and distribute the processing load across multiple machines around the world, and collect the results to build a final answer. Because multiprocessing targets database and transactional systems, its typical deployment occurs on server boards carrying 4 to 8 processors. Often, high-end systems will feature up to 32 processors working together to scale the workload.
Using multiple threads with multiple transactions, multiprocessing systems manage large data-intensive workloads without executing demanding mathematical calculations. For example, Table 3. This is attributed to the level 3 caches being utilized in the two MPs.
Because the multiprocessor is designed for use with much larger databases than used in Table 3. In Table 3. With a larger database to handle, even the two slower MP processors handle a greater number of transactions than the two DP processors running at 2. When using four MP processors, the scalability is maintained. For network operating systems running many different processes operating in parallel, SMP technology is ideal. This includes multithreaded server systems, used for either storage area networking, or online transaction processing.
Multiprocessing systems deal with four problem types associated with control processes, or with the transmission of message packets to synchronize events between processors. These types are. The various ways in which these problems arise in multiprocessing systems become more understandable when considering how a simple message is interpreted within such architecture.
If a message-passing protocol sends packets from one of the processors to the data chain linking the others, each individual processor must interpret the message header, and then pass it along if the packet is not intended for it. Plainly, latency and skew will increase with each pause for interpretation and retransmission of the packet. When additional processors are added to the system chain scaling out , determinism is adversely impacted.
A custom hardware implementation is required on circuit boards designed for multiprocessing systems whereby a dedicated bus, apart from a general-purpose data bus, is provided for command-and-control functions. In this way, determinism can be maintained regardless of the scaling size. In multiprocessing systems using a simple locking kernel design, it is possible for two CPUs to simultaneously enter a test and set loop on the same flag.
These two processors can continue to spin forever, with the read of each causing the store of the other to fail.
To prevent this, a different latency must be purposely introduced for each processor. To provide good scaling, SMP operating systems are provided with separate locks for different kernel subsystems. One solution is called fine-grained kernel locking. It is designed to allow individual kernel subsystems to run as separate threads on different processors simultaneously, permitting a greater degree of parallelism among various application tasks. A fine-grained kernel-locking architecture must permit tasks running on different processors to execute kernel-mode operations simultaneously, producing a threaded kernel.
If different kernel subsystems can be assigned separate spin locks, tasks trying to access these individual subsystems can run concurrently. The quality of the locking mechanism, called its granularity , determines the maximum number of kernel threads that can be run concurrently. If the NOS is designed with independent kernel services for the core scheduler and the file system, two different spin locks can be utilized to protect these two subsystems.
Having separate spin locks assigned for these two subsystems would result in a threaded kernel. Both tasks can execute in kernel mode at the same time, as shown in Figure 3.
This permits Task 2 and Task 3 to conduct useful operations simultaneously, rather than spinning idly and wasting time. If this were not done, any other task trying to acquire the same lock would idly spin forever. Using a fine-grained kernel-locking mechanism enables a greater degree of parallelism in the execution of such user tasks, boosting processor utilization and overall application throughput. Why use multiple processors together instead of simply using a single, more powerful processor to increase the capability of a single node?
The reason is that various roadblocks currently exist limiting the speeds to which a single processor can be exposed. Faster CPU clock speeds require wider buses for board-level signal paths. Supporting chips on the motherboard must be developed to accommodate the throughput improvement.
Increasing clock speeds will not always be enough to handle growing network traffic. Normally, processor manufacturers issue minor revisions to their devices over their manufacturing life cycle.
These revisions are called steppings , and are identified by special numbers printed on the body of the device. More recent steppings have higher version numbers and fewer bugs. Hardware steppings are not provided in the same manner as software patches that can be used to update a version 1. When new steppings come out having fewer bugs than their predecessors, processor manufacturers do not normally volunteer to exchange the older processor for the newer one.
Clients are normally stuck with the original stepping unless the updated processor version is purchased separately. When using multiple processors, always ensure that the stepping numbers on each one match. A significant gap in stepping numbers between different processors running on the same server board greatly increases the chances of a system failure.
The server network administrator should make certain that all processors on a server board are within at least three stepping numbers of other processors in the system. Within this safety range, the BIOS should be able to handle small variances that occur between the processors. However, if these variances are more significant, the BIOS may not be capable of handling the resulting problems. Know that all of the processors utilized on one server board should be within three steppings in order to avoid system problems.
The S-spec number is located on the top edge of the specified Intel processor, as shown in Figure 3. It identifies the specific characteristics of that processor, including the stepping.
Another common but avoidable problem occurs with multiprocessor server boards when processors are installed that are designed to use different clocking speeds. To expect multiprocessors to work properly together, they should have identical speed ratings. The safest approach with multiprocessor server boards is to install processors that are guaranteed by their manufacturer to work together.
These will generally be matched according to their family, model, and stepping versions. I would like to receive exclusive offers and hear about products from Pearson IT Certification and its family of brands. I can unsubscribe at any time. Pearson Education, Inc. This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site.
Please note that other Pearson websites and online products and services have their own separate privacy policies. To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:. For inquiries and questions, we collect the inquiry or question, together with name, contact details email address, phone number and mailing address and any other additional information voluntarily submitted to us through a Contact Us form or an email.
We use this information to address the inquiry and respond to the question. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes. Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. However, designing general benchmarks is difficult.
Even considering hardware performance alone can be challenging because computer hardware consists of several different components see Box 2. Computer systems are deployed and used in a broad variety of ways. As one might expect, different market segments have different use scenarios, and they stress the system in different ways. As a result, the appropriate benchmark to consider can vary considerably between market segments. For example,. A machine that achieves a SPEC benchmark score that is, say, 30 percent faster than that of another machine should feel about 30 percent faster than the other machine on real workloads.
That is an example of performance as time to solution. Complicating matters a bit more, most computer-system deployments define some set of important physical constraints on the system. For example, in the case of a notebook-computer system, important energy-consumption and physical-size constraints must be met. Similarly, even in the largest supercomputer deployments, there are constraints on physical size, weight, power, heat, and cost.
Those constraints are several orders of magnitude larger than in the notebook example, but they still are fundamental in defining the resulting performance and utility of the system. As a result, for a given market opportunity, it often makes sense to gauge the value of a computer system according to a ratio of performance to constraints. Indeed, some of the metrics most frequently used today are such ratios as performance per watt, performance per dollar, and performance per area.
More generally, most computer-system customers are placing increasing emphasis on efficiency of computation rather than on gross performance metrics. A car is not just an engine. One might still have a useful vehicle if the radio and cupholders were missing, but the other features must be present because they all work in harmony to achieve the function of propelling the vehicle controllably and safely.
Computer systems are similar. CPUs function by fetching their instructions from memory. How did the instructions get into memory, and where did they come from? The instructions came from a file on a hard disk and traversed several buses communication pathways to get to memory.
When we speak of the overall performance of a computer system, we are implicitly referring to the overall performance of all those systems operating together. Such a benchmark would be bound by the speed. Although the amazing raw performance gains of the microprocessor over the last 20 years has garnered most of the attention, the overall performance and utility of computer systems are strong functions of both hardware and software.
In fact, as computer systems have deployed more hardware, they have depended more and more on software technologies to harness their computational capability. Software has exploited that capability directly and indirectly. Software has directly exploited increases in computing capability by adding new features to existing software, by solving larger problems more accurately, and by solving previously unsolvable problems.
It has indirectly exploited the capability through the use of abstractions in high-level programming languages, libraries, and virtual-machine execution environments.
By using high-level programming languages and exploiting layers of abstraction, programmers can. Handling most real workloads relies on all three computer subsystems, and their performance metrics therefore reflect the combined speed of all three.
Speed up only the CPU by 10 percent, and the workload is liable to speed up, but not by 10 percent—it will probably speed up in a prorated way because only the sections of the code that are CPU-bound will speed up. Likewise, speed up the memory alone, and the workload performance improves, but typically much less than the memory speedup in isolation. Numerous other pieces of computer systems make up the hardware.
The CPU architectures and microarchitectures encompass instruction sets, branch-prediction algorithms, and other techniques for higher performance. Storage disks and memory is a central component. Switching and interconnect components, from switches to routers to T1 lines, are part of every level of a computer system. There are also hardware interface devices keyboards, displays, and mice.
Those high-level programming constructs make it easier for programmers to develop correct complex programs faster. Abstraction tends to trade increased human programmer productivity for reduced software performance, but the past increases in single-processor performance essentially hid much of the performance cost.
Thus, modern software systems now have and rely on multiple layers of system software to execute programs. The layers can include operating systems, runtime systems, virtual machines, and compilers.
They offer both an opportunity for introducing and managing parallelism and a challenge in that each layer must now also understand and exploit parallelism. The committee discusses those issues in more detail in Chapter 4 and summarizes the performance implications below.
The key performance driver to date has been software portability. Consider a jackhammer on a city street. Assume that using a jackhammer is not a pastime enjoyable in its own right—the goal is to get a job done as soon as possible.
All three possibilities have analogues in computer design, and all three have been and continue to be used. Modern computer systems are designed according to a synchronous, pipelined schema. Synchronous means occurring at the same time. Synchronous digital systems are based on a system clock, a specialized timer signal that coordinates all activities in the system.
Early computers had clock frequencies in the tens of kilohertz. Contemporary microprocessor designs routinely sport clocks with frequencies of over about 3-GHz range.
To a first approximation, the higher the clock rate, the higher the system performance. System designers cannot pick arbitrarily high clock frequencies, however—there are limits to the speed at which the transistors and logic gates can reliably switch, limits to how quickly a signal can traverse a wire, and serious thermal power constraints that worsen in direct proportion to the clock frequency.
How much a computer system can accomplish per clock cycle varies widely from system to system and even from workload to workload in a given system. More generally, once a large collection of programs have become available for a particular computing platform, the broader expectation is that they will all continue to work and speed up in later machine genera-. Substantial system-performance improvements, such as factors of , are available to workloads that happen to fit the constraints of the instruction-set extensions.
There is a special case of time-to-solution workloads: those which can be successfully sped up with dedicated hardware accelerators. These processors were designed originally to handle the demanding computational and memory bandwidth requirements of 3D graphics but more recently have evolved to include more general programmability features.
With their intrinsically massive floating-point horsepower, 10 or more times higher than is available in the general-purpose GP microprocessor, these chips have become the execution engine of choice for some important workloads. Although GPUs are just as constrained by the exponentially rising power dissipation of modern silicon as are the GPs, GPUs are orders of magnitude more energy-efficient for suitable workloads and can therefore accomplish much more processing within a similar power budget.
The new, higher-performing systems are then capable of executing software workloads that would previously have been infeasible; the attractiveness of the new software drives demand for the faster hardware, and the virtuous cycle continues. A few years ago, however, thermal-power dissipation grew to the limits of what air cooling can accomplish and began to constrain the attainable system performance directly.
When the power constraints threatened to diminish the generation-to-generation performance enhancements, chipmakers Intel and AMD turned away from making ever more complex microarchitectures on a single chip and began placing multiple processors on a chip instead. The new chips are called multicore chips.
Current chips have several processors on a single die, and future generations will have even more. As the microprocessor industry shifts to multicore processors, the rate of improvement of each individual processor is substantially diminished.
There is another useful performance metric besides time to solution, and the Internet has pushed it to center stage: system throughput. Consider a Web server, such as one of the machines at search giant Google. Those machines run continuously, and their work is never finished, in that new requests for service continue to arrive. For any given request for service, the user who made the request may care about time to solution, but the overall performance metric for the server is its throughput, which can be thought of informally as the number of jobs that the server can satisfy simultaneously.
When a given workload required the sequential execution of several million operations, a faster clock or a more capable microarchitecture would satisfy the requirement. But compilers are not generally capable of targeting multiple processors in pursuit of a single time-to-solution target; they know how to target one processor.
Multicore chips therefore tend to be used as throughput enhancers. Each available CPU core can pop the next runnable process off the ready list, thus increasing the throughput of the system by running multiple processes concurrently.
But that type of concurrency does not automatically improve the time to solution of any given process. Modern multithreading programming environments and their routine successful use in server applications hold out the promise that applying multiple threads to a single application may yet improve time to solution for multicore platforms.
It is reasonably clear that although time-to-solution performance is topping out, throughput can be increased indefinitely. The as yet unanswered question is whether the buying public will find throughput enhancements as irresistible as they have historically found time-to-solution improvements.
The net result is that the industry is ill prepared for the rather sudden shift from ever-increasing single-processor performance to the presence of increasing numbers of processors in computer systems. The reason that industry is ill prepared is that an enormous amount of existing software does not use thread-level or data-level parallelism—software did not need it to obtain performance improvements, because users simply needed to buy new hardware to get performance improvements.
However, only programs that have these types of parallelism will experience improved performance in the chip multiprocessor era. Although expert programmers in such application domains as graphics, information retrieval, and databases have successfully exploited those types of parallelism and attained performance improvements with increasing numbers of processors, these applications are the exception rather than the rule.
Writing software that expresses the type of parallelism that hardware based on chip multiprocessors will be able to improve is the main obstacle because it requires new software-engineering processes and tools.
Indeed, the outlook for overcoming this obstacle and the ability of academics and industry to do it are primary subjects of this report. There should be little doubt that computers have become an indispensable tool in a broad array of businesses, industries, research endeavors, and educational institutions. They have enabled profound improvement in automation, data analysis, communication, entertainment, and personal productivity. In return, those advances have created a virtuous economic cycle in the development of new technologies and more advanced computing systems.
To understand the sustainability of continuing improvements in computer performance, it is important first to understand the health of this cycle, which is a critical economic underpinning of the computer industry. From a purely technological standpoint, the engineering community has proved to be remarkably innovative in finding ways to continue to reduce microelectronic feature sizes.
First, of course, industry has integrated more and more transistors into the chips that make up the computer systems. Fortunate side effects are improvements in speed and power efficiency of the individual transistors. Computer architects have learned to make use of the increasing numbers and improved characteristics of the transistors to design continually higher-performance computer systems. The demand for the increasingly powerful computer systems has generated sufficient revenue to fuel the development of the next round of technology while providing profits for the companies leading the charge.
Those relationships form the basis of the virtuous economic. BOX 2. This history of computing hardware has been dominated by a few franchises. With the rise of personal computing in the s, compatibility has come to mean the degree of compliance with the Intel architecture also known as IA or x Intel and other semiconductor companies, such as AMD, have managed to find ways to remain compatible with code for earlier generations of x86 processors.
That compatibility comes at a price. For example, the floating-point registers in the x86 architecture are organized as a stack, not as a randomly accessible register set, as all integer registers are.
There was a time in the industry when much architecture research was expended on the notion that because every new compatible generation of chips must carry the aggregated baggage of its past and add ideas to the architecture to keep it current, surely the architecture would eventually fail of its own accord, a victim of its own success.
But that has not happened. There are many important applications of semiconductor technology beyond the desire to build faster and faster high-end computer systems. In particular, the electronics industry has leveraged the advances. The strength of the x86 architecture was most dramatically demonstrated when Intel, the original and major supplier of x86 processors, decided to introduce a new, non-x86 architecture during the transition from bit to bit addressing in the s.
Around the same time, however, AMD created a processor with bit addressing compatible with the x86 architecture, and customers, again driven by the existence of the large software base, preferred the bit x86 processor from AMD over the new IA processor from Intel.
In the end, Intel also developed a bit xcompatible processor that is now far outselling its IA Itanium processor. With the rise of the cell phone and other portable media and computing appliances, yet another dominant architectural approach has emerged: the ARM architecture. The rapidly growing software base for portable applications running on ARM processors has made the compatible series of processors licensed by ARM the dominant processors for embedded and portable applications.
Initial moves to chip-multiprocessor systems are being made with existing architectures based primarily on the x The computing industry has accumulated a lot of history on this subject, and it appears safe to say that the era in which compatibility is an absolute requirement will probably end not because an incompatible but compellingly faster competitor has appeared but only when one of the following conditions takes hold:. Although each of those devices tends to have more substantial constraints size, power, and cost than traditional computer systems, they often embody the computational and networking capabilities of previous generations of higher-end.
In light of the capabilities of the smaller form-factor devices, they will probably play an important role in unleashing the aggregate performance potential of larger-scale networked systems in the future. Those additional market opportunities have strong economic underpinnings of their own, and they have clearly reaped benefits from deploying technology advances driven into place by the computer-systems industry.
In many ways, the incredible utility of computing not only has provided direct improvement in productivity in many industries but also has set the stage for amazing growth in a wide array of codependent products and industries. In recent years, however, we have seen some potentially troublesome changes in the traditional return on investment embedded in this virtuous cycle.
As we approach more of the fundamental physical limits of technology, we continue to see dramatic increases in the costs associated with technology development and in the capital required to build fabrication facilities to the point where only a few companies have the wherewithal even to consider building these facilities.
At the same time, although we can pack more and more transistors into a given area of silicon, we are seeing diminishing improvements in transistor performance and power efficiency. As a result, computer architects can no longer rely on those sorts of improvements as means of building better computer systems and now must rely much more exclusively on making use of the increased transistor-integration capabilities. Our progress in identifying and meeting the broader value propositions has been somewhat mixed.
On the one hand, multiple processor cores and other system-level features are being integrated into monolithic pieces of silicon. On the other hand, to realize the benefits of the multiprocessor machines, the software that runs on them must be conceived and written in a different way from what most programmers are accustomed to. From an end-user perspective, the hardware and the software must combine seamlessly to offer increased value. It is increasingly clear that the computer-systems industry needs to address those software and programmability concerns or risk the ability to offer the next round of compelling customer value.
Without performance incentives to buy the next generation of hardware, the economic virtuous cycle is likely to break down, and this would have widespread negative consequences for many industries.
In summary, the sustained viability of the computer-systems industry is heavily influenced by an underlying virtuous cycle that connects continuing customer perception of value, financial investments, and new products getting to market quickly. Although one of the primary indicators of value has traditionally been the ever-increasing performance of each individual compute node, the next round of technology improve-.
As a result, many computer systems under development are betting on the ability to exploit multiple processors and alternative forms of parallelism in place of the traditional increases in the performance of individual computing nodes. To make good on that bet, there need to be substantial breakthroughs in the software-engineering processes that enable the new types of computer systems.
Moreover, attention will probably be focused on high-level performance issues in large systems at the expense of time to market and the efficiency of the virtuous cycle.
The end of dramatic exponential growth in single-processor performance marks the end of the dominance of the single microprocessor in computing. Because multiprocessing OSs rely on parallel processing, each processor involved in a task must be able to inform the others about how its task is progressing.
This allows the work of the processors to be integrated when the calculations are done such that delays and other inefficiencies are minimized. For example, if a single-processor OS were running an application requiring three tasks to be performed, one taking five milliseconds, another taking eight milliseconds, and the last taking seven milliseconds, the processor would perform each task in order.
The entire application would thus require twenty milliseconds. If a multiprocessing OS were running the same application, the three tasks would be assigned to separate processors. The first would complete the first task in five milliseconds, the second would do the second task in eight milliseconds, and the third would finish its task in seven milliseconds.
Thus, the multiprocessing OS would complete the entire task in eight milliseconds. From this example, it is clear that multiprocessing OSs offer distinct advantages. One type of tightly coupled multiprocessing system has processors share memory with each other. This is known as symmetric multiprocessing SMP. Processor symmetry is present when the multiprocessing OS treats all processors equally, rather than prioritizing a particular one for certain operations. Multiprocessing OSs are designed with special features that support SMP, because the OS must be able to take advantage of the presence of more than one processor.
It must also be able to track the progress of each task so that the results of each operation can be combined once they conclude. In contrast, asymmetric multiprocessing occurs when a computer assigns system maintenance tasks to some types of processors and application tasks to others.
Because the type of task assigned to each processor is not the same, they are out of symmetry. SMP has become more commonplace because it is usually more efficient.
0コメント