high performance computing
 
|
SCI as cluster interconnect
The Dolphin PCI-SCI family of adapter cards is a standard,
high performance solution for clustered systems.
Scali combines these with nativ software to provide infrastructure for
build clusters with outstanding performance.
The following presentations gives a simplified introduction to SCI as
used by Scali.
The figure illustrates (upper part) a dual CPU node with an SCI
network adapter card (PCI/SCI).
The adapter card contains a PCI
interface (PSB) and three SCI link controllers (LC3). Each LC3 has two
unidirectional links, one carrying traffic to the node (SCI in) and one
carrying traffic from the node (SCI out).
Three link controllers enable the node to connect to three dimensions
in a grid, via three unidirectional rings.
Currently each link supports 667MByte/s bandwidth.
The
functionality of the link controller is detailed in the lower part
of the figure.
The incomming traffic that reaches the B-link is either destined for
the local node (via the PSB), or it is picked up by one of the link
controllers.
If the traffic was for another node in the same ring it simply passes
from input link to output link via the bypass FIFO.
As the bypass FIFO is designed as a ``fall through'' FIFO, traffic
that passes through it is not delayed.
Since the switching
functionality is present at all nodes the
switching
function for the cluster is effectively distributed
throughout the cluster. This way SCI-based Scali Clusters do not need
separate switches the way some other cluster interconnects do.
Shared Address Space
Dolphin's SCI adapter cards are based on the ANSI/IEEE 1596-1992
Scalable Coherent Interface standard. This standard was developed to
carry the shared memory ideas from busses over to network based
multiprocessors. The use of SCI in Dolphin's adapter cards does
however not exploit the cache coherent features of SCI, but instead
the cards rely on a subset of the protocol for reading and writing data
without hardware coherence.
This way commodity nodes can benefit from the high performance of SCI
interconnects without need for specialized hardware interfaces to
nodes' cache controllers.
But the concept of shared memory is maintained, with the result that
each node can issue both load and store instructions at remote
memories via SCI.
Scali's communication
libraries for SCI relies heavily on the ability
of the adapter cards to read and write the memory of remote nodes.
The figure to the right illustrates how this is done by letting the
parts of the virtual memory of processes be mapped across SCI, via
translation tables, to the memories of remote nodes.
This way load and store instructions performed by a local processor
can be carried over the SCI interconnect and directly access remote
memory.
This ability of combining instructions executed by the local processor
with remote accesses means that there is no need for a separate
processor for the adapter cards, a feature that distinguishes
Dolphin's SCI adapter cards from competing products.
Push based messaging
State-of-the-art CPUs in nodes are used to
perform the performance critical task of moving data.
This way a minimal hardware solution efficiently forwards generic
memory accesses to the network.
Although both read and writes are available, Scali's MPI implementation
for SCI (Scali MPI Connect) uses only writes, i.e., the local node writes into
the memory of the remote node. This is also called push based messaging.
ScaMPI's performance is
achieved by taking maximum advantage of fast point-to-point links, and
bypassing the time consuming operating system calls and protocol
software overhead found in traditional networking approaches.
By combining a write-only architecture with aggressive use of key
features in modern processors, latency is hidden and performance
increased. The key features in question are multimedia
extensions to traditional instruction sets, and write buffers and merging
in these.
The sending process transfers the data inline, i.e., by
writing data directly from its own cache to the receiving side.
When the processor pushes data onto the network the data is also
consistent, there are no problems with the operating system's
manipulation of virtual memory page frames, and the sending
process knows when the data has left and the buffer can be reused
without further synchronization.
The end result is that Scali MPI Connect achieves very low latency for
small messages, currently under 4µs.
At the same time, the bandwidth achievable is limited by the PCI bus,
as the SCI interconnect has plenty of room for higher bandwidth.
More Information about Scali Software:
Scali MPI Interconnect
Scali Manage - Cluster Management System
Learn more about the hpcLine
|
|
system description | |
|
|
|
computenodes | |
|
|
|
printversion | |
|
|
|
communication network | |
|
|
|
|