hpcLine

high performance computing

News
Press
Products
Presentations
Partner
Middleware
Applications
customer reports
Events
Contact

SCI as cluster interconnect

The Dolphin PCI-SCI family of adapter cards is a standard, high performance solution for clustered systems. Scali combines these with nativ software to provide infrastructure for build clusters with outstanding performance. The following presentations gives a simplified introduction to SCI as used by Scali.
    The figure illustrates (upper part) a dual CPU node with an SCI network adapter card (PCI/SCI). The adapter card contains a PCI interface (PSB) and three SCI link controllers (LC3). Each LC3 has two unidirectional links, one carrying traffic to the node (SCI in) and one carrying traffic from the node (SCI out). Three link controllers enable the node to connect to three dimensions in a grid, via three unidirectional rings. Currently each link supports 667MByte/s bandwidth.
    The functionality of the link controller is detailed in the lower part of the figure. The incomming traffic that reaches the B-link is either destined for the local node (via the PSB), or it is picked up by one of the link controllers. If the traffic was for another node in the same ring it simply passes from input link to output link via the bypass FIFO. As the bypass FIFO is designed as a ``fall through'' FIFO, traffic that passes through it is not delayed. Since the switching functionality is present at all nodes the switching function for the cluster is effectively distributed throughout the cluster. This way SCI-based Scali Clusters do not need separate switches the way some other cluster interconnects do.

Shared Address Space

Dolphin's SCI adapter cards are based on the ANSI/IEEE 1596-1992 Scalable Coherent Interface standard. This standard was developed to carry the shared memory ideas from busses over to network based multiprocessors. The use of SCI in Dolphin's adapter cards does however not exploit the cache coherent features of SCI, but instead the cards rely on a subset of the protocol for reading and writing data without hardware coherence. This way commodity nodes can benefit from the high performance of SCI interconnects without need for specialized hardware interfaces to nodes' cache controllers. But the concept of shared memory is maintained, with the result that each node can issue both load and store instructions at remote memories via SCI.
    Scali's communication libraries for SCI relies heavily on the ability of the adapter cards to read and write the memory of remote nodes. The figure to the right illustrates how this is done by letting the parts of the virtual memory of processes be mapped across SCI, via translation tables, to the memories of remote nodes. This way load and store instructions performed by a local processor can be carried over the SCI interconnect and directly access remote memory. This ability of combining instructions executed by the local processor with remote accesses means that there is no need for a separate processor for the adapter cards, a feature that distinguishes Dolphin's SCI adapter cards from competing products.

Push based messaging

State-of-the-art CPUs in nodes are used to perform the performance critical task of moving data. This way a minimal hardware solution efficiently forwards generic memory accesses to the network. Although both read and writes are available, Scali's MPI implementation for SCI (Scali MPI Connect) uses only writes, i.e., the local node writes into the memory of the remote node. This is also called push based messaging. ScaMPI's performance is achieved by taking maximum advantage of fast point-to-point links, and bypassing the time consuming operating system calls and protocol software overhead found in traditional networking approaches.
    By combining a write-only architecture with aggressive use of key features in modern processors, latency is hidden and performance increased. The key features in question are multimedia extensions to traditional instruction sets, and write buffers and merging in these. The sending process transfers the data inline, i.e., by writing data directly from its own cache to the receiving side. When the processor pushes data onto the network the data is also consistent, there are no problems with the operating system's manipulation of virtual memory page frames, and the sending process knows when the data has left and the buffer can be reused without further synchronization.
    The end result is that Scali MPI Connect achieves very low latency for small messages, currently under 4µs. At the same time, the bandwidth achievable is limited by the PCI bus, as the SCI interconnect has plenty of room for higher bandwidth.

More Information about Scali Software:
 
Scali MPI Interconnect
Scali Manage - Cluster Management System

 

Learn more about the hpcLine

O   system description
  technical overview
  solution concept
  PRIMECENTER Rack 38 U
O   computenodes
  Intel Pentium4 / Pentium D
  Dual Intel Xeon
  Dual/Quad Intel Itanium
  Dual/Quad AMD Opteron
O   printversion
  soluction concept
  Datasheet nodes
  Datasheet rack
O   communication network
  Infiniband
  Myrinet 2000
  ScalableCoherentInterface