Exclusive Coverage of SGI 2003 New Product Line - Part 2 - Hardware Details

In Part 2, we have a detailed hardware module by module analysis of the new products, exclusively for fxguide by industry legend Jean-Francois Panisset. Some of his findings will surprise you, as he compares Octane and Onyx3 with Tezro and Onyx4. Now at one of the world’s leading visual effects houses, Panisset is widely known for his former work at Discreet where for 8 years he lead much of Discreet’s SGI hardware analysis team. Panisset (or JF as he is

On Monday, July 14th 2003, SGI is announcing significant additions to its workstation and high-end visualization systems product line which should be very interesting to users of high-end digital media applications. The SGI Tezro is a family of workstations which replaces the OCTANE2, whereas Onyx4 UltimateVision is a new graphics architecture which takes a different approach to performance than the current InfiniteReality4 hardware. These systems will be discussed mostly from a hardware perspective, with an emphasis on howchanges from previous systems could affect performance and functionality of high-end editing and compositing applications, and how well these systems will integrate in a typical production or post-production environment. No claims are made regarding specific applications from specific vendors: only the potentials of the systems are discussed.

Guest Columnist: Jean-Francois Panisset

The Tezro workstation

CPUs and Memory

SGI has always used the approach of migrating technology first developed in its high end systems into its desktop offerings. In the case of the Tezro workstation, it inherits the design of a quad CPU and memory module introduced in the high end SGI 3900 systems. These modules can have one to four R16K processors and up to 8GB of RAM. Whereas each Cx brick in an SGI 3900 (the basic CPU component) can contain four such modules (up to 16 CPUs in a Cx brick), Tezro holds a single module, and thus supports one, two or four R16K CPUs, each of which has 4MB of cache memory and runs at 600MHz or 700MHz. A nice side effect of using a common processor/memory module is that it allows the use of the same memory modules, and since third party memory is available from a SGI-certified manufacturer for the SGI 3000 family machines, this should mean that third-party memory should be available for Tezro machines without significant delay.

This new CPU/memory module provides significantly increased memory bandwidth with respect to the OCTANE2, up to 3.2GB/sec of bidirectional bandwidth. This makes a four CPU system viable: there is not much point to adding more CPUs to a system if they will be starved for memory bandwidth. This is especially important for digital media applications, which tend to access very large datasets which do not fit in cache memory (a single 4K resolution image at more than 8 bits per component can take up 100MB of memory). Not only is memory bandwidth important to CPU performance, but it is also crucial to moving large mounts of data between a striped disk array, graphics, high speed networking interfaces, video I/O interfaces and the CPUs themselves. As we will see further, increased memory bandwidth on Tezro should enable applications to tackle classes of problems they would not have been able to address on OCTANE2. Parallelized image processing algorithms will greatly benefit from having 4 CPUs in a desktop system, especially for large image sizes.

Chassis and I/O

Physically, Tezro is available as either a single desktop/tower chassis which is slightly larger than the current OCTANE/OCTANE2 chassis, or a rack mounted version which is composed of either one or two 2U units (two Rack Units, which is equivalent to 3.5 inches). The tower configuration has 8 PCI-X slots on three separate busses, and that is one of the most exciting aspects of this new machine. The OCTANE2 was always somewhat limited by the number of expansion slots. Although it had 3 PCI slots in its PCI expansion cardcage, those were on the same 64 bit/33 MHz PCI bus, with a theoretical limit of 266MB/sec. This means that a single 2Gbit/sec FibreChannel adapter could saturate this bus. Two XIO slots were also available, one of which would typically be used for a DM2 video I/O card, the other for a second PCI FibreChannel adapter. In Tezro, These PCI busses, dedicated XIO slots for graphics and video, as well as processors and memory are connected via a crossbararchitecture which allows multiple concurrent data flows at high bandwidth. In the rack-mount configuration, the first module has 4 PCI slots on two busses, V12 graphics and one to four CPUs. The optional second module holds four additional PCI busses on 2 busse s and a dedicated slot for a video I/O card. Additional CPUs or memory cannot be installed in the second rack-mount module.

The PCI-X busses in Tezro support 64 bit transfers at 100MHz, for a theoretical maximum of 800MB/sec. Of the 8 slots, one is permanently used by the built-in IO9 base I/O adapter. There are no XIO slots per se (with PCI-X, it no longer makes sense for SGI to develop its own proprietary I/O adapters), although dedicated slots are used for V12 graphics and optionally a DM3 video I/O adapter. With 7 usable PCI-X slots and lots of memory bandwidth to match, it should be possible to scale the I/O on a Tezro system to tackle very large data movement applications.

The IO9 base I/O adapter includes built-in Gigabit Ethernet: with the price of Gigabit Ethernet switches having reached reasonable levels, anyone moving images around should be thinking about Gigabit Ethernet in their facility. Jumbo packets are supported, which helps improve the efficiency of GigE transfers. Hopefully SGI will add support for a 10 Gigabit Ethernet adapter to IRIX, especially once these boards and the corresponding switches come down in price to more accessible levels. Two Ultra 160 SCSI ports are provided: one is routed internally to a bay which can hold two 3.5″ drives, and one is available on a connector on the back of the machine. Considering that SGI systems tend to last for a fairly long time, it would have been preferable to go with more recent Ultra 320 SCSI ports, since a single Ultra 320 port can potentially sustain enough bandwidth for a stream of 1920×1080 HD material. It is more likely that the external SCSI port will be used to connect to SCSI tape drives for backup purposes, and support for LVD SCSI is helpful in supporting modern tape drives.

A dedicated IDE interface is used to connect the IO9 to the optional internal DVD-ROM drive. The OCTANE2 did not have room for an internal CD-ROM drive, which made software installations more cumbersome (you either had to do the installation over the network, or connect an external CD-ROM drive to the machine). This built-in drive makes it easy to install software, load customer-supplied data, extract scratch music from CDs, and hopefully at some point SGI will supply all of IRIX and the corresponding overlays on a single DVD-ROM, which would make installing IRIX a lot less cumbersome. Although Onyx 3200 have had a DVD-ROM drive from the beginning, it was used as a CD-ROM drive. Tezro will launch with IRIX 6.5.20, which supports the UDF file-system format used by DVD-ROMs. Support for recordable DVD drives would be a great additional to IRIX, allowing the use of DVD drives for archiving purposes and producing DVD dailies.

Two serial ports are built-in, which are typically used for VTR control and a graphics tablet. Two more can be added via a PCI board. The keyboard and mouse connect via PS/2 connectors, making it possible to use standard PC keyboard/mouse extenders if the system is located remotely from the console (a typical scenario when the system is installed in a machine room). The rack-mounted configuration uses an optional USB card to attach the mouse and keyboard, which unfortunately uses up a PCI slot.

Graphics

Tezro uses the same V12 graphics as OCTANE2, which has both positive and negative aspects. On the plus side, this means that existing graphical applications can quickly move to the Tezro platform without having to worry about certifying a completely new graphics adapter. Although OpenGL guarantees a fairly high degree ofcompatibility between different adapters, it remains that each adapter has its own set of capabilities, performance characteristics and bugs. Thus supporting a new OpenGL adapter in a large graphical application can consume significant resources. V12 is a stable and mature product with all the features required for current editing and compositing applications. Although V12 is feature-wise identical in Tezro than in OCTANE2, the board itself has a different form factor, and installs in a dedicated slot in the Tezro chassis (whether the Tower or rack-mount configuration). A single DVI-I digital/analog output is standard, allowing flat panel displays to be connected digitally. Furthermore, V12 in Tezro takes advantage of an optimized interface to the crossbar switch core of the system, allowing images to be transfered between 15 and 25% faster into graphics than on OCTANE2. This should benefit editing and compositing applications, which typically spend more time sending images rather than polygons into the graphics adapter. Combined with the ability to scale I/O, this should allow Tezro to playback large resolution images at more
than 8 bits per component, and possibly more than one stream of high resolution images.

On the negative side, V12 is a design which dates back a few years, and has thus been superseded in terms of performance and features. In particular, it does not benefit from the most exciting development in graphics hardware in the last few years, programmability: recent adapters allow developers to write “vertex shaders” to manipulate geometry in ways not possible with the fixed function geometry pipeline of OpenGL, and “pixel shaders” allowing arbitrary shading operations to be performed on pixel fragments. Combined with floating point pixel formats, this allows arbitrarily complex graphics algorithms to be executed on the graphics adapter, significantly increasing the range of effects that can be rendered quickly and efficiently. What makes V12 still viable are its very high bandwidth for transferring pixels in and out of the adapter, full support for 12 bits per color component, and digital media integration through genlock support and hardware graphics to video support.

Audio and Video

The desktop Tezro system has built in analog audio I/O, which would typically not be used for professional applications (although it can come in handy to listen to your favorite CDs or MP3s). Instead, the RAD PCI audio card is the preferred solution. RAD implements the same functionality as the built-in audio on the OCTANE2 (minus the analog support, which is included in the built-in audio support): AES in/out, ADAT in/out (which can be used to implement 8 channels of AES I/O through an external ADAT to AES converter), locking to a video clock (which OCTANE2 did not support). RAD supports sample rates up to 48kHz: hopefully in the future there will be an option for an audio interface supporting 96kHz and higher sample rates. Existing audio applications which support RAD audio (whether as a PCI card or in its built-in form on OCTANE2 or Onyx2) should be able to take advantage of RAD audio on Tezro without too much effort.

Two separate video I/O interfaces are supported. DM6 is a standard-definition (SD) only PCI board which can be installed in one of the PCI slots. DM3 is an XIO board which supports standard definition and high definition video standards. Although the same basic design as DM2/DM3 on the OCTANE2 or Onyx3200, the board has a different form factor and installs in a dedicated slot in either the Tezro Tower chassis, or in the second rack-mount unit of a rack-mount system, thus it does not take up a PCI slot. It also uses the same Video Break Out Box (VBOB) as DM2/DM3. The difference between DM2 on the OCTANE2 and DM3 on Onyx class machines, apart from a different mounting bracket, is that DM2 would not support 4:4:4 HD transfers at 10 bits per component, since these require 33% more bandwidth than 8 bit transfers. This is due to limitations in the memory bandwidth of the OCTANE2. With its increased memory bandwidth, 4:4:4 10 bit HD transfers are supported by DM3 in the Tezro chassis. Furthermore, there should be enough bandwidth to support more than simple capture or output at HD resolutions, possibly allowing updating a graphics display, or perhaps some simple operations on the images as they flow through memory.

A hardware graphics to video path allows a broadcast monitor to be connected to the graphics output of a workstation and generate a video signal from a portion of the display without imposing any overhead on the graphics adapter. This used to require specialized support in the display back end of the graphics adapter, or a dedicated connection between graphics and video. The generalization of the DVI standard for outputting digital signals from graphics adapters makes it possible to use an external device to extract the video signal from the DVI output. SD-only hardware graphics to video support is available through the use of an optional PCI card, the DM7. Combined with DM6, this can be used to put together a less expensive SD-only system. For SD and HD support, the DM5 daughter board can be installed in the VBOB shared with the DM3 video board. It accepts the DVI output from V12 (equipped with the optional DCD dual channel output option board) and sends it to VBOB, which can output it as a serial digital SD or HD video signal.

Tezro Conclusion

Tezro should prove a great successor to OCTANE2 for editing and compositing applications, allowing significant gains in performance and system flexibility and expandability. The reuse of existing graphics, audio and video components should make it easier for solution providers to support the platform in a timely and cost effective way.

Onyx4 UltimateVision

Large Pipes versus Scalability

Last fall, SGI released InfiniteReality4, the latest version of InfiniteReality graphics. IR provides features and capabilities which are still unmatched in other systems today, but as is the case for V12, it does not benefit from the “programmability revolution” which happened in the last few years in graphics hardware design, and thus implements a fixed function OpenGL pipeline (a very feature-rich one). Also, although IR systems can be scaled to support multiple pipes, each pipe can typically only drive a single display, so unless you have an application which can make use of multiple displays (such as a flight simulator), the scalability of IR graphics is limited by the fact that you can only have one Geometry Engine board and up to four Raster Manager boards per pipe.

Here again, the spread of the DVI standard for digital video output made it possible to consider taking the output of more than one graphics adapter and combining these as a post process to generate a single, larger image. Earlier approaches to combining the output of multiple graphics adapters relied on expensive, dedicated interfaces: DVI is inexpensive and can be implemented with off the shelf components. InfinitePerformance graphics on Onyx-class systems was the first implementation of this scalable graphics approach: the DVI output of one to four V12 graphics adapters is fed to a Compositor box, which then generates a single raster output to drive a display. For instance, each of the V12 pipes can generate one quarter of the complete image, and the Compositor will stitch these 4 quadrants together. In fact, the regions handled by each V12 pipe do not need to have the same size, so a pipe which is handling a busier region of the screen could be given a smaller portion of the image to handle, thus spreading the load more evenly across the pipes. This automatically takes care of scaling fill performance, that is the rate at which a graphics adapter can draw textured, shaded pixels. And if the application is sufficiently smart to send only the relevant geometry data to each of the pipes (either on its own, or through the use of a software toolkit which will take care of this partitioning), it will also scale geometry performance, as the geometry handling front end of each board will only need to process the geometry destined for that pipe.

Commodity Graphics and Scalability

UltimateVision graphics represents the next logical step in this scalable graphics approach: instead of using V12 graphics as the basic building block, SGI uses commodity graphics adapters, thus taking advantage of the economies of scale and rapid turnaround of the commodity graphics industry, and in particular benefiting from the vertex and pixel-level programmability of recent designs. Software layers hide some of the tasks of decomposing a scene to be rendered onto multiple sub-pipes. As is the case with InfinitePerformance, a Compositor takes the DVI outputs of these commodity boards and stitches the sub image together to create the final image to be displayed. And since the major vendors of commodity graphics adapters are on very short product development cycles, it should be possible for UltimateVision to benefit from fairly rapid product revision cycles, gaining additional features and performance as new generations of commodity adapters are incorporated into the system.

Taking Advantage of Scalable Graphics

Most 3D graphical applications maintain an internal representation of what needs to be displayed on the screen, often referred to as a database. For each frame to be drawn, the database is traversed in order to extract those objects which are in the field of view of thevirtual camera: the graphical primitives which make up these objects are sent to the graphics pipe to be rendered. It doesn’t matter much if the determination of those objects in the field of view is fairly coarse, as long as it includes all of the objects you want to see: the graphics pipeline will take care of clipping out objects outside the field of view. Such a model can clearly take advantage of a scalable graphics architecture: if you think of each of the adapters making up the system as having their own virtual camera, you can determine which objects to send to which graphics pipe, and as long as the tests applied are not too coarse (the worse case being where you send the entire database to each graphics pipe), you will get scalable geometry performance as well as fill performance. And in a system with multiple CPUs, you can run different traversal processes on different CPUs to gain extra performance. SGI’s Performer graphics toolkit includes support for these techniques.

But not all graphical tasks fall into this model: for instance, the user of a 2D paintbox may typically only operate on a small part of the display at a time, making it difficult to gain extra performance from simple spatial partitioning of the screen between graphic pipes. Also, in editing and compositing applications, the textures used on surfaces typically change at every frame (since they come from image sequences), and although it is fairly simple to determine which section of geometry falls into which section of the screen, it is not quite as easy to determine which section of a texture image is relevant to a section of the screen, so there may be significant overhead in sending duplicate texture information to the different graphic pipes. Finally, commodity graphics adapters currently have lower pixel transfer rates than V12 graphics, which can make it challenging to exploit the performance gains of these scalable architectures, especially for applications which were developed assuming a single fast graphics pipe, and which need to read back the results of a rendered frame from graphics memory back to system memory. On the other hand, the geometry and fill performance of each individual pipe is very high, close to that of a dual RM InfiniteReality pipe. And if an application can take advantage of cached geometry, the geometry performance can in fact be a lot higher than InfiniteReality.

Apart from performance, the most interesting challenge for application developers is taking advantage of the vertex and pixel shader capabilities of the graphics adapters. For one thing, it should make it possible for SGI to offer support for OpenGL 1.4 functionality as well as the upcoming OpenGL Shading Language, which offers a high-level approach to access programmable hardware functionality instead of having to write “assembly code” for a specific graphics architecture. A similar concept has already been successfully implemented by nVidia with its Cg language. Although pixel shaders are most often thought of being used to implement different per-pixel shading algorithms, they can also be used to perform more or less arbitrary computations on large parallel data sets (especially when floating point pixel formats are used), making them very useful for accelerating image processing algorithms. When considering the implementation of such an optimization, you have to consider whether the performance gain of moving the algorithm to the graphics pipe might not be offset by the overhead of transferring the image data to and from graphics memory, and whether you can use the CPU time you just freed up to do something else.

The Onyx4 System

The Onyx4 system is based on the same building block architecture as the current Onyx350 InfinitePerformance or InfiniteReality systems. The basic brick of the system can hold 2 or 4 R14K CPUs at 600 or R16K 700MHz, 4 PCI-X slots on two 100MHz/64 bit PCI busses, and uses the same IO9 base I/O board as the Tezro system: in fact, this basic CPU brick is fairly similar to the one used by the rack-mount Tezro. Up to two graphics pipes can be installed in each module. Two such modules can be connected directly together. For larger configurations, a NUMAlink module is required, which allows up to 8 other modules to be connected together, for a possible total of 32 CPUs and “lots” of graphics pipes. An application has to be structured accordingly to use a large number of CPUs, and adding additional CPUs past a certain point typically does not offer linear performance gains, unless a problem can be decomposed to take advantage of these resources. Since the graphic pipes fit inside the CPU modules and do not require a separate graphics cardcage, a typical Onyx4 systems will require a lot less rack space than an Onyx3200 system. This should also reduce the power and cooling requirements of these systems.

Video I/O uses the DM3 video board already supported on Onyx 3200 and now on Tezro, which makes it easier to port existing applications (a specific version of the CPU module must be selected which includes a dedicated slot for the DM3). A hardware graphics to video path is provided by installing a DM5 board in the Video Break Out Box (VBOB) shared with DM3, as on OCTANE2 and Tezro. Audio uses the same PCI RAD audio board as discussed previously. All of which is good news in helping audio/video applications to make the transition from previous platforms.

Onyx4 Conclusion

The Onyx4 system, excluding the graphics, represents a natural evolution of the high-end SGI visualization systems, and it should be fairly straightforward for existing editing and compositing applications to take advantage of the extra memory bandwidth, faster CPUs and better I/O capabilities, while relying on familiar interfaces for audio and video I/O. UltimateVision graphics, on the other hand, is a radically new approach, and represents a challenge: a non-trivial amount of work may be required to tap the performance increase of scalable graphics, and although programmable shaders hold great promise, applications have to be explicitly modified to take advantage of them. The greatly reduced cost of this approach, combined with the advantages of programmable shaders, may provide enough incentive to justify the effort.

Jean-Francois Panisset

Exclusive Coverage of SGI 2003 New Product Line – Part 2 – Hardware Details

Posted by Mike Seymour ON July 14, 2003

Leave a Reply Cancel reply