nVidia Releases CUDA
Nvidia has released CUDA – its code that lets developers run their code on GPUs – to server vendors in order to get 64-bit ARM cores into the high performance computing (HPC) market.
The firm said today that ARM64 server processors, which are designed for microservers and web servers because of their energy efficiency, can now process HPC workloads when paired with GPU accelerators using the Nvidia CUDA 6.5 parallel programming framework, which supports 64-bit ARM processors.
“Nvidia’s GPUs provide ARM64 server vendors with the muscle to tackle HPC workloads, enabling them to build high-performance systems that maximise the ARM architecture’s power efficiency and system configurability,” the firm said.
The first GPU-accelerated ARM64 software development servers will be available in July from Cirrascale and E4 Computer Engineering, with production systems expected to ship later this year. The Eurotech Group also plans to ship production systems later this year.
Cirrascale’s system will be the RM1905D, a high density two-in-one 1U server with two Tesla K20 GPU accelerators, which the firm claims provides high performance and low total cost of ownership for private cloud, public cloud, HPC and enterprise applications.
E4′s EK003 is a production-ready, low-power 3U dual-motherboard server appliance with two Tesla K20 GPU accelerators designed for seismic, signal and image processing, video analytics, track analysis, web applications and Mapreduce processing.
Eurotech’s system is an “ultra-high density”, energy efficient and modular Aurora HPC server configuration, based on proprietary Brick Technology and featuring direct hot liquid cooling.
Featuring Applied Micro X-Gene ARM64 CPUs and Nvidia Tesla K20 GPU accelerators, the new ARM64 servers will provide customers with an expanded range of efficient, high-performance computing options to drive compute-intensive HPC and enterprise data centre workloads, Nvidia said.
Nvidia added, “Users will immediately be able to take advantage of hundreds of existing CUDA-accelerated scientific and engineering HPC applications by simply recompiling them to ARM64 systems.”
ARM said that it is working with Nvidia to “explore how we can unite GPU acceleration with novel technologies” and drive “new levels of scientific discovery and innovation”.
ARM To Focus On 64-bit SoC
ARM announced its first 64-bit cores a while ago and SoC makers have already rolled out several 64-bit designs. However, apart from Apple nobody has consumer oriented 64-bit ARM devices on the market just yet. They are slowly starting to show up and ARM says the transition to 64-bit parts is accelerating. However, the first wave of 64-bit ARM parts is not going after the high-end market.
Is 64-bit support on entry-level SoCs just a gimmick?
This trend raises a rather obvious question – are low end ARMv8 parts just a marketing gimmick, or do they really offer a significant performance gain? There is no straight answer at this point. It will depend on Google and chipmakers themselves, as well as phonemakers.
Qualcomm announced its first 64-bit part late last year. The Snapdragon 410 won’t turn many heads. It is going after $150 phones and it is based on Cortex A53 cores. It also has LTE, which makes it rather interesting.
MediaTek is taking a similar approach. Its quad-core MT6732 and octa-core MT6752 parts are Cortex A53 designs, too. Both sport LTE connectivity.
Qualcomm and MediaTek appear to be going after the same market – $100 to $150 phones with LTE and quad-core 64-bit stickers on the box. Marketers should like the idea, as they’re getting a few good buzzwords for entry-level gear.
However, we still don’t know much about their real-world performance. Don’t expect anything spectacular. The Cortex A53 is basically the 64-bit successor to the frugal Cortex A7. The A53 has a bit more cache, 40-bit physical addresses and it ends up a bit faster than the A7, but not by much. ARM says the A7 delivers 1.9DMIPS/MHz per core, while the A53 churns out 2.3DMIPS/MHz. That puts it in the ballpark of the good old Cortex A9. The first consumer oriented quad-core Cortex A9 part was Nvidia’s Tegra 3, so in theory a Cortex A53 quad-core could be as fast as a Tegra 3 clock-for-clock, but at 28nm we should see somewhat higher clocks, along with better graphics.
That’s not bad for $100 to $150 devices. LTE support is just the icing on the cake. Keep in mind that the Cortex A7 is ARM’s most efficient 32-bit core, hence we expect nothing less from the Cortex A53.
The Cortex A57 conundrum
Speaking to CNET’s Brooke Crothers, ARM executive vice president of corporate strategy Tom Lantzsch said the company was surprised by strong demand for 64-bit designs.
“Certainly, we’ve had big uptick in demand for mobile 64-bit products. We’ve seen this with our [Cortex] A53, a high-performance 64-bit mobile processor,” Lantzch told CNET.
He said ARM has been surprised by the pace of 64-bit adoption, with mobile parts coming from Qualcomm, MediaTek and Marvell. He said he hopes to see 64-bit phones by Christmas, although we suspect the first entry-level products will appear much sooner.
Lantzsch points out that even 32-bit code will run more efficiently on 64-bit ARMv8 parts. As software support improves, the performance gains will become more evident.
But where does this leave the Cortex A57? It is supposed to replace the Cortex A15, which had a few teething problems. Like the A15 it is a relatively big core. The A15 was simply too big and impractical on the 32nm node. On 28nm it’s better, but not perfect. It is still a huge core and its market success has been limited.
As a result, it’s highly unlikely that we will see any 28nm Cortex A57 parts. Qualcomm’s upcoming Snapdragon 810 is the first consumer oriented A57 SoC. It is a 20nm design and it is coming later this year, just in time for Christmas as ARM puts it. However, although the Snapdragon 810 will be ready by the end of the year, the first phones based on the new chip are expected to ship in early 2015.
While we will be able to buy 64-bit Android (and possibly Windows Phone) devices before Christmas, most if not all of them will be based on the A53. That’s not necessarily a bad thing. Consumers won’t have to spend $500 to get a 64-bit ARM device, so the user base could start growing long before high-end parts start shipping, thus forcing developers and Google to speed up 64-bit development.
If rumors are to be believed, Google is doing just that and it is not shying away from small 64-bit cores. The search giant is reportedly developing a $100 Nexus phone for emerging markets. It is said to be based on MediaTek’s MT6732 clocked at 1.5GHz. Sounds interesting, provided the rumour turns out to be true.
nVidia Launching New Cards
We weren’t expecting this and it is just a rumour, but reports are emerging that Nvidia is readying two new cards for the winter season. AMD of course is launching new cards four weeks from now, so it is possible that Nvidia would try to counter it.
The big question is with what?
VideoCardz claims one of the cards is an Ultra, possibly the GTX Titan Ultra, while the second one is a dual-GPU job, the Geforce GTX 790. The Ultra is supposedly GK110 based, but it has 2880 unlocked CUDA cores, which is a bit more than the 2688 on the Titan.
The GTX 790 is said to feature two GK110 GPUs, but Nvidia will probably have to clip their wings to get a reasonable TDP.
We’re not entirely sure this is legit. It is plausible, but that doesn’t make it true. It would be good for Nvidia’s image, especially if the revamped GK110 products manage to steal the performance crown from AMD’s new Radeons. However, with such specs, they would end up quite pricey and Nvidia wouldn’t sell that many of them – most enthusiasts would probably be better off waiting for Maxwell.
nVidia’s CUDA 5.5 Available
Nvidia has made its CUDA 5.5 release candidate supporting ARM based processors available for download.
Nvidia has been aggressively pushing its CUDA programming language as a way for developers to exploit the floating point performance of its GPUs. Now the firm has announced the availability of a CUDA 5.5 release candidate, the first version of the language that supports ARM based processors.
Aside from ARM support, Nvidia has improved supported Hyper-Q support and now allows developers to have MPI workload prioritisation. The firm also touted improved performance analysis and improved performance for cross-compilation on x86 processors.
Ian Buck, GM of GPU Computing Software at Nvidia said, “Since developers started using CUDA in 2006, successive generations of better, exponentially faster CUDA GPUs have dramatically boosted the performance of applications on x86-based systems. With support for ARM, the new CUDA release gives developers tremendous flexibility to quickly and easily add GPU acceleration to applications on the broadest range of next-generation HPC platforms.”
Nvidia’s support for ARM processors in CUDA 5.5 is an indication that it will release CUDA enabled Tegra processors in the near future. However outside of the firm’s own Tegra processors, CUDA support is largely useless, as almost all other chip designers have chosen OpenCL as the programming language for their GPUs.
Nvidia did not say when it will release CUDA 5.5, but in the meantime the firm’s release candidate supports Windows, Mac OS X and just about every major Linux distribution.
Are CUDA Applications Limited?
Acceleware said at Nvidia’s GPU Technology Conference (GTC) today that most algorithms that run on GPGPUs are bound by GPU memory size.
Acceleware is partly funded by Nvidia to provide developer training for CUDA to help sell the language to those that are used to traditional C and C++ programming. The firm said that most CUDA algorithms are now limited by GPU local memory size rather than GPU computational performance.
Both AMD and Nvidia provide general purpose GPU (GPGPU) accelerator parts that provide significantly faster computational processing than traditional CPUs, however they have only between 6GB and 8GB of local memory that constrains the size of the dataset the GPU can process. While developers can push more data from system main memory, the latency cost negates the raw performance benefit of the GPU.
Kelly Goss, training program manager at Acceleware, said that “most algorithms are memory bound rather than GPU bound” and “maximising memory usage is key” to optimising GPGPU performance.
She further said that developers need to understand and take advantage of the memory hierarchy of Nvidia’s Kepler GPU and look at ways of reducing the number of memory accesses for every line of GPU computing.
The point Goss was making is that GPU computing is relatively cheap in terms of clock cycles relative to the time it takes to fetch data from local memory, let alone loading GPU memory from system main memory.
Goss, talking to a room full of developers, proceeded to outline some of the performance characteristics of the memory hierarchy in Nvidia’s Kepler GPU architecture, showing the level of detail that CUDA programmers need to pay attention to if they want to extract the full performance potential from Nvidia’s GPGPU computing architecture.
Given Goss’s observation that algorithms running on Nvidia’s GPGPUs are often constrained by local memory size rather than by the GPU itself, the firm might want to look at simplifying the tiers of memory involved and increasing the amount of GPU local memory so that CUDA software developers can process larger datasets.
nVidia Speaks On Performance Issue
Nvidia has said that most of the outlandish performance increase figures touted by GPGPU vendors was down to poor original code rather than sheer brute force computing power provided by GPUs.
Both AMD and Nvidia have been using real-world code examples and projects to promote the performance of their respective GPGPU accelerators for years, but now it seems some of the eye popping figures including speed ups of 100x or 200x were not down to just the computing power of GPGPUs. Sumit Gupta, GM of Nvidia’s Tesla business said that such figures were generally down to starting with unoptimized CPU code.
During Intel’s Xeon Phi pre-launch press conference call, the firm cast doubt on some of the orders of magnitude speed up claims that had been bandied about for years. Now Gupta told The INQUIRER that while those large speed ups did happen, it was possible because of poorly optimized code to begin with, thus the bar was set very low.
Gupta said, “Most of the time when you saw the 100x, 200x and larger numbers those came from universities. Nvidia may have taken university work and shown it and it has an 100x on it, but really most of those gains came from academic work. Typically we find when you investigate why someone got 100x [speed up] is because they didn’t have good CPU code to begin with. When you investigate why they didn’t have good CPU code you find that typically they are domain scientist’s not computer science guys – biologists, chemists, physics – and they wrote some C code and it wasn’t good on the CPU. It turns out most of those people find it easier to code in CUDA C or CUDA Fortran than they do to use MPI or Pthreads to go to multi-core CPUs, so CUDA programming for a GPU is easier than multi-core CPU programming.”