Does Intel Need GPUs For HPCs
Nvidia might have scored a few wins by touting its GPU’s in the HPC market, but it is starting to lose ground to the co-processor, according to Intel’s Diane Bryant.
In an IDC interview Intel’s data center boss said that Nvidia gained an early lead in the market for accelerated HPC workloads when it positioned its GPUs for that task several years ago. However there is a perception that processors used for machine learning today are GPUs like those from Nvidia and AMD.
Bryant was a bit miffed when she was asked how Intel can compete in this market without a GPU. She said that the general purpose GPU, or GPGPU was just another type of accelerator and not one that’s uniquely suited to machine learning.
It is better to look at Knights Landing which is a coprocessor, but it’s an accelerator for floating point operations, and that’s what a GPGPU too.
She said that since the release of the first Xeon Phi in 2014, Intel now clawed back 33 percent of the market for HPC workloads that use a floating point accelerator.
“So we’ve won share against Nvidia, and we’ll continue to win share,” she said.
She said that Intel’s share of the machine learning business may be much smaller, but the market is still young.
“Less than one percent of all the servers that shipped last year were applied to machine learning, so to hear Nvidia is beating us in a market that barely exists yet makes me a little crazy,” she says.
Intel will continue to evolve Xeon Phi to make it better at machine learning tasks. She said that there are two aspects to machine learning – training the algorithmic models, and applying those models to the real world in front-end applications. Intel’s FPGAs and its Xeon processors mean Intel has both sides of the equation covered.
But Nvidia’s GPUs are harder for programmers to work with which could give Intel an edge as ordinary businesses need to adopt machine learning. Knights Landing is “self-booting,” which means customers don’t need to pair it with a regular Xeon to boot an OS.
However Intel’s newest Xeon Phi has a floating point performance of about 3 teraflops, which is a little slow compared to the five teraflops for Nvidia’s new GP100.
Courtesy-Fud
Will Arm/Atom CPUs Replace Xeon/Opteron?
Comments Off on Will Arm/Atom CPUs Replace Xeon/Opteron?
Analyst are saying that smartphone chips could one day replace the Xeon and Opteron processors used in most of the world’s top supercomputers. In a paper in a paper titled “Are mobile processors ready for HPC?” researchers at the Barcelona Supercomputing Center wrote that less expensive chips bumping out faster but higher-priced processors in high-performance systems.
In 1993, the list of the world’s fastest supercomputers, known as the Top500, was dominated by systems based on vector processors. They were nudged out by less expensive RISC processors. RISC chips were eventually replaced by cheaper commodity processors like Intel’s Xeon and AMD Opteron and now mobile chips are likely to take over.
The transitions had a common thread, the researchers wrote: Microprocessors killed the vector supercomputers because they were “significantly cheaper and greener,” the report said. At the moment low-power chips based on designs ARM fit the bill, but Intel is likely to catch up so it is not likely to mean the death of x86.
The report compared Samsung’s 1.7GHz dual-core Exynos 5250, Nvidia’s 1.3GHz quad-core Tegra 3 and Intel’s 2.4GHz quad-core Core i7-2760QM – which is a desktop chip, rather than a server chip. The researchers said they found that ARM processors were more power-efficient on single-core performance than the Intel processor, and that ARM chips can scale effectively in HPC environments. On a multi-core basis, the ARM chips were as efficient as Intel x86 chips at the same clock frequency, but Intel was more efficient at the highest performance level, the researchers said.
Do Supercomputers Lead To Downtime?
As supercomputers grow more powerful, they’ll also become more susceptible to failure, thanks to the increased amount of built-in componentry. A few researchers at the recent SC12 conference, held last week in Salt Lake City, offered possible solutions to this growing problem.
Today’s high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12.
The problem is not a new one, of course. When Lawrence Livermore National Laboratory’s 600-node ASCI (Accelerated Strategic Computing Initiative) White supercomputer went online in 2001, it had a mean time between failures (MTBF) of only five hours, thanks in part to component failures. Later tuning efforts had improved ASCI White’s MTBF to 55 hours, Fiala said.
But as the number of supercomputer nodes grows, so will the problem. “Something has to be done about this. It will get worse as we move to exascale,” Fiala said, referring to how supercomputers of the next decade are expected to have 10 times the computational power that today’s models do.
Today’s techniques for dealing with system failure may not scale very well, Fiala said. He cited checkpointing, in which a running program is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint.
The problem with checkpointing, according to Fiala, is that as the number of nodes grows, the amount of system overhead needed to do checkpointing grows as well — and grows at an exponential rate. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and — should a system fail — recovery operations, Fiala estimated.
Because of all the additional hardware needed for exascale systems, which could be built from a million or more components, system reliability will have to be improved by 100 times in order to keep to the same MTBF that today’s supercomputers enjoy, Fiala said.
Fiala presented technology that he and fellow researchers developed that may help improve reliability. The technology addresses the problem of silent data corruption, when systems make undetected errors writing data to disk.