Dev.SN Comments | Ask SN: When Is Assembly Worth It?

Ask SN: When Is Assembly Worth It?

posted by Cactus on Saturday March 08 2014, @03:30AM

from the don't-tell-me-upgrade-PCs dept.

Subsentient writes:

"I'm a C programmer and Linux enthusiast. For some time, I've had it on my agenda to build the new version of my i586/Pentium 1 compatible distro, since I have a lot of machines that aren't i686 that are still pretty useful.

Let me tell you, since I started working on this, I've been in hell these last few days! The Pentium Pro was the first chip to support CMOV (Conditional move), and although that was many years ago, lots of chips were still manufactured that didn't support this (or had it broken), including many semi-modern VIA chips, and the old AMD K6.

Just about every package that has to deal with multimedia has lots of inline assembler, and most of it contains CMOV. Most packages let you disable it, either with a switch like ./configure --disable-asm or by tricking it into thinking your chip doesn't support it, but some of them (like MPlayer, libvpx/vp9) do NOT. This means, that although my machines are otherwise full blown, good, honest x86-32 chips, I cannot use that software at all, because it always builds in bad instructions, thanks to these huge amounts of inline assembly!

Of course, then there's the fact that these packages, that could otherwise possibly build and work on all types of chips, are now limited to what's usually the ARM/PPC/x86 triumvirate (sorry, no SPARC Linux!), and the small issue that inline assembly is not actually supported by the C standard.

Is assembly worth it for the handicaps and trouble that it brings? Personally I am a language lawyer/standard Nazi, so inline ASM doesn't sit well with me for additional reasons."

This discussion has been archived. No new comments can be posted.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 5, Informative) by hankwang on Saturday March 08 2014, @07:53AM

by hankwang (100) on Saturday March 08 2014, @07:53AM (#13161) Homepage

Knowing more than the compiler is *very* difficult.

Depends. Around 2001 I tried to implement a calculation algorithm in C/C++ that would use table lookups with interpolation in order to prevent spending too much time calculating some expensive function. Compiled in gcc -O2, I think running on a an Athlon. It was something like this:

int i = (int) floor((x - x0)*(1.0/step_size)); double y = table[i]; // do something with y

It was slow as a dog. It turned out that the combination of gcc and glibc turned this into something like

save FPU status set FPU status to "floor" rounding round restore FPU status save FPU status set FPU status to "truncate" rounding assign to int restore FPU status

Each of these four FPU status changes would flush the entire CPU/FPU pipeline and this happened at some inner loop that was called a hundred million times. Replacing this by a few lines of assembler sped upt the program by a factor 10 or so.

I'm also not so convinced that gcc and g++ do a very good job at emitting vector instructions (e.g. SSE) all by themselves. If I write a loop such as

float x[8], y[8]; // ... for (i = 0; i < 8; ++i) x[i] = x[i]*0.5 + y[i];

Just tried on gcc 4.5 (-O2 -S -march=core2 -mfpmath=sse). It will happily use the SSE registers but not actually use vector instructions. With -mavx I get vector instructions, but only if the compiler knows the size of the array at compile-time. If I do something like this with a variable array size n, and decide that n=10000 during program execution, it will not vectorize at all even with -mavx, and that is even if I ensure that the compiler can assume non-aliased data.

Now gcc of course has a zillion options to tweak the code generation, but I can imagine that at some point, someone prefers to simply write assembler code in order to make sure that vectorization is used in places where it make sense.

--
Avantslash: Slashdot+SoylentNews for mobile [avantslash.org].

Parent

Starting Score:	1		point
Moderation		+4
Interesting=2, Informative=2, Total=4
Extra 'Informative' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		5

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 2, Interesting) by cubancigar11 on Saturday March 08 2014, @12:50PM

by cubancigar11 (330) on Saturday March 08 2014, @12:50PM (#13222) Homepage

That looks like a quite a common form of code. Have you tried contacting gcc guys? They would love this kind of info, and maybe we will learn about a way to generate optimized code.

Parent
- Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 2, Interesting) by hankwang on Saturday March 08 2014, @02:26PM
  
  by hankwang (100) on Saturday March 08 2014, @02:26PM (#13257) Homepage
  
  Yes, I contacted them. http://gcc.gnu.org/ml/gcc/2001-09/msg00356.html [gnu.org]
  
  --
  Avantslash: Slashdot+SoylentNews for mobile [avantslash.org].
  
  Parent
  - Re:Premature optimisation is the root of all evil (Score: 1) by grub on Saturday March 08 2014, @05:53PM
    
    by grub (3668) <soylentnews@grub.net> on Saturday March 08 2014, @05:53PM (#13316)
    
    heh, I thought your nick was a pseudonym ("Hank Wang") until I read your email to the GCC list. :)
    
    --
    ~ Trolling is a art ~
    
    Parent
Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 3, Informative) by mojo chan on Saturday March 08 2014, @01:58PM

by mojo chan (266) on Saturday March 08 2014, @01:58PM (#13248)

The problem with GCC not vectorizing code is due to you not telling it all the assumptions you made when you expected it to. GCC will only vectorize when it knows it is absolutely safe to do so, and you need to communicate that. When you wrote your own assembler version you did so based on these same assumptions.
In the specific example you cite have a look at the FFT code in ffdshow, specifically the ARM assembler stuff that uses NEON. To get good performance there is a hell of a lot of duplicated code since it processes stuff in power of 2 block sizes. If you had specified -O3 that's the signal to the compiler to go nuts and generate massive amounts of unrolled code like that. Even then it might not be worth it in all cases because if the array was only say 5 elements long you might spend more time setting up the vector stuff than it would save. So what you need to do is create your own functions to break the array down into fixed size units that can be heavily optimized, just like they did in the ffdshow assembler code. The compiler isn't psychic, unless you tell it this stuff it can't know what kind of data your code will be processing or how big variable length arrays are likely to be at run time.

--
const int one = 65536; (Silvermoon, Texture.cs)

Parent
- Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 3, Interesting) by hankwang on Saturday March 08 2014, @02:38PM
  
  by hankwang (100) on Saturday March 08 2014, @02:38PM (#13262) Homepage
  
  The problem with GCC not vectorizing code is due to you not telling it all the assumptions you made when you expected it to. GCC will only vectorize when it knows it is absolutely safe to do so
  
  For the record, this is the full test code:
  float calc(float x[], float c, int veclen) { int i, j, k; for (int i = 0; i < 10000; ++i) { for (k = 0; k < veclen*4; ++k) x[k] = c*x[k] + x[k+veclen*4]; } }
  
  The compiler should know that there cannot be any aliasing issues in the array 'x', so it *is* safe. But I wasn't aware that -O2 and -O3 makes such a big difference; with -O3 I do indeed get vector instuctions. From now on, I my number crunching code will be -O3...
  
  --
  Avantslash: Slashdot+SoylentNews for mobile [avantslash.org].
  
  Parent
  - Re:Premature optimisation is the root of all evil (Score: 3, Informative) by mojo chan on Saturday March 08 2014, @08:14PM
    
    by mojo chan (266) on Saturday March 08 2014, @08:14PM (#13370)
    
    O2 doesn't make the compiler check if x is safe from aliasing and so forth because it is an expensive operation, and the resulting code can be problematic to debug on some architectures. Moving to O3 does check, so the compiler uses vector instructions. C can be somewhat expensive to optimize because there is a lot of stuff you can do legally that has to be checked for, and often that involves checking entire modules.
    
    --
    const int one = 65536; (Silvermoon, Texture.cs)
    
    Parent

Moderator Help

Dev.SN

Dev.SN ♥ developers

Navigation

Sections

Dev.SN

Ask SN: When Is Assembly Worth It?

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 5, Informative) by hankwang on Saturday March 08 2014, @07:53AM

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 2, Interesting) by cubancigar11 on Saturday March 08 2014, @12:50PM

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 2, Interesting) by hankwang on Saturday March 08 2014, @02:26PM

Re:Premature optimisation is the root of all evil (Score: 1) by grub on Saturday March 08 2014, @05:53PM

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 3, Informative) by mojo chan on Saturday March 08 2014, @01:58PM

Re:Premature optimisation is the root of all evil Re:Premature optimisation is the root of all evil (Score: 3, Interesting) by hankwang on Saturday March 08 2014, @02:38PM

Re:Premature optimisation is the root of all evil (Score: 3, Informative) by mojo chan on Saturday March 08 2014, @08:14PM