Re: Pentium 4 and misaligned doubles

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Pentium 4 and misaligned doubles
From: Mike Pall <mikelu-0508@...>
Date: Tue, 16 Aug 2005 16:35:49 +0200

Hi,

Rici Lake wrote:
> [...] a Pentium 4 has a penalty for reading misaligned doubles 
> which cross a cache line, and a huge penalty for writing.

The results on a PIII are comparable, but with a periodicity
of 8 and two bumps ... something like: nnnnnXnXnnnnnXnX

I noticed that the results are a lot more pronounced in LuaJIT
because the dispatch overhead doesn't hide the unaligned penalties
anymore. So I investigated a bit into a proper solution:

Of course compiling with -malign-double is the easiest thing to
solve this with GCC. Alas, this breaks the x86 ABI. I think this
doesn't matter since the whole Lua core never passes structures
or unions that contain doubles to C library functions or back.

But still ... the recommended way to solve this is to add
  __attribute__ ((aligned(8)))
to either lua_Number or the Value union. Strangely enough, this
doesn't work with GCC 3.3.5. I think this is a bug, but maybe I'm
just misreading the docs.

The only way I could make it work is with:

  typedef struct lua_TValue {
    TValuefields;
  } __attribute__ ((aligned(16))) TValue;

A bit awkward, but solves both the stack alignment and the array
alignment problem.

The net effect is that stack slots and array slots of tables grow
from 12 to 16 bytes and hash slots of tables grow from 28 to 32
bytes. Since multiplying/dividing by 16 or 32 is just a plain
shift, other parts of the code are faster, too.

Maybe it would be a good idea to make this a user definable
setting in luaconf.h (near LUA_NUMBER, to make it clear this
is only relevant for doubles, not floats):

#if defined(__GNUC__) && defined(__i386)
#define LUA_TVALUE_ALIGN	__attribute__ ((aligned(16)))
#else
#define LUA_TVALUE_ALIGN
#endif

On a related note: the lua_number2int() optimization should be
turned off if __SSE2__ is defined (which is the case with
-march=pentium4). SSE2 offers a direct/faster way to truncate
doubles to ints (cvttsd2si reg32, xmm64/mem64). And GCC knows
how to use it (unless you override it with inline assembler).

Bye,
     Mike

Follow-Ups:
- Re: Pentium 4 and misaligned doubles, Rici Lake

References:
- Pentium 4 and misaligned doubles, Rici Lake

Prev by Date: Re: Packaging and importing
Next by Date: Re: Stripping HTML tags
Previous by thread: Re: Pentium 4 and misaligned doubles
Next by thread: Re: Pentium 4 and misaligned doubles
Index(es):
- Date
- Thread