4 Replies - 911 Views - Last Post: 21 April 2016 - 08:35 AM

#1 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5925
  • View blog
  • Posts: 20,255
  • Joined: 05-May 12

Why would 64-bit accesses be slower?

Posted 20 April 2016 - 07:32 AM

I'm very puzzled by this code that I wrote. I was expecting that processing data (XOR-ing and then storing the results) would be faster if I worked on the data 64-bits at a time as opposed to just 32-bits at a time. I know that I got a significant speed boost when I went from processing 8-bits at a time to 32-bits at a time -- about 5-6 times faster. I was expecting to get at least another 25% faster if I started processing data 64-bits at a time instead of just 32-bits at a time, but my test runs and profiling results are showing things to be about 20% slower.

Does anybody have any ideas on why the code marked USE_6432_XOR is slower?
        static unsafe int XorBytes(int count,
                                   byte [] inputBuffer, int inputOffset,
                                   byte [] streamBuffer, int streamOffset,
                                   byte [] outputBuffer, int outputOffset)
        {
            Debug.Assert(count % sizeof(uint) == 0);

            fixed(byte * clear = &inputBuffer[inputOffset],
                         crypt = &streamBuffer[streamOffset],
                         cipher = &outputBuffer[outputOffset])
            {
                var end = &clear[count];

                // Using the vanilla 32 bit version is profiling to be
                // faster than the 64-32 bit version. Need to re-examine
                // as processors and JIT compilers get better over the years.
#if USE_6432_XOR
                var src64 = (UInt64 *) clear;
                var pad64 = (UInt64 *) crypt;
                var dst64 = (UInt64 *) cipher;
                var end64 = &src64[count / sizeof(UInt64)];

                while(src64 < end64)
                    *dst64++ = *src64++ ^ *pad64++;

                var src32 = (UInt32 *) src64;
                var pad32 = (UInt32 *) pad64;
                var dst32 = (UInt32 *) dst64;

                while (src32 < end)
                    *dst32++ = *src32++ ^ *pad32++;
#else
                var src = (uint *) clear;
                var pad = (uint *) crypt;
                var dst = (uint *) cipher;
                while (src < end)
                    *dst++ = *src++ ^ *pad++;
#endif
            }
            return count;
        }



Full project source is at github.com: TuringCipher.NET.

I'm running the code on a 64-bit OS and compiling for Any CPU. (As an aside, my Intel i5 2.5GHz laptop runs the same code about 15% faster than my AMD 3.4GHz desktop. Both are running 64-bit Windows 10 and have the same amount of RAM.)

Is This A Good Question/Topic? 0
  • +

Replies To: Why would 64-bit accesses be slower?

#2 lordofduct  Icon User is offline

  • I'm a cheeseburger
  • member icon


Reputation: 2667
  • View blog
  • Posts: 4,786
  • Joined: 24-September 10

Re: Why would 64-bit accesses be slower?

Posted 20 April 2016 - 11:31 AM

I'm not sure why that'd be happening for you.

I took your source and ran it on my office dev machine (i5 4590 @ 3.3ghz, 8GB RAM, Win7 64-bit). My results hovered at the same performance no matter which version I ran.

USE_6432_XOR
ReferenceTuring:
00:00:01.7000506   (5.882 MB/sec)
00:00:01.6460438   (6.075 MB/sec)
00:00:01.6499566   (6.061 MB/sec)
00:00:01.6644534   (6.008 MB/sec)
00:00:01.6658013   (6.003 MB/sec)
==> Average: 00:00:01.6650000   (6.006 MB/sec)
TableTuring:
00:00:00.3687474   (27.119 MB/sec)
00:00:00.3520658   (28.404 MB/sec)
00:00:00.3511172   (28.481 MB/sec)
00:00:00.3487048   (28.678 MB/sec)
00:00:00.3503092   (28.546 MB/sec)
==> Average: 00:00:00.3540000   (28.249 MB/sec)
FastTuring:
00:00:00.1733951   (57.672 MB/sec)
00:00:00.1711434   (58.431 MB/sec)
00:00:00.1726415   (57.924 MB/sec)
00:00:00.1711733   (58.42 MB/sec)
00:00:00.1730794   (57.777 MB/sec)
==> Average: 00:00:00.1720000   (58.14 MB/sec)



32
ReferenceTuring:
00:00:01.7217522   (5.808 MB/sec)
00:00:01.6491707   (6.064 MB/sec)
00:00:01.6452909   (6.078 MB/sec)
00:00:01.6543601   (6.045 MB/sec)
00:00:01.6431826   (6.086 MB/sec)
==> Average: 00:00:01.6630000   (6.013 MB/sec)
TableTuring:
00:00:00.3507988   (28.506 MB/sec)
00:00:00.3472459   (28.798 MB/sec)
00:00:00.3504392   (28.536 MB/sec)
00:00:00.3504047   (28.538 MB/sec)
00:00:00.3524801   (28.37 MB/sec)
==> Average: 00:00:00.3500000   (28.571 MB/sec)
FastTuring:
00:00:00.1785027   (56.022 MB/sec)
00:00:00.1742183   (57.399 MB/sec)
00:00:00.1767645   (56.572 MB/sec)
00:00:00.1755749   (56.956 MB/sec)
00:00:00.1753647   (57.024 MB/sec)
==> Average: 00:00:00.1760000   (56.818 MB/sec)



Ran it multiple times, got comparable results.

And honestly... USE_6432_XOR is doing twice the work is it not? And yet it performs equal to the other...

Or am I missing something here?
Was This Post Helpful? 1
  • +
  • -

#3 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5925
  • View blog
  • Posts: 20,255
  • Joined: 05-May 12

Re: Why would 64-bit accesses be slower?

Posted 20 April 2016 - 03:14 PM

From your numbers, you look to be running the perf tests in Debug mode. You'll get better numbers in Release mode, but the delta between using the 64-32 bit XOR is not significantly faster than just the 32 bit XOR.
Was This Post Helpful? 0
  • +
  • -

#4 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5925
  • View blog
  • Posts: 20,255
  • Joined: 05-May 12

Re: Why would 64-bit accesses be slower?

Posted 20 April 2016 - 07:38 PM

Well, I found part of the problem... Apparently when I added a C# console project it defaulted to preferring 32-bit code generation.

I found this out the hard way when I was shocked to see this as the code generated by the JIT:
                while(src64 < end64)
00A15424  cmp         edx,eax  
00A15426  jae         00A1545A  
                    *dst64++ = *src64++ ^ *pad64++;
00A15428  mov         esi,dword ptr [ebp-2Ch]  
00A1542B  add         dword ptr [ebp-2Ch],8  
00A1542F  mov         edi,dword ptr [ebp-24h]  
00A15432  add         dword ptr [ebp-24h],8  
00A15436  mov         eax,dword ptr [ebp-28h]  
00A15439  mov         dword ptr [ebp-34h],eax  
00A1543C  add         dword ptr [ebp-28h],8  
00A15440  mov         eax,dword ptr [edi]  
00A15442  mov         edx,dword ptr [edi+4]  
00A15445  mov         ecx,dword ptr [ebp-34h]  
00A15448  xor         eax,dword ptr [ecx]  
00A1544A  xor         edx,dword ptr [ecx+4]  
00A1544D  mov         dword ptr [esi],eax  
00A1544F  mov         dword ptr [esi+4],edx  

                while(src64 < end64)
00A15452  mov         eax,dword ptr [ebp-24h]  

                while(src64 < end64)
00A15455  cmp         eax,dword ptr [ebp-30h]  
00A15458  jb          00A15428  



WTF?!?! Why did it generate 32-bit code for a 64-bit machine? It was because the perf runner shell program had this by default:
Attached Image

Unchecking that checkbox now generates this code:
                while(src64 < end64)
00007FFA5A3276F7  cmp         r9,rax  
00007FFA5A3276FA  jae         00007FFA5A32776D  
                    *dst64++ = *src64++ ^ *pad64++;
00007FFA5A3276FC  lea         rdx,[r11+8]  
00007FFA5A327700  lea         rsi,[r9+8]  
00007FFA5A327704  lea         rdi,[r10+8]  
00007FFA5A327708  mov         r9,qword ptr [r9]  
00007FFA5A32770B  mov         r10,qword ptr [r10]  
00007FFA5A32770E  xor         r9,r10  
00007FFA5A327711  mov         qword ptr [r11],r9  

                while(src64 < end64)
00007FFA5A327714  cmp         rsi,rax  
00007FFA5A327717  jb          00007FFA5A327762  



Unfortunately, I'm still not seeing expected speed boost on the AMD box, nor the Intel box. At least now the 64-bit is about 5-8% faster.

Perhaps I just need to re-write my perf tests to just exercise the XOR method, and not the entire encryption end-to-end since the profiler is now telling me that I'm now only spending about 6% of the time doing the XOR instead of the higher 12% previously.
Was This Post Helpful? 0
  • +
  • -

#5 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5925
  • View blog
  • Posts: 20,255
  • Joined: 05-May 12

Re: Why would 64-bit accesses be slower?

Posted 21 April 2016 - 08:35 AM

Minor update. I updated my perf tests to isolate just the XOR operations away from the rest of the encryption, and do explicit tests for 64 and 32 bit versions without having to recompile. The AMD machine is still showing close values between the two version, but the Intel machine is now showing a consistent 25% better performance for the 64-bit version over the 32-bit version.

At this point, I'm pretty happy with things (except for the mystery of the odd performance of the AMD machine).
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1