The following library implements the legacy TQ cipher. What's new / different ? It implements it natively, an expose it through a .NET interface as the DLL is coded in C++/CLI. The library includes three implementations : one using standard arithmetics, one using vectorized arithmetics with SSE and SSE2 intrinsics and finally one using vectorized arithmetics with AVX and AVX2 intrinsics. The best implementation will be selected based on your CPU. (But you can force it too)
Although it will offer improved performances, don't expect it to be significant. The encryption / decryption is not the biggest part of the server when performance is concerned.
Note. The DLL is built using Visual Studio 2013. It requires MSVC 2013 redistributables and the .NET Framework v4.0. It won't work on Windows XP, and I don't plan at the moment to do a version for it.
Performances
Months ago, I did some benchmark in C++ only. It may not represent the reality at 100% (it wasn't done much seriously, but still enough to get an idea). Plus, I had to do some changes in the code (loading / unloading blocks in registers) due to uncontrolled alignment of data from C#. Gains might be lower with my C++/CLI library.
On a Intel Core 2 Duo P8400 processor, running Mac OS X Yosemite Beta 6 and compiled using Clang 6.0 (with O3), the standard algorithm had a throughput of ~555 MB/s, the SSE/SSE2 optimized algorithm had a throughput 1600 - 3300 MB/s (as high as 40000 MB/s for a 16 bytes buffer). The AVX/AVX2 optimized algorithm should achieve even better performances, but eh, old computer.
The optimized algorithm throughput depends on the buffer size. Why ? If the buffer is a multiple of 128 bits, it will computes the result only by using SSE2. If the buffer is not a multiple of 128 bits, it will do the same O(N) algorithm (standard) for the rest of the buffer. At the end, if the buffers are never a multiple of 16 bytes, you reduce the overall throughput. If the buffer is less than 16 bytes, well, you get the plain old 555 MB/s... There is also a small overhead for the second key when it overlaps on two values (the second key value is took with the higher 8 bits of the counter, if the counter is at e.g. 245, the second key index will be 0 and later 1, which add some complexity). This overhead never really happens when having multiple of 16 bytes buffers. Anyway, in general, the optimized algorithm will be faster as you'll have several blocks of 16 bytes in the buffer.
Feel free to post better benchmarks







