The fastest MOD algorithm in C++ for extremely large uint_64_t numbers stored in an array

Written by wordpress November 9, 2024

I am working with extremely large numbers and would like to verify my Karatsuba multiplication result ((2^136279841)-1)^2 which needs (532 344 * _m256i_epi64)^2 i.e. 4,258,752 uint64_t to store the result.

I stored all required data arrays in a preallocated memory:

size_t num_bits = 136279841;
size_t num_uint64 = (num_bits + 255) / 256 * 4;
size_t n = num_uint64;

// Correct calculation for First_256_offset
size_t First_256_offset = (GB * 0x40000000ULL) - ((2ULL + 1ULL) * num_uint64 * sizeof(uint64_t));

constexpr size_t GB = 3;
static const SIZE_T giga = 1024 * 1024 * 1024;
static const SIZE_T size = GB * giga;
uint64_t* ARRAY = static_cast<uint64_t*>(VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE));

uint64_t* number = ARRAY + First_256_offset / sizeof(uint64_t);

 // Store the number (2^136279841)-1) using _mm256_maskstore_epi64 in a loop
 __m256i ones = _mm256_set1_epi64x(-1);
 size_t i = 0;
 for (; i < (num_uint64 - 4); i += 4) {
     _mm256_store_si256((__m256i*) & number[i], ones);
 }

_mm256_maskstore_epi64((long long int*) & number[i], _mm256_setr_epi64x(-1, -1, -1, -1), _mm256_setr_epi64x(0x1111111111111111, 0x0000000000000001LL, 0x0, 0x0));

I need to calculate the MOD (A, B) where A, multiplication result which takes about 3 minutes on my laptop, is stored from ARRAY and B is the number in the code. The memory space above A and below First_256_offset is used as temporary space for Karatsuba multiplication. In the MOD (A, B) result, I may use the space below First_256_offset.

I need to avoid using any external libraries, vector, string or memalloc functions.

P.S. Note that I am using uint64_t operation in my C++ Karatsuba program because _m256i can handle int64_t only and I need to work with uint64_t data.

Source link

Leave a Reply Cancel reply