Nice programing

새로운 랜덤 라이브러리가 std :: rand ()보다 나은 이유는 무엇입니까?

nicepro 2020. 10. 5. 20:53
반응형

새로운 랜덤 라이브러리가 std :: rand ()보다 나은 이유는 무엇입니까?


그래서 저는 rand ()가 해로운 것으로 간주 되는 강연을 보았고 단순 std::rand()플러스 모듈러스 패러다임을 넘어 난수 생성의 엔진 배포 패러다임을 옹호했습니다 .

그러나 나는 std::rand()직접 실패를보고 싶었 기 때문에 빠른 실험을했다.

  1. 기본적으로, 2 개 개의 함수를 작성 getRandNum_Old()하고 getRandNum_New()그 이용하여 0과 5 사이에 포함 된 난수를 생성 std::rand()하고 std::mt19937+ std::uniform_int_distribution각각.
  2. 그런 다음 "이전"방식을 사용하여 960,000 (6으로 나눌 수있는) 난수를 생성하고 숫자 0-5의 빈도를 기록했습니다. 그런 다음 이러한 주파수의 표준 편차를 계산했습니다. 내가 찾고있는 것은 분포가 진정으로 균일 할 경우 일어날 일이기 때문에 가능한 한 낮은 표준 편차입니다.
  3. 이 시뮬레이션을 1000 번 실행하고 각 시뮬레이션에 대한 표준 편차를 기록했습니다. 또한 소요 시간을 밀리 초 단위로 기록했습니다.
  4. 그 후 다시 똑같이했지만 이번에는 "새로운"방식으로 난수를 생성했습니다.
  5. 마지막으로 구식과 신법 모두에 대한 표준 편차 목록의 평균과 표준 편차를 계산했고, 구식과 신법 모두에 대해 취한 시간 목록의 평균과 표준 편차를 계산했습니다.

결과는 다음과 같습니다.

[OLD WAY]
Spread
       mean:  346.554406
    std dev:  110.318361
Time Taken (ms)
       mean:  6.662910
    std dev:  0.366301

[NEW WAY]
Spread
       mean:  350.346792
    std dev:  110.449190
Time Taken (ms)
       mean:  28.053907
    std dev:  0.654964

놀랍게도 롤의 총 ​​스프레드는 두 방법 모두 동일했습니다. 즉, std::mt19937+ std::uniform_int_distribution는 단순한 std::rand()+ 보다 "더 균일"하지 않았습니다 %. 내가 만든 또 다른 관찰은 새로운 것이 예전 방식보다 약 4 배 느리다는 것입니다. 전반적으로 품질 향상은 거의없이 엄청난 속도 비용을 지불하는 것처럼 보였습니다.

내 실험에 어떤 식 으로든 결함이 있습니까? 아니면 std::rand()정말 그렇게 나쁘지 않고 어쩌면 더 좋을까요?

참고로 다음은 전체적으로 사용한 코드입니다.

#include <cstdio>
#include <random>
#include <algorithm>
#include <chrono>

int getRandNum_Old() {
    static bool init = false;
    if (!init) {
        std::srand(time(nullptr)); // Seed std::rand
        init = true;
    }

    return std::rand() % 6;
}

int getRandNum_New() {
    static bool init = false;
    static std::random_device rd;
    static std::mt19937 eng;
    static std::uniform_int_distribution<int> dist(0,5);
    if (!init) {
        eng.seed(rd()); // Seed random engine
        init = true;
    }

    return dist(eng);
}

template <typename T>
double mean(T* data, int n) {
    double m = 0;
    std::for_each(data, data+n, [&](T x){ m += x; });
    m /= n;
    return m;
}

template <typename T>
double stdDev(T* data, int n) {
    double m = mean(data, n);
    double sd = 0.0;
    std::for_each(data, data+n, [&](T x){ sd += ((x-m) * (x-m)); });
    sd /= n;
    sd = sqrt(sd);
    return sd;
}

int main() {
    const int N = 960000; // Number of trials
    const int M = 1000;   // Number of simulations
    const int D = 6;      // Num sides on die

    /* Do the things the "old" way (blech) */

    int freqList_Old[D];
    double stdDevList_Old[M];
    double timeTakenList_Old[M];

    for (int j = 0; j < M; j++) {
        auto start = std::chrono::high_resolution_clock::now();
        std::fill_n(freqList_Old, D, 0);
        for (int i = 0; i < N; i++) {
            int roll = getRandNum_Old();
            freqList_Old[roll] += 1;
        }
        stdDevList_Old[j] = stdDev(freqList_Old, D);
        auto end = std::chrono::high_resolution_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
        double timeTaken = dur.count() / 1000.0;
        timeTakenList_Old[j] = timeTaken;
    }

    /* Do the things the cool new way! */

    int freqList_New[D];
    double stdDevList_New[M];
    double timeTakenList_New[M];

    for (int j = 0; j < M; j++) {
        auto start = std::chrono::high_resolution_clock::now();
        std::fill_n(freqList_New, D, 0);
        for (int i = 0; i < N; i++) {
            int roll = getRandNum_New();
            freqList_New[roll] += 1;
        }
        stdDevList_New[j] = stdDev(freqList_New, D);
        auto end = std::chrono::high_resolution_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
        double timeTaken = dur.count() / 1000.0;
        timeTakenList_New[j] = timeTaken;
    }

    /* Display Results */

    printf("[OLD WAY]\n");
    printf("Spread\n");
    printf("       mean:  %.6f\n", mean(stdDevList_Old, M));
    printf("    std dev:  %.6f\n", stdDev(stdDevList_Old, M));
    printf("Time Taken (ms)\n");
    printf("       mean:  %.6f\n", mean(timeTakenList_Old, M));
    printf("    std dev:  %.6f\n", stdDev(timeTakenList_Old, M));
    printf("\n");
    printf("[NEW WAY]\n");
    printf("Spread\n");
    printf("       mean:  %.6f\n", mean(stdDevList_New, M));
    printf("    std dev:  %.6f\n", stdDev(stdDevList_New, M));
    printf("Time Taken (ms)\n");
    printf("       mean:  %.6f\n", mean(timeTakenList_New, M));
    printf("    std dev:  %.6f\n", stdDev(timeTakenList_New, M));
}

"old"의 거의 모든 구현은 LCG를rand() 사용합니다 . 그들은 일반적으로 최고의 생성기는 아니지만 일반적으로 이러한 기본 테스트에서 실패하는 것을 보지 않을 것입니다. 평균 및 표준 편차는 일반적으로 최악의 PRNG에서도 올바르게 얻습니다.

Common failings of "bad" - but common enough - rand() implementations are:

  • low randomness of low-order bits;
  • short period;
  • low RAND_MAX;
  • some correlation between successive extractions (in general, LCGs produce numbers that are on a limited number of hyperplanes, although this can be somehow mitigated).

Still, none of these are specific to the API of rand(). A particular implementation could place a xorshift-family generator behind srand/rand and, algoritmically speaking, obtain a state of the art PRNG with no changes of interface, so no test like the one you did would show any weakness in the output.

Edit: @R. correctly notes that the rand/srand interface is limited by the fact that srand takes an unsigned int, so any generator an implementation may put behind them is intrinsically limited to UINT_MAX possible starting seeds (and thus generated sequences). This is true indeed, although the API could be trivially extended to make srand take an unsigned long long, or adding a separate srand(unsigned char *, size_t) overload.


Indeed, the actual problem with rand() is not much of implementation in principle but:

  • backwards compatibility; many current implementations use suboptimal generators, typically with badly chosen parameters; a notorious example is Visual C++, which sports a RAND_MAX of just 32767. However, this cannot be changed easily, as it would break compatibility with the past - people using srand with a fixed seed for reproducible simulations wouldn't be too happy (indeed, IIRC the aforementioned implementation goes back to Microsoft C early versions - or even Lattice C - from the mid-eighties);
  • simplistic interface; rand() provides a single generator with the global state for the whole program. While this is perfectly fine (and actually quite handy) for many simple use cases, it poses problems:

    • with multithreaded code: to fix it you either need a global mutex - which would slow down everything for no reason and kill any chance of repeatability, as the sequence of calls becomes random itself -, or thread-local state; this last one has been adopted by several implementations (notably Visual C++);
    • if you want a "private", reproducible sequence into a specific module of your program that doesn't impact the global state.

Finally, the rand state of affairs:

  • doesn't specify an actual implementation (the C standard provides just a sample implementation), so any program that is intended to produce reproducible output (or expect a PRNG of some known quality) across different compilers must roll its own generator;
  • doesn't provide any cross-platform method to obtain a decent seed (time(NULL) is not, as it isn't granular enough, and often - think embedded devices with no RTC - not even random enough).

Hence the new <random> header, which tries to fix this mess providing algorithms that are:

  • fully specified (so you can have cross-compiler reproducible output and guaranteed characteristics - say, range of the generator);
  • generally of state-of-the-art quality (from when the library was designed; see below);
  • encapsulated in classes (so no global state is forced upon you, which avoids completely threading and nonlocality problems);

... and a default random_device as well to seed them.

Now, if you ask me I would have liked also a simple API built on top of this for the "easy", "guess a number" cases (similar to how Python does provide the "complicated" API, but also the trivial random.randint & Co. using a global, pre-seeded PRNG for us uncomplicated people who'd like not to drown in random devices/engines/adapters/whatever every time we want to extract a number for the bingo cards), but it's true that you can easily build it by yourself over the current facilities (while building the "full" API over a simplistic one wouldn't be possible).


Finally, to get back to your performance comparison: as others have specified, you are comparing a fast LCG with a slower (but generally considered better quality) Mersenne Twister; if you are ok with the quality of an LCG, you can use std::minstd_rand instead of std::mt19937.

Indeed, after tweaking your function to use std::minstd_rand and avoid useless static variables for initialization

int getRandNum_New() {
    static std::minstd_rand eng{std::random_device{}()};
    static std::uniform_int_distribution<int> dist{0, 5};
    return dist(eng);
}

I get 9 ms (old) vs 21 ms (new); finally, if I get rid of dist (which, compared to the classic modulo operator, handles the distribution skew for output range not multiple of the input range) and get back to what you are doing in getRandNum_Old()

int getRandNum_New() {
    static std::minstd_rand eng{std::random_device{}()};
    return eng() % 6;
}

I get it down to 6 ms (so, 30% faster), probably because, unlike the call to rand(), std::minstd_rand is easier to inline.


Incidentally, I did the same test using a hand-rolled (but pretty much conforming to the standard library interface) XorShift64*, and it's 2.3 times faster than rand() (3.68 ms vs 8.61 ms); given that, unlike the Mersenne Twister and the various provided LCGs, it passes the current randomness test suites with flying colors and it's blazingly fast, it makes you wonder why it isn't included in the standard library yet.


If you repeat your experiment with a range larger than 5 then you will probably see different results. When your range is significantly smaller than RAND_MAX there isn't an issue for most applications.

For example if we have a RAND_MAX of 25 then rand() % 5 will produce numbers with the following frequencies:

0: 6
1: 5
2: 5
3: 5
4: 5

As RAND_MAX is guaranteed to be more than 32767 and the difference in frequencies between the least likely and the most likely is only 1, for small numbers the distribution is near enough random for most use cases.


First, surprisingly, the answer changes depending on what you are using the random number for. If it is to drive, say, a random background color changer, using rand() is perfectly fine. If you are using a random number to create a random poker hand or a cryptographically secure key, then it is not fine.

Predictability: the sequence 012345012345012345012345... would provide an even distribution of each number in your sample, but obviously isn't random. For a sequence to be random, the value of n+1 cannot be easily predicted by the value of n (or even by the values of n, n-1, n-2, n-3, etc.) Clearly a repeating sequence of the same digits is a degenerate case, but a sequence generated with any linear congruential generator can be subjected to analysis; if you use default out-of-the-box settings of a common LCG from a common library, a malicious person could "break the sequence" without much effort at all. In the past, several on-line casinos (and some brick-and-mortar ones) were hit for losses by machines using poor random number generators. Even people who should know better have been caught up; TPM chips from several manufacturers have been demonstrated to be easier to break than the bit-length of the keys would otherwise predict because of poor choices made with key-generation parameters.

Distribution: As alluded to in the video, taking a modulo of 100 (or any value not evenly divisible into the length of the sequence) will guarantee that some outcomes will become at least slightly more likely than other outcomes. In the universe of 32767 possible starting values modulo 100, the numbers 0 through 66 will appear 328/327 (0.3%) more often than the values 67 through 99; a factor that may provide an attacker an advantage.


The correct answer is: it depends on what you mean by "better."

The "new" <random> engines were introduced to C++ over 13 years ago, so they're not really new. The C library rand() was introduced decades ago and has been very useful in that time for any number of things.

The C++ standard library provides three classes of random number generator engines: the Linear Congruential (of which rand() is an example), the Lagged Fibonacci, and the Mersenne Twister. There are tradeoffs of each class, and each class is "best" in certain ways. For example, the LCGs have very small state and if the right parameters are chosen, fairly fast on modern desktop processors. The LFGs have larger state and use only memory fetches and addition operation, so are very fast on embedded systems and microcontrollers that lack specialized math hardware. The MTG has huge state and is slow, but can have a very large non-repeating sequence with excellent spectral characteristics.

If none of the supplied generators are good enough for your specific use, the C++ standard library also provides an interface for either a hardware generator or your own custom engine. None of the generators are intended to be used standalone: their intended use is through a distribution object that provides a random sequence with a particular probability distribution function.

Another advantage of <random> over rand() is that rand() uses global state, is not reentrant or threadsafe, and allows a single instance per process. If you need fine-grained control or predictability (ie. able to reproduce a bug given the RNG seed state) then rand() is useless. The <random> generators are locally instanced and have serializable (and restorable) state.

참고URL : https://stackoverflow.com/questions/53040940/why-is-the-new-random-library-better-than-stdrand

반응형