@vincentg64 on Hugging Face: "Most LLMs are not reproducible because the underlying deep neural networks are…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

vincentg64

posted an update Aug 14, 2024

Post

455

Most LLMs are not reproducible because the underlying deep neural networks are not. Because that's something LLM creators don't care about. We do, and ours are reproducible, including our GenAI that uses GAN.

All you have to do is allow the user to specify the seeds of the random number generators involved. First, you need a good random generator you have full control over. Better than numpy.random. See ours, with infinite period and one line of code, faster and better than what's in Python and elsewhere. Here is the link: https://mltblog.com/4fGDLu0

fynnkroeger

Aug 14, 2024

•

edited Aug 14, 2024

This post makes no sense.

What do you mean by "LLMs are not reproducible"?
How do GANs solve this according to you (in practice they are also deep neural networks too)? Why does that even matter as GANs don't do text generation?
Users can specify random seeds with LLMs, such as with huggingface transformers (see https://huggingface.co/blog/how-to-generate).
I what way should the random generator be the bottleneck of reproducibility and the first thing people need? I thought it was deep neural networks?
In what ways is your generator better than numpy's generator? This is a very strong claim, as numpy is extremely battle tested. Is the randomness better? Then compare to random, numpy.random and secrets with dieharder (not some test you make up). You would also need to show that is makes a difference for AI applications.
If yours is faster (strong claim!) then do a rigorous, reproducible benchmark with open code.
Also the default in numpy is not Mersenne Twister but PCG64... (https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng).
You say people have full control over your generator, but is also just one line? Must be a very long line.
The period of the numpy generator is extremely large, how is it needed for AI, that you provide an "infinite" one?

Also just link the github directly, not something that links to something that links to the github. Get to the point

vincentg64

Aug 14, 2024

•

edited Aug 14, 2024

See my tests: Python badly fails the congruential equidistribution among others, whatever generator is implemented in Python 3.10.

The one-line formula is this:

(5^n >> n) % (2^n)

Can be executed very efficiently. Each new n gives you n new bits independent from the previous ones. That's one of many sequences proposed in my paper. With n = 10^6, you get a total of 5 x 10^11 bits.

As for non-reproducibility, all of what I tested, you run the code twice, you get two different results. You have to set a seed for all sources of randomness, not some of them as set_seed does.

Finally, I have 40 years of experience designing random generators of increasing quality and PhD in computational stats (postdoc at the statslabs, Cambridge University).

Your Dieharder battery of tests is a joke designed by amateurs who know basic stuff in stats and nothing in number theory. The fact that everyone uses it does not make my statement less true. It does not even test "strong randomness", a concept defined in one of my books.

Actually, you only need one single test to check strong randomness: the full multivariate Kolmogorov-Smirnov distance. As far as I know, I am the only one to have implemented it in any dimension: https://pypi.org/project/genai-evaluation/

In this post