PL equivalent of gsm8k dataset

#6
by TeeZee - opened

Hi,

I would like to perform a LASER operation on Bielik (https://github.com/cognitivecomputations/laserRMT). Do you have any suggestions what dataset should be used?Original paper/script uses https://huggingface.co/datasets/gsm8k, is there a Polish equivalent of this dataset? If so, could you share or point me to it?

SpeakLeash a.k.a Spichlerz! org

I am not sure if the original script uses gsm8k. If you ask for math dataset then there is no such in Polish but we are working creating on it. Does it need to be a math dataset? I haven't read about this method but if they are using the dataset only for measuring perplexity then I would go for some mix of Polish datasets.

def calculate_model_perplexity(self, datasets=['gsm8k'], seqlen=32, use_cuda_graph=False, use_flash_attn=False):
        model = self.model
        model_str = self.model_name
        acc_loss = 0.0
        total_samples = 0

They are going after perplexity, I was also thinking about some mix, do you have some pointers - what would be THE best ;)

SpeakLeash a.k.a Spichlerz! org

The math dataset is only in rmt_laser_snr_math*.py files but there is other with different datasets: https://github.com/cognitivecomputations/laserRMT/blob/main/rmt_laser_snr.py

You're right, I was focusing on math for my particular use case ;). So. ill start with wikitext full and then with a pl subset .Thanks!

SpeakLeash a.k.a Spichlerz! org

There is NKJP corpus (which is more balanced than wiki) but it is only 1M tokens.

OK, found it, thanks.It seems there is some experimentation to be done to adapt LASER for PL LLms, I'll keep you posted.

Sign up or log in to comment