PL equivalent of gsm8k dataset

by TeeZee - opened May 8, 2024

May 8, 2024

Hi,

I would like to perform a LASER operation on Bielik (https://github.com/cognitivecomputations/laserRMT). Do you have any suggestions what dataset should be used?Original paper/script uses https://huggingface.co/datasets/gsm8k, is there a Polish equivalent of this dataset? If so, could you share or point me to it?

djstrong

SpeakLeash | Spichlerz org May 8, 2024

I am not sure if the original script uses gsm8k. If you ask for math dataset then there is no such in Polish but we are working creating on it. Does it need to be a math dataset? I haven't read about this method but if they are using the dataset only for measuring perplexity then I would go for some mix of Polish datasets.

TeeZee

May 8, 2024

def calculate_model_perplexity(self, datasets=['gsm8k'], seqlen=32, use_cuda_graph=False, use_flash_attn=False):
        model = self.model
        model_str = self.model_name
        acc_loss = 0.0
        total_samples = 0

They are going after perplexity, I was also thinking about some mix, do you have some pointers - what would be THE best ;)

djstrong

SpeakLeash | Spichlerz org May 8, 2024

The math dataset is only in rmt_laser_snr_math*.py files but there is other with different datasets: https://github.com/cognitivecomputations/laserRMT/blob/main/rmt_laser_snr.py

TeeZee

May 8, 2024

You're right, I was focusing on math for my particular use case ;). So. ill start with wikitext full and then with a pl subset .Thanks!

djstrong

SpeakLeash | Spichlerz org May 8, 2024

There is NKJP corpus (which is more balanced than wiki) but it is only 1M tokens.

TeeZee

May 8, 2024

OK, found it, thanks.It seems there is some experimentation to be done to adapt LASER for PL LLms, I'll keep you posted.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment