Maurice Weber's picture

56

Maurice Weber

mauriceweber

·

AI & ML interests

None yet

Recent Activity

new activity about 1 month ago

togethercomputer/RedPajama-Data-V2:Add paper citation

authored a paper about 1 month ago

RedPajama: an Open Dataset for Training Large Language Models

View all activity

Organizations

mauriceweber's activity

New activity in togethercomputer/RedPajama-Data-V2 about 1 month ago

Add paper citation

#30 opened about 1 month ago by

New activity in togethercomputer/RedPajama-Data-V2 4 months ago

RPV2 ccnet preprocessing

#29 opened 4 months ago by

New activity in togethercomputer/RedPajama-Data-1T-Sample 5 months ago

sample split details

#4 opened about 1 year ago by

New activity in togethercomputer/RedPajama-Data-V2 6 months ago

How can I download the sample-10B fastestly?

#28 opened 6 months ago by

New activity in togethercomputer/RedPajama-Data-1T 9 months ago

defunct book subset

#28 opened about 1 year ago by

New activity in togethercomputer/RedPajama-Data-V2 9 months ago

How much disk space would the whole HF dataset take?

#27 opened 9 months ago by

New activity in togethercomputer/RedPajama-Data-V2 12 months ago

rpv2-subsamples

#26 opened 12 months ago by

The doc_id in duplicates is should contain?

#24 opened 12 months ago by

New activity in togethercomputer/RedPajama-Data-V2 about 1 year ago

Deduplication steps

#15 opened about 1 year ago by

Here's a download script parallelized using Spark

#22 opened about 1 year ago by

what is the meaning of snapshots in redpajama-data-v2?

#21 opened about 1 year ago by

How to join documents and quality signals when downloading directly

#19 opened about 1 year ago by

Missing duplicates parquet files

#18 opened about 1 year ago by

Script to download all files of 1B sample data locally

#13 opened about 1 year ago by

What is the total size, of the entirety of this dataset in TB?

#10 opened about 1 year ago by

What's the concept on partitions

#5 opened about 1 year ago by

quality_signals, minhash and duplicates missing

#3 opened about 1 year ago by

Request to add retries into RedPajama-Data-V2.py script

#16 opened about 1 year ago by

How to obtain duplicates from minhash?

#8 opened about 1 year ago by

Obtaining Filtered Samples

#12 opened about 1 year ago by