how do you transfer the original data to datasets?

#28
by weilangchan - opened

Thanks for your fantastic work! And I want to train a new model based on data downloaded from the Internet, such as cellxgene. But the system memory problem araised when I tried to merge these .h5ad format data into datasets. Could you please give me some advice?

Thank you for your interest in Geneformer. Could you clarify where you are encountering memory issues? Is this during some preprocessing while you are assembling the data prior to running the transcriptome tokenizer? Or is it while running the transcriptome tokenizer and if so, during which step? The transcriptome tokenizer scans through the .loom input files without loading the whole file into memory, which we do to avoid memory issues. We did not encounter memory issues while tokenizing ~30 million cells for Genecorpus-30M, but we pretrained the model over 2 years ago so there is much more data available now and I'm not sure how many cells you are working with.

I am closing this issue for now as there have been no updates but please feel free to open a new issue if needed.

ctheodoris changed discussion status to closed

Sign up or log in to comment