Problem abot median value

#35
by weilangchan - opened

In its data preprocessing, Geneformer used the median expression value of the gene as the normalization factor, thereby reducing the level of housekeeping genes that are always highly expressed and elevating the level of transcription factor (TF) genes that are lowly expressed.
For each gene, we calculated the median of its expression values across all cells by referring to Geneformer's approach, and found that the median calculation result for many genes is 1.0, which feels puzzling.

image.png

Afterwards, we looked at the median file provided by the Geneformer paper and discovered that the median expression values they calculated for different genes were floating point numbers, and they were not the same.

image.png

Theoretically, count values are integers, so if the median is calculated directly from counts, the result should be an integer. But the medians provided by Geneformer are float numbers. We need to inquire with the Geneformer authors about the exact way they calculate the median for different genes.

Thank you for your question. In single cell data, the depth of sequencing affects the number of counts detected in each cell. For example, for the same exact cell, the counts for gene A and B may be 3 and 4, respectively, if sequenced at a certain depth, or 6 and 8 if sequenced at another depth. Therefore, as we describe in the Methods of the manuscript, it's important to normalize the raw counts by the total counts per cell before comparing their values across cells. This leads to the number being a non-integer value.

Additionally, the normalization factors take advantage of the vast array of observations of cell state in Genecorpus-30M to determine the genes that uniquely distinguish cell state. This normalization factor for each gene is calculated once from the pretraining corpus and is used for all future datasets presented to the model. I see you are using mouse genes; so to accomplish this goal, the normalization factor for the mouse genes should be calculated across your entire pretraining corpus (with tens of millions of cells and a vast array of cell states represented) to achieve the richness of gene expression variability expected within the organism. Please note also that, as discussed in the Methods, we opted to use the non-zero median value of expression rather than include zeros in the distribution so as not to weight the value by tissue representation within Genecorpus-30M, assuming that a representative range of transcript values would be observed within the cells in which each gene was detected. Finally, because calculating a nonzero median value of expression across tens of millions of cells can be memory intensive, we recommend using the approach we describe in the Methods to aggregate the transcript count distribution in a memory-efficient manner.

All of the above information is included in the Methods, an excerpt of which is below for convenience:

"To accomplish this, we first calculated the non-zero median value of expression of each detected gene across all cells passing quality filtering from the entire Genecorpus-30M. We aggregated the transcript count distribution for each gene in a memory-efficient manner by scanning through chunks of .loom data using loompy, normalizing the gene transcript counts in each cell by the total transcript count of that cell to account for varying sequencing depth and updating the normalized count distribution of the gene within the t-digest data structure developed for accurate online accumulation of rank-based statistics. We then normalized the genes in each single-cell transcriptome by the non-zero median value of expression of that gene across Genecorpus-30M and ordered the genes by the rank of their normalized expression in that specific cell. Of note, we opted to use the non-zero median value of expression rather than include zeros in the distribution so as not to weight the value by tissue representation within Genecorpus-30M, assuming that a representative range of transcript values would be observed within the cells in which each gene was detected. This normalization factor for each gene is calculated once from the pretraining corpus and is used for all future datasets presented to the model. The provided tokenizer code includes this normalization procedure and should be used for tokenizing new datasets presented to Geneformer to ensure consistency of the normalization factor used for each gene."

ctheodoris changed discussion status to closed

Could u pls open source the code related to calculate median values?

weilangchan changed discussion status to open

Besides, i wanna know if the t-digest is applied after normalization.
Thank u in advance!

Thank you for your question. The t-digest is not "applied". It is a method for accurate online accumulation of rank-based statistics. Once the data is added to the t-digest, it updates the distribution; there is no way to apply the total counts normalization after the data is accumulated in the t-digest. Please read about t-digests in the reference provided in the Methods of the Geneformer manuscript in order to understand this method:

Dunning, T. The t-digest: efficient estimates of distributions. Softw. Impacts 7, 100049 (2021).

Upon request, we are providing the code that we used for obtaining the non-zero median expression value of each gene across the broad range of cell types represented in Genecorpus-30M that we use as a normalization factor to prioritize genes that uniquely distinguish cell state.

However, we want to ensure the following is very clear to users:
If using Geneformer, to ensure consistency of the normalization factor used for each gene for all future datasets, users should use the Geneformer transcriptome tokenizer to tokenize their datasets and should not re-calculate this normalization factor for their individual dataset. This code for re-calculating the normalization factor should only be used by users who are pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M.

Furthermore, it is critical that this calculation is performed on a large-scale pretraining corpus that has tens of millions of cells from a broad range of human tissues. The richness of variable cell states in the pretraining corpus is what allows this normalization factor to accomplish the goal of prioritizing genes that uniquely distinguish cell states. This normalization factor for each gene is calculated once from the large-scale pretraining corpus and is used for all future datasets presented to the model.

Also, as discussed in the Methods, we only included droplet-based sequencing platforms in the pretraining corpus to assure expression value unit comparability for the calculation of this normalization factor. Users wishing to pretrain a new model from scratch with a new pretraining corpus should choose either droplet-based or plate-based platforms for calculating this normalization factor, or they should exercise caution that including both platforms may cause unintended effects on the results. Once the normalization factor is calculated however, data from any platform can be used with the model because the expression value units will be consistent within each individual cell.

ctheodoris changed discussion status to closed

Sign up or log in to comment