AssertionError in classifier_utils.py
Hi, thanks for the great project and code! I am running the cell classification notebook. You mention that you have already obtained the training ID's for splitting the data by setting attr_to_split="individual" and attr_to_balance=["disease","lvef","age","sex","length"] in the prepare_data function. So, I tried to do this myself, but I got an AssertionError in line 242 of classifier_utils.py.
Here is the code I am running:
cc.prepare_data(input_data_file="../Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
output_directory=output_dir,
output_prefix=output_prefix,
attr_to_split="individual",
attr_to_balance=["disease","lvef","age","sex","length"])
Here is the error. (I added a line to print the metadata_df in classifier_utils.py for debugging):
AssertionError
File ~/.conda/envs/geneformer_env/lib/python3.10/site-packages/geneformer/classifier_utils.py:243, in balance_attr_splits(data, attr_to_split, attr_to_balance, eval_size, max_trials, pval_threshold, state_key, nproc)
241 split_attr_ids = list(metadata_df["split_attr_ids"])
242 print(metadata_df) #################################################################
--> 243 assert len(split_attr_ids) == len(set(split_attr_ids))
244 eval_num = round(len(split_attr_ids) * eval_size)
245 colnames = (
246 ["trial_num", "train_ids", "eval_ids"]
247 + pu.flatten_list(
(...)
257 + ["mean_pval"]
258 )
AssertionError:
And here is the printed metadata_df:
split_attr_ids disease lvef age sex length
0 1422 1 70.0 54.0 Male 1756
1 1678 0 65.0 46.0 Male 2048
2 1631 1 42.0 46.0 Male 2048
3 1479 1 15.0 29.0 Male 2048
4 1516 0 57.5 66.0 Female 862
... ... ... ... ... ... ...
144928 1558 0 55.0 58.0 Female 1241
144931 1722 1 38.0 51.0 Male 883
144937 1510 1 35.0 58.0 Male 647
144939 1617 2 15.0 64.0 Male 968
144949 1371 2 15.0 54.0 Male 1536
[40415 rows x 6 columns]
Thanks!
Thank you for your interest in Geneformer! This error is likely because you are splitting by "individual" and ["disease","lvef","age","sex"] are individual-level attributes, but "length" is a cell-level attribute. If you'd like to ensure the length is evenly distributed, you can add a column such as "avg_length" with the average length per individual.
Hi, thank you for the suggestion! I am now getting a different error. I have added an average length column, which resolved the assertion error. Now in classifier_utils, there is a problem with the definition of the variable pval: local variable 'pval' referenced before assignment. Should line 313 be changed to: pval = chisquare(f_obs=obs, f_exp=exp).pvalue?
File ~/.conda/envs/geneformer_env/lib/python3.10/site-packages/geneformer/classifier_utils.py:315, in balance_attr_splits(data, attr_to_split, attr_to_balance, eval_size, max_trials, pval_threshold, state_key, nproc)
313 chisquare(f_obs=obs, f_exp=exp).pvalue
314 train_attr_counts = str(obs_counts).strip("Counter(").strip(")")
--> 315 eval_attr_counts = str(exp_counts).strip("Counter(").strip(")")
316 df_vals += [train_attr_counts, eval_attr_counts, pval]
317 else:
UnboundLocalError: local variable 'pval' referenced before assignment
Thank you for catching this! A fix has been pushed. By the way, in the future, if you have a separate error from the original one in the discussion, it's really helpful to start a new discussion with a new name pertinent to the new error. That will be really helpful to others who may encounter the same issue so they can easily find the prior discussion. Thank you!