AssertionError in classifier_utils.py

#343
by lhl1bit - opened

Hi, thanks for the great project and code! I am running the cell classification notebook. You mention that you have already obtained the training ID's for splitting the data by setting attr_to_split="individual" and attr_to_balance=["disease","lvef","age","sex","length"] in the prepare_data function. So, I tried to do this myself, but I got an AssertionError in line 242 of classifier_utils.py.

Here is the code I am running:
cc.prepare_data(input_data_file="../Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
output_directory=output_dir,
output_prefix=output_prefix,
attr_to_split="individual",
attr_to_balance=["disease","lvef","age","sex","length"])

Here is the error. (I added a line to print the metadata_df in classifier_utils.py for debugging):
AssertionError
File ~/.conda/envs/geneformer_env/lib/python3.10/site-packages/geneformer/classifier_utils.py:243, in balance_attr_splits(data, attr_to_split, attr_to_balance, eval_size, max_trials, pval_threshold, state_key, nproc)
241 split_attr_ids = list(metadata_df["split_attr_ids"])
242 print(metadata_df) #################################################################
--> 243 assert len(split_attr_ids) == len(set(split_attr_ids))
244 eval_num = round(len(split_attr_ids) * eval_size)
245 colnames = (
246 ["trial_num", "train_ids", "eval_ids"]
247 + pu.flatten_list(
(...)
257 + ["mean_pval"]
258 )

AssertionError:

And here is the printed metadata_df:
split_attr_ids disease lvef age sex length
0 1422 1 70.0 54.0 Male 1756
1 1678 0 65.0 46.0 Male 2048
2 1631 1 42.0 46.0 Male 2048
3 1479 1 15.0 29.0 Male 2048
4 1516 0 57.5 66.0 Female 862
... ... ... ... ... ... ...
144928 1558 0 55.0 58.0 Female 1241
144931 1722 1 38.0 51.0 Male 883
144937 1510 1 35.0 58.0 Male 647
144939 1617 2 15.0 64.0 Male 968
144949 1371 2 15.0 54.0 Male 1536

[40415 rows x 6 columns]

Thanks!

Thank you for your interest in Geneformer! This error is likely because you are splitting by "individual" and ["disease","lvef","age","sex"] are individual-level attributes, but "length" is a cell-level attribute. If you'd like to ensure the length is evenly distributed, you can add a column such as "avg_length" with the average length per individual.

ctheodoris changed discussion status to closed

Sign up or log in to comment