AssertionError in classifier_utils.py

#343
by lhl1bit - opened

Hi, thanks for the great project and code! I am running the cell classification notebook. You mention that you have already obtained the training ID's for splitting the data by setting attr_to_split="individual" and attr_to_balance=["disease","lvef","age","sex","length"] in the prepare_data function. So, I tried to do this myself, but I got an AssertionError in line 242 of classifier_utils.py.

Here is the code I am running:
cc.prepare_data(input_data_file="../Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
output_directory=output_dir,
output_prefix=output_prefix,
attr_to_split="individual",
attr_to_balance=["disease","lvef","age","sex","length"])

Here is the error. (I added a line to print the metadata_df in classifier_utils.py for debugging):
AssertionError
File ~/.conda/envs/geneformer_env/lib/python3.10/site-packages/geneformer/classifier_utils.py:243, in balance_attr_splits(data, attr_to_split, attr_to_balance, eval_size, max_trials, pval_threshold, state_key, nproc)
241 split_attr_ids = list(metadata_df["split_attr_ids"])
242 print(metadata_df) #################################################################
--> 243 assert len(split_attr_ids) == len(set(split_attr_ids))
244 eval_num = round(len(split_attr_ids) * eval_size)
245 colnames = (
246 ["trial_num", "train_ids", "eval_ids"]
247 + pu.flatten_list(
(...)
257 + ["mean_pval"]
258 )

AssertionError:

And here is the printed metadata_df:
split_attr_ids disease lvef age sex length
0 1422 1 70.0 54.0 Male 1756
1 1678 0 65.0 46.0 Male 2048
2 1631 1 42.0 46.0 Male 2048
3 1479 1 15.0 29.0 Male 2048
4 1516 0 57.5 66.0 Female 862
... ... ... ... ... ... ...
144928 1558 0 55.0 58.0 Female 1241
144931 1722 1 38.0 51.0 Male 883
144937 1510 1 35.0 58.0 Male 647
144939 1617 2 15.0 64.0 Male 968
144949 1371 2 15.0 54.0 Male 1536

[40415 rows x 6 columns]

Thanks!

Thank you for your interest in Geneformer! This error is likely because you are splitting by "individual" and ["disease","lvef","age","sex"] are individual-level attributes, but "length" is a cell-level attribute. If you'd like to ensure the length is evenly distributed, you can add a column such as "avg_length" with the average length per individual.

ctheodoris changed discussion status to closed

Hi, thank you for the suggestion! I am now getting a different error. I have added an average length column, which resolved the assertion error. Now in classifier_utils, there is a problem with the definition of the variable pval: local variable 'pval' referenced before assignment. Should line 313 be changed to: pval = chisquare(f_obs=obs, f_exp=exp).pvalue?

File ~/.conda/envs/geneformer_env/lib/python3.10/site-packages/geneformer/classifier_utils.py:315, in balance_attr_splits(data, attr_to_split, attr_to_balance, eval_size, max_trials, pval_threshold, state_key, nproc)
313 chisquare(f_obs=obs, f_exp=exp).pvalue
314 train_attr_counts = str(obs_counts).strip("Counter(").strip(")")
--> 315 eval_attr_counts = str(exp_counts).strip("Counter(").strip(")")
316 df_vals += [train_attr_counts, eval_attr_counts, pval]
317 else:

UnboundLocalError: local variable 'pval' referenced before assignment

Thank you for catching this! A fix has been pushed. By the way, in the future, if you have a separate error from the original one in the discussion, it's really helpful to start a new discussion with a new name pertinent to the new error. That will be really helpful to others who may encounter the same issue so they can easily find the prior discussion. Thank you!

Sign up or log in to comment