08/13/2023 00:40:06 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/13/2023 00:40:06 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='s-nlp/bart-base-detox', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/16_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='s-nlp/bart-base-detox', student_model='s-nlp/bart-base-detox', pred_distill=True, intermediate_distill=True, weight_bits=16, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/13/2023 00:40:24 - INFO - __main__ - ***** Running training ***** 08/13/2023 00:40:24 - INFO - __main__ - Num examples = 19546 08/13/2023 00:40:24 - INFO - __main__ - Num Epochs = 10 08/13/2023 00:40:24 - INFO - __main__ - Instantaneous batch size per device = 8 08/13/2023 00:40:24 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/13/2023 00:40:24 - INFO - __main__ - Gradient Accumulation steps = 2 08/13/2023 00:40:24 - INFO - __main__ - Total optimization steps = 24440 08/13/2023 00:40:24 - INFO - __main__ - student encoder layers = 3 08/13/2023 00:40:24 - INFO - __main__ - student decoder layers = 1 08/13/2023 00:40:24 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/13/2023 00:40:24 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/13/2023 00:48:32 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 22.59420505922646} 08/13/2023 00:57:03 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 54.978123489133125} 08/13/2023 01:05:13 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 62.023727727095206} 08/13/2023 01:13:26 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 63.97172971159594} 08/13/2023 01:21:23 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.5360970481176} 08/13/2023 01:29:39 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.94717856121133} 08/13/2023 01:37:40 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 64.59791205000518} 08/13/2023 01:45:57 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 65.09188210629405} 08/13/2023 01:54:25 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 65.43488256903395} 08/13/2023 02:02:48 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 65.13437994345524} 08/13/2023 13:33:25 - WARNING - __main__ - You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with `--source_prefix 'summarize: ' ` 08/13/2023 13:33:25 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/13/2023 13:33:25 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='t5-large', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/16_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='t5-large', student_model='t5-large', pred_distill=True, intermediate_distill=True, weight_bits=16, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/13/2023 13:33:49 - INFO - __main__ - ***** Running training ***** 08/13/2023 13:33:49 - INFO - __main__ - Num examples = 19546 08/13/2023 13:33:49 - INFO - __main__ - Num Epochs = 10 08/13/2023 13:33:49 - INFO - __main__ - Instantaneous batch size per device = 8 08/13/2023 13:33:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/13/2023 13:33:49 - INFO - __main__ - Gradient Accumulation steps = 2 08/13/2023 13:33:49 - INFO - __main__ - Total optimization steps = 24440 08/13/2023 13:33:49 - INFO - __main__ - student encoder layers = 3 08/13/2023 13:33:49 - INFO - __main__ - student decoder layers = 1 08/13/2023 13:33:49 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/13/2023 13:33:49 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/13/2023 18:13:36 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False 08/13/2023 18:13:36 - WARNING - __main__ - Namespace(dataset_name='s-nlp/paradetox', dataset_config_name=None, train_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix=None, preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, pad_to_max_length=False, model_name_or_path='facebook/bart-large', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=4, learning_rate=3e-05, weight_decay=0.0, num_train_epochs=10, max_train_steps=None, gradient_accumulation_steps=2, lr_scheduler_type=, warmup_ratio=0.05, output_dir='./output_s-nlp/paradetox_bart_base_detox/16_8_3_1_10_3e-05_fp16', seed=28, model_type=None, teacher_model='facebook/bart-large', student_model='facebook/bart-large', pred_distill=True, intermediate_distill=True, weight_bits=16, input_bits=8, clip_val=2.5, length_penalty=150, max_length=62, min_length=11, num_beams=6, do_train=True, do_test=True, test_teacher=False, distill_encoder=3, distill_decoder=1, log_steps=20, local_rank=0, weighted=False, new_distill_map=False, task_weight=1, logits_weight=1, hid_weight=1) 08/13/2023 18:13:57 - INFO - __main__ - ***** Running training ***** 08/13/2023 18:13:57 - INFO - __main__ - Num examples = 19546 08/13/2023 18:13:57 - INFO - __main__ - Num Epochs = 10 08/13/2023 18:13:57 - INFO - __main__ - Instantaneous batch size per device = 8 08/13/2023 18:13:57 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 08/13/2023 18:13:57 - INFO - __main__ - Gradient Accumulation steps = 2 08/13/2023 18:13:57 - INFO - __main__ - Total optimization steps = 24440 08/13/2023 18:13:57 - INFO - __main__ - student encoder layers = 3 08/13/2023 18:13:57 - INFO - __main__ - student decoder layers = 1 08/13/2023 18:13:57 - INFO - __main__ - student encoder layers [0, 1, 2] is mapped with teacher encoder layers [0, 2, 5] 08/13/2023 18:13:57 - INFO - __main__ - student decoder layers [0] is mapped with teacher decoder layers [5] 08/13/2023 18:23:59 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 13.927599253697903} 08/13/2023 18:33:46 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 14.54024753825692} 08/13/2023 18:43:47 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 15.696559967829156} 08/13/2023 18:53:59 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 17.028561047497664} 08/13/2023 19:04:19 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 18.190088344180207} 08/13/2023 19:14:34 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 19.35744020416755} 08/13/2023 19:24:59 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 20.69450840196689} 08/13/2023 19:35:05 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 21.69196456544141} 08/13/2023 19:45:14 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 22.46896551649043} 08/13/2023 19:55:18 - INFO - __main__ - evaluation result: {'accuracy': 0.9501243829727173, 'similarity': 0.5612009167671204, 'fluency': 0.8357802033424377, 'joint': 0.4501223564147949, 'chrF': 22.655778155996654}