/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:48:47 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:48:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:48:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/rte to /home/aiscuser/.cache/huggingface/datasets/glue/rte/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/697k [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=50, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100/runs/Jul19_14-48-49_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=25, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=80.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=50, target_sparsity=0.59, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/RTE', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.01, distill_temp=2.0, use_mac_l0=True, prune_location=[3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=100, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:49:32 Evaluating: accuracy: 0.6823, eval_loss: 1.9479, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [9, 100] prune location: [3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.029388, lagrangian_loss: 0.000382, attention_score_distillation_loss: 0.098598 ---------------------------------------------------------------------- time: 2023-07-19 14:49:57 Evaluating: accuracy: 0.6715, eval_loss: 2.0791, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0073, step: 50 lambda_1: -0.4534, lambda_2: 0.5518 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.058469, lagrangian_loss: 0.003047, attention_score_distillation_loss: 0.098007 loss: 0.069943, lagrangian_loss: 0.008408, attention_score_distillation_loss: 0.097426 ETA: 0:50:53 | Epoch 0 finished. Took 38.65 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:50:22 Evaluating: accuracy: 0.6426, eval_loss: 2.4718, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0148, step: 100 lambda_1: -1.2022, lambda_2: 1.4034 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.008282, lagrangian_loss: 0.016422, attention_score_distillation_loss: 0.096749 loss: 1.125452, lagrangian_loss: 0.026686, attention_score_distillation_loss: 0.096124 ---------------------------------------------------------------------- time: 2023-07-19 14:50:48 Evaluating: accuracy: 0.6679, eval_loss: 2.1584, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0224, step: 150 lambda_1: -1.9793, lambda_2: 2.3241 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.462131, lagrangian_loss: 0.038737, attention_score_distillation_loss: 0.095322 ETA: 0:51:20 | Epoch 1 finished. Took 40.33 seconds. loss: 0.007580, lagrangian_loss: 0.051961, attention_score_distillation_loss: 0.094810 ---------------------------------------------------------------------- time: 2023-07-19 14:51:13 Evaluating: accuracy: 0.6787, eval_loss: 2.0548, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.03, step: 200 lambda_1: -2.7483, lambda_2: 3.2444 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.542208, lagrangian_loss: 0.068293, attention_score_distillation_loss: 0.093928 loss: 0.710257, lagrangian_loss: 0.085466, attention_score_distillation_loss: 0.093510 ETA: 0:50:30 | Epoch 2 finished. Took 39.1 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:51:39 Evaluating: accuracy: 0.6643, eval_loss: 1.8392, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0375, step: 250 lambda_1: -3.5145, lambda_2: 4.1713 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.132700, lagrangian_loss: 0.102254, attention_score_distillation_loss: 0.092884 loss: 0.447086, lagrangian_loss: 0.122965, attention_score_distillation_loss: 0.092084 ---------------------------------------------------------------------- time: 2023-07-19 14:52:05 Evaluating: accuracy: 0.6968, eval_loss: 1.9262, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0451, step: 300 lambda_1: -4.2695, lambda_2: 5.0840 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.139396, lagrangian_loss: 0.146520, attention_score_distillation_loss: 0.091695 ETA: 0:50:14 | Epoch 3 finished. Took 40.56 seconds. loss: 0.013123, lagrangian_loss: 0.171388, attention_score_distillation_loss: 0.090984 ---------------------------------------------------------------------- time: 2023-07-19 14:52:30 Evaluating: accuracy: 0.6498, eval_loss: 2.3036, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0526, step: 350 lambda_1: -5.0326, lambda_2: 6.0202 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 0.98] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 loss: 0.224630, lagrangian_loss: 0.195066, attention_score_distillation_loss: 0.090328 loss: 0.100729, lagrangian_loss: 0.217814, attention_score_distillation_loss: 0.089662 ETA: 0:49:16 | Epoch 4 finished. Took 38.46 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:52:56 Evaluating: accuracy: 0.6787, eval_loss: 1.8973, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.013, expected_sparsity: 0.0109, expected_sequence_sparsity: 0.7457, target_sparsity: 0.0602, step: 400 lambda_1: -5.7785, lambda_2: 6.9207 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 1. 1. 1. 1. 0.95] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1101111111111111111111111111110111111111111111111111111111111111111101011111111111111111111100111010 loss: 0.006938, lagrangian_loss: 0.234816, attention_score_distillation_loss: 0.089078 loss: 0.279214, lagrangian_loss: 0.255123, attention_score_distillation_loss: 0.088445 ---------------------------------------------------------------------- time: 2023-07-19 14:53:21 Evaluating: accuracy: 0.6823, eval_loss: 2.0859, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0152, expected_sparsity: 0.015, expected_sequence_sparsity: 0.7467, target_sparsity: 0.0678, step: 450 lambda_1: -6.4900, lambda_2: 7.7537 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 1. 0.99 0.92] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.89] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.89] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1101111111111111111111111111110111111111111111111101111111111111011101011111011111111111111100111010 loss: 0.003668, lagrangian_loss: 0.274670, attention_score_distillation_loss: 0.087722 ETA: 0:48:51 | Epoch 5 finished. Took 40.57 seconds. loss: 0.004139, lagrangian_loss: 0.295678, attention_score_distillation_loss: 0.087193 ---------------------------------------------------------------------- time: 2023-07-19 14:53:47 Evaluating: accuracy: 0.7076, eval_loss: 2.0329, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0196, expected_sparsity: 0.0177, expected_sequence_sparsity: 0.7474, target_sparsity: 0.0753, step: 500 lambda_1: -7.1801, lambda_2: 8.5509 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.99 0.89] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1100111111111111111111111111110111111111111111111101111111111111011101011111111111111101110100111010 loss: 0.009879, lagrangian_loss: 0.315272, attention_score_distillation_loss: 0.085547 loss: 0.503761, lagrangian_loss: 0.329236, attention_score_distillation_loss: 0.085897 ETA: 0:48:04 | Epoch 6 finished. Took 38.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:54:12 Evaluating: accuracy: 0.704, eval_loss: 1.9875, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0217, expected_sparsity: 0.0205, expected_sequence_sparsity: 0.7481, target_sparsity: 0.0829, step: 550 lambda_1: -7.8482, lambda_2: 9.3105 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 1. 0.99 0.99 0.99 0.86] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1100111111111111111111111111110111111111111111111101111111111111011101011011011111111101110100111010 loss: 0.144107, lagrangian_loss: 0.354074, attention_score_distillation_loss: 0.085122 loss: 0.331107, lagrangian_loss: 0.369864, attention_score_distillation_loss: 0.084503 ---------------------------------------------------------------------- time: 2023-07-19 14:54:37 Evaluating: accuracy: 0.7076, eval_loss: 1.8793, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0239, expected_sparsity: 0.0218, expected_sequence_sparsity: 0.7485, target_sparsity: 0.0905, step: 600 lambda_1: -8.5034, lambda_2: 10.0515 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 1. 0.98 0.99 0.98 0.85] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1100111111111111111111111111110111111111111111111101111111111111011101011011011111111101110100101010 loss: 0.194772, lagrangian_loss: 0.387276, attention_score_distillation_loss: 0.083909 ETA: 0:47:30 | Epoch 7 finished. Took 40.15 seconds. loss: 0.777043, lagrangian_loss: 0.396693, attention_score_distillation_loss: 0.083216 ---------------------------------------------------------------------- time: 2023-07-19 14:55:02 Evaluating: accuracy: 0.7076, eval_loss: 2.1299, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0261, expected_sparsity: 0.0245, expected_sequence_sparsity: 0.7492, target_sparsity: 0.098, step: 650 lambda_1: -9.1318, lambda_2: 10.7424 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 1. 0.97 0.99 0.98 0.83] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1100111111111111111111111111110111111111111111111100111111101111011101011011011111111101110100101010 loss: 1.046288, lagrangian_loss: 0.379496, attention_score_distillation_loss: 0.082634 loss: 0.018136, lagrangian_loss: 0.376472, attention_score_distillation_loss: 0.082042 ---------------------------------------------------------------------- time: 2023-07-19 14:55:28 Evaluating: accuracy: 0.6787, eval_loss: 2.1448, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0282, expected_sparsity: 0.0273, expected_sequence_sparsity: 0.7499, target_sparsity: 0.1056, step: 700 lambda_1: -9.6937, lambda_2: 11.3030 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 1. 0.97 0.98 0.96 0.81] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1000111111111111111111111111110111111111111111111100111111101111011101011011011111111101110100001010 loss: 0.013614, lagrangian_loss: 0.369929, attention_score_distillation_loss: 0.081239 ETA: 0:46:54 | Epoch 8 finished. Took 40.03 seconds. loss: 0.009624, lagrangian_loss: 0.335372, attention_score_distillation_loss: 0.080678 ---------------------------------------------------------------------- time: 2023-07-19 14:55:53 Evaluating: accuracy: 0.6823, eval_loss: 2.034, token_prune_loc: [False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0505, expected_sparsity: 0.0489, expected_sequence_sparsity: 0.7555, target_sparsity: 0.1132, step: 750 lambda_1: -10.1785, lambda_2: 11.7273 lambda_3: 0.0000 train remain: [0.98 1. 1. 0.99 1. 0.95 0.98 0.94 0.79] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.78] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111101111111011111111101111011111111111111111111101011111111111111110110111111111111111110111110 1000111111111111111111111111110111111111111111111100111101101111011101011011011111111101110000001010 loss: 0.338716, lagrangian_loss: 0.296148, attention_score_distillation_loss: 0.079852 loss: 0.412922, lagrangian_loss: 0.258776, attention_score_distillation_loss: 0.079081 ETA: 0:46:07 | Epoch 9 finished. Took 38.59 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:56:19 Evaluating: accuracy: 0.6895, eval_loss: 2.0758, token_prune_loc: [False, False, False, False, False, True, False, True, True], macs_sparsity: 0.0959, expected_sparsity: 0.0929, expected_sequence_sparsity: 0.7669, target_sparsity: 0.1207, step: 800 lambda_1: -10.5439, lambda_2: 11.9697 lambda_3: 0.0000 train remain: [0.98 1. 0.99 0.99 1. 0.92 0.97 0.93 0.77] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 0.89, 0.76] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.78, 0.6] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1011111111111101111111111111111111111111111011111011111111111111101111111111011111110011111011101100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111101111111011111111101111011111111111111111111101011111111111111110110111111111111101110111110 0000111111111111111111111111110111111111111111111100111101101111011101011011011111111101010000001010 loss: 0.395373, lagrangian_loss: 0.190279, attention_score_distillation_loss: 0.078797 loss: 0.791593, lagrangian_loss: 0.186629, attention_score_distillation_loss: 0.077928 ---------------------------------------------------------------------- time: 2023-07-19 14:56:43 Evaluating: accuracy: 0.6715, eval_loss: 2.2295, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1238, expected_sparsity: 0.1211, expected_sequence_sparsity: 0.7742, target_sparsity: 0.1283, step: 850 lambda_1: -10.8044, lambda_2: 12.0899 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.99 1. 0.91 0.95 0.93 0.75] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.9, 0.88, 0.74] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.78, 0.69, 0.51] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1011111111111101111111111111111111111111111011111011111111111111101101111111011111110011111011101100 1111111111111011111111111111111111111110111111111111111101100111111111110111111111101111111111011100 1111111101111111011111111101111011111111111111111111101011111111111101110110111111111111101110111110 0000111111111111111111111111110111111111111111111100111101101111011101011011011111110100010000001010 loss: 0.006347, lagrangian_loss: 0.152384, attention_score_distillation_loss: 0.077399 ETA: 0:45:31 | Epoch 10 finished. Took 40.05 seconds. loss: 0.005691, lagrangian_loss: 0.130721, attention_score_distillation_loss: 0.076740 ---------------------------------------------------------------------- time: 2023-07-19 14:57:09 Evaluating: accuracy: 0.6751, eval_loss: 2.1946, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1307, expected_sparsity: 0.1285, expected_sequence_sparsity: 0.7761, target_sparsity: 0.1359, step: 900 lambda_1: -10.9998, lambda_2: 12.1561 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.99 1. 0.89 0.94 0.91 0.73] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.89, 0.87, 0.73] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.77, 0.67, 0.49] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101101111111011111110011111011101100 1111111111111011111111111111111111111110111111111111110101100111111111110111111111101111111111011100 1111111101111111011111111101111011111111111111111111101011111111111101110110111111111111101110111010 0000111111111111111111111111110111111111111111111100111101101111010101011011011111110100010000001010 loss: 0.767037, lagrangian_loss: 0.098042, attention_score_distillation_loss: 0.076120 loss: 0.021643, lagrangian_loss: 0.079077, attention_score_distillation_loss: 0.075473 ETA: 0:44:47 | Epoch 11 finished. Took 38.89 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:57:35 Evaluating: accuracy: 0.7004, eval_loss: 1.9245, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1403, expected_sparsity: 0.1357, expected_sequence_sparsity: 0.778, target_sparsity: 0.1434, step: 950 lambda_1: -11.1225, lambda_2: 12.1820 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.99 1. 0.88 0.92 0.9 0.72] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.88, 0.86, 0.72] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.75, 0.64, 0.46] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101101111111011111110011011011101100 1111111111111011111111111111111111111110111111111111110101100111111111110111111111101101111111011100 1111111101111111011111111111111011111111111111111111101011111111111001110110111111111111101110111000 0000111111111111111111111111110111111111111111111100111101101111010101011011010111110100010000001010 loss: 0.012502, lagrangian_loss: 0.050715, attention_score_distillation_loss: 0.074825 loss: 0.007984, lagrangian_loss: 0.016869, attention_score_distillation_loss: 0.074227 ---------------------------------------------------------------------- time: 2023-07-19 14:58:00 Evaluating: accuracy: 0.6534, eval_loss: 2.4658, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1477, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.7798, target_sparsity: 0.151, step: 1000 lambda_1: -11.1668, lambda_2: 12.1867 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.99 1. 0.87 0.91 0.88 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.87, 0.85, 0.71] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.73, 0.62, 0.44] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101101111101011111110011011011101100 1111111111111011111111111111111111111110111111111111110101100111111111110111111111101101110111011100 1111111101111111011111111111111011101111111111111111101011111111111001110110111111111111101110111000 0000111111111111111111111111110111111111111111111100111101101111010101011011010111110100010000000010 loss: 0.275271, lagrangian_loss: 0.002323, attention_score_distillation_loss: 0.073470 ETA: 0:44:13 | Epoch 12 finished. Took 40.56 seconds. loss: 0.464858, lagrangian_loss: -0.017623, attention_score_distillation_loss: 0.072871 ---------------------------------------------------------------------- time: 2023-07-19 14:58:26 Evaluating: accuracy: 0.6679, eval_loss: 2.22, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1511, expected_sparsity: 0.1468, expected_sequence_sparsity: 0.7808, target_sparsity: 0.1585, step: 1050 lambda_1: -11.1517, lambda_2: 12.1875 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.99 1. 0.86 0.89 0.88 0.7 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.84, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.61, 0.42] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101100111111011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111111110111111111101101110111011100 1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110111000 0000111111111111111111111111110111101111111111111100111101101111010101011011010111110100010000000010 loss: 0.279965, lagrangian_loss: -0.024106, attention_score_distillation_loss: 0.072124 loss: 0.203552, lagrangian_loss: -0.046333, attention_score_distillation_loss: 0.071598 ETA: 0:43:30 | Epoch 13 finished. Took 38.9 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:58:51 Evaluating: accuracy: 0.6606, eval_loss: 2.2472, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1559, expected_sparsity: 0.1535, expected_sequence_sparsity: 0.7826, target_sparsity: 0.1661, step: 1100 lambda_1: -11.0900, lambda_2: 12.1935 lambda_3: 0.0000 train remain: [0.97 0.99 0.99 0.98 1. 0.85 0.88 0.86 0.7 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.85, 0.83, 0.69] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.71, 0.59, 0.4] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111111110111111111101001110111011100 1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110101000 0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000010 loss: 0.163342, lagrangian_loss: -0.059517, attention_score_distillation_loss: 0.071035 loss: 0.452755, lagrangian_loss: -0.076871, attention_score_distillation_loss: 0.070374 ---------------------------------------------------------------------- time: 2023-07-19 14:59:17 Evaluating: accuracy: 0.6462, eval_loss: 2.4025, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1572, expected_sparsity: 0.1554, expected_sequence_sparsity: 0.7831, target_sparsity: 0.1737, step: 1150 lambda_1: -10.9755, lambda_2: 12.2126 lambda_3: 0.0000 train remain: [0.97 0.99 0.99 0.98 1. 0.84 0.87 0.85 0.69] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.84, 0.83, 0.69] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.7, 0.58, 0.4] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111111011111011111111111111101100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111111110111011111101001110111011100 1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110101000 0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000010 loss: 0.230774, lagrangian_loss: -0.083824, attention_score_distillation_loss: 0.069834 ETA: 0:42:57 | Epoch 14 finished. Took 40.97 seconds. loss: 0.463707, lagrangian_loss: -0.088518, attention_score_distillation_loss: 0.069067 ---------------------------------------------------------------------- time: 2023-07-19 14:59:43 Evaluating: accuracy: 0.6534, eval_loss: 2.2773, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1667, expected_sparsity: 0.162, expected_sequence_sparsity: 0.7847, target_sparsity: 0.1812, step: 1200 lambda_1: -10.8239, lambda_2: 12.2443 lambda_3: 0.0000 train remain: [0.97 0.99 0.99 0.98 1. 0.84 0.86 0.85 0.68] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.83, 0.82, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.68, 0.56, 0.38] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111101011111011111111111111101100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111110110111011111101001110111011100 1111111101111111011111111101111011101011111111111111101011111111111001110110111111111111101110101000 0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000000 loss: 0.358757, lagrangian_loss: -0.096495, attention_score_distillation_loss: 0.068505 loss: 0.433719, lagrangian_loss: -0.119376, attention_score_distillation_loss: 0.067795 ETA: 0:42:13 | Epoch 15 finished. Took 38.59 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:00:08 Evaluating: accuracy: 0.6823, eval_loss: 2.2446, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.168, expected_sparsity: 0.1646, expected_sequence_sparsity: 0.7854, target_sparsity: 0.1888, step: 1250 lambda_1: -10.6231, lambda_2: 12.2994 lambda_3: 0.0000 train remain: [0.96 0.99 0.99 0.97 1. 0.83 0.85 0.85 0.68] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.82, 0.67] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.55, 0.37] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111101011111011111111111111101100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000110111011100 1111111101111111011111111101111011101111111111111111100011111111111001110110111111111111101110101000 0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.315280, lagrangian_loss: -0.130742, attention_score_distillation_loss: 0.067091 loss: 0.726314, lagrangian_loss: -0.143399, attention_score_distillation_loss: 0.066571 ---------------------------------------------------------------------- time: 2023-07-19 15:00:34 Evaluating: accuracy: 0.6968, eval_loss: 1.8997, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1715, expected_sparsity: 0.1684, expected_sequence_sparsity: 0.7864, target_sparsity: 0.1964, step: 1300 lambda_1: -10.3442, lambda_2: 12.4021 lambda_3: 0.0000 train remain: [0.96 0.99 0.99 0.96 1. 0.83 0.84 0.84 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.82, 0.81, 0.67] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.66, 0.54, 0.36] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000110111011100 1111111101111111011111111101111011101111111111111111100011111111111001110110111101111111101110101000 0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.004618, lagrangian_loss: -0.156081, attention_score_distillation_loss: 0.065823 loss: 0.316054, lagrangian_loss: -0.168408, attention_score_distillation_loss: 0.065266 ETA: 0:41:37 | Epoch 16 finished. Took 40.61 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:00:59 Evaluating: accuracy: 0.657, eval_loss: 2.5078, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1728, expected_sparsity: 0.1702, expected_sequence_sparsity: 0.7869, target_sparsity: 0.2039, step: 1350 lambda_1: -10.0052, lambda_2: 12.5523 lambda_3: 0.0000 train remain: [0.96 0.98 0.98 0.95 1. 0.83 0.83 0.84 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.81, 0.67] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.66, 0.53, 0.36] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111101100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000100111011100 1111111101111111011111111101111011101011111111111111100011111111111001110110111111111111101110101000 0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.009137, lagrangian_loss: -0.162867, attention_score_distillation_loss: 0.064569 loss: 0.007877, lagrangian_loss: -0.177110, attention_score_distillation_loss: 0.063956 ---------------------------------------------------------------------- time: 2023-07-19 15:01:24 Evaluating: accuracy: 0.639, eval_loss: 2.5564, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1754, expected_sparsity: 0.1727, expected_sequence_sparsity: 0.7875, target_sparsity: 0.2115, step: 1400 lambda_1: -9.6294, lambda_2: 12.7362 lambda_3: 0.0000 train remain: [0.96 0.98 0.98 0.94 1. 0.82 0.82 0.84 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.8, 0.81, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.65, 0.52, 0.35] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011111101000100111011100 1111111101111111011111111101111011101011111111111111100011111111111001110110111111111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000 loss: 1.071051, lagrangian_loss: -0.177870, attention_score_distillation_loss: 0.063342 ETA: 0:40:59 | Epoch 17 finished. Took 39.99 seconds. loss: 0.046190, lagrangian_loss: -0.192143, attention_score_distillation_loss: 0.062673 ---------------------------------------------------------------------- time: 2023-07-19 15:01:49 Evaluating: accuracy: 0.6968, eval_loss: 2.1858, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2352, expected_sparsity: 0.2299, expected_sequence_sparsity: 0.8023, target_sparsity: 0.2191, step: 1450 lambda_1: -9.1886, lambda_2: 12.9919 lambda_3: 0.0000 train remain: [0.95 0.98 0.98 0.92 0.99 0.82 0.82 0.83 0.66] infer remain: [1.0, 1.0, 1.0, 0.85, 1.0, 0.81, 0.8, 0.8, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.85, 0.69, 0.55, 0.44, 0.29] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111110111111111011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011111101000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.035929, lagrangian_loss: -0.209627, attention_score_distillation_loss: 0.062113 loss: 0.551003, lagrangian_loss: -0.196629, attention_score_distillation_loss: 0.061320 ETA: 0:40:17 | Epoch 18 finished. Took 38.93 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:02:15 Evaluating: accuracy: 0.6606, eval_loss: 2.2157, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2387, expected_sparsity: 0.2335, expected_sequence_sparsity: 0.8033, target_sparsity: 0.2266, step: 1500 lambda_1: -8.6754, lambda_2: 13.3431 lambda_3: 0.0000 train remain: [0.95 0.98 0.98 0.91 0.99 0.82 0.81 0.83 0.66] infer remain: [1.0, 1.0, 1.0, 0.85, 1.0, 0.8, 0.79, 0.8, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.85, 0.68, 0.54, 0.43, 0.28] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111110111111111011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011111100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.022867, lagrangian_loss: -0.213551, attention_score_distillation_loss: 0.060804 loss: 0.218195, lagrangian_loss: -0.210707, attention_score_distillation_loss: 0.060107 ---------------------------------------------------------------------- time: 2023-07-19 15:02:41 Evaluating: accuracy: 0.657, eval_loss: 2.3756, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2439, expected_sparsity: 0.2372, expected_sequence_sparsity: 0.8042, target_sparsity: 0.2342, step: 1550 lambda_1: -8.1008, lambda_2: 13.7943 lambda_3: 0.0000 train remain: [0.95 0.97 0.98 0.9 0.99 0.82 0.8 0.83 0.66] infer remain: [1.0, 1.0, 1.0, 0.84, 1.0, 0.8, 0.79, 0.8, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 0.67, 0.53, 0.42, 0.28] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111110111111111011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011111100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000 loss: 0.003218, lagrangian_loss: -0.223738, attention_score_distillation_loss: 0.058921 ETA: 0:39:40 | Epoch 19 finished. Took 40.6 seconds. loss: 0.273641, lagrangian_loss: -0.231929, attention_score_distillation_loss: 0.058897 ---------------------------------------------------------------------- time: 2023-07-19 15:03:06 Evaluating: accuracy: 0.6534, eval_loss: 2.3509, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2478, expected_sparsity: 0.2429, expected_sequence_sparsity: 0.8057, target_sparsity: 0.2417, step: 1600 lambda_1: -7.4613, lambda_2: 14.3699 lambda_3: 0.0000 train remain: [0.94 0.97 0.97 0.88 0.99 0.81 0.8 0.82 0.66] infer remain: [1.0, 1.0, 1.0, 0.83, 1.0, 0.8, 0.78, 0.8, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.83, 0.66, 0.52, 0.41, 0.27] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010111111111011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.133612, lagrangian_loss: -0.224616, attention_score_distillation_loss: 0.058146 loss: 0.004877, lagrangian_loss: -0.233436, attention_score_distillation_loss: 0.057620 ETA: 0:38:58 | Epoch 20 finished. Took 38.89 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:03:31 Evaluating: accuracy: 0.657, eval_loss: 2.3672, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2478, expected_sparsity: 0.2438, expected_sequence_sparsity: 0.8059, target_sparsity: 0.2493, step: 1650 lambda_1: -6.7702, lambda_2: 15.0619 lambda_3: 0.0000 train remain: [0.94 0.97 0.97 0.87 0.99 0.81 0.8 0.82 0.66] infer remain: [1.0, 1.0, 1.0, 0.83, 1.0, 0.8, 0.78, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.83, 0.66, 0.52, 0.41, 0.27] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111010111111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.908298, lagrangian_loss: -0.215904, attention_score_distillation_loss: 0.056991 loss: 0.573679, lagrangian_loss: -0.190294, attention_score_distillation_loss: 0.056323 ---------------------------------------------------------------------- time: 2023-07-19 15:03:56 Evaluating: accuracy: 0.6679, eval_loss: 2.2519, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2526, expected_sparsity: 0.2475, expected_sequence_sparsity: 0.8069, target_sparsity: 0.2569, step: 1700 lambda_1: -6.1248, lambda_2: 15.6818 lambda_3: 0.0000 train remain: [0.94 0.97 0.97 0.87 0.99 0.81 0.8 0.82 0.65] infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.8, 0.78, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.66, 0.51, 0.4, 0.26] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.405243, lagrangian_loss: -0.167420, attention_score_distillation_loss: 0.055674 ETA: 0:38:20 | Epoch 21 finished. Took 40.24 seconds. loss: 0.009080, lagrangian_loss: -0.138851, attention_score_distillation_loss: 0.055099 ---------------------------------------------------------------------- time: 2023-07-19 15:04:22 Evaluating: accuracy: 0.6968, eval_loss: 2.0523, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2526, expected_sparsity: 0.2475, expected_sequence_sparsity: 0.8069, target_sparsity: 0.2644, step: 1750 lambda_1: -5.5905, lambda_2: 16.1181 lambda_3: 0.0000 train remain: [0.94 0.97 0.97 0.86 0.99 0.81 0.8 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.8, 0.78, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.66, 0.51, 0.4, 0.26] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100 1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.010331, lagrangian_loss: -0.117505, attention_score_distillation_loss: 0.054452 loss: 0.416026, lagrangian_loss: -0.111376, attention_score_distillation_loss: 0.053778 ETA: 0:37:38 | Epoch 22 finished. Took 38.68 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:04:48 Evaluating: accuracy: 0.6643, eval_loss: 2.3321, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2539, expected_sparsity: 0.2495, expected_sequence_sparsity: 0.8074, target_sparsity: 0.272, step: 1800 lambda_1: -5.1439, lambda_2: 16.4235 lambda_3: 0.0000 train remain: [0.94 0.97 0.96 0.85 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.79, 0.78, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.65, 0.51, 0.4, 0.26] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.003822, lagrangian_loss: -0.098229, attention_score_distillation_loss: 0.053143 loss: 0.405344, lagrangian_loss: -0.080831, attention_score_distillation_loss: 0.052330 ---------------------------------------------------------------------- time: 2023-07-19 15:05:13 Evaluating: accuracy: 0.6498, eval_loss: 2.3126, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2565, expected_sparsity: 0.2531, expected_sequence_sparsity: 0.8083, target_sparsity: 0.2796, step: 1850 lambda_1: -4.7614, lambda_2: 16.6499 lambda_3: 0.0000 train remain: [0.95 0.96 0.96 0.85 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.78, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.5, 0.39, 0.26] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.005114, lagrangian_loss: -0.065466, attention_score_distillation_loss: 0.051820 ETA: 0:37:00 | Epoch 23 finished. Took 40.23 seconds. loss: 0.321013, lagrangian_loss: -0.063140, attention_score_distillation_loss: 0.051282 ---------------------------------------------------------------------- time: 2023-07-19 15:05:38 Evaluating: accuracy: 0.6787, eval_loss: 2.1906, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.2871, step: 1900 lambda_1: -4.4577, lambda_2: 16.7908 lambda_3: 0.0000 train remain: [0.94 0.96 0.96 0.84 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.244342, lagrangian_loss: -0.056144, attention_score_distillation_loss: 0.050648 loss: 0.084355, lagrangian_loss: -0.048615, attention_score_distillation_loss: 0.050072 ---------------------------------------------------------------------- time: 2023-07-19 15:06:04 Evaluating: accuracy: 0.6787, eval_loss: 2.2249, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.2947, step: 1950 lambda_1: -4.2078, lambda_2: 16.8851 lambda_3: 0.0000 train remain: [0.94 0.96 0.96 0.84 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 ETA: 0:36:22 | Epoch 24 finished. Took 40.36 seconds. loss: 0.003573, lagrangian_loss: -0.043804, attention_score_distillation_loss: 0.049396 loss: 0.402710, lagrangian_loss: -0.028737, attention_score_distillation_loss: 0.048592 ---------------------------------------------------------------------- time: 2023-07-19 15:06:29 Evaluating: accuracy: 0.6462, eval_loss: 2.4393, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.3023, step: 2000 lambda_1: -4.0187, lambda_2: 16.9394 lambda_3: 0.0000 train remain: [0.94 0.96 0.95 0.84 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110101100011111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.002686, lagrangian_loss: -0.025169, attention_score_distillation_loss: 0.048074 loss: 0.416361, lagrangian_loss: -0.021635, attention_score_distillation_loss: 0.047521 ETA: 0:35:41 | Epoch 25 finished. Took 39.16 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:06:55 Evaluating: accuracy: 0.6787, eval_loss: 2.2932, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2581, expected_sequence_sparsity: 0.8096, target_sparsity: 0.3098, step: 2050 lambda_1: -3.9092, lambda_2: 16.9577 lambda_3: 0.0000 train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.014857, lagrangian_loss: -0.010472, attention_score_distillation_loss: 0.046770 loss: 0.250807, lagrangian_loss: -0.007671, attention_score_distillation_loss: 0.046138 ---------------------------------------------------------------------- time: 2023-07-19 15:07:21 Evaluating: accuracy: 0.6282, eval_loss: 2.5998, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2581, expected_sequence_sparsity: 0.8096, target_sparsity: 0.3174, step: 2100 lambda_1: -3.8559, lambda_2: 16.9626 lambda_3: 0.0000 train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.79 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.008185, lagrangian_loss: 0.002283, attention_score_distillation_loss: 0.045498 ETA: 0:35:03 | Epoch 26 finished. Took 40.73 seconds. loss: 0.004606, lagrangian_loss: 0.005938, attention_score_distillation_loss: 0.044921 ---------------------------------------------------------------------- time: 2023-07-19 15:07:47 Evaluating: accuracy: 0.6318, eval_loss: 2.5448, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2586, expected_sequence_sparsity: 0.8097, target_sparsity: 0.325, step: 2150 lambda_1: -3.8766, lambda_2: 16.9644 lambda_3: 0.0000 train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.78 0.81 0.65] infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110101100011111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.680501, lagrangian_loss: 0.011555, attention_score_distillation_loss: 0.044267 loss: 0.003173, lagrangian_loss: 0.012474, attention_score_distillation_loss: 0.043623 ETA: 0:34:23 | Epoch 27 finished. Took 39.11 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:08:12 Evaluating: accuracy: 0.639, eval_loss: 2.5054, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3219, expected_sparsity: 0.3174, expected_sequence_sparsity: 0.825, target_sparsity: 0.3325, step: 2200 lambda_1: -3.9486, lambda_2: 16.9719 lambda_3: 0.0000 train remain: [0.93 0.94 0.95 0.82 0.99 0.8 0.78 0.8 0.64] infer remain: [1.0, 0.88, 1.0, 0.79, 1.0, 0.79, 0.76, 0.79, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.7, 0.7, 0.55, 0.42, 0.33, 0.21] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100011111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.461044, lagrangian_loss: 0.016034, attention_score_distillation_loss: 0.043001 loss: 0.302222, lagrangian_loss: 0.027008, attention_score_distillation_loss: 0.042296 ---------------------------------------------------------------------- time: 2023-07-19 15:08:37 Evaluating: accuracy: 0.6282, eval_loss: 2.5437, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3219, expected_sparsity: 0.3181, expected_sequence_sparsity: 0.8251, target_sparsity: 0.3401, step: 2250 lambda_1: -4.0827, lambda_2: 16.9966 lambda_3: 0.0000 train remain: [0.93 0.94 0.95 0.82 0.99 0.8 0.78 0.8 0.64] infer remain: [1.0, 0.88, 1.0, 0.79, 1.0, 0.79, 0.76, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.7, 0.7, 0.55, 0.42, 0.33, 0.21] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111111110001100011111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.002845, lagrangian_loss: 0.033180, attention_score_distillation_loss: 0.041716 ETA: 0:33:44 | Epoch 28 finished. Took 40.15 seconds. loss: 0.322900, lagrangian_loss: 0.041308, attention_score_distillation_loss: 0.041073 ---------------------------------------------------------------------- time: 2023-07-19 15:09:02 Evaluating: accuracy: 0.6354, eval_loss: 2.4492, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3258, expected_sparsity: 0.3212, expected_sequence_sparsity: 0.8259, target_sparsity: 0.3476, step: 2300 lambda_1: -4.3109, lambda_2: 17.0635 lambda_3: 0.0000 train remain: [0.93 0.94 0.95 0.81 0.99 0.8 0.78 0.8 0.64] infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.79, 0.76, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.21] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011010111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.006759, lagrangian_loss: 0.055445, attention_score_distillation_loss: 0.040399 loss: 0.370006, lagrangian_loss: 0.059518, attention_score_distillation_loss: 0.039905 ETA: 0:33:03 | Epoch 29 finished. Took 39.34 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:09:28 Evaluating: accuracy: 0.6065, eval_loss: 2.6959, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3258, expected_sparsity: 0.3212, expected_sequence_sparsity: 0.8259, target_sparsity: 0.3552, step: 2350 lambda_1: -4.6406, lambda_2: 17.2022 lambda_3: 0.0000 train remain: [0.92 0.94 0.94 0.81 0.99 0.8 0.78 0.8 0.64] infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.79, 0.76, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.21] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110110000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 1.083088, lagrangian_loss: 0.075279, attention_score_distillation_loss: 0.039230 loss: 0.421221, lagrangian_loss: 0.087199, attention_score_distillation_loss: 0.038668 ---------------------------------------------------------------------- time: 2023-07-19 15:09:54 Evaluating: accuracy: 0.6065, eval_loss: 2.6834, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.328, expected_sparsity: 0.3229, expected_sequence_sparsity: 0.8264, target_sparsity: 0.3628, step: 2400 lambda_1: -5.0757, lambda_2: 17.4417 lambda_3: 0.0000 train remain: [0.92 0.93 0.94 0.81 0.99 0.8 0.77 0.8 0.64] infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.78, 0.76, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.2] 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011100111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101011111011111111111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.154039, lagrangian_loss: 0.091723, attention_score_distillation_loss: 0.037923 ETA: 0:32:24 | Epoch 30 finished. Took 39.57 seconds. loss: 0.508581, lagrangian_loss: 0.101501, attention_score_distillation_loss: 0.037309 ---------------------------------------------------------------------- time: 2023-07-19 15:10:19 Evaluating: accuracy: 0.5993, eval_loss: 2.736, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4416, expected_sparsity: 0.4334, expected_sequence_sparsity: 0.8549, target_sparsity: 0.3703, step: 2450 lambda_1: -5.5404, lambda_2: 17.7165 lambda_3: 0.0000 train remain: [0.91 0.93 0.94 0.81 0.99 0.8 0.77 0.8 0.64] infer remain: [0.84, 0.88, 0.87, 0.78, 1.0, 0.78, 0.75, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.74, 0.64, 0.5, 0.5, 0.39, 0.29, 0.23, 0.15] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111110 1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111110111111111110111111111111110111101011111111101111111011101111101011100110 1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011100111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.715809, lagrangian_loss: 0.106383, attention_score_distillation_loss: 0.036697 loss: 0.003022, lagrangian_loss: 0.120150, attention_score_distillation_loss: 0.036098 ETA: 0:31:43 | Epoch 31 finished. Took 39.32 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:10:45 Evaluating: accuracy: 0.6426, eval_loss: 2.3733, token_prune_loc: [True, True, False, True, False, True, True, True, True], macs_sparsity: 0.416, expected_sparsity: 0.41, expected_sequence_sparsity: 0.8489, target_sparsity: 0.3779, step: 2500 lambda_1: -6.0449, lambda_2: 18.0453 lambda_3: 0.0000 train remain: [0.91 0.92 0.94 0.81 0.99 0.79 0.77 0.8 0.64] infer remain: [0.83, 0.87, 1.0, 0.77, 1.0, 0.78, 0.75, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.72, 0.56, 0.56, 0.43, 0.33, 0.25, 0.16] 1111111111111011111111101111101011011011110111100110110110111011111111111101011011111111111111111110 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011000111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.011106, lagrangian_loss: 0.126886, attention_score_distillation_loss: 0.035442 loss: 0.010029, lagrangian_loss: 0.135964, attention_score_distillation_loss: 0.034778 ---------------------------------------------------------------------- time: 2023-07-19 15:11:10 Evaluating: accuracy: 0.6173, eval_loss: 2.5042, token_prune_loc: [True, True, False, True, False, True, True, True, True], macs_sparsity: 0.416, expected_sparsity: 0.41, expected_sequence_sparsity: 0.8489, target_sparsity: 0.3855, step: 2550 lambda_1: -6.5762, lambda_2: 18.4166 lambda_3: 0.0000 train remain: [0.9 0.92 0.94 0.8 0.99 0.79 0.76 0.8 0.64] infer remain: [0.83, 0.87, 1.0, 0.77, 1.0, 0.78, 0.75, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.72, 0.56, 0.56, 0.43, 0.33, 0.25, 0.16] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011000111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000011111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 1.012969, lagrangian_loss: 0.137797, attention_score_distillation_loss: 0.034087 ETA: 0:31:05 | Epoch 32 finished. Took 40.25 seconds. loss: 0.011051, lagrangian_loss: 0.130916, attention_score_distillation_loss: 0.033475 ---------------------------------------------------------------------- time: 2023-07-19 15:11:36 Evaluating: accuracy: 0.6426, eval_loss: 2.4858, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4528, expected_sparsity: 0.4453, expected_sequence_sparsity: 0.858, target_sparsity: 0.393, step: 2600 lambda_1: -7.0526, lambda_2: 18.7208 lambda_3: 0.0000 train remain: [0.9 0.91 0.93 0.79 0.99 0.79 0.76 0.79 0.64] infer remain: [0.83, 0.87, 0.87, 0.76, 1.0, 0.78, 0.75, 0.78, 0.64] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.63, 0.48, 0.48, 0.37, 0.28, 0.22, 0.14] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100 1111111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110 1111111111111111001111111111111111011101111101111001101010111111111110111111010101100110011000111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000 0000011111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.407927, lagrangian_loss: 0.115509, attention_score_distillation_loss: 0.032848 loss: 0.010131, lagrangian_loss: 0.108897, attention_score_distillation_loss: 0.032199 ---------------------------------------------------------------------- time: 2023-07-19 15:12:01 Evaluating: accuracy: 0.6101, eval_loss: 2.6151, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4567, expected_sparsity: 0.4515, expected_sequence_sparsity: 0.8596, target_sparsity: 0.4006, step: 2650 lambda_1: -7.4446, lambda_2: 18.9283 lambda_3: 0.0000 train remain: [0.89 0.91 0.92 0.78 0.99 0.79 0.75 0.79 0.64] infer remain: [0.83, 0.87, 0.86, 0.75, 1.0, 0.78, 0.74, 0.77, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.62, 0.47, 0.47, 0.36, 0.27, 0.21, 0.13] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100 1110111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110 1111111111111111001111111111111111011101111101111001101010111111111110111111010101100110011000110000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001100111111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110101000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.003942, lagrangian_loss: 0.101805, attention_score_distillation_loss: 0.031580 ETA: 0:30:27 | Epoch 33 finished. Took 40.92 seconds. loss: 0.329970, lagrangian_loss: 0.106134, attention_score_distillation_loss: 0.030961 ---------------------------------------------------------------------- time: 2023-07-19 15:12:27 Evaluating: accuracy: 0.6101, eval_loss: 2.5916, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4645, expected_sparsity: 0.4557, expected_sequence_sparsity: 0.8607, target_sparsity: 0.4082, step: 2700 lambda_1: -7.8081, lambda_2: 19.1057 lambda_3: 0.0000 train remain: [0.89 0.9 0.91 0.77 0.99 0.78 0.75 0.79 0.63] infer remain: [0.83, 0.86, 0.86, 0.75, 1.0, 0.77, 0.74, 0.77, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.71, 0.61, 0.46, 0.46, 0.35, 0.26, 0.2, 0.13] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100 1110111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110 1111111111111111001111111111111111011101111101111001101010111101111110111111010101100110011000111000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100011111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110101000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.007095, lagrangian_loss: 0.094590, attention_score_distillation_loss: 0.030400 loss: 0.004053, lagrangian_loss: 0.105643, attention_score_distillation_loss: 0.029657 ETA: 0:29:46 | Epoch 34 finished. Took 39.21 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:12:52 Evaluating: accuracy: 0.6426, eval_loss: 2.4546, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4758, expected_sparsity: 0.4642, expected_sequence_sparsity: 0.8629, target_sparsity: 0.4157, step: 2750 lambda_1: -8.1299, lambda_2: 19.2449 lambda_3: 0.0000 train remain: [0.88 0.89 0.91 0.77 0.99 0.78 0.75 0.79 0.63] infer remain: [0.82, 0.86, 0.85, 0.74, 1.0, 0.77, 0.74, 0.77, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.71, 0.6, 0.44, 0.44, 0.34, 0.25, 0.19, 0.12] 1111111111111011111111101111101011011011110111100110110110111011111111111101011011111111111111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100 1111111111111111111111111110111111111110111111011011110111101011111111101111111011101111101011100110 1111111111111111001111111111111111011101111101111001101010111101111110111111010101100110011000101000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100011111110100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.056042, lagrangian_loss: 0.097039, attention_score_distillation_loss: 0.029012 loss: 0.007306, lagrangian_loss: 0.076601, attention_score_distillation_loss: 0.028449 ---------------------------------------------------------------------- time: 2023-07-19 15:13:18 Evaluating: accuracy: 0.6426, eval_loss: 2.4185, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4758, expected_sparsity: 0.4649, expected_sequence_sparsity: 0.8631, target_sparsity: 0.4233, step: 2800 lambda_1: -8.3977, lambda_2: 19.3408 lambda_3: 0.0000 train remain: [0.87 0.89 0.9 0.76 0.99 0.78 0.74 0.78 0.63] infer remain: [0.82, 0.86, 0.85, 0.74, 1.0, 0.77, 0.73, 0.77, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.71, 0.6, 0.44, 0.44, 0.34, 0.25, 0.19, 0.12] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100 1111111111111111111111111110111111111110111111011011110111101011111111101111111011101111101011100110 1111111111111111001101111111111111011101111101111001101010111111111110111111010101100110011000101000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100011111010100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.310369, lagrangian_loss: 0.078774, attention_score_distillation_loss: 0.027789 ETA: 0:29:07 | Epoch 35 finished. Took 39.96 seconds. loss: 0.070664, lagrangian_loss: 0.093996, attention_score_distillation_loss: 0.027186 ---------------------------------------------------------------------- time: 2023-07-19 15:13:43 Evaluating: accuracy: 0.6173, eval_loss: 2.5996, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4771, expected_sparsity: 0.47, expected_sequence_sparsity: 0.8644, target_sparsity: 0.4309, step: 2850 lambda_1: -8.6763, lambda_2: 19.4434 lambda_3: 0.0000 train remain: [0.87 0.89 0.9 0.75 0.99 0.78 0.74 0.78 0.63] infer remain: [0.82, 0.85, 0.85, 0.73, 1.0, 0.77, 0.73, 0.77, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.7, 0.59, 0.43, 0.43, 0.33, 0.24, 0.19, 0.12] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100 1111111111111111111111111110111111111110111111011011110111101111111111101011111011101111101011100110 1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000101000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111111101111011111111111111111111101001111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011111111110111111111110110001100011111010100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.007409, lagrangian_loss: 0.097927, attention_score_distillation_loss: 0.026529 loss: 0.004345, lagrangian_loss: 0.080982, attention_score_distillation_loss: 0.025926 ETA: 0:28:26 | Epoch 36 finished. Took 38.75 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:14:08 Evaluating: accuracy: 0.6209, eval_loss: 2.5957, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4831, expected_sparsity: 0.4754, expected_sequence_sparsity: 0.8658, target_sparsity: 0.4384, step: 2900 lambda_1: -8.9536, lambda_2: 19.5435 lambda_3: 0.0000 train remain: [0.86 0.88 0.9 0.74 0.99 0.78 0.73 0.78 0.63] infer remain: [0.82, 0.85, 0.84, 0.72, 1.0, 0.77, 0.72, 0.76, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.7, 0.59, 0.42, 0.42, 0.32, 0.23, 0.18, 0.11] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100 1111111111111111111111111110111111111110111111011011110111101011111111111011101011101111101011100110 1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000100000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.002487, lagrangian_loss: 0.076532, attention_score_distillation_loss: 0.025307 loss: 0.041345, lagrangian_loss: 0.081311, attention_score_distillation_loss: 0.024667 ---------------------------------------------------------------------- time: 2023-07-19 15:14:34 Evaluating: accuracy: 0.6173, eval_loss: 2.6695, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.487, expected_sparsity: 0.4802, expected_sequence_sparsity: 0.867, target_sparsity: 0.446, step: 2950 lambda_1: -9.2007, lambda_2: 19.6217 lambda_3: 0.0000 train remain: [0.85 0.88 0.89 0.74 0.99 0.78 0.73 0.78 0.63] infer remain: [0.81, 0.85, 0.84, 0.72, 1.0, 0.76, 0.72, 0.76, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.69, 0.58, 0.42, 0.42, 0.32, 0.23, 0.17, 0.11] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010 1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100 1111111111111111111111111110111111111110111111011011110111101111111111101011101011101111101011100110 1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000001000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100 1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.008462, lagrangian_loss: 0.074117, attention_score_distillation_loss: 0.024051 ETA: 0:27:47 | Epoch 37 finished. Took 40.15 seconds. loss: 0.043701, lagrangian_loss: 0.069809, attention_score_distillation_loss: 0.023345 ---------------------------------------------------------------------- time: 2023-07-19 15:14:59 Evaluating: accuracy: 0.6173, eval_loss: 2.699, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4961, expected_sparsity: 0.4872, expected_sequence_sparsity: 0.8689, target_sparsity: 0.4535, step: 3000 lambda_1: -9.4446, lambda_2: 19.6970 lambda_3: 0.0000 train remain: [0.85 0.87 0.88 0.73 0.99 0.77 0.73 0.77 0.63] infer remain: [0.81, 0.84, 0.83, 0.71, 1.0, 0.76, 0.72, 0.76, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.68, 0.56, 0.4, 0.4, 0.3, 0.22, 0.17, 0.11] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010 1110111011111111110110101111111111111111110111111111011111111011110111111111111101111101111011001100 1111111111111111111111111110111111111110111111011011110111101011111111101011101011101111101011100110 1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.114981, lagrangian_loss: 0.100112, attention_score_distillation_loss: 0.022715 loss: 0.519294, lagrangian_loss: 0.097500, attention_score_distillation_loss: 0.022118 ETA: 0:27:06 | Epoch 38 finished. Took 39.11 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:15:25 Evaluating: accuracy: 0.6245, eval_loss: 2.6877, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4961, expected_sparsity: 0.49, expected_sequence_sparsity: 0.8696, target_sparsity: 0.4611, step: 3050 lambda_1: -9.7092, lambda_2: 19.7842 lambda_3: 0.0000 train remain: [0.84 0.87 0.88 0.73 0.99 0.77 0.72 0.77 0.63] infer remain: [0.81, 0.84, 0.82, 0.71, 1.0, 0.76, 0.71, 0.76, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.68, 0.56, 0.4, 0.4, 0.3, 0.21, 0.16, 0.1] 1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010 1110111011111111111110101111111111111111110111111111011111111011110111111110111101111101111011001100 1111111111111111111111111110111111111110111111011011110111101011111111101010101011101111101011100110 1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.341494, lagrangian_loss: 0.085704, attention_score_distillation_loss: 0.021495 loss: 0.255516, lagrangian_loss: 0.068332, attention_score_distillation_loss: 0.020899 ---------------------------------------------------------------------- time: 2023-07-19 15:15:51 Evaluating: accuracy: 0.6029, eval_loss: 2.6045, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5078, expected_sparsity: 0.4994, expected_sequence_sparsity: 0.872, target_sparsity: 0.4687, step: 3100 lambda_1: -9.9610, lambda_2: 19.8629 lambda_3: 0.0000 train remain: [0.83 0.86 0.87 0.72 0.99 0.77 0.72 0.77 0.63] infer remain: [0.8, 0.83, 0.82, 0.7, 1.0, 0.75, 0.71, 0.76, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.54, 0.38, 0.38, 0.29, 0.2, 0.15, 0.1] 1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010 1110111011111111110110101111111111111111110111111111011111111011110111111110111101111101111011001100 1111111111111111111111111110111111111110111111011011111111101011111111101010101011101011101011100110 1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110011000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111101001111011111011111111001100011101011111110011011011101100 1011111111111011111111111111011101111110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.032344, lagrangian_loss: 0.065753, attention_score_distillation_loss: 0.020272 ETA: 0:26:28 | Epoch 39 finished. Took 40.91 seconds. loss: 0.015267, lagrangian_loss: 0.076968, attention_score_distillation_loss: 0.019599 ---------------------------------------------------------------------- time: 2023-07-19 15:16:16 Evaluating: accuracy: 0.5921, eval_loss: 2.7903, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5112, expected_sparsity: 0.5024, expected_sequence_sparsity: 0.8728, target_sparsity: 0.4762, step: 3150 lambda_1: -10.1977, lambda_2: 19.9312 lambda_3: 0.0000 train remain: [0.83 0.86 0.86 0.72 0.99 0.76 0.71 0.77 0.62] infer remain: [0.8, 0.83, 0.81, 0.7, 1.0, 0.75, 0.7, 0.75, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.54, 0.38, 0.38, 0.28, 0.2, 0.15, 0.09] 1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010 1110111011111111111110101111111111111111110111111111011111111011110111111110111101111001111011001100 1111111111111111111111111111111111111110111111011011111111101011111111101010101011101000101011100110 1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110011000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.002551, lagrangian_loss: 0.078639, attention_score_distillation_loss: 0.018954 loss: 0.004602, lagrangian_loss: 0.075491, attention_score_distillation_loss: 0.018321 ETA: 0:25:48 | Epoch 40 finished. Took 39.24 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:16:42 Evaluating: accuracy: 0.6282, eval_loss: 2.5548, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5151, expected_sparsity: 0.5062, expected_sequence_sparsity: 0.8738, target_sparsity: 0.4838, step: 3200 lambda_1: -10.4200, lambda_2: 19.9915 lambda_3: 0.0000 train remain: [0.83 0.85 0.84 0.71 0.99 0.76 0.71 0.77 0.62] infer remain: [0.8, 0.83, 0.8, 0.69, 1.0, 0.75, 0.7, 0.75, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.53, 0.37, 0.37, 0.27, 0.19, 0.14, 0.09] 1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010 1110111011111111111110101111111111111111110111111111011111111011110111111110111101111001111011001100 1111111111111111111111111111111111111110111111110011111111101011111111101010101010101000101011100110 1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.003761, lagrangian_loss: 0.053516, attention_score_distillation_loss: 0.017741 loss: 1.196991, lagrangian_loss: 0.075100, attention_score_distillation_loss: 0.017049 ---------------------------------------------------------------------- time: 2023-07-19 15:17:08 Evaluating: accuracy: 0.6209, eval_loss: 2.5255, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5229, expected_sparsity: 0.5169, expected_sequence_sparsity: 0.8765, target_sparsity: 0.4914, step: 3250 lambda_1: -10.6231, lambda_2: 20.0405 lambda_3: 0.0000 train remain: [0.82 0.84 0.83 0.7 0.99 0.76 0.71 0.76 0.62] infer remain: [0.79, 0.82, 0.79, 0.68, 1.0, 0.74, 0.7, 0.75, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.79, 0.65, 0.51, 0.35, 0.35, 0.26, 0.18, 0.14, 0.08] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011111111011110111010 1110111011111111110110101111111111111111110111111111011111111011110111111110111101111001111011001100 1111111111111111111111111111111111111110111111010011111111101011111111101010101010101000101011100110 1111111111111111001101111111111111011101111101111001101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.015737, lagrangian_loss: 0.071713, attention_score_distillation_loss: 0.016435 loss: 0.056410, lagrangian_loss: 0.081059, attention_score_distillation_loss: 0.015840 ETA: 0:25:09 | Epoch 41 finished. Took 40.19 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:17:33 Evaluating: accuracy: 0.6173, eval_loss: 2.5088, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5229, expected_sparsity: 0.5169, expected_sequence_sparsity: 0.8765, target_sparsity: 0.4989, step: 3300 lambda_1: -10.8754, lambda_2: 20.1152 lambda_3: 0.0000 train remain: [0.82 0.84 0.83 0.7 0.99 0.75 0.71 0.76 0.62] infer remain: [0.79, 0.82, 0.79, 0.68, 1.0, 0.74, 0.7, 0.75, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.79, 0.65, 0.51, 0.35, 0.35, 0.26, 0.18, 0.14, 0.08] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011111111011110111010 1110111011111111110110101111111111111111110111111111011111111011110111111110111101111001111011001100 1111111111111111111111111111111111111110111111010011111111101011111111101010101010101000101011100110 1111111111111111001101111111111111011101111101111001101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.076251, lagrangian_loss: 0.100739, attention_score_distillation_loss: 0.015197 loss: 0.350246, lagrangian_loss: 0.128922, attention_score_distillation_loss: 0.014505 ---------------------------------------------------------------------- time: 2023-07-19 15:17:59 Evaluating: accuracy: 0.5993, eval_loss: 2.6515, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5346, expected_sparsity: 0.5269, expected_sequence_sparsity: 0.8791, target_sparsity: 0.5065, step: 3350 lambda_1: -11.2008, lambda_2: 20.2370 lambda_3: 0.0000 train remain: [0.82 0.83 0.82 0.69 0.99 0.75 0.7 0.76 0.62] infer remain: [0.78, 0.81, 0.78, 0.67, 1.0, 0.74, 0.69, 0.75, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.63, 0.49, 0.33, 0.33, 0.24, 0.17, 0.13, 0.08] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010 1110111011111111110110101111111111111111110111111111011111111011110110111110111101111001111011001100 1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000101011100110 1111111111111111001101111111111111011001111101111001101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.006544, lagrangian_loss: 0.143903, attention_score_distillation_loss: 0.013892 ETA: 0:24:30 | Epoch 42 finished. Took 40.78 seconds. loss: 0.301119, lagrangian_loss: 0.108145, attention_score_distillation_loss: 0.013329 ---------------------------------------------------------------------- time: 2023-07-19 15:18:24 Evaluating: accuracy: 0.639, eval_loss: 2.4096, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5359, expected_sparsity: 0.5298, expected_sequence_sparsity: 0.8799, target_sparsity: 0.5141, step: 3400 lambda_1: -11.5582, lambda_2: 20.3831 lambda_3: 0.0000 train remain: [0.81 0.82 0.81 0.69 0.99 0.75 0.7 0.76 0.62] infer remain: [0.78, 0.81, 0.77, 0.67, 1.0, 0.73, 0.69, 0.74, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.63, 0.49, 0.33, 0.33, 0.24, 0.16, 0.12, 0.08] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010 1110111011111111111110101111111111111111110111111111011111111011110100111110111101111001111011001100 1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000001011100110 1111111111111111001101111111111111011001111101111001101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000100111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.014456, lagrangian_loss: 0.141687, attention_score_distillation_loss: 0.012656 loss: 0.005534, lagrangian_loss: 0.112436, attention_score_distillation_loss: 0.012044 ETA: 0:23:50 | Epoch 43 finished. Took 39.39 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:18:50 Evaluating: accuracy: 0.6173, eval_loss: 2.5429, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5445, expected_sparsity: 0.5361, expected_sequence_sparsity: 0.8815, target_sparsity: 0.5216, step: 3450 lambda_1: -11.9156, lambda_2: 20.5289 lambda_3: 0.0000 train remain: [0.8 0.82 0.8 0.68 0.99 0.74 0.69 0.76 0.62] infer remain: [0.78, 0.8, 0.76, 0.66, 1.0, 0.73, 0.68, 0.74, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.62, 0.47, 0.31, 0.31, 0.23, 0.16, 0.11, 0.07] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010 1110111011111111110110101111111111111111110111111111011111111011110100111110111101111001111011001100 1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000001010100110 1111111111111111001101111111111111011001111101111000101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011101000 1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000000111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000 loss: 0.003931, lagrangian_loss: 0.097588, attention_score_distillation_loss: 0.011439 loss: 0.007178, lagrangian_loss: 0.115671, attention_score_distillation_loss: 0.010769 ---------------------------------------------------------------------- time: 2023-07-19 15:19:16 Evaluating: accuracy: 0.6029, eval_loss: 2.6406, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5523, expected_sparsity: 0.5413, expected_sequence_sparsity: 0.8828, target_sparsity: 0.5292, step: 3500 lambda_1: -12.2549, lambda_2: 20.6603 lambda_3: 0.0000 train remain: [0.79 0.81 0.79 0.67 0.99 0.74 0.69 0.75 0.62] infer remain: [0.77, 0.8, 0.75, 0.66, 1.0, 0.73, 0.68, 0.74, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.77, 0.62, 0.46, 0.3, 0.3, 0.22, 0.15, 0.11, 0.07] 1111111111111011111111111111101011011011100111100110110110111011111111110101011011010111011110111010 1110111011111111110110101111111111111111110111111111011111111011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101110011111111101011101111101010101010101000001010100110 1111111111111111001101111111111111011001111101111000101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011100000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000100111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.005086, lagrangian_loss: 0.122999, attention_score_distillation_loss: 0.010151 ETA: 0:23:11 | Epoch 44 finished. Took 40.49 seconds. loss: 0.011251, lagrangian_loss: 0.114003, attention_score_distillation_loss: 0.009537 ---------------------------------------------------------------------- time: 2023-07-19 15:19:41 Evaluating: accuracy: 0.6282, eval_loss: 2.494, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5609, expected_sparsity: 0.5507, expected_sequence_sparsity: 0.8852, target_sparsity: 0.5367, step: 3550 lambda_1: -12.5875, lambda_2: 20.7863 lambda_3: 0.0000 train remain: [0.79 0.81 0.78 0.67 0.98 0.73 0.69 0.75 0.61] infer remain: [0.76, 0.79, 0.74, 0.65, 1.0, 0.72, 0.68, 0.74, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.6, 0.44, 0.29, 0.29, 0.21, 0.14, 0.1, 0.06] 1111111111111011111111111111101011011011100111100010110110111011111111111101011011010111011100111010 1110111011111111110110101111111111111111110111111111011110111011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111101011101111101010101010101000001010100110 1111111111111111001101111111111011011001111101111000101000111101111110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000100111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.068712, lagrangian_loss: 0.109909, attention_score_distillation_loss: 0.008896 loss: 0.011009, lagrangian_loss: 0.136202, attention_score_distillation_loss: 0.008250 ETA: 0:22:30 | Epoch 45 finished. Took 39.16 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:20:07 Evaluating: accuracy: 0.639, eval_loss: 2.3263, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5648, expected_sparsity: 0.5543, expected_sequence_sparsity: 0.8861, target_sparsity: 0.5443, step: 3600 lambda_1: -12.9416, lambda_2: 20.9284 lambda_3: 0.0000 train remain: [0.78 0.81 0.77 0.66 0.98 0.73 0.68 0.75 0.61] infer remain: [0.76, 0.79, 0.73, 0.64, 1.0, 0.72, 0.67, 0.74, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.6, 0.44, 0.28, 0.28, 0.2, 0.14, 0.1, 0.06] 1111111111111011111111111111101011011011100111100010110110111011111111111101011011010111011100111010 1110111011111111110110101111111111111111110111111111011110111011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111001011101111101010101010101000001010100110 1111111111111111001101111111111011011001111101111000101000111101011110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.744691, lagrangian_loss: 0.128177, attention_score_distillation_loss: 0.007626 loss: 0.006925, lagrangian_loss: 0.141114, attention_score_distillation_loss: 0.007001 ---------------------------------------------------------------------- time: 2023-07-19 15:20:33 Evaluating: accuracy: 0.6643, eval_loss: 2.3173, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5661, expected_sparsity: 0.5595, expected_sequence_sparsity: 0.8875, target_sparsity: 0.5519, step: 3650 lambda_1: -13.3363, lambda_2: 21.1052 lambda_3: 0.0000 train remain: [0.77 0.81 0.76 0.65 0.98 0.73 0.68 0.75 0.61] infer remain: [0.75, 0.78, 0.73, 0.64, 1.0, 0.72, 0.67, 0.74, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.75, 0.58, 0.43, 0.27, 0.27, 0.2, 0.13, 0.1, 0.06] 1111111111111011111111111111101011011011100111100010110110111011111111110101011011010111011100111010 1110111011111111110110101111111111111111110111111111001110111011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111001011101111101010101010101000001010100110 1111111111111111001101111111111011011001111101111000101000111101011110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.009152, lagrangian_loss: 0.186133, attention_score_distillation_loss: 0.006354 ETA: 0:21:51 | Epoch 46 finished. Took 40.65 seconds. loss: 0.004601, lagrangian_loss: 0.197659, attention_score_distillation_loss: 0.005716 ---------------------------------------------------------------------- time: 2023-07-19 15:20:58 Evaluating: accuracy: 0.6065, eval_loss: 2.5811, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5752, expected_sparsity: 0.5664, expected_sequence_sparsity: 0.8893, target_sparsity: 0.5594, step: 3700 lambda_1: -13.8023, lambda_2: 21.3534 lambda_3: 0.0000 train remain: [0.77 0.8 0.75 0.65 0.98 0.72 0.68 0.75 0.61] infer remain: [0.74, 0.78, 0.72, 0.63, 1.0, 0.71, 0.67, 0.73, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.58, 0.42, 0.26, 0.26, 0.19, 0.12, 0.09, 0.06] 1111111111111011111111111111101011011011100111100010110110111011111111110101010011010111011100111010 1110111011111111110110101111111111111111110111111111001110111011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111001011101111101010101000101000001010100110 1111111111111111001001111111111011011001111101111000101000111101011110111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110110101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.413352, lagrangian_loss: 0.187738, attention_score_distillation_loss: 0.005107 loss: 0.004332, lagrangian_loss: 0.207729, attention_score_distillation_loss: 0.004463 ETA: 0:21:11 | Epoch 47 finished. Took 39.42 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:21:24 Evaluating: accuracy: 0.6065, eval_loss: 2.67, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5804, expected_sparsity: 0.5701, expected_sequence_sparsity: 0.8902, target_sparsity: 0.567, step: 3750 lambda_1: -14.2933, lambda_2: 21.6325 lambda_3: 0.0000 train remain: [0.76 0.79 0.74 0.64 0.98 0.72 0.67 0.74 0.61] infer remain: [0.74, 0.77, 0.71, 0.63, 1.0, 0.71, 0.67, 0.73, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.57, 0.4, 0.25, 0.25, 0.18, 0.12, 0.09, 0.05] 1111111111111011111111111111101011011011100111100010110110111011111111110101010011010111011100111010 1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111001011101111100010101000101000001010100110 1111111111111111001101111111111011011001111101111000101000111101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000 1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100 1111111101011111011111111101111011101011111011111111100011111011111001100110110101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.003369, lagrangian_loss: 0.189280, attention_score_distillation_loss: 0.003839 loss: 0.049288, lagrangian_loss: 0.191961, attention_score_distillation_loss: 0.003212 ---------------------------------------------------------------------- time: 2023-07-19 15:21:50 Evaluating: accuracy: 0.6318, eval_loss: 2.4441, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5856, expected_sparsity: 0.5762, expected_sequence_sparsity: 0.8918, target_sparsity: 0.5746, step: 3800 lambda_1: -14.8084, lambda_2: 21.9450 lambda_3: 0.0000 train remain: [0.75 0.79 0.73 0.64 0.98 0.72 0.67 0.74 0.61] infer remain: [0.73, 0.77, 0.7, 0.62, 1.0, 0.71, 0.66, 0.73, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.73, 0.56, 0.39, 0.24, 0.24, 0.17, 0.11, 0.08, 0.05] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100111010 1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001111011001100 1111111111111111111111111110111111111110111101010011111111001011101111100010101000100000001010100110 1111111111111111001001111111111011011001111101111000101000111101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111011100 1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.003779, lagrangian_loss: 0.208335, attention_score_distillation_loss: 0.002579 ETA: 0:20:32 | Epoch 48 finished. Took 40.81 seconds. loss: 0.149992, lagrangian_loss: 0.221737, attention_score_distillation_loss: 0.001945 ---------------------------------------------------------------------- time: 2023-07-19 15:22:15 Evaluating: accuracy: 0.6318, eval_loss: 2.5037, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5882, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8929, target_sparsity: 0.5821, step: 3850 lambda_1: -15.3436, lambda_2: 22.2901 lambda_3: 0.0000 train remain: [0.74 0.78 0.72 0.63 0.98 0.71 0.66 0.74 0.61] infer remain: [0.73, 0.76, 0.69, 0.62, 1.0, 0.7, 0.66, 0.73, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.73, 0.55, 0.38, 0.24, 0.24, 0.17, 0.11, 0.08, 0.05] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100111010 1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001011011001100 1111111111111111111111111110111111111110111101010011110111001011101111100010101000100000001010100110 1111111111111111001001111111111011011001111101111000101000111101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111011100 1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 loss: 0.004148, lagrangian_loss: 0.271993, attention_score_distillation_loss: 0.001310 loss: 0.276304, lagrangian_loss: 0.256949, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:22:41 Evaluating: accuracy: 0.6318, eval_loss: 2.5379, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5946, expected_sparsity: 0.588, expected_sequence_sparsity: 0.8949, target_sparsity: 0.5897, step: 3900 lambda_1: -15.9210, lambda_2: 22.7007 lambda_3: 0.0000 train remain: [0.74 0.78 0.71 0.62 0.98 0.71 0.66 0.74 0.61] infer remain: [0.72, 0.75, 0.68, 0.61, 1.0, 0.7, 0.65, 0.73, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.54, 0.37, 0.22, 0.22, 0.16, 0.1, 0.07, 0.05] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100101010 1110111011111111110110101111111111111111110111111111001110011011110100111110101101111001011011001100 1111111111111111111111111110111111111110111101010010110111001011101111100010101000100000001010100110 1111111111111111001001111111111011011001111101111000101000110101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111011100 1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 ETA: 0:19:53 | Epoch 49 finished. Took 40.6 seconds. loss: 0.038694, lagrangian_loss: 0.288122, attention_score_distillation_loss: 0.000984 loss: 0.004382, lagrangian_loss: 0.260619, attention_score_distillation_loss: 0.000984 Starting saving the best from epoch 50 and step 3950 ---------------------------------------------------------------------- time: 2023-07-19 15:23:07 Evaluating: accuracy: 0.6643, eval_loss: 2.2208, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.602, expected_sparsity: 0.5936, expected_sequence_sparsity: 0.8963, target_sparsity: 0.59, step: 3950 lambda_1: -16.4693, lambda_2: 23.0853 lambda_3: 0.0000 train remain: [0.73 0.77 0.7 0.62 0.97 0.71 0.66 0.74 0.61] infer remain: [0.71, 0.75, 0.67, 0.6, 1.0, 0.7, 0.65, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.36, 0.21, 0.21, 0.15, 0.1, 0.07, 0.04] 1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100101010 1110111011111111110110101111111111111111110111111111001110011011110100111110101101111001011011001100 1111111111111111111111111110111111111100111101010010110111001011101111100010101000100000001010100110 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111011100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Saving the best model so far: [Epoch 50 | Step: 3950 | MACs sparsity: 0.602 | Score: 0.6643 | Loss: 2.2208] loss: 0.422135, lagrangian_loss: 0.207163, attention_score_distillation_loss: 0.000983 loss: 0.951814, lagrangian_loss: 0.147679, attention_score_distillation_loss: 0.000985 ETA: 0:19:45 | Epoch 50 finished. Took 95.03 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:24:29 Evaluating: accuracy: 0.6065, eval_loss: 2.5588, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6046, expected_sparsity: 0.5973, expected_sequence_sparsity: 0.8972, target_sparsity: 0.59, step: 4000 lambda_1: -16.8111, lambda_2: 23.2405 lambda_3: 0.0000 train remain: [0.73 0.76 0.69 0.61 0.97 0.7 0.66 0.73 0.6 ] infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.69, 0.65, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.14, 0.09, 0.07, 0.04] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100001010 1110111011111111110110101111111111111111110111111111001110011011110100111110101101011001011011001100 1011111111111111111111111110111111111110111101010010110111001011101101100010101000101000001010100010 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100010000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.336843, lagrangian_loss: 0.106470, attention_score_distillation_loss: 0.000985 loss: 0.005895, lagrangian_loss: 0.057680, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:24:54 Evaluating: accuracy: 0.6498, eval_loss: 2.2217, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6123, expected_sparsity: 0.6042, expected_sequence_sparsity: 0.899, target_sparsity: 0.59, step: 4050 lambda_1: -16.9701, lambda_2: 23.2784 lambda_3: 0.0000 train remain: [0.72 0.76 0.69 0.6 0.97 0.7 0.65 0.73 0.6 ] infer remain: [0.7, 0.73, 0.65, 0.59, 1.0, 0.69, 0.65, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.2, 0.2, 0.14, 0.09, 0.06, 0.04] 1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100001010 1110111011111111110110101111111111111111110111111111001110011011110100111110100101011001011011001100 1011111111111111111111111110111111111110111101010010110111001011101101100010101000101000001010000010 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.431041, lagrangian_loss: 0.012984, attention_score_distillation_loss: 0.000986 ETA: 0:19:04 | Epoch 51 finished. Took 40.73 seconds. loss: 0.450005, lagrangian_loss: -0.025127, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:25:20 Evaluating: accuracy: 0.6282, eval_loss: 2.4345, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6149, expected_sparsity: 0.6068, expected_sequence_sparsity: 0.8997, target_sparsity: 0.59, step: 4100 lambda_1: -16.9393, lambda_2: 23.2879 lambda_3: 0.0000 train remain: [0.71 0.75 0.67 0.6 0.97 0.7 0.65 0.73 0.6 ] infer remain: [0.7, 0.73, 0.64, 0.58, 1.0, 0.69, 0.64, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.19, 0.19, 0.13, 0.08, 0.06, 0.04] 1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100001010 1110111011111111110110101111111111111111110111111111001110011011110100111110100101011001011011001100 1011111111111111111111111110111111111100111101010010110111001011101101100010101000101000001010000010 1101111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.005654, lagrangian_loss: -0.076509, attention_score_distillation_loss: 0.000987 loss: 0.266467, lagrangian_loss: -0.118968, attention_score_distillation_loss: 0.000986 ETA: 0:18:22 | Epoch 52 finished. Took 39.02 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:25:45 Evaluating: accuracy: 0.6065, eval_loss: 2.6424, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6197, expected_sparsity: 0.6115, expected_sequence_sparsity: 0.9009, target_sparsity: 0.59, step: 4150 lambda_1: -16.6850, lambda_2: 23.3727 lambda_3: 0.0000 train remain: [0.71 0.74 0.67 0.59 0.96 0.69 0.65 0.73 0.6 ] infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.68, 0.64, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.08, 0.06, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010011010101011100001010 1110111011111111110110100111111111111111110111111111001110011011110100111110100101011001011011001100 1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000010 1101111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.012612, lagrangian_loss: -0.160934, attention_score_distillation_loss: 0.000975 loss: 0.478241, lagrangian_loss: -0.205715, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:26:10 Evaluating: accuracy: 0.6318, eval_loss: 2.3875, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6287, expected_sparsity: 0.6179, expected_sequence_sparsity: 0.9026, target_sparsity: 0.59, step: 4200 lambda_1: -16.2349, lambda_2: 23.6216 lambda_3: 0.0000 train remain: [0.7 0.74 0.66 0.58 0.96 0.69 0.64 0.73 0.6 ] infer remain: [0.68, 0.71, 0.63, 0.57, 1.0, 0.68, 0.64, 0.72, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.12, 0.08, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011100001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100101011001011011001100 1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000010 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100100000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.005949, lagrangian_loss: -0.250325, attention_score_distillation_loss: 0.000984 ETA: 0:17:41 | Epoch 53 finished. Took 40.55 seconds. loss: 0.025485, lagrangian_loss: -0.296913, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:26:36 Evaluating: accuracy: 0.6245, eval_loss: 2.4973, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6313, expected_sparsity: 0.6204, expected_sequence_sparsity: 0.9032, target_sparsity: 0.59, step: 4250 lambda_1: -15.5835, lambda_2: 24.1509 lambda_3: 0.0000 train remain: [0.69 0.73 0.65 0.58 0.96 0.69 0.64 0.72 0.6 ] infer remain: [0.68, 0.71, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.11, 0.07, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011100001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100101011001011011001100 1011111111111111111111111110111111111100111101010010110111001011101101100010001000100000001010000010 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003920, lagrangian_loss: -0.309019, attention_score_distillation_loss: 0.000987 loss: 0.159576, lagrangian_loss: -0.334167, attention_score_distillation_loss: 0.000987 ETA: 0:17:00 | Epoch 54 finished. Took 39.46 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:27:02 Evaluating: accuracy: 0.6606, eval_loss: 2.3501, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6245, expected_sequence_sparsity: 0.9043, target_sparsity: 0.59, step: 4300 lambda_1: -14.8043, lambda_2: 24.9288 lambda_3: 0.0000 train remain: [0.69 0.73 0.65 0.57 0.96 0.69 0.64 0.72 0.6 ] infer remain: [0.67, 0.7, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011000001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100001011001011011001100 1011111111111111111111111110111111111100111101010010110111001011101101100010001000100000001010000010 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003980, lagrangian_loss: -0.337197, attention_score_distillation_loss: 0.000985 loss: 0.914364, lagrangian_loss: -0.357105, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:27:28 Evaluating: accuracy: 0.6318, eval_loss: 2.4607, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6257, expected_sequence_sparsity: 0.9046, target_sparsity: 0.59, step: 4350 lambda_1: -13.9215, lambda_2: 25.9556 lambda_3: 0.0000 train remain: [0.68 0.72 0.64 0.57 0.95 0.68 0.64 0.72 0.6 ] infer remain: [0.67, 0.7, 0.61, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011000001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100001011001011011001100 1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.004179, lagrangian_loss: -0.348407, attention_score_distillation_loss: 0.000986 ETA: 0:16:19 | Epoch 55 finished. Took 40.31 seconds. loss: 0.003455, lagrangian_loss: -0.360717, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:27:53 Evaluating: accuracy: 0.6498, eval_loss: 2.3405, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 4400 lambda_1: -12.9876, lambda_2: 27.1190 lambda_3: 0.0000 train remain: [0.68 0.72 0.64 0.56 0.95 0.68 0.64 0.72 0.6 ] infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100 1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010 1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.004634, lagrangian_loss: -0.363920, attention_score_distillation_loss: 0.000985 loss: 0.540920, lagrangian_loss: -0.363997, attention_score_distillation_loss: 0.000985 ETA: 0:15:37 | Epoch 56 finished. Took 39.62 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:28:19 Evaluating: accuracy: 0.6137, eval_loss: 2.4352, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 4450 lambda_1: -12.0277, lambda_2: 28.3537 lambda_3: 0.0000 train remain: [0.67 0.71 0.63 0.56 0.95 0.68 0.64 0.72 0.6 ] infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03] 1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100 1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010 1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.128388, lagrangian_loss: -0.345102, attention_score_distillation_loss: 0.000982 loss: 0.037993, lagrangian_loss: -0.356329, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:28:45 Evaluating: accuracy: 0.6173, eval_loss: 2.4553, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.643, expected_sparsity: 0.6344, expected_sequence_sparsity: 0.9068, target_sparsity: 0.59, step: 4500 lambda_1: -11.0517, lambda_2: 29.6301 lambda_3: 0.0000 train remain: [0.67 0.71 0.63 0.56 0.95 0.68 0.63 0.72 0.6 ] infer remain: [0.65, 0.69, 0.6, 0.55, 1.0, 0.67, 0.63, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.45, 0.27, 0.15, 0.15, 0.1, 0.06, 0.04, 0.03] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010 1101111111111111001001111111101011011001101101111000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003925, lagrangian_loss: -0.350756, attention_score_distillation_loss: 0.000985 ETA: 0:14:56 | Epoch 57 finished. Took 40.8 seconds. loss: 0.009974, lagrangian_loss: -0.340278, attention_score_distillation_loss: 0.000983 ---------------------------------------------------------------------- time: 2023-07-19 15:29:11 Evaluating: accuracy: 0.6318, eval_loss: 2.4038, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6468, expected_sparsity: 0.6367, expected_sequence_sparsity: 0.9074, target_sparsity: 0.59, step: 4550 lambda_1: -10.0790, lambda_2: 30.8959 lambda_3: 0.0000 train remain: [0.66 0.71 0.62 0.55 0.94 0.68 0.63 0.72 0.6 ] infer remain: [0.65, 0.68, 0.6, 0.54, 1.0, 0.67, 0.63, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.14, 0.1, 0.06, 0.04, 0.03] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.071053, lagrangian_loss: -0.338397, attention_score_distillation_loss: 0.000986 loss: 0.004020, lagrangian_loss: -0.309773, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:29:37 Evaluating: accuracy: 0.6245, eval_loss: 2.4036, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6468, expected_sparsity: 0.6369, expected_sequence_sparsity: 0.9074, target_sparsity: 0.59, step: 4600 lambda_1: -9.1241, lambda_2: 32.1191 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.94 0.68 0.63 0.72 0.59] infer remain: [0.65, 0.68, 0.6, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.14, 0.1, 0.06, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.011813, lagrangian_loss: -0.303010, attention_score_distillation_loss: 0.000985 ETA: 0:14:16 | Epoch 58 finished. Took 40.74 seconds. loss: 0.389025, lagrangian_loss: -0.289140, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:30:02 Evaluating: accuracy: 0.6173, eval_loss: 2.4259, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6379, expected_sequence_sparsity: 0.9077, target_sparsity: 0.59, step: 4650 lambda_1: -8.1995, lambda_2: 33.2752 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.94 0.68 0.63 0.72 0.59] infer remain: [0.65, 0.68, 0.59, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.14, 0.09, 0.06, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.005033, lagrangian_loss: -0.275261, attention_score_distillation_loss: 0.000986 loss: 0.024751, lagrangian_loss: -0.249146, attention_score_distillation_loss: 0.000983 ETA: 0:13:34 | Epoch 59 finished. Took 39.4 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:30:28 Evaluating: accuracy: 0.6173, eval_loss: 2.492, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6379, expected_sequence_sparsity: 0.9077, target_sparsity: 0.59, step: 4700 lambda_1: -7.3059, lambda_2: 34.3695 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.72 0.59] infer remain: [0.65, 0.68, 0.59, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.14, 0.09, 0.06, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.006469, lagrangian_loss: -0.239501, attention_score_distillation_loss: 0.000986 loss: 0.007700, lagrangian_loss: -0.224145, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:30:54 Evaluating: accuracy: 0.6137, eval_loss: 2.4349, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4750 lambda_1: -6.4449, lambda_2: 35.4011 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.67 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.483426, lagrangian_loss: -0.201096, attention_score_distillation_loss: 0.000986 ETA: 0:12:54 | Epoch 60 finished. Took 40.94 seconds. loss: 0.333813, lagrangian_loss: -0.191403, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:31:20 Evaluating: accuracy: 0.6173, eval_loss: 2.5087, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4800 lambda_1: -5.6111, lambda_2: 36.3844 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.67 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.002886, lagrangian_loss: -0.173184, attention_score_distillation_loss: 0.000986 loss: 0.002963, lagrangian_loss: -0.152122, attention_score_distillation_loss: 0.000986 ETA: 0:12:13 | Epoch 61 finished. Took 38.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:31:45 Evaluating: accuracy: 0.6498, eval_loss: 2.2829, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4850 lambda_1: -4.8085, lambda_2: 37.3120 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.54 0.93 0.67 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003775, lagrangian_loss: -0.137805, attention_score_distillation_loss: 0.000985 loss: 0.453928, lagrangian_loss: -0.117485, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:32:11 Evaluating: accuracy: 0.6245, eval_loss: 2.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 4900 lambda_1: -4.0345, lambda_2: 38.1898 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.54 0.93 0.67 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.013513, lagrangian_loss: -0.100817, attention_score_distillation_loss: 0.000985 ETA: 0:11:32 | Epoch 62 finished. Took 41.15 seconds. loss: 0.003451, lagrangian_loss: -0.085410, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:32:37 Evaluating: accuracy: 0.5957, eval_loss: 2.5711, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 4950 lambda_1: -3.2855, lambda_2: 39.0262 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.54 0.93 0.68 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.320035, lagrangian_loss: -0.069144, attention_score_distillation_loss: 0.000986 loss: 0.002735, lagrangian_loss: -0.052947, attention_score_distillation_loss: 0.000985 ETA: 0:10:51 | Epoch 63 finished. Took 39.15 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:33:02 Evaluating: accuracy: 0.6173, eval_loss: 2.3944, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 5000 lambda_1: -2.5592, lambda_2: 39.8266 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.54 0.93 0.68 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003159, lagrangian_loss: -0.037125, attention_score_distillation_loss: 0.000986 loss: 0.006786, lagrangian_loss: -0.021828, attention_score_distillation_loss: 0.000987 ---------------------------------------------------------------------- time: 2023-07-19 15:33:28 Evaluating: accuracy: 0.6137, eval_loss: 2.45, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 5050 lambda_1: -1.8595, lambda_2: 40.5815 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.008101, lagrangian_loss: -0.011321, attention_score_distillation_loss: 0.000981 ETA: 0:10:10 | Epoch 64 finished. Took 40.78 seconds. loss: 0.443700, lagrangian_loss: 0.007804, attention_score_distillation_loss: 0.000987 ---------------------------------------------------------------------- time: 2023-07-19 15:33:53 Evaluating: accuracy: 0.6209, eval_loss: 2.407, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 5100 lambda_1: -1.1800, lambda_2: 41.3047 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59] infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.287923, lagrangian_loss: 0.021347, attention_score_distillation_loss: 0.000987 loss: 0.003075, lagrangian_loss: 0.028340, attention_score_distillation_loss: 0.000984 ETA: 0:09:29 | Epoch 65 finished. Took 39.07 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:34:19 Evaluating: accuracy: 0.6282, eval_loss: 2.315, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6419, expected_sequence_sparsity: 0.9087, target_sparsity: 0.59, step: 5150 lambda_1: -0.5346, lambda_2: 41.9669 lambda_3: 0.0000 train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59] infer remain: [0.65, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.280213, lagrangian_loss: 0.041546, attention_score_distillation_loss: 0.000984 loss: 0.003004, lagrangian_loss: 0.051726, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:34:45 Evaluating: accuracy: 0.6318, eval_loss: 2.277, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6419, expected_sequence_sparsity: 0.9087, target_sparsity: 0.59, step: 5200 lambda_1: 0.0847, lambda_2: 42.5851 lambda_3: 0.0000 train remain: [0.67 0.71 0.62 0.55 0.93 0.68 0.63 0.71 0.59] infer remain: [0.65, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 loss: 0.003429, lagrangian_loss: 0.058079, attention_score_distillation_loss: 0.000978 loss: 0.003739, lagrangian_loss: 0.072819, attention_score_distillation_loss: 0.000984 ETA: 0:08:48 | Epoch 66 finished. Took 40.95 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:35:11 Evaluating: accuracy: 0.6823, eval_loss: 2.1643, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6409, expected_sequence_sparsity: 0.9085, target_sparsity: 0.59, step: 5250 lambda_1: 0.6754, lambda_2: 43.1551 lambda_3: 0.0000 train remain: [0.67 0.71 0.63 0.55 0.93 0.68 0.63 0.72 0.6 ] infer remain: [0.65, 0.68, 0.6, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6643 @ step 3950 epoch 50.64 Saving the best model so far: [Epoch 67 | Step: 5250 | MACs sparsity: 0.6494 | Score: 0.6823 | Loss: 2.1643] loss: 0.004336, lagrangian_loss: 0.076222, attention_score_distillation_loss: 0.000984 loss: 0.003434, lagrangian_loss: 0.093380, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:36:10 Evaluating: accuracy: 0.6137, eval_loss: 2.5237, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6408, expected_sequence_sparsity: 0.9085, target_sparsity: 0.59, step: 5300 lambda_1: 1.2383, lambda_2: 43.6787 lambda_3: 0.0000 train remain: [0.67 0.71 0.63 0.55 0.94 0.68 0.63 0.72 0.6 ] infer remain: [0.65, 0.68, 0.6, 0.54, 0.87, 0.67, 0.62, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010 1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000 1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.003758, lagrangian_loss: 0.100397, attention_score_distillation_loss: 0.000987 ETA: 0:08:14 | Epoch 67 finished. Took 74.39 seconds. loss: 0.004088, lagrangian_loss: 0.090482, attention_score_distillation_loss: 0.000982 ---------------------------------------------------------------------- time: 2023-07-19 15:36:36 Evaluating: accuracy: 0.6101, eval_loss: 2.4932, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.643, expected_sparsity: 0.6343, expected_sequence_sparsity: 0.9068, target_sparsity: 0.59, step: 5350 lambda_1: 1.7714, lambda_2: 44.1531 lambda_3: 0.0000 train remain: [0.67 0.71 0.63 0.56 0.94 0.68 0.63 0.72 0.6 ] infer remain: [0.65, 0.69, 0.6, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.45, 0.27, 0.15, 0.15, 0.1, 0.06, 0.04, 0.03] 1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100 1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.134332, lagrangian_loss: 0.109142, attention_score_distillation_loss: 0.000986 loss: 0.003198, lagrangian_loss: 0.106465, attention_score_distillation_loss: 0.000986 ETA: 0:07:32 | Epoch 68 finished. Took 39.27 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:37:02 Evaluating: accuracy: 0.6137, eval_loss: 2.5805, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 5400 lambda_1: 2.2698, lambda_2: 44.5713 lambda_3: 0.0000 train remain: [0.68 0.72 0.64 0.56 0.94 0.68 0.64 0.72 0.6 ] infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100 1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.004231, lagrangian_loss: 0.106258, attention_score_distillation_loss: 0.000986 loss: 0.003570, lagrangian_loss: 0.106881, attention_score_distillation_loss: 0.000984 ---------------------------------------------------------------------- time: 2023-07-19 15:37:27 Evaluating: accuracy: 0.6101, eval_loss: 2.5148, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 5450 lambda_1: 2.7270, lambda_2: 44.9251 lambda_3: 0.0000 train remain: [0.68 0.72 0.64 0.56 0.94 0.68 0.64 0.72 0.6 ] infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100 1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000010 1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.158236, lagrangian_loss: 0.109893, attention_score_distillation_loss: 0.000985 ETA: 0:06:51 | Epoch 69 finished. Took 40.79 seconds. loss: 0.001853, lagrangian_loss: 0.098316, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:37:53 Evaluating: accuracy: 0.6426, eval_loss: 2.2234, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6352, expected_sparsity: 0.627, expected_sequence_sparsity: 0.9049, target_sparsity: 0.59, step: 5500 lambda_1: 3.1372, lambda_2: 45.2110 lambda_3: 0.0000 train remain: [0.68 0.73 0.65 0.57 0.95 0.69 0.64 0.72 0.6 ] infer remain: [0.66, 0.7, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010 1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001101 1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.003346, lagrangian_loss: 0.099365, attention_score_distillation_loss: 0.000986 loss: 0.002418, lagrangian_loss: 0.093740, attention_score_distillation_loss: 0.000984 ETA: 0:06:10 | Epoch 70 finished. Took 39.22 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:38:19 Evaluating: accuracy: 0.6498, eval_loss: 2.2467, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6243, expected_sequence_sparsity: 0.9042, target_sparsity: 0.59, step: 5550 lambda_1: 3.4941, lambda_2: 45.4272 lambda_3: 0.0000 train remain: [0.69 0.74 0.66 0.57 0.95 0.69 0.64 0.72 0.61] infer remain: [0.67, 0.7, 0.62, 0.56, 1.0, 0.68, 0.64, 0.71, 0.6] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011000001010 1110111011111111110110100011111111111111110111111111001110011011110100111110100001001001011011001101 1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000000 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.001547, lagrangian_loss: 0.092779, attention_score_distillation_loss: 0.000986 loss: 0.002488, lagrangian_loss: 0.083619, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:38:45 Evaluating: accuracy: 0.657, eval_loss: 2.304, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.63, expected_sparsity: 0.6204, expected_sequence_sparsity: 0.9032, target_sparsity: 0.59, step: 5600 lambda_1: 3.8006, lambda_2: 45.5856 lambda_3: 0.0000 train remain: [0.69 0.74 0.66 0.58 0.95 0.69 0.65 0.72 0.61] infer remain: [0.67, 0.71, 0.63, 0.57, 1.0, 0.68, 0.64, 0.72, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.48, 0.3, 0.17, 0.17, 0.12, 0.07, 0.05, 0.03] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011000001010 1110111011111111110110100011111111111111110111111111011110011011110100111110100001001001011011001101 1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010101 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.004084, lagrangian_loss: 0.082704, attention_score_distillation_loss: 0.000987 ETA: 0:05:28 | Epoch 71 finished. Took 40.69 seconds. loss: 0.007104, lagrangian_loss: 0.071440, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:39:09 Evaluating: accuracy: 0.6426, eval_loss: 2.3328, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6287, expected_sparsity: 0.6173, expected_sequence_sparsity: 0.9024, target_sparsity: 0.59, step: 5650 lambda_1: 4.0550, lambda_2: 45.6941 lambda_3: 0.0000 train remain: [0.7 0.74 0.67 0.58 0.96 0.7 0.65 0.73 0.61] infer remain: [0.68, 0.71, 0.63, 0.57, 1.0, 0.69, 0.65, 0.72, 0.61] layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.12, 0.08, 0.06, 0.03] 1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001010 1110111011111111110110100011111111111111110111111111001110011011110100111110100001011001011011001101 1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.003086, lagrangian_loss: 0.056156, attention_score_distillation_loss: 0.000984 loss: 0.004400, lagrangian_loss: 0.049035, attention_score_distillation_loss: 0.000985 ETA: 0:04:47 | Epoch 72 finished. Took 38.71 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:39:35 Evaluating: accuracy: 0.6462, eval_loss: 2.3567, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.624, expected_sparsity: 0.6163, expected_sequence_sparsity: 0.9021, target_sparsity: 0.59, step: 5700 lambda_1: 4.2374, lambda_2: 45.7497 lambda_3: 0.0000 train remain: [0.71 0.75 0.67 0.59 0.96 0.7 0.66 0.73 0.62] infer remain: [0.68, 0.71, 0.63, 0.58, 1.0, 0.69, 0.65, 0.72, 0.62] layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.18, 0.18, 0.12, 0.08, 0.06, 0.04] 1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011100001010 1110111011111111110110100011111111111111110111111111001110011011110100111110100001011001011011001101 1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111 1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100001 0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.334937, lagrangian_loss: 0.035039, attention_score_distillation_loss: 0.000984 loss: 0.005062, lagrangian_loss: 0.031952, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:40:01 Evaluating: accuracy: 0.6354, eval_loss: 2.3583, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6175, expected_sparsity: 0.6105, expected_sequence_sparsity: 0.9006, target_sparsity: 0.59, step: 5750 lambda_1: 4.3475, lambda_2: 45.7704 lambda_3: 0.0000 train remain: [0.71 0.75 0.68 0.59 0.96 0.7 0.66 0.73 0.63] infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.69, 0.65, 0.73, 0.63] layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.08, 0.06, 0.04] 1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001011 1110111011111111110110100011111111111111110111111111011110011011110100111110100001011001011011001101 1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.003359, lagrangian_loss: 0.018613, attention_score_distillation_loss: 0.000986 ETA: 0:04:06 | Epoch 73 finished. Took 40.27 seconds. loss: 0.006600, lagrangian_loss: 0.009017, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:40:27 Evaluating: accuracy: 0.6209, eval_loss: 2.4687, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6175, expected_sparsity: 0.6098, expected_sequence_sparsity: 0.9005, target_sparsity: 0.59, step: 5800 lambda_1: 4.3975, lambda_2: 45.7758 lambda_3: 0.0000 train remain: [0.72 0.76 0.68 0.59 0.96 0.71 0.66 0.74 0.64] infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.7, 0.66, 0.73, 0.64] layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.09, 0.06, 0.04] 1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001011 1110111011111111110110100111110111111111110111111111011110011011110100111110100001011001011011001101 1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011 1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001 0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.004027, lagrangian_loss: -0.008580, attention_score_distillation_loss: 0.000984 loss: 0.002055, lagrangian_loss: -0.008641, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:40:52 Evaluating: accuracy: 0.6498, eval_loss: 2.3617, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6162, expected_sparsity: 0.606, expected_sequence_sparsity: 0.8995, target_sparsity: 0.59, step: 5850 lambda_1: 4.3736, lambda_2: 45.7781 lambda_3: 0.0000 train remain: [0.72 0.77 0.69 0.6 0.97 0.71 0.67 0.74 0.65] infer remain: [0.7, 0.72, 0.64, 0.59, 1.0, 0.7, 0.66, 0.73, 0.65] layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.5, 0.32, 0.19, 0.19, 0.13, 0.09, 0.06, 0.04] 1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011100001011 1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001101 1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011 1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000001 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001 0001010111111111011111111110110111101011111111111000111100101111010101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 ETA: 0:03:25 | Epoch 74 finished. Took 41.12 seconds. loss: 0.002758, lagrangian_loss: -0.012068, attention_score_distillation_loss: 0.000988 loss: 0.002605, lagrangian_loss: -0.029704, attention_score_distillation_loss: 0.000984 ---------------------------------------------------------------------- time: 2023-07-19 15:41:18 Evaluating: accuracy: 0.6173, eval_loss: 2.4781, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.611, expected_sparsity: 0.6028, expected_sequence_sparsity: 0.8987, target_sparsity: 0.59, step: 5900 lambda_1: 4.2653, lambda_2: 45.7971 lambda_3: 0.0000 train remain: [0.73 0.77 0.69 0.6 0.97 0.71 0.67 0.74 0.66] infer remain: [0.7, 0.73, 0.65, 0.59, 1.0, 0.7, 0.66, 0.73, 0.66] layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.2, 0.2, 0.14, 0.09, 0.07, 0.04] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010101011000001011 1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011 1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000011 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001 0101010111111111011111111110110111101011111111111000111100101111010101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.002687, lagrangian_loss: -0.030338, attention_score_distillation_loss: 0.000987 loss: 0.002762, lagrangian_loss: -0.041094, attention_score_distillation_loss: 0.000986 ETA: 0:02:44 | Epoch 75 finished. Took 39.1 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:41:43 Evaluating: accuracy: 0.6318, eval_loss: 2.5083, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6076, expected_sparsity: 0.5995, expected_sequence_sparsity: 0.8978, target_sparsity: 0.59, step: 5950 lambda_1: 4.0637, lambda_2: 45.8584 lambda_3: 0.0000 train remain: [0.73 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.68] infer remain: [0.71, 0.73, 0.65, 0.59, 1.0, 0.71, 0.66, 0.74, 0.67] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.52, 0.34, 0.2, 0.2, 0.14, 0.09, 0.07, 0.05] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010101011100001011 1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011 1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000011 1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100001 0101010111111111011111111110110111101011111111111010111100101111010101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.003304, lagrangian_loss: -0.052091, attention_score_distillation_loss: 0.000984 loss: 0.003466, lagrangian_loss: -0.051544, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:42:09 Evaluating: accuracy: 0.6318, eval_loss: 2.4144, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6063, expected_sparsity: 0.598, expected_sequence_sparsity: 0.8974, target_sparsity: 0.59, step: 6000 lambda_1: 3.7838, lambda_2: 45.9723 lambda_3: 0.0000 train remain: [0.74 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.69] infer remain: [0.71, 0.73, 0.65, 0.6, 1.0, 0.71, 0.67, 0.74, 0.68] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.52, 0.34, 0.2, 0.2, 0.14, 0.1, 0.07, 0.05] 1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011000001011 1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011 1111111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011 0111010111111111011111111110110111101011111111111010111100101111010101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.001548, lagrangian_loss: -0.052469, attention_score_distillation_loss: 0.000984 ETA: 0:02:03 | Epoch 76 finished. Took 40.54 seconds. loss: 0.002003, lagrangian_loss: -0.051637, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:42:35 Evaluating: accuracy: 0.6101, eval_loss: 2.5264, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5998, expected_sparsity: 0.5947, expected_sequence_sparsity: 0.8966, target_sparsity: 0.59, step: 6050 lambda_1: 3.4407, lambda_2: 46.1415 lambda_3: 0.0000 train remain: [0.74 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.69] infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.69] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05] 1111111111111011111111111111100011111011100111100010110110111011111111110101010011010101011000001011 1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011 1111111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011 0111010111111111011111111110110111101111111111111010111100101111010101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.002690, lagrangian_loss: -0.049374, attention_score_distillation_loss: 0.000984 loss: 0.002767, lagrangian_loss: -0.045989, attention_score_distillation_loss: 0.000984 ETA: 0:01:22 | Epoch 77 finished. Took 39.42 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:43:01 Evaluating: accuracy: 0.6173, eval_loss: 2.453, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5998, expected_sparsity: 0.5946, expected_sequence_sparsity: 0.8965, target_sparsity: 0.59, step: 6100 lambda_1: 3.0483, lambda_2: 46.3619 lambda_3: 0.0000 train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ] infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.7] layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05] 1111111111111011111111111111100011111011100111100010110110111011111111110101010011010101011000001011 1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011 0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.002130, lagrangian_loss: -0.043038, attention_score_distillation_loss: 0.000987 loss: 0.004331, lagrangian_loss: -0.039912, attention_score_distillation_loss: 0.000986 ---------------------------------------------------------------------- time: 2023-07-19 15:43:27 Evaluating: accuracy: 0.6245, eval_loss: 2.5277, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5985, expected_sparsity: 0.5918, expected_sequence_sparsity: 0.8958, target_sparsity: 0.59, step: 6150 lambda_1: 2.5994, lambda_2: 46.6516 lambda_3: 0.0000 train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ] infer remain: [0.72, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.7] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05] 1111111111111011111111111111100011111011100111100010110110111011111111110101010011010111011000001011 1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011 0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.002988, lagrangian_loss: -0.034397, attention_score_distillation_loss: 0.000986 ETA: 0:00:41 | Epoch 78 finished. Took 40.82 seconds. loss: 0.003530, lagrangian_loss: -0.029778, attention_score_distillation_loss: 0.000985 ---------------------------------------------------------------------- time: 2023-07-19 15:43:52 Evaluating: accuracy: 0.6282, eval_loss: 2.495, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5972, expected_sparsity: 0.5913, expected_sequence_sparsity: 0.8957, target_sparsity: 0.59, step: 6200 lambda_1: 2.1262, lambda_2: 46.9763 lambda_3: 0.0000 train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ] infer remain: [0.72, 0.74, 0.66, 0.6, 1.0, 0.72, 0.67, 0.74, 0.7] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.08, 0.05] 1111111111111011111111111111100011111011100111100010110110111011111111110101010011010111011000001011 1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111 1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011 1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000011 1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111 1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011 0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101 Best eval score so far: 0.6823 @ step 5250 epoch 67.31 loss: 0.173288, lagrangian_loss: -0.023798, attention_score_distillation_loss: 0.000984 loss: 0.003138, lagrangian_loss: -0.018495, attention_score_distillation_loss: 0.000986 ETA: 0:00:00 | Epoch 79 finished. Took 39.2 seconds. 07/19/2023 15:46:17 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=c90f704e-2048-427a-a825-62713710c8b9&run_id=c90f704e-2048-427a-a825-62713710c8b9