unifying the input shape of the text-only branch and the text-image branch

For **text-only** nad **text-image** forward process takes input of different shape in the in the modeling.py, we should try to keep all data from the dataloader and and remove the text only branches that takes only the first batch from the inner batch.

This way the input of the `InternLMXComposer2ForCausalLM.forward` will be universally be (1, bs)
Inside the InternLMXComposer2ForCausalLM.forward
In the image-text mode, `interleav_wrap` encodes the ['text_input'] of size (1, bs)
In the text-only mode, ['text_input'] is firstly squeezed into a list of size (bs,) `tokenizer` encode the reshaped text inputs.

Files changed (1) hide show

modeling_internlm_xcomposer2.py +2 -1

modeling_internlm_xcomposer2.py CHANGED Viewed

@@ -423,6 +423,7 @@ class InternLMXComposer2ForCausalLM(InternLM2PreTrainedModel):
                 Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                 config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                 (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
         Returns:
         """
@@ -458,7 +459,7 @@ class InternLMXComposer2ForCausalLM(InternLM2PreTrainedModel):
                     image, text, image_nums)
             else:
                 to_regress_tokens, targets = self.text2emb(
-                    text, add_special_tokens=True)
                 to_regress_embeds = self.model.tok_embeddings(
                     to_regress_tokens.input_ids)
                 attention_mask = to_regress_tokens.attention_mask

                 Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                 config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                 (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            kwargs['samples']['text_input] should have dimension 1 x bs
         Returns:
         """
                     image, text, image_nums)
             else:
                 to_regress_tokens, targets = self.text2emb(
+                    text[0], add_special_tokens=True)
                 to_regress_embeds = self.model.tok_embeddings(
                     to_regress_tokens.input_ids)
                 attention_mask = to_regress_tokens.attention_mask