Target_module of this phi-3-small model

#3
by hackint0sh - opened

after loading the model use the
for name , module in model.named_modules():
print(name)
to get the module of the layers

for this model it is [ up_proj , down_proj ]

Microsoft org

I'm sorry, but would it be possible to clarify a bit more on the question or provide some additional context ? I'm not sure I understand the issue.

import transformers

model_name = "microsoft/Phi-3-small-128k-instruct"  # Replace with your desired Phi-3-Small variant
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)

for name, module in model.named_modules():
    print(name)

By running this code, you'll obtain a comprehensive list of all the modules within the model, including those specifically related to its layers. For Phi-3-Small, you can expect to see output similar to:

up_proj
down_proj
... # Other modules in the model

This reveals that the key modules associated with layers in the Phi-3-Small model are named up_proj and down_proj. It's essential to consult the Phi-3 documentation for a detailed explanation of their roles within the model's architecture.

Microsoft org

That is accurate.
up_proj and down_proj are a part of the MLP layer with GEGLU activation (https://arxiv.org/pdf/2002.05202)
See this line.

I was thrown the runtime error when inferencing the model using device_map = "auto". Does it only works with a single GPU for inferencing?
This problem only happens with small; medium and mini work just fine. :shrug.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True,
    device_map="auto",
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
Microsoft org
β€’
edited May 24, 2024

Huh interesting,
For some reason, seems like the pipeline allocated the model on one GPU, and the tensors on another (one on "cuda:0", the other one on "cuda:1").
I'd say it might be better to explicitly control the device placement, just to avoid any confusion. Copying from the README below

model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()  # <----- Explicitly specifying the device to send the model to
model = model.to(device)  # <----- Send the model to the particular device
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device  # <----- Also tell the pipeline to use the same device while creating the input tensors
)

Let me know if this fixes the issue ?

By multi-GPU inferencing, do you want to do data parallel inferencing, or tensor-slicing ?
Data parallelism can be done by running the script with any launcher of your choice (torchrun/deepspeed/mpi, just set the current_device correctly based on local rank, and that should work imo).

Tensor slicing is a separate problem: hard to give more info without knowing how you want to do the tensor-slicing.

Thanks. Assigning both the pipeline and model to the same device works.
I'm still not sure why setting device_map="auto" only fails at small but not medium nor mini?

I have tried on A10G with the following code

model_id = "microsoft/Phi-3-small-128k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # loading the model with flash-attenstion support
    torch_dtype=torch.bfloat16,
    device_map=None
)
model = AutoModelForCausalLM.from_pretrained( model_id, **model_kwargs) 
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()  
model = model.to(device)  
tokenizer = AutoTokenizer.from_pretrained(model_id)

still the code is throwing the error

AssertionError: Flash Attention is not available, but is needed for dense attention

@hackint0sh Hi there! The inference code (here) assumes that flash-attn is installed.

Run pip install flash-attn to fix the error.

$ pip install flash-attn

Cheers!

nguyenbh changed discussion status to closed

@hackint0sh Hi there! The inference code (here) assumes that flash-attn is installed.

Run pip install flash-attn to fix the error.

$ pip install flash-attn

Cheers!

Doesn't work for me:
Traceback (most recent call last):
  File "/home/ubuntu/Multimodal-Uncertainty-Quantification/playground/construct_graph2.py", line 24, in <module>
    model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-large", torch_dtype=torch.bfloat16, trust_remote_code=True)
  File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3788, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 903, in __init__
    self.model = Phi3SmallModel(config)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in __init__
    self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in <listcomp>
    self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 651, in __init__
    self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 218, in __init__
    assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention

Hi @tpadhi1 ! πŸ€— The error message is shown when this code block fails, which implies that the following code snippet will raise ImportError in your environment:

import flash_attn
if int(flash_attn.__version__.split('.')[0]) < 2:
    from flash_attn.flash_attn_interface import (
        flash_attn_func,
        flash_attn_unpadded_kvpacked_func as flash_attn_varlen_kvpacked_func,
        )

    # rename `max_seqlen`
    def flash_attn_varlen_qkvpacked_func(qkv, cu_seqlens, max_seqlen, dropout_p=0.0, **kwargs):
        return flash_attn_func(qkv, cu_seqlens, dropout_p=dropout_p, max_s=max_seqlen, **kwargs)

else:
    from flash_attn.flash_attn_interface import (
        flash_attn_varlen_kvpacked_func,
    )
    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
is_flash_attention_available = True

Can you run the above code? It should raise an exception, which will help you narrow down the root cause.

Sign up or log in to comment