longforemr-kobart-summary-v1 / README.md

cocoirun

Update README.md

9fcb9ee verified 4 months ago

preview code

raw

history blame contribute delete

No virus

7.73 kB

	---
	license: cc-by-nc-nd-4.0
	---

	Longformer인코더 KoBART로 AIHUB 금융 및 콜 상담 대화 데이터를 CHATGPT를 통해 요약한 학습 데이터를 학습한 모델


	```
	input = """고객: 안녕하세요, 제가 여기서 사용하는 신용카드에 대해 궁금한 게 있어요.

	상담원: 안녕하세요! 네, 어떤 문의가 있으신가요?

	고객: 제가 이번 달에 카드를 사용하면서 리워드 포인트를 얼마나 쌓았는지 확인하고 싶어요.

	상담원: 네, 당신의 리워드 포인트 잔액을 확인해 드릴 수 있습니다. 제가 당신의 카드 번호를 입력하고 확인해볼게요. 번호를 알려주실 수 있을까요?

	고객: 네, 제 카드 번호는 1234-5678-9012-3456입니다.

	상담원: 감사합니다. 잠시만 기다려주세요. 확인 중이에요... 네, 현재 당신의 리워드 포인트 잔액은 3,250 포인트입니다.

	고객: 알겠어요, 감사합니다! 그럼 추가적인 이용 혜택이나 할인에 관한 정보도 얻을 수 있을까요?

	상담원: 물론이죠! 저희 카드사는 다양한 이용 혜택을 제공하고 있습니다. 예를 들어, 여행, 쇼핑, 식사 등 다양한 분야에서 할인 혜택을 받을 수 있거나, 리워드 포인트를 사용하여 상품이나 기프트 카드로 교환할 수 있습니다. 어떤 혜택에 관심이 있으신가요?

	고객: 저는 여행 할인이나 마일리지 적립에 관심이 있어요.

	상담원: 그런 경우에는 당신에게 적합한 여행 카드 혜택을 제공하는 카드를 추천해 드릴 수 있습니다. 여행 카드는 항공사 마일리지를 쌓을 수 있고, 호텔 할인 혜택을 받을 수도 있습니다. 제가 몇 가지 옵션을 제안해 볼까요?

	고객: 네, 그러면 좋을 것 같아요. 감사합니다!
	상담원: 말씀해 주셔서 감사합니다. 이제 제가 몇 가지 추천을 드리도록 하겠습니다. 어떤 항공사를 주로 이용하시나요?"""
	```


	```
	output ="""
	- 고객이 신용카드에 대해 궁금한 사항 상담
	- 리워드 포인트 확인 요청
	- 상담원이 카드 번호와 잔액 확인 후 추가 이용 혜택 안내
	- 고객이 여행 할인, 마일리지, 호텔 할인 등 다양한 혜택에 관심 표현
	"""
	```


	해당 모델을 활용하기 위해서 다음과 같은 class 필요
	```
	class LongformerSelfAttentionForBart(nn.Module):
	def __init__(self, config, layer_id):
	super().__init__()
	self.embed_dim = config.d_model
	self.longformer_self_attn = LongformerSelfAttention(config, layer_id=layer_id)
	self.output = nn.Linear(self.embed_dim, self.embed_dim)

	def forward(
	self,
	hidden_states: torch.Tensor,
	key_value_states: Optional[torch.Tensor] = None,
	past_key_value: Optional[Tuple[torch.Tensor]] = None,
	attention_mask: Optional[torch.Tensor] = None,
	layer_head_mask: Optional[torch.Tensor] = None,
	output_attentions: bool = False,
	) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:

	is_cross_attention = key_value_states is not None
	bsz, tgt_len, embed_dim = hidden_states.size()

	# bs x seq_len x seq_len -> bs x seq_len 으로 변경
	attention_mask = attention_mask.squeeze(dim=1)
	attention_mask = attention_mask[:,0]

	is_index_masked = attention_mask < 0
	is_index_global_attn = attention_mask > 0
	is_global_attn = is_index_global_attn.flatten().any().item()

	outputs = self.longformer_self_attn(
	hidden_states,
	attention_mask=attention_mask,
	layer_head_mask=None,
	is_index_masked=is_index_masked,
	is_index_global_attn=is_index_global_attn,
	is_global_attn=is_global_attn,
	output_attentions=output_attentions,
	)

	attn_output = self.output(outputs[0])

	return (attn_output,) + outputs[1:] if len(outputs) == 2 else (attn_output, None, None)
	```

	```
	class LongformerEncoderDecoderForConditionalGeneration(BartForConditionalGeneration):
	def __init__(self, config):
	super().__init__(config)

	if config.attention_mode == 'n2':
	pass # do nothing, use BertSelfAttention instead
	else:

	self.model.encoder.embed_positions = BartLearnedPositionalEmbedding(
	config.max_encoder_position_embeddings,
	config.d_model)

	self.model.decoder.embed_positions = BartLearnedPositionalEmbedding(
	config.max_decoder_position_embeddings,
	config.d_model)

	for i, layer in enumerate(self.model.encoder.layers):
	layer.self_attn = LongformerSelfAttentionForBart(config, layer_id=i)
	```

	```
	class LongformerEncoderDecoderConfig(BartConfig):
	def __init__(self, attention_window: List[int] = None, attention_dilation: List[int] = None,
	autoregressive: bool = False, attention_mode: str = 'sliding_chunks',
	gradient_checkpointing: bool = False, **kwargs):
	"""
	Args:
	attention_window: list of attention window sizes of length = number of layers.
	window size = number of attention locations on each side.
	For an affective window size of 512, use `attention_window=[256]*num_layers`
	which is 256 on each side.
	attention_dilation: list of attention dilation of length = number of layers.
	attention dilation of `1` means no dilation.
	autoregressive: do autoregressive attention or have attention of both sides
	attention_mode: 'n2' for regular n^2 self-attention, 'tvm' for TVM implemenation of Longformer
	selfattention, 'sliding_chunks' for another implementation of Longformer selfattention
	"""
	super().__init__(**kwargs)
	self.attention_window = attention_window
	self.attention_dilation = attention_dilation
	self.autoregressive = autoregressive
	self.attention_mode = attention_mode
	self.gradient_checkpointing = gradient_checkpointing
	assert self.attention_mode in ['tvm', 'sliding_chunks', 'n2']
	```
	모델 오브젝트 로드 후
	weight파일을 별도로 다운받아서 load_state_dict로 웨이트를 불러야 합니다.
	```
	tokenizer = AutoTokenizer.from_pretrained("cocoirun/longforemr-kobart-summary-v1")
	model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained("cocoirun/longforemr-kobart-summary-v1")
	device = torch.device('cuda')
	model.load_state_dict(torch.load("summary weight.ckpt"))
	model.to(device)
	```

	모델 요약 함수
	```
	def summarize(text, max_len):
	max_seq_len = 4096
	context_tokens = ['<s>'] + tokenizer.tokenize(text) + ['</s>']
	input_ids = tokenizer.convert_tokens_to_ids(context_tokens)

	if len(input_ids) < max_seq_len:
	while len(input_ids) < max_seq_len:
	input_ids += [tokenizer.pad_token_id]

	else:
	input_ids = input_ids[:max_seq_len - 1] + [
	tokenizer.eos_token_id]

	res_ids = model.generate(torch.tensor([input_ids]).to(device),
	max_length=max_len,
	num_beams=5,
	no_repeat_ngram_size = 3,
	eos_token_id=tokenizer.eos_token_id,
	bad_words_ids=[[tokenizer.unk_token_id]])

	res = tokenizer.batch_decode(res_ids.tolist(), skip_special_tokens=True)[0]
	res = res.replace("\n\n","\n")
	return res
	```