--- library_name: transformers tags: - lyrics - text - text-to-lyrics - artist-to-lyrics - text-generation datasets: - smgriffin/modern-pop-lyrics language: - en base_model: - openai-community/gpt2 pipeline_tag: text-generation --- # Model Card for pop-lyrics-generator-v1 Finetuned from openai-community/gpt2 on smgriffin/modern-pop-lyrics - generates lyrics for specific pop artists. ### Model Description It's pretty good at generating a song structure and stylized lyrics by artist, but bad at rhyming. Sometimes repeats the same thing over and over, but so do pop artists. It might be good for inspiration while writing lyrics. Some of the content generated can be really silly and potentially offensive - especially if you input Lil Wayne. - **Developed by:** Scott Griffin - **Model type:** Generative Language - **Language(s) (NLP):** English, Spanish - **Finetuned from model [optional]:** openai-community/gpt2 Check out the w&b run here: [https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm](https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm) & my blog post on making it here: [https://scottsblog.glitch.me#pop-lyrics-generator-v1](https://scottsblog.glitch.me#pop-lyrics-generator-v1) ## Uses This model is not for commercial use. The content is the property of the individual artist from which the model was finetuned. This is for research purposes only. ## How to Use Use the code below to generate lyrics: ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline # load model model_name = "smgriffin/pop-lyrics-generator-v1" model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2Tokenizer.from_pretrained(model_name) # create text generation pipeline text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer) # prompt for justin bieber lyrics artist_name = "Justin Bieber" prompt = f"Artist: {artist_name}\nLyrics:" # generate and print generated_texts = text_generator( prompt, max_length=150, num_return_sequences=1, temperature=0.9, # less than .9 results in a lot of repeated lyrics top_k=50, top_p=0.95, do_sample=True, ) print("Generated Lyrics:") print(generated_texts[0]["generated_text"]) ``` ## How to Fine-Tune Your Own Lyric Generation Model Use the code below to get finetune your own GPT2 model (for example on the smgriffin/modern-pop-lyrics dataset): ```python import os import pandas as pd from datasets import load_dataset from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling # output directory output_dir = "/your/output/directory" os.makedirs(output_dir, exist_ok=True) # load dataset dataset = load_dataset("smgriffin/modern-pop-lyrics") # preprocess dataset def preprocess_function(example): # Combine artist name with lyrics for conditioning combined = [f"Artist: {artist}\nLyrics: {lyrics}\n\n" for artist, lyrics in zip(example['artist'], example['lyrics'])] return {"text": combined} processed_dataset = dataset.map(preprocess_function, batched=True) # split to train and test sets train_test_split = processed_dataset["train"].train_test_split(test_size=0.1, seed=42) train_dataset = train_test_split["train"] val_dataset = train_test_split["test"] # load tokenizer, model model_name = "gpt2" # Base GPT-2 model for fine-tuning tokenizer = GPT2Tokenizer.from_pretrained(model_name) # fill pad_token with eos_tone (gpt2 doesn't have a padding token) tokenizer.pad_token = tokenizer.eos_token # tokenize dataset def tokenize_function(example): tokenized = tokenizer( example["text"], truncation=True, padding="max_length", max_length=512, ) return { "input_ids": tokenized["input_ids"], "attention_mask": tokenized["attention_mask"], "labels": tokenized["input_ids"], } train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"]) val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"]) # data collator data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) # load GPT-2 model = GPT2LMHeadModel.from_pretrained(model_name) # training arguments training_args = TrainingArguments( output_dir=output_dir, overwrite_output_dir=True, evaluation_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=10, save_steps=1000, save_total_limit=1, logging_dir=f"{output_dir}/logs", logging_steps=50, gradient_accumulation_steps=2, fp16=True, max_grad_norm=1.0, push_to_hub=False, ) # init trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset, tokenizer=tokenizer, data_collator=data_collator, ) # start fine-tuning trainer.train() # save model trainer.save_model(output_dir) tokenizer.save_pretrained(output_dir) ```