Dataset?

#1
by jlinux - opened

Do you have plans to release the dataset? I can see how it could be helpful to finetune further on addition mermaid diagrams. It also might be helpful for pikchr (pikchr.org) diagrams since you can streamline the format a bit more and diagrams are compatible with Visio.

Do you have plans to release the dataset? I can see how it could be helpful to finetune further on addition mermaid diagrams. It also might be helpful for pikchr (pikchr.org) diagrams since you can streamline the format a bit more and diagrams are compatible with Visio.

I do not plan to release my ~500 example hand curated dataset as that is used for my evaluation dataset in training now as I believe these are the highest of quality compared to Generated by GPT4 and do not want that public as this is my benchmarker now for how the model is improving as it is incredibly hard to evaluate a models performance on subjective things like story flow.

In my personal research I created a dataset augmentation toolkit for Mermaid diagrams that uses my MermaidMistral 7B model (or any of my mermaid models really would work for the toolkit),

  • the toolkit uses my alpaca prompt format with an example instruction, it created 17k example dataset entries (from my original 500 or so hand crafted with the mermaid editor) that are generated over a temperature variance. (My idea) :D

I have one dataset for creativity with generated examples with increasing temperature from 0.5 to 1.0, and another factual dataset from 0.7 -> 0.1 that seems to really elevate the models performance in one direction or another and I tend to train on the factual dataset first then the creative dataset for less than an epoch after as that as it seems to get the best results for creativity while still remaining factual and accurate to the context provided to the model.

I am always looking to improve my models/dataset.
If you create a new dataset for mermaid diagrams, please ping me as I am always willing to work with others on helping nudge the model to do what people want,
some people are using my model with prompt engineering to generate full system diagrams of their codebase,
some people are using my model for story flow and asking the model to plan out the story for them given a long input complete the story kind of thing.

toolkit : https://github.com/Troys-Code/AI_Research/tree/6ef11bd8a3e61539e53ba28b5d420e41b06a154c


I am currently working on a new dataset that is being used for further downstream tasks such as uses for personal assistant where the AI is a person and the user is another and it keeps track of conversations and stores and updates knowledge graphs based on their conversation.

I hope to find more people to help me create this dataset in the future, but currently I am on the job hunt and have not had the time to dedicate to my hobbies.

Sign up or log in to comment