How to detect words?

#17
by GoominDev - opened

When the input contains profanity or injection, the score is displayed,
but is there a way to find out which words caused the score result?

Meta Llama org

@GoominDev We don't have token-level classification yet, though I think a binary search over the input (e.g. chunking it into halves, scanning each half recursively) to find a section that might be triggering the model would work pretty well.

GoominDev changed discussion status to closed

Sign up or log in to comment