This model seems to be trained using sDPO instead of DPO. How is reward calculation done in this model during inference, for an assistant response to a question?
· Sign up or log in to comment