Action Decoding Errors
Hi, I performed training on the LIBERO spatial dataset. When I performed inference, action decoding errors occurred. Below is an example.
Error decoding tokens: cannot reshape array of size 33 into shape (7)
Tokens: [271, 326, 340, 372, 271, 326, 1512, 1683, 297, 803, 333, 258]
I looked into the code and found this is because the coefficient matrix decoded by the BPE tokenizer does not match with the shape of (time_horizon, action_dim). I was using time_horzion=5 action_dim=7. So I would expect an array of size 35.
Did you encounter such problem? Since there is no guarantee on generating tokens that would result in a coefficient matrix which matches the desired shape, how would such problem be handled in practice? What was the time horizon used for LIBERO training in Fig. 6 of the paper?
Thanks!
Hi!
Indeed, when you train models to predict the action tokens, it's likely that early on they will predict incorrect output shapes. That's why the FAST tokenizer handles the decoding errors gracefully, ie just warns you it happened and then returns a default all-0 array.
In practice, models typically learn very quickly within a few thousand gradient steps to only produce outputs of the correct length, so the number of decoding errors should reduce quickly.
Time horizon for libero was 10.
Thanks for your response. I have trained the model for 40k steps. I believe it has converged. But I still occasionally get this error. Does it sound reasonable?
I see the same issue.
Ive trained the tokenizer using a 2k samples dataset, and it seems to be extremely sensitive, for a sequence of 20 tokens, if th emodel is wrong by 1 token, the tokenizer cant decode.
seems a bit not practical
@kiddyna occasional errors are fine, we do see them too, especially if you push the generalization bounds of your policy, but most predictions of a trained model should be valid.
@WonderYear1905 not sure whether you have re-trained the tokenizer or trained a model using our default tokenizer. In either case, the current default for the tokenizer decoding is that it only decodes actions if the shape after BPE de-compression exactly matches [action_horizon x action dim]. This strict check is a safety mechanism, to prevent bad predictions to get decoded into actions and then being executed on the robot.
You can, however, be more lenient during decoding. Intuitively, the first predicted tokens in FAST, the lowest frequency tokens, dominate the overall shape of the output, and those are often good, even if the model messes up later tokens (errors in an autoregressive model accumulate, so earlier tokens tend to be more accurate than later tokens). So what you can do is to always take the predicted action sequence and 0-pad / truncate it to match the expected [action_horizon x action dim] length before reshaping during decode. That way you ensure by design that the model always decodes a valid action. However, the drawback is that occasionally, you may be decoding a bad model prediction, which can send your robot flying. So we err on the side of caution and skip such "fishy" action predictions during decoding. In practice, the robot usually just moves a tiny bit in such cases and often the next model prediction is valid again, so in practice this solution seems to strike the best balance of safety and performance.