English

The Iteration Process

#1
by Arsenever - opened

Dear authors, I wonder if the generation of candidates of LVDR is same as the HDR during training.
When you extract some components with spaCy, what principles are used for word substitution? If there is the fixed vocabulary or with the help of LLM? Do candidates need to be filtered again to rule out unreasonable circumstances?
Will the code of candidates(LVDR/DDR/HDR) production will be released in future? Thanks for your attention, that will help me a lot!

Alibaba-PAI org

Sorry for the late reply.
The generation of candidates of LVDR and the HDR during training are similar. We use multiple word sets so that the words in each set are basically of the same type (such as they are all colors) but have completely different semantics, and then randomly replace them based on the set. We have also tried using LLM, but the effect is not ideal because the generated substitution is uncontrollable. We also manually check the replacement of LVDR, so that the replacement words are "basically of the same type but with completely different semantics" as much as possible. We currently have no plans to release the code of candidates.

Thank you very much for your answer. By the way, I'm looking for the meta json for Shot2Story Retrieval, is this accurate? https://huggingface.co/datasets/mhan/shot2story/blob/main/20k_test.json

Sorry for the late reply.
The generation of candidates of LVDR and the HDR during training are similar. We use multiple word sets so that the words in each set are basically of the same type (such as they are all colors) but have completely different semantics, and then randomly replace them based on the set. We have also tried using LLM, but the effect is not ideal because the generated substitution is uncontrollable. We also manually check the replacement of LVDR, so that the replacement words are "basically of the same type but with completely different semantics" as much as possible. We currently have no plans to release the code of candidates.

In LVDR,only one word is replaced in every iteration. If there is a number during HDR training? I mention that multiple words are replaced every iteration.
image.png

Alibaba-PAI org

Thank you very much for your answer. By the way, I'm looking for the meta json for Shot2Story Retrieval, is this accurate? https://huggingface.co/datasets/mhan/shot2story/blob/main/20k_test.json

Yes. We use the early 20K version.

In LVDR,only one word is replaced in every iteration. If there is a number during HDR training? I mention that multiple words are replaced every iteration.

In LVDR, different subsets have different numbers of word replacements in every iteration. During HDR training, the number is randomly sampled from 1 to 5.

Thank you for answering the incessant questions of a beginner. I would like to ask if you could share the multiple word sets so that we can accurately reproduce the results.

I used the "video" and "whole_caption" in https://huggingface.co/datasets/mhan/shot2story/blob/main/20k_test.json for Shot2Story-W retrieval (about 4025 pairs), but only got V2T 89.37 &T2V 90.21. Did I miss something?
And Shot2Story-S retrieval is between "video" and "captions", right?

An example of the json file:
{"video": "OSfMU69X3C4.7.mp4", "image_id": "OSfMU69X3C4.7.mp4", "id": 0, "whole_caption": "The video begins with a woman dressed in black standing in a bustling office space, engaging in a conversation. As she speaks, the background reveals a group of diligent employees at their workstations. She shares her thoughts on how individuals often lack the capacity to fulfill their desires. The scene transitions to another woman, this time clad in a green dress, standing in the same office environment. She addresses the camera directly, sharing that the Canners Charity Day has successfully raised over a hundred million dollars over the years, indicating the office's involvement in philanthropic activities. The focus then shifts to a man and a woman, both standing in the office, each holding a phone. They end their respective calls almost simultaneously, exchanging smiles as they converse. The man is dressed in a black suit, while the woman is in a white blouse, adding a formal tone to the scene. The video then cuts to a man standing in front of a desk, phone in hand. He is also dressed in a black suit and is seen smiling. Behind him, another man is capturing the moment on a camera, suggesting that they might be documenting their work or creating promotional content. In the final scene, the man in the black suit is seen standing beside a computer desk, hunched over slightly. He is interacting with the person seated in front of the computer, sharing a smile. This scene further emphasizes the friendly and collaborative atmosphere within the office.", "whole_ASR": "I think we all want to do something and we dont always have the ability to do that. Over the years, canners charity day has raised over a hundred million dollars and while bringing in rock stars like Jon, Bon Jovi for a day can be a thrill.", "nvid": "OSfMU69X3C4.7.mp4", "video_names": ["OSfMU69X3C4.7_0_120.mp4", "OSfMU69X3C4.7_121_266.mp4", "OSfMU69X3C4.7_267_319.mp4", "OSfMU69X3C4.7_320_388.mp4", "OSfMU69X3C4.7_389_421.mp4"], "audio_captions": ["According to the woman in the video, she says that people don't always have the ability to do what they want to do.", "According to the woman in the video, over the years, canners Charity Day has raised over a hundred million dollars.", "", "", ""], "captions": ["A lady in black stands in an office space and speaks to someone. In the background is a group of people at workstations working.", "A lady in a green dress stands in an office space and speaks to the camera. In the background is a group of people sitting at workstations working.", "A man and a woman stand in an office each holding a phone and then putting it down almost at the same time. The man is wearing a black suit and the woman is wearing a white blouse. The man smiles and the woman smiles while talking.", "A man is standing in front of a desk with a telephone in his hand. The man is wearing a black suit and smiling. A man behind him is filming him with a camera.", "A man in a black suit stands hunched over beside a computer desk. He looks at the person sitting in front of the computer and smiles."], "ASR": [" want to do something and we don't always have the ability to do that.", " Over the years, Cantor's Charity Day has raised over $100 million.", " and while bringing in rock stars.", " like Jon Bon Jovi for a day can be a thrill.", " Lettnick takes a look."]}

I used the "video" and "whole_caption" in https://huggingface.co/datasets/mhan/shot2story/blob/main/20k_test.json for Shot2Story-W retrieval (about 4025 pairs), but only got V2T 89.37 &T2V 90.21. Did I miss something?

In text-vision retrieval tasks, dual softmax (Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss) has been widely used. You can refer to the following code:

def compute_metrics(x):
    try:
        x = x.cpu().numpy()
    except:
        pass
    sx = np.sort(-x, axis=1)
    d = np.diag(-x)
    d = d[:, np.newaxis]
    ind = sx - d
    ind = np.where(ind == 0)
    ind = ind[1]
    metrics = {}
    metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
    metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
    metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
    metrics['MR'] = np.median(ind) + 1
    metrics["MedianR"] = metrics['MR']
    metrics["MeanR"] = np.mean(ind) + 1
    return metrics

# Assuming that text_feat and img_feat have already been calculated
temp = 100.
sim_matrix = (text_feat @ img_feat.T) *temp
sim_matrix_dsl = sim_matrix * F.softmax(sim_matrix, 0)
sim_matrix_dsl_T = sim_matrix.T * F.softmax(sim_matrix.T, 0)

dsl_tv_metrics = compute_metrics(sim_matrix_dsl)
dsl_vt_metrics = compute_metrics(sim_matrix_dsl_T)
print('dsl_tv_metrics: ',  dsl_tv_metrics)
print('dsl_vt_metrics: ',  dsl_vt_metrics)

And Shot2Story-S retrieval is between "video" and "captions", right?

Yes.

I used the "video" and "whole_caption" in https://huggingface.co/datasets/mhan/shot2story/blob/main/20k_test.json for Shot2Story-W retrieval (about 4025 pairs), but only got V2T 89.37 &T2V 90.21. Did I miss something?

In text-vision retrieval tasks, dual softmax (Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss) has been widely used. You can refer to the following code:

def compute_metrics(x):
    try:
        x = x.cpu().numpy()
    except:
        pass
    sx = np.sort(-x, axis=1)
    d = np.diag(-x)
    d = d[:, np.newaxis]
    ind = sx - d
    ind = np.where(ind == 0)
    ind = ind[1]
    metrics = {}
    metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
    metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
    metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
    metrics['MR'] = np.median(ind) + 1
    metrics["MedianR"] = metrics['MR']
    metrics["MeanR"] = np.mean(ind) + 1
    return metrics

# Assuming that text_feat and img_feat have already been calculated
temp = 100.
sim_matrix = (text_feat @ img_feat.T) *temp
sim_matrix_dsl = sim_matrix * F.softmax(sim_matrix, 0)
sim_matrix_dsl_T = sim_matrix.T * F.softmax(sim_matrix.T, 0)

dsl_tv_metrics = compute_metrics(sim_matrix_dsl)
dsl_vt_metrics = compute_metrics(sim_matrix_dsl_T)
print('dsl_tv_metrics: ',  dsl_tv_metrics)
print('dsl_vt_metrics: ',  dsl_vt_metrics)

And Shot2Story-S retrieval is between "video" and "captions", right?

Yes.

So all retrieval results in the paper are based on DSL?

Alibaba-PAI org

Yes.

Yes.

Thank you~

Sign up or log in to comment