I use the video clips as input to do video face classification task (Palsy or not).
Here is my model like:
class MAE_only_x(nn.Module):
def __init__(self, args):
super().__init__()
self.mae = Marlin.from_online("marlin_vit_base_ytf")
self.mae.eval()
# self.norm = nn.BatchNorm1d(768*2)
self.decoder = nn.Sequential(
nn.Linear(768, 512),
# nn.BatchNorm1d(512),
nn.ReLU(),
# nn.Dropout(0.3),
nn.Linear(512, 2)
)
def forward(self, x, phrase='train'):
"""
Input x is shape (B, L, d_input)
"""
# x [batch, 16, 3, 224, 224]
x = x.permute(0, 2, 1, 3, 4).contiguous()
x = self.mae.extract_features(x,keep_seq=False) # (B, 768)
pred_logit = self.decoder(x) # (B, d_model) -> (B, d_output)
return pred_logit
x is my input have Batch with 16 frames per video and with 3 224x224 RGB.
But the question is with 2 label classificaiton task. The loss of the network never goes down.
Is there any thing can I do for better finetune using Marlin.
Any advice will be grateful.
Despitely, My model loss never goes down, like the following: