I have a model with a forward
function that receives optional parameters, like this:
class MyModel(nn.Module):
...
def forward(self, interactions: torch.Tensor, user_features: Optional[torch.Tensor] = None):
"""
Where _N_ is the number of items
interactions (Tensor): Nx2
user_features (Tensor): Nx(number of features)
"""
...
For this reason, my Dataset
also returns a dict, based on a DataFrame
class StackOverflowDataset(torch.utils.data.Dataset):
def __init__(self, data, user_features=None):
self._data = data
self._user_features = user_features
def __getitem__(self, idx):
if self._user_features is None:
return {'interactions': self._data[idx]}
else:
return {
'interactions': self._data[idx],
'user_features': self._user_features[self._data[idx]['user']]
}
def __len__(self):
return len(self._data)
Most of the training time is spent on __getitem__
, which makes me wonder: is having optional arguments on MyModel.forward
bad practice? I guess most of the time is spent playing with numpy item by item, and then converting it to PyTorch’s tensors and moving it to the GPU. Is there any way I can “pre-process” all of this beforehand but still use a Dataloader
that returns a dict?
It seems that Datapipe also goes row by row. Would it be possible to directly return a dict with the whole ‘interactions’ tensor, and then the Dataloader
slices it up?