Model Components¶
The main components of a WideDeep
(i.e. Multimodal) model are tabular data, text and images, which are feed into the model via so called wide
, deeptabular
, deeptext
and deepimage
model components
1. wide
¶
The wide
component is a Linear layer "plugged" into the output neuron(s). Here, the non-linearities are captured via crossed columns. Crossed columns are, quoting directly the paper: "For binary features, a cross-product transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise".
The only particularity of our implementation is that we have implemented the linear layer via an Embedding layer plus a bias. While the implementations are equivalent, the latter is faster and far more memory efficient, since we do not need to one hot encode the categorical features.
Let's assume we the following dataset:
import torch
import pandas as pd
import numpy as np
from torch import nn
df = pd.DataFrame({"color": ["r", "b", "g"], "size": ["s", "n", "l"]})
df.head()
color | size | |
---|---|---|
0 | r | s |
1 | b | n |
2 | g | l |
one hot encoded, the first observation would be
obs_0_oh = (np.array([1.0, 0.0, 0.0, 1.0, 0.0, 0.0])).astype("float32")
if we simply numerically encode (label encode or le
) the values:
obs_0_le = (np.array([0, 3])).astype("int64")
Note that in the functioning implementation of the package we start from 1, saving 0 for padding, i.e. unseen values.
Now, let's see if the two implementations are equivalent
# we have 6 different values. Let's assume we are performing a regression, so pred_dim = 1
lin = nn.Linear(6, 1)
emb = nn.Embedding(6, 1)
emb.weight = nn.Parameter(lin.weight.reshape_as(emb.weight))
lin(torch.tensor(obs_0_oh))
tensor([-0.5181], grad_fn=<ViewBackward0>)
emb(torch.tensor(obs_0_le)).sum() + lin.bias
tensor([-0.5181], grad_fn=<AddBackward0>)
And this is precisely how the linear model Wide
is implemented
from pytorch_widedeep.models import Wide
/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# ?Wide
wide = Wide(input_dim=10, pred_dim=1)
wide
Wide( (wide_linear): Embedding(11, 1, padding_idx=0) )
Note that even though the input dim is 10, the Embedding layer has 11 weights. Again, this is because we save 0
for padding, which is used for unseen values during the encoding process.
As I mentioned, deeptabular
has enough complexity on its own and it will be described in a separated notebook. Let's then jump to deeptext
.
2. deeptabular
¶
The deeptabular
model alone is what normally would be referred as Deep Learning for tabular data. As mentioned a number of times throughout the library, each component can be used independently. Therefore, if you wanted to use any of the models below alone, it is perfectly possible. There are just a couple of simple requirement that will be covered in a later notebook.
By the time of writing, there are a number of models available in pytorch-widedeep
to do DL for tabular data. These are:
TabMlp
ContextAttentionMLP
SelfAttentionMLP
TabResnet
Tabnet
TabTransformer
FT-Tabransformer
SAINT
TabFastFormer
TabPerceiver
Let's have a look to one of them. For more information on each of these models, please, have a look to the documentation
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp
data = {
"cat1": np.random.choice(["A", "B", "C"], size=20),
"cat2": np.random.choice(["X", "Y"], size=20),
"cont1": np.random.rand(20),
"cont2": np.random.rand(20),
}
df = pd.DataFrame(data)
df.head()
cat1 | cat2 | cont1 | cont2 | |
---|---|---|---|---|
0 | A | Y | 0.789347 | 0.561789 |
1 | C | X | 0.050822 | 0.061538 |
2 | A | Y | 0.863784 | 0.241967 |
3 | C | X | 0.917848 | 0.644658 |
4 | C | Y | 0.042328 | 0.417303 |
# see the docs for details on all params/options
tab_preprocessor = TabPreprocessor(
cat_embed_cols=["cat1", "cat2"],
continuous_cols=["cont1", "cont2"],
embedding_rule="fastai",
)
X_tab = tab_preprocessor.fit_transform(df)
/Users/javierrodriguezzaurin/Projects/pytorch-widedeep/pytorch_widedeep/preprocessing/tab_preprocessor.py:358: UserWarning: Continuous columns will not be normalised warnings.warn("Continuous columns will not be normalised")
# toy example just to build a model.
tabmlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=tab_preprocessor.continuous_cols,
embed_continuous_method="standard",
cont_embed_dim=4,
mlp_hidden_dims=[8, 4],
mlp_linear_first=True,
)
tabmlp
TabMlp( (cat_embed): DiffSizeCatEmbeddings( (embed_layers): ModuleDict( (emb_layer_cat1): Embedding(4, 3, padding_idx=0) (emb_layer_cat2): Embedding(3, 2, padding_idx=0) ) (embedding_dropout): Dropout(p=0.0, inplace=False) ) (cont_norm): Identity() (cont_embed): ContEmbeddings( INFO: [ContLinear = weight(n_cont_cols, embed_dim) + bias(n_cont_cols, embed_dim)] (linear): ContLinear(n_cont_cols=2, embed_dim=4, embed_dropout=0.0) (dropout): Dropout(p=0.0, inplace=False) ) (encoder): MLP( (mlp): Sequential( (dense_layer_0): Sequential( (0): Linear(in_features=13, out_features=8, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.1, inplace=False) ) (dense_layer_1): Sequential( (0): Linear(in_features=8, out_features=4, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.1, inplace=False) ) ) ) )
Lets describe a bit the model: first we have what we call a DiffSizeCatEmbeddings
, where categorical columns with different number of unique categories will be encoded with embeddings of different dimensions. Then the continuous columns will not be normalised (the normalised layer is just the identity) and they will be embedded via a "standard" method, using a so-called ContLinear
layer. This layer displays some INFO
that tells us what it is (ContLinear = weight(n_cont_cols, embed_dim) + bias(n_cont_cols, embed_dim)]
). There are two other options available to embed the continuous cols based on the paper On Embeddings for Numerical Features in Tabular Deep Learning. These are PieceWise
and Periodic
and all available via the embed_continuous_method
param, which can adopt values "standard", "piecewise"
and "periodic"
. The embedded categorical and continuous columns will be then concatenated ($3 + 2 + (4 * 2) = 13$ input dims) and passed to an MLP.
3. deeptext
¶
At the time of writing, pytorch-widedeep
offers three models that can be passed to WideDeep
as the deeptext
component. These are:
- BasicRNN
- AttentiveRNN
- StackedAttentiveRNN
For details on each of these models, please, have a look to the documentation of the package.
We will soon integrate with Hugginface, but let me insist. It is perfectly possible to use custom models for each component, please, have a look to the corresponding notebook. In general, simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations (i.e. not the final prediction) so that these activations are collected by WideDeep
and combined accordingly. In addition, the models MUST also contain an attribute output_dim
with the size of these last layers of activations.
Let's have a look to the BasicRNN
model
from pytorch_widedeep.models import BasicRNN
basic_rnn = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)
/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/torch/nn/modules/rnn.py:82: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1 warnings.warn("dropout option adds dropout after all but last "
basic_rnn
BasicRNN( (word_embed): Embedding(4, 4, padding_idx=0) (rnn): LSTM(4, 4, batch_first=True, dropout=0.1) (rnn_mlp): Identity() )
You could, if you wanted, add a Fully Connected Head (FC-Head) on top of it
4. deepimage
¶
At the time of writing pytorch-widedeep
is integrated with torchvision via the Vision
class. This means that the it is possible to use a variant of the following architectures:
- resnet
- shufflenet
- resnext
- wide_resnet
- regnet
- densenet
- mobilenet
- mnasnet
- efficientnet
- squeezenet
The user can choose which layers will be trainable. Alternatively, in none of these architectures is useful, one could use a simple, fully trained CNN (please see the package documentation) or pass a custom model.
let's have a look
from pytorch_widedeep.models import Vision
resnet = Vision(pretrained_model_setup="resnet18", n_trainable=0)
resnet
Vision( (features): Sequential( (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (4): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (5): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (6): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (7): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (8): AdaptiveAvgPool2d(output_size=(1, 1)) ) )