Model Components¶

The main components of a WideDeep (i.e. Multimodal) model are tabular data, text and images, which are feed into the model via so called wide, deeptabular, deeptext and deepimage model components

1. `wide`¶

The wide component is a Linear layer "plugged" into the output neuron(s). Here, the non-linearities are captured via crossed columns. Crossed columns are, quoting directly the paper: "For binary features, a cross-product transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise".

The only particularity of our implementation is that we have implemented the linear layer via an Embedding layer plus a bias. While the implementations are equivalent, the latter is faster and far more memory efficient, since we do not need to one hot encode the categorical features.

Let's assume we the following dataset:

In [1]:

Copied!

import torch
import pandas as pd
import numpy as np

from torch import nn
import torch
import pandas as pd
import numpy as np

from torch import nn

In [2]:

Copied!

df = pd.DataFrame({"color": ["r", "b", "g"], "size": ["s", "n", "l"]})
df.head()
df = pd.DataFrame({"color": ["r", "b", "g"], "size": ["s", "n", "l"]})
df.head()

Out[2]:

	color	size
0	r	s
1	b	n
2	g	l

one hot encoded, the first observation would be

In [3]:

Copied!

obs_0_oh = (np.array([1.0, 0.0, 0.0, 1.0, 0.0, 0.0])).astype("float32")
obs_0_oh = (np.array([1.0, 0.0, 0.0, 1.0, 0.0, 0.0])).astype("float32")

if we simply numerically encode (label encode or le) the values:

In [4]:

Copied!

obs_0_le = (np.array([0, 3])).astype("int64")
obs_0_le = (np.array([0, 3])).astype("int64")

Note that in the functioning implementation of the package we start from 1, saving 0 for padding, i.e. unseen values.

Now, let's see if the two implementations are equivalent

In [5]:

Copied!

# we have 6 different values. Let's assume we are performing a regression, so pred_dim = 1
lin = nn.Linear(6, 1)
# we have 6 different values. Let's assume we are performing a regression, so pred_dim = 1
lin = nn.Linear(6, 1)

In [6]:

Copied!

emb = nn.Embedding(6, 1)
emb.weight = nn.Parameter(lin.weight.reshape_as(emb.weight))
emb = nn.Embedding(6, 1)
emb.weight = nn.Parameter(lin.weight.reshape_as(emb.weight))

In [7]:

Copied!

lin(torch.tensor(obs_0_oh))
lin(torch.tensor(obs_0_oh))

Out[7]:

tensor([-0.5181], grad_fn=<ViewBackward0>)

In [8]:

Copied!

emb(torch.tensor(obs_0_le)).sum() + lin.bias
emb(torch.tensor(obs_0_le)).sum() + lin.bias

Out[8]:

tensor([-0.5181], grad_fn=<AddBackward0>)

And this is precisely how the linear model Wide is implemented

In [9]:

Copied!

from pytorch_widedeep.models import Wide
from pytorch_widedeep.models import Wide

/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [10]:

Copied!

# ?Wide
# ?Wide

In [11]:

Copied!

wide = Wide(input_dim=10, pred_dim=1)
wide
wide = Wide(input_dim=10, pred_dim=1)
wide

Out[11]:

Wide(
  (wide_linear): Embedding(11, 1, padding_idx=0)
)

Note that even though the input dim is 10, the Embedding layer has 11 weights. Again, this is because we save 0 for padding, which is used for unseen values during the encoding process.

As I mentioned, deeptabular has enough complexity on its own and it will be described in a separated notebook. Let's then jump to deeptext.

2. `deeptabular`¶

The deeptabular model alone is what normally would be referred as Deep Learning for tabular data. As mentioned a number of times throughout the library, each component can be used independently. Therefore, if you wanted to use any of the models below alone, it is perfectly possible. There are just a couple of simple requirement that will be covered in a later notebook.

By the time of writing, there are a number of models available in pytorch-widedeep to do DL for tabular data. These are:

TabMlp
ContextAttentionMLP
SelfAttentionMLP
TabResnet
Tabnet
TabTransformer
FT-Tabransformer
SAINT
TabFastFormer
TabPerceiver

Let's have a look to one of them. For more information on each of these models, please, have a look to the documentation

In [12]:

Copied!

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp

In [13]:

Copied!





data = {
    "cat1": np.random.choice(["A", "B", "C"], size=20),
    "cat2": np.random.choice(["X", "Y"], size=20),
    "cont1": np.random.rand(20),
    "cont2": np.random.rand(20),
}

df = pd.DataFrame(data)
data = {
    "cat1": np.random.choice(["A", "B", "C"], size=20),
    "cat2": np.random.choice(["X", "Y"], size=20),
    "cont1": np.random.rand(20),
    "cont2": np.random.rand(20),
}

df = pd.DataFrame(data)

In [14]:

Copied!

df.head()
df.head()

Out[14]:

	cat1	cat2	cont1	cont2
0	A	Y	0.789347	0.561789
1	C	X	0.050822	0.061538
2	A	Y	0.863784	0.241967
3	C	X	0.917848	0.644658
4	C	Y	0.042328	0.417303

In [15]:

Copied!





# see the docs for details on all params/options
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=["cat1", "cat2"],
    continuous_cols=["cont1", "cont2"],
    embedding_rule="fastai",
)
# see the docs for details on all params/options
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=["cat1", "cat2"],
    continuous_cols=["cont1", "cont2"],
    embedding_rule="fastai",
)

In [16]:

Copied!

X_tab = tab_preprocessor.fit_transform(df)
X_tab = tab_preprocessor.fit_transform(df)

/Users/javierrodriguezzaurin/Projects/pytorch-widedeep/pytorch_widedeep/preprocessing/tab_preprocessor.py:358: UserWarning: Continuous columns will not be normalised
  warnings.warn("Continuous columns will not be normalised")

In [17]:

Copied!





# toy example just to build a model.
tabmlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=tab_preprocessor.continuous_cols,
    embed_continuous_method="standard",
    cont_embed_dim=4,
    mlp_hidden_dims=[8, 4],
    mlp_linear_first=True,
)
tabmlp
# toy example just to build a model.
tabmlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=tab_preprocessor.continuous_cols,
    embed_continuous_method="standard",
    cont_embed_dim=4,
    mlp_hidden_dims=[8, 4],
    mlp_linear_first=True,
)
tabmlp

Out[17]:

TabMlp(
  (cat_embed): DiffSizeCatEmbeddings(
    (embed_layers): ModuleDict(
      (emb_layer_cat1): Embedding(4, 3, padding_idx=0)
      (emb_layer_cat2): Embedding(3, 2, padding_idx=0)
    )
    (embedding_dropout): Dropout(p=0.0, inplace=False)
  )
  (cont_norm): Identity()
  (cont_embed): ContEmbeddings(
    INFO: [ContLinear = weight(n_cont_cols, embed_dim) + bias(n_cont_cols, embed_dim)]
    (linear): ContLinear(n_cont_cols=2, embed_dim=4, embed_dropout=0.0)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Linear(in_features=13, out_features=8, bias=True)
        (1): ReLU(inplace=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (dense_layer_1): Sequential(
        (0): Linear(in_features=8, out_features=4, bias=True)
        (1): ReLU(inplace=True)
        (2): Dropout(p=0.1, inplace=False)
      )
    )
  )
)

Lets describe a bit the model: first we have what we call a DiffSizeCatEmbeddings, where categorical columns with different number of unique categories will be encoded with embeddings of different dimensions. Then the continuous columns will not be normalised (the normalised layer is just the identity) and they will be embedded via a "standard" method, using a so-called ContLinear layer. This layer displays some INFO that tells us what it is (ContLinear = weight(n_cont_cols, embed_dim) + bias(n_cont_cols, embed_dim)]). There are two other options available to embed the continuous cols based on the paper On Embeddings for Numerical Features in Tabular Deep Learning. These are PieceWise and Periodic and all available via the embed_continuous_method param, which can adopt values "standard", "piecewise" and "periodic". The embedded categorical and continuous columns will be then concatenated ($3 + 2 + (4 * 2) = 13$ input dims) and passed to an MLP.

3. `deeptext`¶

At the time of writing, pytorch-widedeep offers three models that can be passed to WideDeep as the deeptext component. These are:

BasicRNN
AttentiveRNN
StackedAttentiveRNN

For details on each of these models, please, have a look to the documentation of the package.

We will soon integrate with Hugginface, but let me insist. It is perfectly possible to use custom models for each component, please, have a look to the corresponding notebook. In general, simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations (i.e. not the final prediction) so that these activations are collected by WideDeep and combined accordingly. In addition, the models MUST also contain an attribute output_dim with the size of these last layers of activations.

Let's have a look to the BasicRNN model

In [18]:

Copied!

from pytorch_widedeep.models import BasicRNN
from pytorch_widedeep.models import BasicRNN

In [19]:

Copied!

basic_rnn = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)
basic_rnn = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)

/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/torch/nn/modules/rnn.py:82: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1
  warnings.warn("dropout option adds dropout after all but last "

In [20]:

Copied!

basic_rnn
basic_rnn

Out[20]:

BasicRNN(
  (word_embed): Embedding(4, 4, padding_idx=0)
  (rnn): LSTM(4, 4, batch_first=True, dropout=0.1)
  (rnn_mlp): Identity()
)

You could, if you wanted, add a Fully Connected Head (FC-Head) on top of it

4. `deepimage`¶

At the time of writing pytorch-widedeep is integrated with torchvision via the Vision class. This means that the it is possible to use a variant of the following architectures:

resnet
shufflenet
resnext
wide_resnet
regnet
densenet
mobilenet
mnasnet
efficientnet
squeezenet

The user can choose which layers will be trainable. Alternatively, in none of these architectures is useful, one could use a simple, fully trained CNN (please see the package documentation) or pass a custom model.

let's have a look

In [21]:

Copied!

from pytorch_widedeep.models import Vision
from pytorch_widedeep.models import Vision

In [22]:

Copied!

resnet = Vision(pretrained_model_setup="resnet18", n_trainable=0)
resnet = Vision(pretrained_model_setup="resnet18", n_trainable=0)

In [23]:

Copied!

resnet
resnet

Out[23]:

Vision(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (5): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (6): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (7): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (8): AdaptiveAvgPool2d(output_size=(1, 1))
  )
)

Model Components¶

1. wide¶

2. deeptabular¶

3. deeptext¶

4. deepimage¶

1. `wide`¶

2. `deeptabular`¶

3. `deeptext`¶

4. `deepimage`¶