Skip to content

The models module

This module contains the models that can be used as the four main components that will comprise a Wide and Deep model (wide, deeptabular, deeptext, deepimage), as well as the WideDeep "constructor" class. Note that each of the four components can be used independently. It also contains all the documentation for the models that can be used for self-supervised pre-training with tabular data.

Wide

Wide(input_dim, pred_dim=1)

Bases: Module

Defines a Wide (linear) model where the non-linearities are captured via the so-called crossed-columns. This can be used as the wide component of a Wide & Deep model.

Parameters:

  • input_dim (int) –

    size of the Linear layer (implemented via an Embedding layer). input_dim is the summation of all the individual values for all the features that go through the wide model. For example, if the wide model receives 2 features with 5 individual values each, input_dim = 10

  • pred_dim (int, default: 1 ) –

    size of the ouput tensor containing the predictions. Note that unlike all the other models, the wide model is connected directly to the output neuron(s) when used to build a Wide and Deep model. Therefore, it requires the pred_dim parameter.

Attributes:

  • wide_linear (Module) –

    the linear layer that comprises the wide branch of the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import Wide
>>> X = torch.empty(4, 4).random_(4)
>>> wide = Wide(input_dim=X.unique().size(0), pred_dim=1)
>>> out = wide(X)
Source code in pytorch_widedeep/models/tabular/linear/wide.py
43
44
45
46
47
48
49
50
51
52
53
54
@alias("pred_dim", ["pred_size", "num_class"])
def __init__(self, input_dim: int, pred_dim: int = 1):
    super(Wide, self).__init__()

    self.input_dim = input_dim
    self.pred_dim = pred_dim

    # Embeddings: val + 1 because 0 is reserved for padding/unseen cateogories.
    self.wide_linear = nn.Embedding(input_dim + 1, pred_dim, padding_idx=0)
    # (Sum(Embedding) + bias) is equivalent to (OneHotVector + Linear)
    self.bias = nn.Parameter(torch.zeros(pred_dim))
    self._reset_parameters()

forward

forward(X)

Forward pass. Simply connecting the Embedding layer with the ouput neuron(s)

Source code in pytorch_widedeep/models/tabular/linear/wide.py
65
66
67
68
69
def forward(self, X: Tensor) -> Tensor:
    r"""Forward pass. Simply connecting the Embedding layer with the ouput
    neuron(s)"""
    out = self.wide_linear(X.long()).sum(dim=1) + self.bias
    return out

TabMlp

TabMlp(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, continuous_cols=None, cont_norm_layer=None, embed_continuous=None, embed_continuous_method=None, cont_embed_dim=None, cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, mlp_hidden_dims=[200, 100], mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)

Bases: BaseTabularModelWithoutAttention

Defines a TabMlp model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features, embedded or not. These are then passed through a series of dense layers (i.e. a MLP).

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int, int]]], default: None ) –

    List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous (Optional[bool], default: None ) –

    Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
    ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: None ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dim (Optional[int], default: None ) –

    Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • mlp_hidden_dims (List[int], default: [200, 100] ) –

    List with the number of neurons per dense layer in the mlp.

  • mlp_activation (str, default: 'relu' ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • mlp_dropout (Union[float, List[float]], default: 0.1 ) –

    float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

  • mlp_batchnorm (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers

  • mlp_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

  • mlp_linear_first (bool, default: True ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • encoder (Module) –

    mlp model that will receive the concatenation of the embeddings and the continuous columns

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabMlp
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = TabMlp(mlp_hidden_dims=[8, 4], column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols=["e"])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/tab_mlp.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous: Optional[bool] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = None,
    cont_embed_dim: Optional[int] = None,
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    mlp_hidden_dims: List[int] = [200, 100],
    mlp_activation: str = "relu",
    mlp_dropout: Union[float, List[float]] = 0.1,
    mlp_batchnorm: bool = False,
    mlp_batchnorm_last: bool = False,
    mlp_linear_first: bool = True,
):
    super(TabMlp, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=embed_continuous,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dim=cont_embed_dim,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    # Embeddings are instantiated at the base model
    # Mlp
    mlp_input_dim = self.cat_out_dim + self.cont_out_dim
    mlp_hidden_dims = [mlp_input_dim] + mlp_hidden_dims
    self.encoder = MLP(
        mlp_hidden_dims,
        mlp_activation,
        mlp_dropout,
        mlp_batchnorm,
        mlp_batchnorm_last,
        mlp_linear_first,
    )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabMlpDecoder

TabMlpDecoder(embed_dim, mlp_hidden_dims=[100, 200], mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)

Bases: Module

Companion decoder model for the TabMlp model (which can be considered an encoder itself).

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). The TabMlpDecoder will receive the output from the MLP and 'reconstruct' the embeddings.

Parameters:

  • embed_dim (int) –

    Size of the embeddings tensor that needs to be reconstructed.

  • mlp_hidden_dims (List[int], default: [100, 200] ) –

    List with the number of neurons per dense layer in the mlp.

  • mlp_activation (str, default: 'relu' ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • mlp_dropout (Union[float, List[float]], default: 0.1 ) –

    float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

  • mlp_batchnorm (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers

  • mlp_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

  • mlp_linear_first (bool, default: True ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • decoder (Module) –

    mlp model that will receive the output of the encoder

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabMlpDecoder
>>> x_inp = torch.rand(3, 8)
>>> decoder = TabMlpDecoder(embed_dim=32, mlp_hidden_dims=[8,16])
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/mlp/tab_mlp.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def __init__(
    self,
    embed_dim: int,
    mlp_hidden_dims: List[int] = [100, 200],
    mlp_activation: str = "relu",
    mlp_dropout: Union[float, List[float]] = 0.1,
    mlp_batchnorm: bool = False,
    mlp_batchnorm_last: bool = False,
    mlp_linear_first: bool = True,
):
    super(TabMlpDecoder, self).__init__()

    self.embed_dim = embed_dim

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    self.decoder = MLP(
        mlp_hidden_dims + [self.embed_dim],
        mlp_activation,
        mlp_dropout,
        mlp_batchnorm,
        mlp_batchnorm_last,
        mlp_linear_first,
    )

TabResnet

TabResnet(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, continuous_cols=None, cont_norm_layer=None, embed_continuous=None, embed_continuous_method=None, cont_embed_dim=None, cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, blocks_dims=[200, 100, 100], blocks_dropout=0.1, simplify_blocks=False, mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: BaseTabularModelWithoutAttention

Defines a TabResnet model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features, embedded or not. These are then passed through a series of Resnet blocks. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int, int]]], default: None ) –

    List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous (Optional[bool], default: None ) –

    Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
    ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: None ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dim (Optional[int], default: None ) –

    Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • blocks_dims (List[int], default: [200, 100, 100] ) –

    List of integers that define the input and output units of each block. For example: [200, 100, 100] will generate 2 blocks. The first will receive a tensor of size 200 and output a tensor of size 100, and the second will receive a tensor of size 100 and output a tensor of size 100. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

  • blocks_dropout (float, default: 0.1 ) –

    Block's internal dropout.

  • simplify_blocks (bool, default: False ) –

    Boolean indicating if the simplest possible residual blocks (X -> [ [LIN, BN, ACT] + X ]) will be used instead of a standard one (X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2] + X ]).

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If None the output of the Resnet Blocks will be connected directly to the output neuron(s).

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (Module) –

    deep dense Resnet model that will receive the concatenation of the embeddings and the continuous columns

  • mlp (Module) –

    if mlp_hidden_dims is True, this attribute will be an mlp model that will receive the results of the concatenation of the embeddings and the continuous columns -- if present --.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabResnet
>>> X_deep = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabResnet(blocks_dims=[16,4], column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols = ['e'])
>>> out = model(X_deep)
Source code in pytorch_widedeep/models/tabular/resnet/tab_resnet.py
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous: Optional[bool] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = None,
    cont_embed_dim: Optional[int] = None,
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    blocks_dims: List[int] = [200, 100, 100],
    blocks_dropout: float = 0.1,
    simplify_blocks: bool = False,
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(TabResnet, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=embed_continuous,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dim=cont_embed_dim,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    if len(blocks_dims) < 2:
        raise ValueError(
            "'blocks' must contain at least two elements, e.g. [256, 128]"
        )

    self.blocks_dims = blocks_dims
    self.blocks_dropout = blocks_dropout
    self.simplify_blocks = simplify_blocks

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    # Embeddings are instantiated at the base model

    # Resnet
    dense_resnet_input_dim = self.cat_out_dim + self.cont_out_dim
    self.encoder = DenseResnet(
        dense_resnet_input_dim, blocks_dims, blocks_dropout, self.simplify_blocks
    )

    # Mlp: adding an MLP on top of the Resnet blocks is optional and
    # therefore all related params are optional
    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.blocks_dims[-1]] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                True if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabResnetDecoder

TabResnetDecoder(embed_dim, blocks_dims=[100, 100, 200], blocks_dropout=0.1, simplify_blocks=False, mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: Module

Companion decoder model for the TabResnet model (which can be considered an encoder itself)

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). This class will receive the output from the ResNet blocks or the MLP(if present) and 'reconstruct' the embeddings.

Parameters:

  • embed_dim (int) –

    Size of the embeddings tensor to be reconstructed.

  • blocks_dims (List[int], default: [100, 100, 200] ) –

    List of integers that define the input and output units of each block. For example: [200, 100, 100] will generate 2 blocks. The first will receive a tensor of size 200 and output a tensor of size 100, and the second will receive a tensor of size 100 and output a tensor of size 100. See pytorch_widedeep.models.tab_resnet._layers for details on the structure of each block.

  • blocks_dropout (float, default: 0.1 ) –

    Block's internal dropout.

  • simplify_blocks (bool, default: False ) –

    Boolean indicating if the simplest possible residual blocks (X -> [ [LIN, BN, ACT] + X ]) will be used instead of a standard one (X -> [ [LIN1, BN1, ACT1] -> [LIN2, BN2] + X ]).

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If None the output of the Resnet Blocks will be connected directly to the output neuron(s).

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • decoder (Module) –

    deep dense Resnet model that will receive the output of the encoder IF mlp_hidden_dims is None

  • mlp (Module) –

    if mlp_hidden_dims is not None, the overall decoder will consist in an MLP that will receive the output of the encoder followed by the deep dense Resnet.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabResnetDecoder
>>> x_inp = torch.rand(3, 8)
>>> decoder = TabResnetDecoder(embed_dim=32, blocks_dims=[8, 16, 16])
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/resnet/tab_resnet.py
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
def __init__(
    self,
    embed_dim: int,
    blocks_dims: List[int] = [100, 100, 200],
    blocks_dropout: float = 0.1,
    simplify_blocks: bool = False,
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(TabResnetDecoder, self).__init__()

    if len(blocks_dims) < 2:
        raise ValueError(
            "'blocks' must contain at least two elements, e.g. [256, 128]"
        )

    self.embed_dim = embed_dim

    self.blocks_dims = blocks_dims
    self.blocks_dropout = blocks_dropout
    self.simplify_blocks = simplify_blocks

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                True if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
        self.decoder = DenseResnet(
            self.mlp_hidden_dims[-1],
            blocks_dims,
            blocks_dropout,
            self.simplify_blocks,
        )
    else:
        self.mlp = None
        self.decoder = DenseResnet(
            blocks_dims[0], blocks_dims, blocks_dropout, self.simplify_blocks
        )

    self.reconstruction_layer = nn.Linear(blocks_dims[-1], embed_dim, bias=False)

TabNet

TabNet(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, continuous_cols=None, cont_norm_layer=None, embed_continuous=None, embed_continuous_method=None, cont_embed_dim=None, cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, n_steps=3, step_dim=8, attn_dim=8, dropout=0.0, n_glu_step_dependent=2, n_glu_shared=2, ghost_bn=True, virtual_batch_size=128, momentum=0.02, gamma=1.3, epsilon=1e-15, mask_type='sparsemax')

Bases: BaseTabularModelWithoutAttention

Defines a TabNet model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

The implementation in this library is fully based on that here by the dreamquark-ai team, simply adapted so that it can work within the WideDeep frame. Therefore, ALL CREDIT TO THE DREAMQUARK-AI TEAM.

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int, int]]], default: None ) –

    List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous (Optional[bool], default: None ) –

    Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
    ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: None ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dim (Optional[int], default: None ) –

    Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • n_steps (int, default: 3 ) –

    number of decision steps. For a better understanding of the function of n_steps and the upcoming parameters, please see the paper.

  • step_dim (int, default: 8 ) –

    Step's output dimension. This is the output dimension that WideDeep will collect and connect to the output neuron(s).

  • attn_dim (int, default: 8 ) –

    Attention dimension

  • dropout (float, default: 0.0 ) –

    GLU block's internal dropout

  • n_glu_step_dependent (int, default: 2 ) –

    number of GLU Blocks ([FC -> BN -> GLU]) that are step dependent

  • n_glu_shared (int, default: 2 ) –

    number of GLU Blocks ([FC -> BN -> GLU]) that will be shared across decision steps

  • ghost_bn (bool, default: True ) –

    Boolean indicating if Ghost Batch Normalization will be used.

  • virtual_batch_size (int, default: 128 ) –

    Batch size when using Ghost Batch Normalization

  • momentum (float, default: 0.02 ) –

    Ghost Batch Normalization's momentum. The dreamquark-ai advises for very low values. However high values are used in the original publication. During our tests higher values lead to better results

  • gamma (float, default: 1.3 ) –

    Relaxation parameter in the paper. When gamma = 1, a feature is enforced to be used only at one decision step. As gamma increases, more flexibility is provided to use a feature at multiple decision steps

  • epsilon (float, default: 1e-15 ) –

    Float to avoid log(0). Always keep low

  • mask_type (str, default: 'sparsemax' ) –

    Mask function to use. Either 'sparsemax' or 'entmax'

Attributes:

Examples:

>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = TabNet(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=["e"])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/tabnet/tab_net.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous: Optional[bool] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = None,
    cont_embed_dim: Optional[int] = None,
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    n_steps: int = 3,
    step_dim: int = 8,
    attn_dim: int = 8,
    dropout: float = 0.0,
    n_glu_step_dependent: int = 2,
    n_glu_shared: int = 2,
    ghost_bn: bool = True,
    virtual_batch_size: int = 128,
    momentum: float = 0.02,
    gamma: float = 1.3,
    epsilon: float = 1e-15,
    mask_type: str = "sparsemax",
):
    super(TabNet, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=embed_continuous,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dim=cont_embed_dim,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.n_steps = n_steps
    self.step_dim = step_dim
    self.attn_dim = attn_dim
    self.dropout = dropout
    self.n_glu_step_dependent = n_glu_step_dependent
    self.n_glu_shared = n_glu_shared
    self.ghost_bn = ghost_bn
    self.virtual_batch_size = virtual_batch_size
    self.momentum = momentum
    self.gamma = gamma
    self.epsilon = epsilon
    self.mask_type = mask_type

    # Embeddings are instantiated at the base model
    self.embed_out_dim = self.cat_out_dim + self.cont_out_dim

    # TabNet
    self.encoder = TabNetEncoder(
        self.embed_out_dim,
        n_steps,
        step_dim,
        attn_dim,
        dropout,
        n_glu_step_dependent,
        n_glu_shared,
        ghost_bn,
        virtual_batch_size,
        momentum,
        gamma,
        epsilon,
        mask_type,
    )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

TabNetDecoder

TabNetDecoder(embed_dim, n_steps=3, step_dim=8, dropout=0.0, n_glu_step_dependent=2, n_glu_shared=2, ghost_bn=True, virtual_batch_size=128, momentum=0.02)

Bases: Module

Companion decoder model for the TabNet model (which can be considered an encoder itself)

This class is designed to be used with the EncoderDecoderTrainer when using self-supervised pre-training (see the corresponding section in the docs). This class will receive the output from the TabNet encoder (i.e. the output from the so called 'steps') and 'reconstruct' the embeddings.

Parameters:

  • embed_dim (int) –

    Size of the embeddings tensor to be reconstructed.

  • n_steps (int, default: 3 ) –

    number of decision steps. For a better understanding of the function of n_steps and the upcoming parameters, please see the paper.

  • step_dim (int, default: 8 ) –

    Step's output dimension. This is the output dimension that WideDeep will collect and connect to the output neuron(s).

  • dropout (float, default: 0.0 ) –

    GLU block's internal dropout

  • n_glu_step_dependent (int, default: 2 ) –

    number of GLU Blocks ([FC -> BN -> GLU]) that are step dependent

  • n_glu_shared (int, default: 2 ) –

    number of GLU Blocks ([FC -> BN -> GLU]) that will be shared across decision steps

  • ghost_bn (bool, default: True ) –

    Boolean indicating if Ghost Batch Normalization will be used.

  • virtual_batch_size (int, default: 128 ) –

    Batch size when using Ghost Batch Normalization

  • momentum (float, default: 0.02 ) –

    Ghost Batch Normalization's momentum. The dreamquark-ai advises for very low values. However high values are used in the original publication. During our tests higher values lead to better results

Attributes:

  • decoder (Module) –

    decoder that will receive the output from the encoder's steps and will reconstruct the embeddings

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabNetDecoder
>>> x_inp = [torch.rand(3, 8), torch.rand(3, 8), torch.rand(3, 8)]
>>> decoder = TabNetDecoder(embed_dim=32, ghost_bn=False)
>>> res = decoder(x_inp)
>>> res.shape
torch.Size([3, 32])
Source code in pytorch_widedeep/models/tabular/tabnet/tab_net.py
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
def __init__(
    self,
    embed_dim: int,
    n_steps: int = 3,
    step_dim: int = 8,
    dropout: float = 0.0,
    n_glu_step_dependent: int = 2,
    n_glu_shared: int = 2,
    ghost_bn: bool = True,
    virtual_batch_size: int = 128,
    momentum: float = 0.02,
):
    super(TabNetDecoder, self).__init__()

    self.n_steps = n_steps
    self.step_dim = step_dim
    self.dropout = dropout
    self.n_glu_step_dependent = n_glu_step_dependent
    self.n_glu_shared = n_glu_shared
    self.ghost_bn = ghost_bn
    self.virtual_batch_size = virtual_batch_size
    self.momentum = momentum

    shared_layers = nn.ModuleList()
    for i in range(n_glu_shared):
        if i == 0:
            shared_layers.append(nn.Linear(step_dim, 2 * step_dim, bias=False))
        else:
            shared_layers.append(nn.Linear(step_dim, 2 * step_dim, bias=False))

    self.decoder = nn.ModuleList()
    for step in range(n_steps):
        transformer = FeatTransformer(
            step_dim,
            step_dim,
            dropout,
            shared_layers,
            n_glu_step_dependent,
            ghost_bn,
            virtual_batch_size,
            momentum=momentum,
        )
        self.decoder.append(transformer)

    self.reconstruction_layer = nn.Linear(step_dim, embed_dim, bias=False)
    initialize_non_glu(self.reconstruction_layer, step_dim, embed_dim)

ContextAttentionMLP

ContextAttentionMLP(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, attn_dropout=0.2, with_addnorm=False, attn_activation='leaky_relu', n_blocks=3)

Bases: BaseTabularModelWithAttention

Defines a ContextAttentionMLP model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features that are also embedded. These are then passed through a series of attention blocks. Each attention block is comprised by a ContextAttentionEncoder. Such encoder is in part inspired by the attention mechanism described in Hierarchical Attention Networks for Document Classification. See pytorch_widedeep.models.tabular.mlp._attention_layers for details.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

  • attn_dropout (float, default: 0.2 ) –

    Dropout for each attention block

  • with_addnorm (bool, default: False ) –

    Boolean indicating if residual connections will be used in the attention blocks

  • attn_activation (str, default: 'leaky_relu' ) –

    String indicating the activation function to be applied to the dense layer in each attention encoder. 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported.

  • n_blocks (int, default: 3 ) –

    Number of attention blocks

Attributes:

  • encoder (Module) –

    Sequence of attention encoders.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import ContextAttentionMLP
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = ContextAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/context_attention_mlp.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    attn_dropout: float = 0.2,
    with_addnorm: bool = False,
    attn_activation: str = "leaky_relu",
    n_blocks: int = 3,
):
    super(ContextAttentionMLP, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        input_dim=input_dim,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.attn_dropout = attn_dropout
    self.with_addnorm = with_addnorm
    self.attn_activation = attn_activation
    self.n_blocks = n_blocks

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

    # Embeddings are instantiated at the base model
    # Attention Blocks
    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "attention_block" + str(i),
            ContextAttentionEncoder(
                input_dim,
                attn_dropout,
                with_addnorm,
                attn_activation,
            ),
        )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, F)\), where \(N\) is the batch size and \(F\) is the number of features/columns in the dataset

SelfAttentionMLP

SelfAttentionMLP(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, attn_dropout=0.2, n_heads=8, use_bias=False, with_addnorm=False, attn_activation='leaky_relu', n_blocks=3)

Bases: BaseTabularModelWithAttention

Defines a SelfAttentionMLP model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class combines embedding representations of the categorical features with numerical (aka continuous) features that are also embedded. These are then passed through a series of attention blocks. Each attention block is comprised by what we would refer as a simplified SelfAttentionEncoder. See pytorch_widedeep.models.tabular.mlp._attention_layers for details. The reason to use a simplified version of self attention is because we observed that the 'standard' attention mechanism used in the TabTransformer has a notable tendency to overfit.

In more detail, this model only uses Q and K (and not V). If we think about it as in terms of text (and intuitively), the Softmax(QK^T) is the attention mechanism that tells us how much, at each position in the input sentence, each word is represented or 'expressed'. We refer to that as 'attention weights'. These attention weighst are normally multiplied by a Value matrix to further strength the focus on the words that each word should be attending to (again, intuitively).

In this implementation we skip this last multiplication and instead we multiply the attention weights directly by the input tensor. This is a simplification that we expect is beneficial in terms of avoiding overfitting for tabular data.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

  • attn_dropout (float, default: 0.2 ) –

    Dropout for each attention block

  • n_heads (int, default: 8 ) –

    Number of attention heads per attention block.

  • use_bias (bool, default: False ) –

    Boolean indicating whether or not to use bias in the Q, K projection layers.

  • with_addnorm (bool, default: False ) –

    Boolean indicating if residual connections will be used in the attention blocks

  • attn_activation (str, default: 'leaky_relu' ) –

    String indicating the activation function to be applied to the dense layer in each attention encoder. 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported.

  • n_blocks (int, default: 3 ) –

    Number of attention blocks

Attributes:

  • cat_and_cont_embed (Module) –

    This is the module that processes the categorical and continuous columns

  • encoder (Module) –

    Sequence of attention encoders.

Examples:

>>> import torch
>>> from pytorch_widedeep.models import SelfAttentionMLP
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = SelfAttentionMLP(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols = ['e'])
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/mlp/self_attention_mlp.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    attn_dropout: float = 0.2,
    n_heads: int = 8,
    use_bias: bool = False,
    with_addnorm: bool = False,
    attn_activation: str = "leaky_relu",
    n_blocks: int = 3,
):
    super(SelfAttentionMLP, self).__init__(
        column_idx=column_idx,
        input_dim=input_dim,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.attn_dropout = attn_dropout
    self.n_heads = n_heads
    self.use_bias = use_bias
    self.with_addnorm = with_addnorm
    self.attn_activation = attn_activation
    self.n_blocks = n_blocks

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

    # Embeddings are instantiated at the base model
    # Attention Blocks
    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "attention_block" + str(i),
            SelfAttentionEncoder(
                input_dim,
                attn_dropout,
                use_bias,
                n_heads,
                with_addnorm,
                attn_activation,
            ),
        )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, H, F, F)\), where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset

TabTransformer

TabTransformer(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous=None, embed_continuous_method=None, cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, n_heads=8, use_qkv_bias=False, n_blocks=4, attn_dropout=0.2, ff_dropout=0.1, ff_factor=4, transformer_activation='gelu', use_linear_attention=False, use_flash_attention=False, mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)

Bases: BaseTabularModelWithAttention

Defines our adptation of the TabTransformer model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: This is an enhanced adaptation of the model described in the paper. It can be considered as the flagship of our transformer family of models for tabular data and offers mutiple, additional features relative to the original publication(and some other models in the library)

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: None ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

  • n_heads (int, default: 8 ) –

    Number of attention heads per Transformer block

  • use_qkv_bias (bool, default: False ) –

    Boolean indicating whether or not to use bias in the Q, K, and V projection layers.

  • n_blocks (int, default: 4 ) –

    Number of Transformer blocks

  • attn_dropout (float, default: 0.2 ) –

    Dropout that will be applied to the Multi-Head Attention layers

  • ff_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (int, default: 4 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

  • transformer_activation (str, default: 'gelu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • use_linear_attention (bool, default: False ) –

    Boolean indicating if Linear Attention (from Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention) will be used. The inclusing of this mode of attention is inspired by this post, where the Uber team finds that this attention mechanism leads to the best results for their tabular data.

  • use_flash_attention (bool, default: False ) –

    Boolean indicating if Flash Attention will be used.

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

  • mlp_activation (str, default: 'relu' ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (float, default: 0.1 ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (bool, default: True ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (Module) –

    Sequence of Transformer blocks

  • mlp (Module) –

    MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_transformer.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous: Optional[bool] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = None,
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    n_heads: int = 8,
    use_qkv_bias: bool = False,
    n_blocks: int = 4,
    attn_dropout: float = 0.2,
    ff_dropout: float = 0.1,
    ff_factor: int = 4,
    transformer_activation: str = "gelu",
    use_linear_attention: bool = False,
    use_flash_attention: bool = False,
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: str = "relu",
    mlp_dropout: float = 0.1,
    mlp_batchnorm: bool = False,
    mlp_batchnorm_last: bool = False,
    mlp_linear_first: bool = True,
):
    super(TabTransformer, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=embed_continuous,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        input_dim=input_dim,
        full_embed_dropout=full_embed_dropout,
    )

    self.n_heads = n_heads
    self.use_qkv_bias = use_qkv_bias
    self.n_blocks = n_blocks
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.transformer_activation = transformer_activation
    self.use_linear_attention = use_linear_attention
    self.use_flash_attention = use_flash_attention
    self.ff_factor = ff_factor

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0

    if self.n_cont and not self.n_cat and not self.embed_continuous:
        raise ValueError(
            "If only continuous features are used 'embed_continuous' must be set to 'True'"
        )

    # Embeddings are instantiated at the base model
    # Transformer blocks
    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "transformer_block" + str(i),
            TransformerEncoder(
                input_dim,
                n_heads,
                use_qkv_bias,
                attn_dropout,
                ff_dropout,
                ff_factor,
                transformer_activation,
                use_linear_attention,
                use_flash_attention,
            ),
        )

    self.mlp_first_hidden_dim = self._mlp_first_hidden_dim()

    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                False if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, H, F, F)\), where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset

ℹ️ NOTE: if flash attention or linear attention are used, no attention weights are saved during the training process and calling this property will throw a ValueError

SAINT

SAINT(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, use_qkv_bias=False, n_heads=8, n_blocks=2, attn_dropout=0.1, ff_dropout=0.2, ff_factor=4, transformer_activation='gelu', mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: BaseTabularModelWithAttention

Defines a SAINT model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: This is an slightly modified and enhanced version of the model described in the paper,

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

  • n_heads (int, default: 8 ) –

    Number of attention heads per Transformer block

  • use_qkv_bias (bool, default: False ) –

    Boolean indicating whether or not to use bias in the Q, K, and V projection layers

  • n_blocks (int, default: 2 ) –

    Number of SAINT-Transformer blocks.

  • attn_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the Multi-Head Attention column and row layers

  • ff_dropout (float, default: 0.2 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (int, default: 4 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

  • transformer_activation (str, default: 'gelu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (Module) –

    Sequence of SAINT-Transformer blocks

  • mlp (Module) –

    MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import SAINT
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = SAINT(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/saint.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    use_qkv_bias: bool = False,
    n_heads: int = 8,
    n_blocks: int = 2,
    attn_dropout: float = 0.1,
    ff_dropout: float = 0.2,
    ff_factor: int = 4,
    transformer_activation: str = "gelu",
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(SAINT, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        input_dim=input_dim,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.use_qkv_bias = use_qkv_bias
    self.n_heads = n_heads
    self.n_blocks = n_blocks
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.ff_factor = ff_factor
    self.transformer_activation = transformer_activation

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
    self.n_feats = self.n_cat + self.n_cont

    # Embeddings are instantiated at the base model
    # Transformer blocks
    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "saint_block" + str(i),
            SaintEncoder(
                input_dim,
                n_heads,
                use_qkv_bias,
                attn_dropout,
                ff_dropout,
                ff_factor,
                transformer_activation,
                self.n_feats,
            ),
        )

    self.mlp_first_hidden_dim = (
        self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
    )

    # Mlp: adding an MLP on top of the Resnet blocks is optional and
    # therefore all related params are optional
    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                False if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. Each element of the list is a tuple where the first and the second elements are the column and row attention weights respectively

The shape of the attention weights is:

  • column attention: \((N, H, F, F)\)

  • row attention: \((1, H, N, N)\)

where \(N\) is the batch size, \(H\) is the number of heads and \(F\) is the number of features/columns in the dataset

FTTransformer

FTTransformer(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=64, kv_compression_factor=0.5, kv_sharing=False, use_qkv_bias=False, n_heads=8, n_blocks=4, attn_dropout=0.2, ff_dropout=0.1, ff_factor=1.33, transformer_activation='reglu', mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: BaseTabularModelWithAttention

Defines a FTTransformer model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 64 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns.

  • kv_compression_factor (float, default: 0.5 ) –

    By default, the FTTransformer uses Linear Attention (See Linformer: Self-Attention with Linear Complexity ). The compression factor that will be used to reduce the input sequence length. If we denote the resulting sequence length as \(k = int(kv_{compression \space factor} \times s)\) where \(s\) is the input sequence length.

  • kv_sharing (bool, default: False ) –

    Boolean indicating if the \(E\) and \(F\) projection matrices will share weights. See Linformer: Self-Attention with Linear Complexity for details

  • n_heads (int, default: 8 ) –

    Number of attention heads per FTTransformer block

  • use_qkv_bias (bool, default: False ) –

    Boolean indicating whether or not to use bias in the Q, K, and V projection layers

  • n_blocks (int, default: 4 ) –

    Number of FTTransformer blocks

  • attn_dropout (float, default: 0.2 ) –

    Dropout that will be applied to the Linear-Attention layers

  • ff_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (float, default: 1.33 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4, but they use 4/3 in the paper.

  • transformer_activation (str, default: 'reglu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final FTTransformer block will be used.

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (Module) –

    Sequence of FTTransformer blocks

  • mlp (Module) –

    MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import FTTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = FTTransformer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/ft_transformer.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 64,
    kv_compression_factor: float = 0.5,
    kv_sharing: bool = False,
    use_qkv_bias: bool = False,
    n_heads: int = 8,
    n_blocks: int = 4,
    attn_dropout: float = 0.2,
    ff_dropout: float = 0.1,
    ff_factor: float = 1.33,
    transformer_activation: str = "reglu",
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(FTTransformer, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        input_dim=input_dim,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.kv_compression_factor = kv_compression_factor
    self.kv_sharing = kv_sharing
    self.use_qkv_bias = use_qkv_bias
    self.n_heads = n_heads
    self.n_blocks = n_blocks
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.ff_factor = ff_factor
    self.transformer_activation = transformer_activation

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
    self.n_feats = self.n_cat + self.n_cont

    # Embeddings are instantiated at the base model
    # Transformer blocks
    is_first = True
    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "fttransformer_block" + str(i),
            FTTransformerEncoder(
                input_dim,
                self.n_feats,
                n_heads,
                use_qkv_bias,
                attn_dropout,
                ff_dropout,
                ff_factor,
                kv_compression_factor,
                kv_sharing,
                transformer_activation,
                is_first,
            ),
        )
        is_first = False

    self.mlp_first_hidden_dim = (
        self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
    )

    # Mlp: adding an MLP on top of the Resnet blocks is optional and
    # therefore all related params are optional
    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                False if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is: \((N, H, F, k)\), where \(N\) is the batch size, \(H\) is the number of attention heads, \(F\) is the number of features/columns and \(k\) is the reduced sequence length or dimension, i.e. \(k = int(kv_{compression \space factor} \times s)\)

TabPerceiver

TabPerceiver(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, n_cross_attns=1, n_cross_attn_heads=4, n_latents=16, latent_dim=128, n_latent_heads=4, n_latent_blocks=4, n_perceiver_blocks=4, share_weights=False, attn_dropout=0.1, ff_dropout=0.1, ff_factor=4, transformer_activation='geglu', mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: BaseTabularModelWithAttention

Defines an adaptation of a Perceiver that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: while there are scientific publications for the TabTransformer, SAINT and FTTransformer, the TabPerceiver and the TabFastFormer are our own adaptations of the Perceiver and the FastFormer for tabular data.

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns.

  • n_cross_attns (int, default: 1 ) –

    Number of times each perceiver block will cross attend to the input data (i.e. number of cross attention components per perceiver block). This should normally be 1. However, in the paper they describe some architectures (normally computer vision-related problems) where the Perceiver attends multiple times to the input array. Therefore, maybe multiple cross attention to the input array is also useful in some cases for tabular data 🤷 .

  • n_cross_attn_heads (int, default: 4 ) –

    Number of attention heads for the cross attention component

  • n_latents (int, default: 16 ) –

    Number of latents. This is the \(N\) parameter in the paper. As indicated in the paper, this number should be significantly lower than \(M\) (the number of columns in the dataset). Setting \(N\) closer to \(M\) defies the main purpose of the Perceiver, which is to overcome the transformer quadratic bottleneck

  • latent_dim (int, default: 128 ) –

    Latent dimension.

  • n_latent_heads (int, default: 4 ) –

    Number of attention heads per Latent Transformer

  • n_latent_blocks (int, default: 4 ) –

    Number of transformer encoder blocks (normalised MHA + normalised FF) per Latent Transformer

  • n_perceiver_blocks (int, default: 4 ) –

    Number of Perceiver blocks defined as [Cross Attention + Latent Transformer]

  • share_weights (bool, default: False ) –

    Boolean indicating if the weights will be shared between Perceiver blocks

  • attn_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the Multi-Head Attention layers

  • ff_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (int, default: 4 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

  • transformer_activation (str, default: 'geglu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (ModuleDict) –

    ModuleDict with the Perceiver blocks

  • latents (Parameter) –

    Latents that will be used for prediction

  • mlp (Module) –

    MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabPerceiver
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabPerceiver(column_idx=column_idx, cat_embed_input=cat_embed_input,
... continuous_cols=continuous_cols, n_latents=2, latent_dim=16,
... n_perceiver_blocks=2)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    n_cross_attns: int = 1,
    n_cross_attn_heads: int = 4,
    n_latents: int = 16,
    latent_dim: int = 128,
    n_latent_heads: int = 4,
    n_latent_blocks: int = 4,
    n_perceiver_blocks: int = 4,
    share_weights: bool = False,
    attn_dropout: float = 0.1,
    ff_dropout: float = 0.1,
    ff_factor: int = 4,
    transformer_activation: str = "geglu",
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(TabPerceiver, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        input_dim=input_dim,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.n_cross_attns = n_cross_attns
    self.n_cross_attn_heads = n_cross_attn_heads
    self.n_latents = n_latents
    self.latent_dim = latent_dim
    self.n_latent_heads = n_latent_heads
    self.n_latent_blocks = n_latent_blocks
    self.n_perceiver_blocks = n_perceiver_blocks
    self.share_weights = share_weights
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.ff_factor = ff_factor
    self.transformer_activation = transformer_activation

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    # Embeddings are instantiated at the base model
    # Transformer blocks
    self.latents = nn.init.trunc_normal_(
        nn.Parameter(torch.empty(n_latents, latent_dim))
    )

    self.encoder = nn.ModuleDict()
    first_perceiver_block = self._build_perceiver_block()
    self.encoder["perceiver_block0"] = first_perceiver_block

    if share_weights:
        for n in range(1, n_perceiver_blocks):
            self.encoder["perceiver_block" + str(n)] = first_perceiver_block
    else:
        for n in range(1, n_perceiver_blocks):
            self.encoder["perceiver_block" + str(n)] = self._build_perceiver_block()

    self.mlp_first_hidden_dim = self.latent_dim

    # Mlp
    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                False if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. If the weights are not shared between perceiver blocks each element of the list will be a list itself containing the Cross Attention and Latent Transformer attention weights respectively

The shape of the attention weights is:

  • Cross Attention: \((N, C, L, F)\)

  • Latent Attention: \((N, T, L, L)\)

WHere \(N\) is the batch size, \(C\) is the number of Cross Attention heads, \(L\) is the number of Latents, \(F\) is the number of features/columns in the dataset and \(T\) is the number of Latent Attention heads

TabFastFormer

TabFastFormer(column_idx, *, cat_embed_input=None, cat_embed_dropout=None, use_cat_bias=None, cat_embed_activation=None, shared_embed=None, add_shared_embed=None, frac_shared_embed=None, continuous_cols=None, cont_norm_layer=None, embed_continuous_method='standard', cont_embed_dropout=None, cont_embed_activation=None, quantization_setup=None, n_frequencies=None, sigma=None, share_last_layer=None, full_embed_dropout=None, input_dim=32, n_heads=8, use_bias=False, n_blocks=4, attn_dropout=0.1, ff_dropout=0.2, ff_factor=4, share_qv_weights=False, share_weights=False, transformer_activation='relu', mlp_hidden_dims=None, mlp_activation=None, mlp_dropout=None, mlp_batchnorm=None, mlp_batchnorm_last=None, mlp_linear_first=None)

Bases: BaseTabularModelWithAttention

Defines an adaptation of a FastFormer that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

Most of the parameters for this class are Optional since the use of categorical or continuous is in fact optional (i.e. one can use categorical features only, continuous features only or both).

ℹ️ NOTE: while there are scientific publications for the TabTransformer, SAINT and FTTransformer, the TabPerceiver and the TabFastFormer are our own adaptations of the Perceiver and the FastFormer for tabular data.

Parameters:

  • column_idx (Dict[str, int]) –

    Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

  • cat_embed_input (Optional[List[Tuple[str, int]]], default: None ) –

    List of Tuples with the column name and number of unique values and embedding dimension. e.g. [(education, 11), ...]

  • cat_embed_dropout (Optional[float], default: None ) –

    Categorical embeddings dropout. If None, it will default to 0.

  • use_cat_bias (Optional[bool], default: None ) –

    Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

  • cat_embed_activation (Optional[str], default: None ) –

    Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • shared_embed (Optional[bool], default: None ) –

    Boolean indicating if the embeddings will be "shared". The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • add_shared_embed (Optional[bool], default: None ) –

    The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.embeddings_layers.SharedEmbeddings If 'None' is passed, it will default to 'False'.

  • frac_shared_embed (Optional[float], default: None ) –

    The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column. If 'None' is passed, it will default to 0.0.

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the numeric (aka continuous) columns

  • cont_norm_layer (Optional[Literal[batchnorm, layernorm]], default: None ) –

    Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

  • embed_continuous_method (Optional[Literal[standard, piecewise, periodic]], default: 'standard' ) –

    Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

  • cont_embed_dropout (Optional[float], default: None ) –

    Dropout for the continuous embeddings. If None, it will default to 0.0

  • cont_embed_activation (Optional[str], default: None ) –

    Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

  • quantization_setup (Optional[Dict[str, List[float]]], default: None ) –

    This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

  • n_frequencies (Optional[int], default: None ) –

    This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • sigma (Optional[float], default: None ) –

    This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

  • share_last_layer (Optional[bool], default: None ) –

    This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

  • full_embed_dropout (Optional[bool], default: None ) –

    If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

  • input_dim (int, default: 32 ) –

    The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns

  • n_heads (int, default: 8 ) –

    Number of attention heads per FastFormer block

  • use_bias (bool, default: False ) –

    Boolean indicating whether or not to use bias in the Q, K, and V projection layers

  • n_blocks (int, default: 4 ) –

    Number of FastFormer blocks

  • attn_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the Additive Attention layers

  • ff_dropout (float, default: 0.2 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (int, default: 4 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

  • share_qv_weights (bool, default: False ) –

    Following the paper, this is a boolean indicating if the Value (\(V\)) and the Query (\(Q\)) transformation parameters will be shared.

  • share_weights (bool, default: False ) –

    In addition to sharing the \(V\) and \(Q\) transformation parameters, the parameters across different Fastformer layers can also be shared. Please, see pytorch_widedeep/models/tabular/transformers/tab_fastformer.py for details

  • transformer_activation (str, default: 'relu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    MLP hidden dimensions. If not provided no MLP on top of the final FTTransformer block will be used

  • mlp_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If not provided no MLP on top of the final Transformer block will be used.

  • mlp_activation (Optional[str], default: None ) –

    Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky'_relu' and _'gelu' are supported. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 'relu'.

  • mlp_dropout (Optional[float], default: None ) –

    float with the dropout between the dense layers of the MLP. If 'mlp_hidden_dims' is not None and this parameter is None, it will default to 0.0.

  • mlp_batchnorm (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_batchnorm_last (Optional[bool], default: None ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers If 'mlp_hidden_dims' is not None and this parameter is None, it will default to False.

  • mlp_linear_first (Optional[bool], default: None ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT] If 'mlp_hidden_dims' is not None and this parameter is None, it will default to True.

Attributes:

  • encoder (Module) –

    Sequence of FasFormer blocks.

  • mlp (Module) –

    MLP component in the model

Examples:

>>> import torch
>>> from pytorch_widedeep.models import TabFastFormer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> cat_embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabFastFormer(column_idx=column_idx, cat_embed_input=cat_embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
def __init__(
    self,
    column_idx: Dict[str, int],
    *,
    cat_embed_input: Optional[List[Tuple[str, int]]] = None,
    cat_embed_dropout: Optional[float] = None,
    use_cat_bias: Optional[bool] = None,
    cat_embed_activation: Optional[str] = None,
    shared_embed: Optional[bool] = None,
    add_shared_embed: Optional[bool] = None,
    frac_shared_embed: Optional[float] = None,
    continuous_cols: Optional[List[str]] = None,
    cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
    embed_continuous_method: Optional[
        Literal["standard", "piecewise", "periodic"]
    ] = "standard",
    cont_embed_dropout: Optional[float] = None,
    cont_embed_activation: Optional[str] = None,
    quantization_setup: Optional[Dict[str, List[float]]] = None,
    n_frequencies: Optional[int] = None,
    sigma: Optional[float] = None,
    share_last_layer: Optional[bool] = None,
    full_embed_dropout: Optional[bool] = None,
    input_dim: int = 32,
    n_heads: int = 8,
    use_bias: bool = False,
    n_blocks: int = 4,
    attn_dropout: float = 0.1,
    ff_dropout: float = 0.2,
    ff_factor: int = 4,
    share_qv_weights: bool = False,
    share_weights: bool = False,
    transformer_activation: str = "relu",
    mlp_hidden_dims: Optional[List[int]] = None,
    mlp_activation: Optional[str] = None,
    mlp_dropout: Optional[float] = None,
    mlp_batchnorm: Optional[bool] = None,
    mlp_batchnorm_last: Optional[bool] = None,
    mlp_linear_first: Optional[bool] = None,
):
    super(TabFastFormer, self).__init__(
        column_idx=column_idx,
        cat_embed_input=cat_embed_input,
        cat_embed_dropout=cat_embed_dropout,
        use_cat_bias=use_cat_bias,
        cat_embed_activation=cat_embed_activation,
        shared_embed=shared_embed,
        add_shared_embed=add_shared_embed,
        frac_shared_embed=frac_shared_embed,
        continuous_cols=continuous_cols,
        cont_norm_layer=cont_norm_layer,
        embed_continuous=None,
        embed_continuous_method=embed_continuous_method,
        cont_embed_dropout=cont_embed_dropout,
        cont_embed_activation=cont_embed_activation,
        input_dim=input_dim,
        quantization_setup=quantization_setup,
        n_frequencies=n_frequencies,
        sigma=sigma,
        share_last_layer=share_last_layer,
        full_embed_dropout=full_embed_dropout,
    )

    self.n_heads = n_heads
    self.use_bias = use_bias
    self.n_blocks = n_blocks
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.ff_factor = ff_factor
    self.share_qv_weights = share_qv_weights
    self.share_weights = share_weights
    self.transformer_activation = transformer_activation

    self.mlp_hidden_dims = mlp_hidden_dims
    self.mlp_activation = mlp_activation
    self.mlp_dropout = mlp_dropout
    self.mlp_batchnorm = mlp_batchnorm
    self.mlp_batchnorm_last = mlp_batchnorm_last
    self.mlp_linear_first = mlp_linear_first

    self.with_cls_token = "cls_token" in column_idx
    self.n_cat = len(cat_embed_input) if cat_embed_input is not None else 0
    self.n_cont = len(continuous_cols) if continuous_cols is not None else 0
    self.n_feats = self.n_cat + self.n_cont

    # Embeddings are instantiated at the base model
    # Transformer blocks
    self.encoder = nn.Sequential()
    first_fastformer_block = FastFormerEncoder(
        input_dim,
        n_heads,
        use_bias,
        attn_dropout,
        ff_dropout,
        ff_factor,
        share_qv_weights,
        transformer_activation,
    )
    self.encoder.add_module("fastformer_block0", first_fastformer_block)
    for i in range(1, n_blocks):
        if share_weights:
            self.encoder.add_module(
                "fastformer_block" + str(i), first_fastformer_block
            )
        else:
            self.encoder.add_module(
                "fastformer_block" + str(i),
                FastFormerEncoder(
                    input_dim,
                    n_heads,
                    use_bias,
                    attn_dropout,
                    ff_dropout,
                    ff_factor,
                    share_qv_weights,
                    transformer_activation,
                ),
            )

    self.mlp_first_hidden_dim = (
        self.input_dim if self.with_cls_token else (self.n_feats * self.input_dim)
    )

    # Mlp: adding an MLP on top of the Resnet blocks is optional and
    # therefore all related params are optional
    if self.mlp_hidden_dims is not None:
        self.mlp = MLP(
            d_hidden=[self.mlp_first_hidden_dim] + self.mlp_hidden_dims,
            activation=(
                "relu" if self.mlp_activation is None else self.mlp_activation
            ),
            dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
            batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
            batchnorm_last=(
                False
                if self.mlp_batchnorm_last is None
                else self.mlp_batchnorm_last
            ),
            linear_first=(
                False if self.mlp_linear_first is None else self.mlp_linear_first
            ),
        )
    else:
        self.mlp = None

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights. Each element of the list is a tuple where the first and second elements are the \(\alpha\) and \(\beta\) attention weights in the paper.

The shape of the attention weights is \((N, H, F)\) where \(N\) is the batch size, \(H\) is the number of attention heads and \(F\) is the number of features/columns in the dataset


ℹ️ NOTE: when we started developing the library we thought that combining Deep Learning architectures for tabular data, with CNN-based architectures (pretrained or not) for images and Transformer-based architectures for text would be an 'overkill' (also, pretrained transformer-based models were not as readily available as they are today). Therefore, at that time we made the decision of including in the library simple RNN-based architectures for the text dataset. A lot has passed since then and it is our intention to integrate this library with the Hugginface's Transformers library in the near future. Nonetheless, note that it is still possible to use any custom model as the deeptext component using this library. Please, see the example section in this documentation for details

BasicRNN

BasicRNN(vocab_size, embed_dim=None, embed_matrix=None, embed_trainable=True, rnn_type='lstm', hidden_dim=64, n_layers=3, rnn_dropout=0.1, bidirectional=False, use_hidden_state=True, padding_idx=1, head_hidden_dims=None, head_activation='relu', head_dropout=None, head_batchnorm=False, head_batchnorm_last=False, head_linear_first=False)

Bases: BaseWDModelComponent

Standard text classifier/regressor comprised by a stack of RNNs (LSTMs or GRUs) that can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of the stack of RNNs

Parameters:

  • vocab_size (int) –

    Number of words in the vocabulary

  • embed_dim (Optional[int], default: None ) –

    Dimension of the word embeddings if non-pretained word vectors are used

  • embed_matrix (Optional[ndarray], default: None ) –

    Pretrained word embeddings

  • embed_trainable (bool, default: True ) –

    Boolean indicating if the pretrained embeddings are trainable

  • rnn_type (str, default: 'lstm' ) –

    String indicating the type of RNN to use. One of 'lstm' or 'gru'

  • hidden_dim (int, default: 64 ) –

    Hidden dim of the RNN

  • n_layers (int, default: 3 ) –

    Number of recurrent layers

  • rnn_dropout (float, default: 0.1 ) –

    Dropout for each RNN layer except the last layer

  • bidirectional (bool, default: False ) –

    Boolean indicating whether the staked RNNs are bidirectional

  • use_hidden_state (bool, default: True ) –

    Boolean indicating whether to use the final hidden state or the RNN's output as predicting features. Typically the former is used.

  • padding_idx (int, default: 1 ) –

    index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

  • head_hidden_dims (Optional[List[int]], default: None ) –

    List with the sizes of the dense layers in the head e.g: [128, 64]

  • head_activation (str, default: 'relu' ) –

    Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • head_dropout (Optional[float], default: None ) –

    Dropout of the dense layers in the head

  • head_batchnorm (bool, default: False ) –

    Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

  • head_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

  • head_linear_first (bool, default: False ) –

    Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • word_embed (Module) –

    word embedding matrix

  • rnn (Module) –

    Stack of RNNs

  • rnn_mlp (Module) –

    Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import BasicRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = BasicRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/basic_rnn.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
def __init__(
    self,
    vocab_size: int,
    embed_dim: Optional[int] = None,
    embed_matrix: Optional[np.ndarray] = None,
    embed_trainable: bool = True,
    rnn_type: str = "lstm",
    hidden_dim: int = 64,
    n_layers: int = 3,
    rnn_dropout: float = 0.1,
    bidirectional: bool = False,
    use_hidden_state: bool = True,
    padding_idx: int = 1,
    head_hidden_dims: Optional[List[int]] = None,
    head_activation: str = "relu",
    head_dropout: Optional[float] = None,
    head_batchnorm: bool = False,
    head_batchnorm_last: bool = False,
    head_linear_first: bool = False,
):
    super(BasicRNN, self).__init__()

    if embed_dim is None and embed_matrix is None:
        raise ValueError(
            "If no 'embed_matrix' is passed, the embedding dimension must"
            "be specified with 'embed_dim'"
        )

    if rnn_type.lower() not in ["lstm", "gru"]:
        raise ValueError(
            f"'rnn_type' must be 'lstm' or 'gru', got {rnn_type} instead"
        )

    if (
        embed_dim is not None
        and embed_matrix is not None
        and not embed_dim == embed_matrix.shape[1]
    ):
        warnings.warn(
            "the input embedding dimension {} and the dimension of the "
            "pretrained embeddings {} do not match. The pretrained embeddings "
            "dimension ({}) will be used".format(
                embed_dim, embed_matrix.shape[1], embed_matrix.shape[1]
            ),
            UserWarning,
        )

    self.vocab_size = vocab_size
    self.embed_trainable = embed_trainable
    self.embed_dim = embed_dim

    self.rnn_type = rnn_type
    self.hidden_dim = hidden_dim
    self.n_layers = n_layers
    self.rnn_dropout = rnn_dropout
    self.bidirectional = bidirectional
    self.use_hidden_state = use_hidden_state
    self.padding_idx = padding_idx

    self.head_hidden_dims = head_hidden_dims
    self.head_activation = head_activation
    self.head_dropout = head_dropout
    self.head_batchnorm = head_batchnorm
    self.head_batchnorm_last = head_batchnorm_last
    self.head_linear_first = head_linear_first

    # Embeddings
    if embed_matrix is not None:
        self.word_embed, self.embed_dim = self._set_embeddings(embed_matrix)
    else:
        self.word_embed = nn.Embedding(
            self.vocab_size, self.embed_dim, padding_idx=self.padding_idx
        )

    # RNN
    rnn_params = {
        "input_size": self.embed_dim,
        "hidden_size": hidden_dim,
        "num_layers": n_layers,
        "bidirectional": bidirectional,
        "dropout": rnn_dropout,
        "batch_first": True,
    }
    if self.rnn_type.lower() == "lstm":
        self.rnn: Union[nn.LSTM, nn.GRU] = nn.LSTM(**rnn_params)
    elif self.rnn_type.lower() == "gru":
        self.rnn = nn.GRU(**rnn_params)

    self.rnn_output_dim = hidden_dim * 2 if bidirectional else hidden_dim

    # FC-Head (Mlp)
    if self.head_hidden_dims is not None:
        head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
        self.rnn_mlp: Union[MLP, nn.Identity] = MLP(
            head_hidden_dims,
            head_activation,
            head_dropout,
            head_batchnorm,
            head_batchnorm_last,
            head_linear_first,
        )
    else:
        # simple hack to add readability in the forward pass
        self.rnn_mlp = nn.Identity()

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

AttentiveRNN

AttentiveRNN(vocab_size, embed_dim=None, embed_matrix=None, embed_trainable=True, rnn_type='lstm', hidden_dim=64, n_layers=3, rnn_dropout=0.1, bidirectional=False, use_hidden_state=True, padding_idx=1, attn_concatenate=True, attn_dropout=0.1, head_hidden_dims=None, head_activation='relu', head_dropout=None, head_batchnorm=False, head_batchnorm_last=False, head_linear_first=False)

Bases: BasicRNN

Text classifier/regressor comprised by a stack of RNNs (LSTMs or GRUs) plus an attention layer. This model can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of attention layer

Parameters:

  • vocab_size (int) –

    Number of words in the vocabulary

  • embed_dim (Optional[int], default: None ) –

    Dimension of the word embeddings if non-pretained word vectors are used

  • embed_matrix (Optional[ndarray], default: None ) –

    Pretrained word embeddings

  • embed_trainable (bool, default: True ) –

    Boolean indicating if the pretrained embeddings are trainable

  • rnn_type (str, default: 'lstm' ) –

    String indicating the type of RNN to use. One of 'lstm' or 'gru'

  • hidden_dim (int, default: 64 ) –

    Hidden dim of the RNN

  • n_layers (int, default: 3 ) –

    Number of recurrent layers

  • rnn_dropout (float, default: 0.1 ) –

    Dropout for each RNN layer except the last layer

  • bidirectional (bool, default: False ) –

    Boolean indicating whether the staked RNNs are bidirectional

  • use_hidden_state (bool, default: True ) –

    Boolean indicating whether to use the final hidden state or the RNN's output as predicting features. Typically the former is used.

  • padding_idx (int, default: 1 ) –

    index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

  • attn_concatenate (bool, default: True ) –

    Boolean indicating if the input to the attention mechanism will be the output of the RNN or the output of the RNN concatenated with the last hidden state.

  • attn_dropout (float, default: 0.1 ) –

    Internal dropout for the attention mechanism

  • head_hidden_dims (Optional[List[int]], default: None ) –

    List with the sizes of the dense layers in the head e.g: [128, 64]

  • head_activation (str, default: 'relu' ) –

    Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • head_dropout (Optional[float], default: None ) –

    Dropout of the dense layers in the head

  • head_batchnorm (bool, default: False ) –

    Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

  • head_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

  • head_linear_first (bool, default: False ) –

    Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • word_embed (Module) –

    word embedding matrix

  • rnn (Module) –

    Stack of RNNs

  • rnn_mlp (Module) –

    Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import AttentiveRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = AttentiveRNN(vocab_size=4, hidden_dim=4, n_layers=2, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/attentive_rnn.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def __init__(
    self,
    vocab_size: int,
    embed_dim: Optional[int] = None,
    embed_matrix: Optional[np.ndarray] = None,
    embed_trainable: bool = True,
    rnn_type: str = "lstm",
    hidden_dim: int = 64,
    n_layers: int = 3,
    rnn_dropout: float = 0.1,
    bidirectional: bool = False,
    use_hidden_state: bool = True,
    padding_idx: int = 1,
    attn_concatenate: bool = True,
    attn_dropout: float = 0.1,
    head_hidden_dims: Optional[List[int]] = None,
    head_activation: str = "relu",
    head_dropout: Optional[float] = None,
    head_batchnorm: bool = False,
    head_batchnorm_last: bool = False,
    head_linear_first: bool = False,
):
    super(AttentiveRNN, self).__init__(
        vocab_size=vocab_size,
        embed_dim=embed_dim,
        embed_matrix=embed_matrix,
        embed_trainable=embed_trainable,
        rnn_type=rnn_type,
        hidden_dim=hidden_dim,
        n_layers=n_layers,
        rnn_dropout=rnn_dropout,
        bidirectional=bidirectional,
        use_hidden_state=use_hidden_state,
        padding_idx=padding_idx,
        head_hidden_dims=head_hidden_dims,
        head_activation=head_activation,
        head_dropout=head_dropout,
        head_batchnorm=head_batchnorm,
        head_batchnorm_last=head_batchnorm_last,
        head_linear_first=head_linear_first,
    )

    # Embeddings and RNN defined in the BasicRNN inherited class

    # Attention
    self.attn_concatenate = attn_concatenate
    self.attn_dropout = attn_dropout

    if bidirectional and attn_concatenate:
        self.rnn_output_dim = hidden_dim * 4
    elif bidirectional or attn_concatenate:
        self.rnn_output_dim = hidden_dim * 2
    else:
        self.rnn_output_dim = hidden_dim
    self.attn = ContextAttention(
        self.rnn_output_dim, attn_dropout, sum_along_seq=True
    )

    # FC-Head (Mlp)
    if self.head_hidden_dims is not None:
        head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
        self.rnn_mlp = MLP(
            head_hidden_dims,
            head_activation,
            head_dropout,
            head_batchnorm,
            head_batchnorm_last,
            head_linear_first,
        )

attention_weights property

attention_weights

List with the attention weights

The shape of the attention weights is \((N, S)\), where \(N\) is the batch size and \(S\) is the length of the sequence

StackedAttentiveRNN

StackedAttentiveRNN(vocab_size, embed_dim=None, embed_matrix=None, embed_trainable=True, rnn_type='lstm', hidden_dim=64, bidirectional=False, padding_idx=1, n_blocks=3, attn_concatenate=False, attn_dropout=0.1, with_addnorm=False, head_hidden_dims=None, head_activation='relu', head_dropout=None, head_batchnorm=False, head_batchnorm_last=False, head_linear_first=False)

Bases: BaseWDModelComponent

Text classifier/regressor comprised by a stack of blocks: [RNN + Attention]. This can be used as the deeptext component of a Wide & Deep model or independently by itself.

In addition, there is the option to add a Fully Connected (FC) set of dense layers on top of the attentiob blocks

Parameters:

  • vocab_size (int) –

    Number of words in the vocabulary

  • embed_dim (Optional[int], default: None ) –

    Dimension of the word embeddings if non-pretained word vectors are used

  • embed_matrix (Optional[ndarray], default: None ) –

    Pretrained word embeddings

  • embed_trainable (bool, default: True ) –

    Boolean indicating if the pretrained embeddings are trainable

  • rnn_type (str, default: 'lstm' ) –

    String indicating the type of RNN to use. One of 'lstm' or 'gru'

  • hidden_dim (int, default: 64 ) –

    Hidden dim of the RNN

  • bidirectional (bool, default: False ) –

    Boolean indicating whether the staked RNNs are bidirectional

  • padding_idx (int, default: 1 ) –

    index of the padding token in the padded-tokenised sequences. The TextPreprocessor class within this library uses fastai's tokenizer where the token index 0 is reserved for the 'unknown' word token. Therefore, the default value is set to 1.

  • n_blocks (int, default: 3 ) –

    Number of attention blocks. Each block is comprised by an RNN and a Context Attention Encoder

  • attn_concatenate (bool, default: False ) –

    Boolean indicating if the input to the attention mechanism will be the output of the RNN or the output of the RNN concatenated with the last hidden state or simply

  • attn_dropout (float, default: 0.1 ) –

    Internal dropout for the attention mechanism

  • with_addnorm (bool, default: False ) –

    Boolean indicating if the output of each block will be added to the input and normalised

  • head_hidden_dims (Optional[List[int]], default: None ) –

    List with the sizes of the dense layers in the head e.g: [128, 64]

  • head_activation (str, default: 'relu' ) –

    Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • head_dropout (Optional[float], default: None ) –

    Dropout of the dense layers in the head

  • head_batchnorm (bool, default: False ) –

    Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

  • head_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

  • head_linear_first (bool, default: False ) –

    Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • word_embed (Module) –

    word embedding matrix

  • rnn (Module) –

    Stack of RNNs

  • rnn_mlp (Module) –

    Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

Examples:

>>> import torch
>>> from pytorch_widedeep.models import StackedAttentiveRNN
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = StackedAttentiveRNN(vocab_size=4, hidden_dim=4, padding_idx=0, embed_dim=4)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/stacked_attentive_rnn.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def __init__(
    self,
    vocab_size: int,
    embed_dim: Optional[int] = None,
    embed_matrix: Optional[np.ndarray] = None,
    embed_trainable: bool = True,
    rnn_type: str = "lstm",
    hidden_dim: int = 64,
    bidirectional: bool = False,
    padding_idx: int = 1,
    n_blocks: int = 3,
    attn_concatenate: bool = False,
    attn_dropout: float = 0.1,
    with_addnorm: bool = False,
    head_hidden_dims: Optional[List[int]] = None,
    head_activation: str = "relu",
    head_dropout: Optional[float] = None,
    head_batchnorm: bool = False,
    head_batchnorm_last: bool = False,
    head_linear_first: bool = False,
):
    super(StackedAttentiveRNN, self).__init__()

    if (
        embed_dim is not None
        and embed_matrix is not None
        and not embed_dim == embed_matrix.shape[1]
    ):
        warnings.warn(
            "the input embedding dimension {} and the dimension of the "
            "pretrained embeddings {} do not match. The pretrained embeddings "
            "dimension ({}) will be used".format(
                embed_dim, embed_matrix.shape[1], embed_matrix.shape[1]
            ),
            UserWarning,
        )

    if rnn_type.lower() not in ["lstm", "gru"]:
        raise ValueError(
            f"'rnn_type' must be 'lstm' or 'gru', got {rnn_type} instead"
        )

    self.vocab_size = vocab_size
    self.embed_trainable = embed_trainable
    self.embed_dim = embed_dim

    self.rnn_type = rnn_type
    self.hidden_dim = hidden_dim
    self.bidirectional = bidirectional
    self.padding_idx = padding_idx

    self.n_blocks = n_blocks
    self.attn_concatenate = attn_concatenate
    self.attn_dropout = attn_dropout
    self.with_addnorm = with_addnorm

    self.head_hidden_dims = head_hidden_dims
    self.head_activation = head_activation
    self.head_dropout = head_dropout
    self.head_batchnorm = head_batchnorm
    self.head_batchnorm_last = head_batchnorm_last
    self.head_linear_first = head_linear_first

    # Embeddings
    self.word_embed, self.embed_dim = self._set_embeddings(embed_matrix)

    # Linear Projection: if embed_dim is different that the input of the
    # attention blocks we add a linear projection
    if bidirectional and attn_concatenate:
        self.rnn_output_dim = hidden_dim * 4
    elif bidirectional or attn_concatenate:
        self.rnn_output_dim = hidden_dim * 2
    else:
        self.rnn_output_dim = hidden_dim

    if self.rnn_output_dim != self.embed_dim:
        self.embed_proj: Union[nn.Linear, nn.Identity] = nn.Linear(
            self.embed_dim, self.rnn_output_dim
        )
    else:
        self.embed_proj = nn.Identity()

    # RNN
    rnn_params = {
        "input_size": self.rnn_output_dim,
        "hidden_size": hidden_dim,
        "bidirectional": bidirectional,
        "batch_first": True,
    }
    if self.rnn_type.lower() == "lstm":
        self.rnn: Union[nn.LSTM, nn.GRU] = nn.LSTM(**rnn_params)
    elif self.rnn_type.lower() == "gru":
        self.rnn = nn.GRU(**rnn_params)

    # FC-Head (Mlp)
    self.attention_blks = nn.ModuleList()
    for i in range(n_blocks):
        self.attention_blks.append(
            ContextAttentionEncoder(
                self.rnn,
                self.rnn_output_dim,
                attn_dropout,
                attn_concatenate,
                with_addnorm=with_addnorm if i != n_blocks - 1 else False,
                sum_along_seq=i == n_blocks - 1,
            )
        )

    # Mlp
    if self.head_hidden_dims is not None:
        head_hidden_dims = [self.rnn_output_dim] + self.head_hidden_dims
        self.rnn_mlp: Union[MLP, nn.Identity] = MLP(
            head_hidden_dims,
            head_activation,
            head_dropout,
            head_batchnorm,
            head_batchnorm_last,
            head_linear_first,
        )
    else:
        # simple hack to add readability in the forward pass
        self.rnn_mlp = nn.Identity()

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

attention_weights property

attention_weights

List with the attention weights per block

The shape of the attention weights is \((N, S)\) Where \(N\) is the batch size and \(S\) is the length of the sequence

Transformer

Transformer(vocab_size, seq_length, input_dim, n_heads, n_blocks, attn_dropout=0.1, ff_dropout=0.1, ff_factor=4, activation='gelu', use_linear_attention=False, use_flash_attention=False, padding_idx=0, with_cls_token=False, *, with_pos_encoding=True, pos_encoding_dropout=0.1, pos_encoder=None)

Bases: Module

Basic Encoder-Only Transformer Model for text classification/regression. As all other models in the library this model can be used as the deeptext component of a Wide & Deep model or independently by itself.

ℹ️ NOTE: This model is introduced in the context of recommendation systems and thought for sequences of any nature (e.g. items). It can, of course, still be used for text. However, at this stage, we have decided to not include the possibility of loading pretrained word vectors since we aim to integrate the library wit Huggingface in the (hopefully) near future

Parameters:

  • vocab_size (int) –

    Number of words in the vocabulary

  • input_dim (int) –

    Dimension of the token embeddings

    Param aliases: embed_dim, d_model.

  • seq_length (int) –

    Input sequence length

  • n_heads (int) –

    Number of attention heads per Transformer block

  • n_blocks (int) –

    Number of Transformer blocks

  • attn_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the Multi-Head Attention layers

  • ff_dropout (float, default: 0.1 ) –

    Dropout that will be applied to the FeedForward network

  • ff_factor (int, default: 4 ) –

    Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

  • activation (str, default: 'gelu' ) –

    Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

  • padding_idx (int, default: 0 ) –

    index of the padding token in the padded-tokenised sequences.

  • with_cls_token (bool, default: False ) –

    Boolean indicating if a '[CLS]' token is included in the tokenized sequences. If present, the final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. NOTE: if included in the tokenized sequences it must be inserted as the first token in the sequences.

  • with_pos_encoding (bool, default: True ) –

    Boolean indicating if positional encoding will be used

  • pos_encoding_dropout (float, default: 0.1 ) –

    Positional encoding dropout

  • pos_encoder (Optional[Module], default: None ) –

    This model uses by default a standard positional encoding approach. However, any custom positional encoder can also be used and pass to the Transformer model via the 'pos_encoder' parameter

Attributes:

  • embedding (Module) –

    Standard token embedding layer

  • pos_encoder (Module) –

    Positional Encoder

  • encoder (Module) –

    Sequence of Transformer blocks

Examples:

>>> import torch
>>> from pytorch_widedeep.models import Transformer
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = Transformer(vocab_size=4, seq_length=5, input_dim=8, n_heads=1, n_blocks=1)
>>> out = model(X_text)
Source code in pytorch_widedeep/models/text/basic_transformer.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
@alias("input_dim", ["embed_dim", "d_model"])
@alias("seq_length", ["max_length", "maxlen"])
def __init__(
    self,
    vocab_size: int,
    seq_length: int,
    input_dim: int,
    n_heads: int,
    n_blocks: int,
    attn_dropout: float = 0.1,
    ff_dropout: float = 0.1,
    ff_factor: int = 4,
    activation: str = "gelu",
    use_linear_attention: bool = False,
    use_flash_attention: bool = False,
    padding_idx: int = 0,
    with_cls_token: bool = False,
    *,  # from here on pos encoding args
    with_pos_encoding: bool = True,
    pos_encoding_dropout: float = 0.1,
    pos_encoder: Optional[nn.Module] = None,
):
    super().__init__()

    self.input_dim = input_dim
    self.seq_length = seq_length
    self.n_heads = n_heads
    self.n_blocks = n_blocks
    self.attn_dropout = attn_dropout
    self.ff_dropout = ff_dropout
    self.ff_factor = ff_factor
    self.activation = activation
    self.use_linear_attention = use_linear_attention
    self.use_flash_attention = use_flash_attention
    self.padding_idx = padding_idx
    self.with_cls_token = with_cls_token
    self.with_pos_encoding = with_pos_encoding
    self.pos_encoding_dropout = pos_encoding_dropout

    self.embedding = nn.Embedding(
        vocab_size, input_dim, padding_idx=self.padding_idx
    )

    if with_pos_encoding:
        if pos_encoder is not None:
            self.pos_encoder: Union[nn.Module, nn.Identity, PositionalEncoding] = (
                pos_encoder
            )
        else:
            self.pos_encoder = PositionalEncoding(
                input_dim, pos_encoding_dropout, seq_length
            )
    else:
        self.pos_encoder = nn.Identity()

    self.encoder = nn.Sequential()
    for i in range(n_blocks):
        self.encoder.add_module(
            "transformer_block" + str(i),
            TransformerEncoder(
                input_dim,
                n_heads,
                False,  # use_qkv_bias
                attn_dropout,
                ff_dropout,
                ff_factor,
                activation,
                use_linear_attention,
                use_flash_attention,
            ),
        )

Vision

Vision(pretrained_model_setup=None, n_trainable=None, trainable_params=None, channel_sizes=[64, 128, 256, 512], kernel_sizes=[7, 3, 3, 3], strides=[2, 1, 1, 1], head_hidden_dims=None, head_activation='relu', head_dropout=0.1, head_batchnorm=False, head_batchnorm_last=False, head_linear_first=False)

Bases: BaseWDModelComponent

Defines a standard image classifier/regressor using a pretrained network or a sequence of convolution layers that can be used as the deepimage component of a Wide & Deep model or independently by itself.

ℹ️ NOTE: this class represents the integration between pytorch-widedeep and torchvision. New architectures will be available as they are added to torchvision. In a distant future we aim to bring transformer-based architectures as well. However, simple CNN-based architectures (and even MLP-based) seem to produce SoTA results. For the time being, we describe below the options available through this class

Parameters:

  • pretrained_model_setup (Union[str, Dict[str, Union[str, WeightsEnum]]], default: None ) –

    Name of the pretrained model. Should be a variant of the following architectures: 'resnet', 'shufflenet', 'resnext', 'wide_resnet', 'regnet', 'densenet', 'mobilenetv3', 'mobilenetv2', 'mnasnet', 'efficientnet' and 'squeezenet'. if pretrained_model_setup = None a basic, fully trainable CNN will be used. Alternatively, since Torchvision 0.13 one can use pretrained models with different weigths. Therefore, pretrained_model_setup can also be dictionary with the name of the model and the weights (e.g. {'resnet50': ResNet50_Weights.DEFAULT} or {'resnet50': "IMAGENET1K_V2"}).
    Aliased as pretrained_model_name.

  • n_trainable (Optional[int], default: None ) –

    Number of trainable layers starting from the layer closer to the output neuron(s). Note that this number DOES NOT take into account the so-called 'head' which is ALWAYS trainable. If trainable_params is not None this parameter will be ignored

  • trainable_params (Optional[List[str]], default: None ) –

    List of strings containing the names (or substring within the name) of the parameters that will be trained. For example, if we use a 'resnet18' pretrained model and we set trainable_params = ['layer4'] only the parameters of 'layer4' of the network (and the head, as mentioned before) will be trained. Note that setting this or the previous parameter involves some knowledge of the architecture used.

  • channel_sizes (List[int], default: [64, 128, 256, 512] ) –

    List of integers with the channel sizes of a CNN in case we choose not to use a pretrained model

  • kernel_sizes (Union[int, List[int]], default: [7, 3, 3, 3] ) –

    List of integers with the kernel sizes of a CNN in case we choose not to use a pretrained model. Must be of length equal to len(channel_sizes) - 1.

  • strides (Union[int, List[int]], default: [2, 1, 1, 1] ) –

    List of integers with the stride sizes of a CNN in case we choose not to use a pretrained model. Must be of length equal to len(channel_sizes) - 1.

  • head_hidden_dims (Optional[List[int]], default: None ) –

    List with the number of neurons per dense layer in the head. e.g: [64,32]

  • head_activation (str, default: 'relu' ) –

    Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • head_dropout (Union[float, List[float]], default: 0.1 ) –

    float indicating the dropout between the dense layers.

  • head_batchnorm (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the dense layers

  • head_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

  • head_linear_first (bool, default: False ) –

    Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes:

  • features (Module) –

    The pretrained model or Standard CNN plus the optional head

Examples:

>>> import torch
>>> from pytorch_widedeep.models import Vision
>>> X_img = torch.rand((2,3,224,224))
>>> model = Vision(channel_sizes=[64, 128], kernel_sizes = [3, 3], strides=[1, 1], head_hidden_dims=[32, 8])
>>> out = model(X_img)
Source code in pytorch_widedeep/models/image/vision.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
@alias("pretrained_model_setup", ["pretrained_model_name"])
def __init__(
    self,
    pretrained_model_setup: Union[str, Dict[str, Union[str, WeightsEnum]]] = None,
    n_trainable: Optional[int] = None,
    trainable_params: Optional[List[str]] = None,
    channel_sizes: List[int] = [64, 128, 256, 512],
    kernel_sizes: Union[int, List[int]] = [7, 3, 3, 3],
    strides: Union[int, List[int]] = [2, 1, 1, 1],
    head_hidden_dims: Optional[List[int]] = None,
    head_activation: str = "relu",
    head_dropout: Union[float, List[float]] = 0.1,
    head_batchnorm: bool = False,
    head_batchnorm_last: bool = False,
    head_linear_first: bool = False,
):
    super(Vision, self).__init__()

    self._check_pretrained_model_setup(
        pretrained_model_setup, n_trainable, trainable_params
    )

    self.pretrained_model_setup = pretrained_model_setup
    self.n_trainable = n_trainable
    self.trainable_params = trainable_params
    self.channel_sizes = channel_sizes
    self.kernel_sizes = kernel_sizes
    self.strides = strides
    self.head_hidden_dims = head_hidden_dims
    self.head_activation = head_activation
    self.head_dropout = head_dropout
    self.head_batchnorm = head_batchnorm
    self.head_batchnorm_last = head_batchnorm_last
    self.head_linear_first = head_linear_first

    self.features, self.backbone_output_dim = self._get_features()

    if pretrained_model_setup is not None:
        self._freeze(self.features)

    if self.head_hidden_dims is not None:
        head_hidden_dims = [self.backbone_output_dim] + self.head_hidden_dims
        self.vision_mlp = MLP(
            head_hidden_dims,
            self.head_activation,
            self.head_dropout,
            self.head_batchnorm,
            self.head_batchnorm_last,
            self.head_linear_first,
        )

output_dim property

output_dim

The output dimension of the model. This is a required property neccesary to build the WideDeep class

WideDeep

WideDeep(wide=None, deeptabular=None, deeptext=None, deepimage=None, deephead=None, head_hidden_dims=None, head_activation='relu', head_dropout=0.1, head_batchnorm=False, head_batchnorm_last=False, head_linear_first=True, enforce_positive=False, enforce_positive_activation='softplus', pred_dim=1, with_fds=False, **fds_config)

Bases: Module

Main collector class that combines all wide, deeptabular deeptext and deepimage models.

Note that all models described so far in this library must be passed to the WideDeep class once constructed. This is because the models output the last layer before the prediction layer. Such prediction layer is added by the WideDeep class as it collects the components for every data mode.

There are two options to combine these models that correspond to the two main architectures that pytorch-widedeep can build.

  • Directly connecting the output of the model components to an ouput neuron(s).

  • Adding a Fully-Connected Head (FC-Head) on top of the deep models. This FC-Head will combine the output form the deeptabular, deeptext and deepimage and will be then connected to the output neuron(s).

Parameters:

  • wide (Optional[Module], default: None ) –

    Wide model. This is a linear model where the non-linearities are captured via crossed-columns.

  • deeptabular (Optional[BaseWDModelComponent], default: None ) –

    Currently this library implements a number of possible architectures for the deeptabular component. See the documenation of the package.

  • deeptext (Optional[BaseWDModelComponent], default: None ) –

    Currently this library implements a number of possible architectures for the deeptext component. See the documenation of the package.

  • deepimage (Optional[BaseWDModelComponent], default: None ) –

    Currently this library uses torchvision and implements a number of possible architectures for the deepimage component. See the documenation of the package.

  • deephead (Optional[BaseWDModelComponent], default: None ) –

    Alternatively, the user can pass a custom model that will receive the output of the deep component. If deephead is not None all the previous fc-head parameters will be ignored

  • head_hidden_dims (Optional[List[int]], default: None ) –

    List with the sizes of the dense layers in the head e.g: [128, 64]

  • head_activation (str, default: 'relu' ) –

    Activation function for the dense layers in the head. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

  • head_dropout (float, default: 0.1 ) –

    Dropout of the dense layers in the head

  • head_batchnorm (bool, default: False ) –

    Boolean indicating whether or not to include batch normalization in the dense layers that form the 'rnn_mlp'

  • head_batchnorm_last (bool, default: False ) –

    Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

  • head_linear_first (bool, default: True ) –

    Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

  • enforce_positive (bool, default: False ) –

    Boolean indicating if the output from the final layer must be positive. This is important if you are using loss functions with non-negative input restrictions, e.g. RMSLE, or if you know your predictions are bounded in between 0 and inf

  • enforce_positive_activation (str, default: 'softplus' ) –

    Activation function to enforce that the final layer has a positive output. 'softplus' or 'relu' are supported.

  • pred_dim (int, default: 1 ) –

    Size of the final wide and deep output layer containing the predictions. 1 for regression and binary classification or number of classes for multiclass classification.

  • with_fds (bool, default: False ) –

    Boolean indicating if Feature Distribution Smoothing (FDS) will be applied before the final prediction layer. Only available for regression problems. See Delving into Deep Imbalanced Regression for details.

Other Parameters:

  • **fds_config

    Dictionary with the parameters to be used when using Feature Distribution Smoothing. Please, see the docs for the FDSLayer.
    ℹ️ NOTE: Feature Distribution Smoothing is available when using ONLY a deeptabular component
    ℹ️ NOTE: We consider this feature absolutely experimental and we recommend the user to not use it unless the corresponding publication is well understood

Examples:

>>> from pytorch_widedeep.models import TabResnet, Vision, BasicRNN, Wide, WideDeep
>>> embed_input = [(u, i, j) for u, i, j in zip(["a", "b", "c"][:4], [4] * 3, [8] * 3)]
>>> column_idx = {k: v for v, k in enumerate(["a", "b", "c"])}
>>> wide = Wide(10, 1)
>>> deeptabular = TabResnet(blocks_dims=[8, 4], column_idx=column_idx, cat_embed_input=embed_input)
>>> deeptext = BasicRNN(vocab_size=10, embed_dim=4, padding_idx=0)
>>> deepimage = Vision()
>>> model = WideDeep(wide=wide, deeptabular=deeptabular, deeptext=deeptext, deepimage=deepimage)

ℹ️ NOTE: It is possible to use custom components to build Wide & Deep models. Simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations(i.e. not the final prediction) so that these activations are collected by WideDeep and combined accordingly. In addition, the models MUST also contain an attribute output_dim with the size of these last layers of activations. See for example pytorch_widedeep.models.tab_mlp.TabMlp

Source code in pytorch_widedeep/models/wide_deep.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
@alias(  # noqa: C901
    "pred_dim",
    ["num_class", "pred_size"],
)
def __init__(
    self,
    wide: Optional[nn.Module] = None,
    deeptabular: Optional[BaseWDModelComponent] = None,
    deeptext: Optional[BaseWDModelComponent] = None,
    deepimage: Optional[BaseWDModelComponent] = None,
    deephead: Optional[BaseWDModelComponent] = None,
    head_hidden_dims: Optional[List[int]] = None,
    head_activation: str = "relu",
    head_dropout: float = 0.1,
    head_batchnorm: bool = False,
    head_batchnorm_last: bool = False,
    head_linear_first: bool = True,
    enforce_positive: bool = False,
    enforce_positive_activation: str = "softplus",
    pred_dim: int = 1,
    with_fds: bool = False,
    **fds_config,
):
    super(WideDeep, self).__init__()

    self._check_inputs(
        wide,
        deeptabular,
        deeptext,
        deepimage,
        deephead,
        head_hidden_dims,
        pred_dim,
        with_fds,
    )

    # this attribute will be eventually over-written by the Trainer's
    # device. Acts here as a 'placeholder'.
    self.wd_device: Optional[str] = None

    # required as attribute just in case we pass a deephead
    self.pred_dim = pred_dim

    self.with_fds = with_fds
    self.enforce_positive = enforce_positive

    # The main 5 components of the wide and deep assemble: wide,
    # deeptabular, deeptext, deepimage and deephead
    self.with_deephead = deephead is not None or head_hidden_dims is not None
    if deephead is None and head_hidden_dims is not None:
        self.deephead = self._build_deephead(
            deeptabular,
            deeptext,
            deepimage,
            head_hidden_dims,
            head_activation,
            head_dropout,
            head_batchnorm,
            head_batchnorm_last,
            head_linear_first,
        )
    elif deephead is not None:
        self.deephead = nn.Sequential(
            deephead, nn.Linear(deephead.output_dim, self.pred_dim)
        )
    else:
        # for consistency with other components we default to None
        self.deephead = None

    self.wide = wide
    self.deeptabular, self.deeptext, self.deepimage = self._set_model_components(
        deeptabular, deeptext, deepimage, self.with_deephead
    )

    if self.with_fds:
        self.fds_layer = FDSLayer(feature_dim=self.deeptabular.output_dim, **fds_config)  # type: ignore[arg-type]

    if self.enforce_positive:
        self.enf_pos = get_activation_fn(enforce_positive_activation)

FDSLayer

FDSLayer(feature_dim, granularity=100, y_max=None, y_min=None, start_update=0, start_smooth=2, kernel='gaussian', ks=5, sigma=2, momentum=0.9, clip_min=None, clip_max=None)

Bases: Module

Feature Distribution Smoothing layer. Please, see Delving into Deep Imbalanced Regression for details.

ℹ️ NOTE: this is NOT an available model per se, but more a utility that can be used as we run a WideDeep model. The parameters of this extra layers can be set as the class WideDeep is instantiated via the keyword arguments fds_config.

ℹ️ NOTE: Feature Distribution Smoothing is available when using ONLY a deeptabular component

ℹ️ NOTE: We consider this feature absolutely experimental and we recommend the user to not use it unless the corresponding publication is well understood

The code here is based on the code at the official repo

Parameters:

  • feature_dim (int) –

    input dimension size, i.e. output size of previous layer. This will be the dimension of the output from the deeptabular component

  • granularity (int, default: 100 ) –

    number of bins that the target \(y\) is divided into and that will be used to compute the features' statistics (mean and variance)

  • y_max (Optional[float], default: None ) –

    \(y\) upper limit to be considered when binning

  • y_min (Optional[float], default: None ) –

    \(y\) lower limit to be considered when binning

  • start_update (int, default: 0 ) –

    number of _'waiting epochs' after which the FDS layer will start to update its statistics

  • start_smooth (int, default: 2 ) –

    number of _'waiting epochs' after which the FDS layer will start smoothing the feature distributions

  • kernel (Literal[gaussian, triang, laplace], default: 'gaussian' ) –

    choice of smoothing kernel

  • ks (int, default: 5 ) –

    kernel window size

  • sigma (float, default: 2 ) –

    if a 'gaussian' or 'laplace' kernels are used, this is the corresponding standard deviation

  • momentum (Optional[float], default: 0.9 ) –

    to train the layer the authors used a momentum update of the running statistics across each epoch. Set to 0.9 in the paper.

  • clip_min (Optional[float], default: None ) –

    this parameter is used to clip the ratio between the so called running variance and the smoothed variance, and is introduced for numerical stability. We leave it as optional as we did not find a notable improvement in our experiments. The authors used a value of 0.1

  • clip_max (Optional[float], default: None ) –

    same as clip_min but for the upper limit.We leave it as optional as we did not find a notable improvement in our experiments. The authors used a value of 10.

Source code in pytorch_widedeep/models/fds_layer.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def __init__(
    self,
    feature_dim: int,
    granularity: int = 100,
    y_max: Optional[float] = None,
    y_min: Optional[float] = None,
    start_update: int = 0,
    start_smooth: int = 2,
    kernel: Literal["gaussian", "triang", "laplace"] = "gaussian",
    ks: int = 5,
    sigma: float = 2,
    momentum: Optional[float] = 0.9,
    clip_min: Optional[float] = None,
    clip_max: Optional[float] = None,
):
    """
    Feature Distribution Smoothing layer. Please, see
    [Delving into Deep Imbalanced Regression](https:/arxiv.org/abs/2102.09554)
    for details.

    :information_source: **NOTE**: this is NOT an available model per se,
     but more a utility that can be used as we run a `WideDeep` model.
     The parameters of this extra layers can be set as the class
     `WideDeep` is instantiated via the keyword arguments `fds_config`.

    :information_source: **NOTE**: Feature Distribution Smoothing is
     available when using ONLY a `deeptabular` component

    :information_source: **NOTE**: We consider this feature absolutely
    experimental and we recommend the user to not use it unless the
    corresponding [publication](https://arxiv.org/abs/2102.09554) is
    well understood

    The code here is based on the code at the
    [official repo](https://github.com/YyzHarry/imbalanced-regression)

    Parameters
    ----------
    feature_dim: int,
        input dimension size, i.e. output size of previous layer. This
        will be the dimension of the output from the `deeptabular`
        component
    granularity: int = 100,
        number of bins that the target $y$ is divided into and that will
        be used to compute the features' statistics (mean and variance)
    y_max: Optional[float] = None,
        $y$ upper limit to be considered when binning
    y_min: Optional[float] = None,
        $y$ lower limit to be considered when binning
    start_update: int = 0,
        number of _'waiting epochs' after which the FDS layer will start
        to update its statistics
    start_smooth: int = 1,
        number of _'waiting epochs' after which the FDS layer will start
        smoothing the feature distributions
    kernel: Literal["gaussian", "triang", "laplace", None] = "gaussian",
        choice of smoothing kernel
    ks: int = 5,
        kernel window size
    sigma: Union[int, float] = 2,
        if a _'gaussian'_ or _'laplace'_ kernels are used, this is the
        corresponding standard deviation
    momentum: float = 0.9,
        to train the layer the authors used a momentum update of the running
        statistics across each epoch. Set to 0.9 in the paper.
    clip_min: Optional[float] = None,
        this parameter is used to clip the ratio between the so called
        running variance and the smoothed variance, and is introduced for
        numerical stability. We leave it as optional as we did not find a
        notable improvement in our experiments. The authors used a value
        of 0.1
    clip_max: Optional[float] = None,
        same as `clip_min` but for the upper limit.We leave it as optional
        as we did not find a notable improvement in our experiments. The
        authors used a value of 10.
    """
    super(FDSLayer, self).__init__()
    assert (
        start_update + 1 < start_smooth
    ), "initial update must start at least 2 epoch before smoothing"

    self.feature_dim = feature_dim
    self.granularity = granularity
    self.y_max = y_max
    self.y_min = y_min
    self.kernel_window = torch.tensor(
        get_kernel_window(kernel, ks, sigma), dtype=torch.float32
    )
    self.half_ks = (ks - 1) // 2
    self.momentum = momentum
    self.start_update = start_update
    self.start_smooth = start_smooth
    self.clip_min = clip_min
    self.clip_max = clip_max

    self.pred_layer = nn.Linear(feature_dim, 1)

    self._register_buffers()