# The models module¶

This module contains the four main components that will comprise a Wide and Deep model, and the WideDeep “constructor” class. These four components are: wide, deeptabular, deeptext, deepimage.

Note

TabMlp, TabResnet, TabNet, TabTransformer, SAINT, FTTransformer, TabPerceiver and TabFastFormer can all be used as the deeptabular component of the model and simply represent different alternatives

class pytorch_widedeep.models.wide.Wide(wide_dim, pred_dim=1)[source]

wide (linear) component

Linear model implemented via an Embedding layer connected to the output neuron(s).

Parameters
• wide_dim (int) – size of the Embedding layer. wide_dim is the summation of all the individual values for all the features that go through the wide component. For example, if the wide component receives 2 features with 5 individual values each, wide_dim = 10

• pred_dim (int, default = 1) – size of the ouput tensor containing the predictions

Attributes

wide_linear (nn.Module) – the linear layer that comprises the wide branch of the model

Examples

>>> import torch
>>> from pytorch_widedeep.models import Wide
>>> X = torch.empty(4, 4).random_(6)
>>> wide = Wide(wide_dim=X.unique().size(0), pred_dim=1)
>>> out = wide(X)

class pytorch_widedeep.models.tab_mlp.TabMlp(column_idx, embed_input=None, embed_dropout=0.1, continuous_cols=None, cont_norm_layer='batchnorm', mlp_hidden_dims=[200, 100], mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=False)[source]

Defines a TabMlp model that can be used as the deeptabular component of a Wide & Deep model.

This class combines embedding representations of the categorical features with numerical (aka continuous) features. These are then passed through a series of dense layers (i.e. a MLP).

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the TabMlp model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List, Optional, default = None) – List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), …]

• embed_dropout (float, default = 0.1) – embeddings dropout

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• cont_norm_layer (str, default = "batchnorm") – Type of normalization layer applied to the continuous features. Options are: ‘layernorm’, ‘batchnorm’ or None.

• mlp_hidden_dims (List, default = [200, 100]) – List with the number of neurons per dense layer in the mlp.

• mlp_activation (str, default = "relu") – Activation function for the dense layers of the MLP. Currently tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float or List, default = 0.1) – float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_embed_and_cont (nn.Module) – This is the module that processes the categorical and continuous columns

• tab_mlp (nn.Sequential) – mlp model that will receive the concatenation of the embeddings and the continuous columns

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabMlp
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabMlp(mlp_hidden_dims=[8,4], column_idx=column_idx, embed_input=embed_input,
... continuous_cols = ['e'])
>>> out = model(X_tab)

class pytorch_widedeep.models.tab_resnet.TabResnet(column_idx, embed_input=None, embed_dropout=0.1, continuous_cols=None, cont_norm_layer='batchnorm', concat_cont_first=True, blocks_dims=[200, 100, 100], blocks_dropout=0.1, mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=False)[source]

Defines a so-called TabResnet model that can be used as the deeptabular component of a Wide & Deep model.

This class combines embedding representations of the categorical features with numerical (aka continuous) features. These are then passed through a series of Resnet blocks. See pytorch_widedeep.models.tab_resnet.BasicBlock for details on the structure of each block.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the Resnet model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), …].

• embed_dropout (float, default = 0.1) – embeddings dropout

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• cont_norm_layer (str, default = "batchnorm") – Type of normalization layer applied to the continuous features. Options are: ‘layernorm’, ‘batchnorm’ or None.

• concat_cont_first (bool, default = True) – If True the continuum columns will be concatenated with the Categorical Embeddings and then passed through the Resnet blocks. If False, the Categorical Embeddings will be passed through the Resnet blocks and then the output of the Resnet blocks will be concatenated with the continuous features.

• blocks_dims (List, default = [200, 100, 100]) – List of integers that define the input and output units of each block. For example: [200, 100, 100] will generate 2 blocks. The first will receive a tensor of size 200 and output a tensor of size 100, and the second will receive a tensor of size 100 and output a tensor of size 100. See pytorch_widedeep.models.tab_resnet.BasicBlock for details on the structure of each block.

• blocks_dropout (float, default = 0.1) – Block’s “internal” dropout. This dropout is applied to the first of the two dense layers that comprise each BasicBlock.

• mlp_hidden_dims (List, Optional, default = None) – List with the number of neurons per dense layer in the MLP. e.g: [64, 32]. If None the output of the Resnet Blocks will be connected directly to the output neuron(s), i.e. using a MLP is optional.

• mlp_activation (str, default = "relu") – Activation function for the dense layers of the MLP. Currently tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – float with the dropout between the dense layers of the MLP.

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_embed_and_cont (nn.Module) – This is the module that processes the categorical and continuous columns

• dense_resnet (nn.Sequential) – deep dense Resnet model that will receive the concatenation of the embeddings and the continuous columns

• tab_resnet_mlp (nn.Sequential) – if mlp_hidden_dims is True, this attribute will be an mlp model that will receive:

• the results of the concatenation of the embeddings and the continuous columns – if present – and then passed it through the dense_resnet (concat_cont_first = True), or

• the result of passing the embeddings through the dense_resnet and then concatenating the results with the continuous columns – if present – (concat_cont_first = False)

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabResnet
>>> X_deep = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabResnet(blocks_dims=[16,4], column_idx=column_idx, embed_input=embed_input,
... continuous_cols = ['e'])
>>> out = model(X_deep)

class pytorch_widedeep.models.tabnet.tab_net.TabNet(column_idx, embed_input, embed_dropout=0.0, continuous_cols=None, cont_norm_layer=None, n_steps=3, step_dim=8, attn_dim=8, dropout=0.0, n_glu_step_dependent=2, n_glu_shared=2, ghost_bn=True, virtual_batch_size=128, momentum=0.02, gamma=1.3, epsilon=1e-15, mask_type='sparsemax')[source]

Defines a TabNet model (https://arxiv.org/abs/1908.07442) model that can be used as the deeptabular component of a Wide & Deep model.

The implementation in this library is fully based on that here: https://github.com/dreamquark-ai/tabnet, simply adapted so that it can work within the WideDeep frame. Therefore, all credit to the dreamquark-ai team

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), …]

• embed_dropout (float, default = 0.) – embeddings dropout

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• cont_norm_layer (str, default = "batchnorm") – Type of normalization layer applied to the continuous features. Options are: ‘layernorm’, ‘batchnorm’ or None.

• n_steps (int, default = 3) – number of decision steps

• step_dim (int, default = 8) – Step’s output dimension. This is the output dimension that WideDeep will collect and connect to the output neuron(s). For a better understanding of the function of this and the upcoming parameters, please see the paper.

• attn_dim (int, default = 8) – Attention dimension

• dropout (float, default = 0.0) – GLU block’s internal dropout

• n_glu_step_dependent (int, default = 2) – number of GLU Blocks [FC -> BN -> GLU] that are step dependent

• n_glu_shared (int, default = 2) – number of GLU Blocks [FC -> BN -> GLU] that will be shared across decision steps

• ghost_bn (bool, default=True) – Boolean indicating if Ghost Batch Normalization will be used.

• virtual_batch_size (int, default = 128) – Batch size when using Ghost Batch Normalization

• momentum (float, default = 0.02) – Ghost Batch Normalization’s momentum. The dreamquark-ai advises for very low values. However high values are used in the original publication. During our tests higher values lead to better results

• gamma (float, default = 1.3) – Relaxation parameter in the paper. When gamma = 1, a feature is enforced to be used only at one decision step. As gamma increases, more flexibility is provided to use a feature at multiple decision steps

• epsilon (float, default = 1e-15) – Float to avoid log(0). Always keep low

• mask_type (str, default = "sparsemax") – Mask function to use. Either “sparsemax” or “entmax”

Attributes
• cat_embed_and_cont (nn.Module) – This is the module that processes the categorical and continuous columns

• tabnet_encoder (nn.Module) – Module containing the TabNet encoder. See the paper.

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabNet
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabNet(column_idx=column_idx, embed_input=embed_input, continuous_cols = ['e'])

class pytorch_widedeep.models.transformers.tab_transformer.TabTransformer(column_idx, embed_input=None, embed_dropout=0.1, full_embed_dropout=False, shared_embed=False, add_shared_embed=False, frac_shared_embed=0.25, continuous_cols=None, embed_continuous=False, embed_continuous_activation=None, cont_norm_layer=None, input_dim=32, n_heads=8, use_bias=False, n_blocks=4, attn_dropout=0.2, ff_dropout=0.1, transformer_activation='gelu', mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)[source]

Defines a TabTransformer model (arXiv:2012.06678) that can be used as the deeptabular component of a Wide & Deep model.

Note that this is an enhanced adaptation of the model described in the original publication, containing a series of additional features.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name and number of unique values e.g. [(‘education’, 11), …]

• embed_dropout (float, default = 0.1) – Dropout to be applied to the embeddings matrix

• full_embed_dropout (bool, default = False) – Boolean indicating if an entire embedding (i.e. the representation of one column) will be dropped in the batch. See: pytorch_widedeep.models.transformers._layers.FullEmbeddingDropout. If full_embed_dropout = True, embed_dropout is ignored.

• shared_embed (bool, default = False) – The idea behind shared_embed is described in the Appendix A in the paper: ‘The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns’. In other words, the idea is to let the model learn which column is embedded at the time.

• add_shared_embed (bool, default = False) – The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.transformers._layers.SharedEmbeddings

• frac_shared_embed (float, default = 0.25) – The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column.

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• embed_continuous (bool, default = False) – Boolean indicating if the continuous features will be “embedded”. See pytorch_widedeep.models.transformers._layers.ContinuousEmbeddings Note that setting this to True is similar (but not identical) to the so called FT-Transformer (Feature Tokenizer + Transformer). See pytorch_widedeep.models.transformers.ft_transformer.FTTransformer for details on the dedicated implementation available in this library

• embed_continuous_activation (str, default = None) – String indicating the activation function to be applied to the continuous embeddings, if any. tanh, relu, leaky_relu and gelu are supported.

• cont_norm_layer (str, default = "layernorm",) – Type of normalization layer applied to the continuous features before they are passed to the network. Options are: layernorm, batchnorm or None.

• input_dim (int, default = 32) – The so-called dimension of the model. In general is the number of embeddings used to encode the categorical and/or continuous columns

• n_heads (int, default = 8) – Number of attention heads per Transformer block

• use_bias (bool, default = False) – Boolean indicating whether or not to use bias in the Q, K, and V projection layers.

• n_blocks (int, default = 4) – Number of Transformer blocks

• attn_dropout (float, default = 0.2) – Dropout that will be applied to the Multi-Head Attention layers

• ff_dropout (float, default = 0.1) – Dropout that will be applied to the FeedForward network

• transformer_activation (str, default = "gelu") – Transformer Encoder activation function. tanh, relu, leaky_relu, gelu, geglu and reglu are supported

• mlp_hidden_dims (List, Optional, default = None) – MLP hidden dimensions. If not provided it will default to [l, 4*l, 2*l] where l is the MLP input dimension

• mlp_activation (str, default = "relu") – MLP activation function. tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – Dropout that will be applied to the final MLP

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_and_cont_embed (nn.Module) – This is the module that processes the categorical and continuous columns

• transformer_blks (nn.Sequential) – Sequence of Transformer blocks

• transformer_mlp (nn.Module) – MLP component in the model

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabTransformer(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)

property attention_weights: List

List with the attention weights

The shape of the attention weights is:

$$(N, H, F, F)$$

Where N is the batch size, H is the number of attention heads and F is the number of features/columns in the dataset

Return type

List

class pytorch_widedeep.models.transformers.saint.SAINT(column_idx, embed_input=None, embed_dropout=0.1, full_embed_dropout=False, shared_embed=False, add_shared_embed=False, frac_shared_embed=0.25, continuous_cols=None, embed_continuous_activation=None, cont_norm_layer=None, input_dim=32, use_bias=False, n_heads=8, n_blocks=2, attn_dropout=0.1, ff_dropout=0.2, transformer_activation='gelu', mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)[source]

Defines a SAINT model (arXiv:2106.01342) that can be used as the deeptabular component of a Wide & Deep model.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name and number of unique values e.g. [(‘education’, 11), …]

• embed_dropout (float, default = 0.1) – Dropout to be applied to the embeddings matrix

• full_embed_dropout (bool, default = False) – Boolean indicating if an entire embedding (i.e. the representation of one column) will be dropped in the batch. See: pytorch_widedeep.models.transformers._layers.FullEmbeddingDropout. If full_embed_dropout = True, embed_dropout is ignored.

• shared_embed (bool, default = False) – The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: ‘The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns’. In other words, the idea is to let the model learn which column is embedded at the time.

• add_shared_embed (bool, default = False) – The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.transformers._layers.SharedEmbeddings

• frac_shared_embed (float, default = 0.25) – The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column.

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• embed_continuous_activation (str, default = None) – String indicating the activation function to be applied to the continuous embeddings, if any. tanh, relu, leaky_relu and gelu are supported.

• cont_norm_layer (str, default = None,) – Type of normalization layer applied to the continuous features before they are embedded. Options are: layernorm, batchnorm or None.

• input_dim (int, default = 32) – The so-called dimension of the model. In general is the number of embeddings used to encode the categorical and/or continuous columns

• n_heads (int, default = 8) – Number of attention heads per Transformer block

• use_bias (bool, default = False) – Boolean indicating whether or not to use bias in the Q, K, and V projection layers

• n_blocks (int, default = 2) – Number of SAINT-Transformer blocks. 1 in the paper.

• attn_dropout (float, default = 0.2) – Dropout that will be applied to the Multi-Head Attention column and row layers

• ff_dropout (float, default = 0.1) – Dropout that will be applied to the FeedForward network

• transformer_activation (str, default = "gelu") – Transformer Encoder activation function. tanh, relu, leaky_relu, gelu, geglu and reglu are supported

• mlp_hidden_dims (List, Optional, default = None) – MLP hidden dimensions. If not provided it will default to [l, 4*l, 2*l] where l is the MLP input dimension

• mlp_activation (str, default = "relu") – MLP activation function. tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – Dropout that will be applied to the final MLP

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_and_cont_embed (nn.Module) – This is the module that processes the categorical and continuous columns

• transformer_blks (nn.Sequential) – Sequence of SAINT-Transformer blocks

• transformer_mlp (nn.Module) – MLP component in the model

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import SAINT
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = SAINT(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)

property attention_weights: List

List with the attention weights. Each element of the list is a tuple where the first and the second elements are the column and row attention weights respectively

The shape of the attention weights is:

• column attention: $$(N, H, F, F)$$

• row attention: $$(1, H, N, N)$$

where N is the batch size, H is the number of heads and F is the number of features/columns in the dataset

Return type

List

class pytorch_widedeep.models.transformers.ft_transformer.FTTransformer(column_idx, embed_input=None, embed_dropout=0.1, full_embed_dropout=False, shared_embed=False, add_shared_embed=False, frac_shared_embed=0.25, continuous_cols=None, embed_continuous_activation=None, cont_norm_layer=None, input_dim=64, kv_compression_factor=0.5, kv_sharing=False, use_bias=False, n_heads=8, n_blocks=4, attn_dropout=0.2, ff_dropout=0.1, transformer_activation='reglu', ff_factor=1.33, mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)[source]

Defines a FTTransformer model (arXiv:2106.11959) that can be used as the deeptabular component of a Wide & Deep model.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name and number of unique values e.g. [(‘education’, 11), …]

• embed_dropout (float, default = 0.1) – Dropout to be applied to the embeddings matrix

• full_embed_dropout (bool, default = False) – Boolean indicating if an entire embedding (i.e. the representation of one column) will be dropped in the batch. See: pytorch_widedeep.models.transformers._layers.FullEmbeddingDropout. If full_embed_dropout = True, embed_dropout is ignored.

• shared_embed (bool, default = False) –

The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: ‘The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns’. In other words, the idea is to let the model learn which column is embedded at the time.

• add_shared_embed (bool, default = False,) – The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.transformers._layers.SharedEmbeddings

• frac_shared_embed (float, default = 0.25) – The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column.

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• embed_continuous_activation (str, default = None) – String indicating the activation function to be applied to the continuous embeddings, if any. tanh, relu, leaky_relu and gelu are supported.

• cont_norm_layer (str, default = None,) – Type of normalization layer applied to the continuous features before they are embedded. Options are: layernorm, batchnorm or None.

• input_dim (int, default = 64) – The so-called dimension of the model. Is the number of embeddings used to encode the categorical and/or continuous columns.

• kv_compression_factor (int, default = 0.5) – By default, the FTTransformer uses Linear Attention (See Linformer: Self-Attention with Linear Complexity ) The compression factor that will be used to reduce the input sequence length. If we denote the resulting sequence length as $$k$$ $$k = int(kv_{compression \space factor} \times s)$$ where $$s$$ is the input sequence length.

• kv_sharing (bool, default = False) –

Boolean indicating if the $$E$$ and $$F$$ projection matrices will share weights. See Linformer: Self-Attention with Linear Complexity for details

• n_heads (int, default = 8) – Number of attention heads per FTTransformer block

• use_bias (bool, default = False) – Boolean indicating whether or not to use bias in the Q, K, and V projection layers

• n_blocks (int, default = 4) – Number of FTTransformer blocks

• attn_dropout (float, default = 0.2) – Dropout that will be applied to the Linear-Attention layers

• ff_dropout (float, default = 0.1) – Dropout that will be applied to the FeedForward network

• transformer_activation (str, default = "gelu") – Transformer Encoder activation function. tanh, relu, leaky_relu, gelu, geglu and reglu are supported

• ff_factor (float, default = 4 / 3) – Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4, but they use 4/3 in the paper.

• mlp_hidden_dims (List, Optional, default = None) – MLP hidden dimensions. If not provided no MLP on top of the final FTTransformer block will be used

• mlp_activation (str, default = "relu") – MLP activation function. tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – Dropout that will be applied to the final MLP

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_and_cont_embed (nn.Module) – This is the module that processes the categorical and continuous columns

• transformer_blks (nn.Sequential) – Sequence of FTTransformer blocks

• transformer_mlp (nn.Module) – MLP component in the model

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import FTTransformer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = FTTransformer(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)

property attention_weights: List

List with the attention weights

The shape of the attention weights is:

$$(N, H, F, k)$$

where N is the batch size, H is the number of attention heads, F is the number of features/columns and k is the reduced sequence length or dimension, i.e. $$k = int(kv_ {compression \space factor} \times s)$$

Return type

List

class pytorch_widedeep.models.transformers.tab_perceiver.TabPerceiver(column_idx, embed_input=None, embed_dropout=0.1, full_embed_dropout=False, shared_embed=False, add_shared_embed=False, frac_shared_embed=0.25, continuous_cols=None, embed_continuous_activation=None, cont_norm_layer=None, input_dim=32, n_cross_attns=1, n_cross_attn_heads=4, n_latents=16, latent_dim=128, n_latent_heads=4, n_latent_blocks=4, n_perceiver_blocks=4, share_weights=False, attn_dropout=0.1, ff_dropout=0.1, transformer_activation='geglu', mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)[source]

Defines an adaptation of a Perceiver model (arXiv:2103.03206) that can be used as the deeptabular component of a Wide & Deep model.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name and number of unique values e.g. [(‘education’, 11), …]

• embed_dropout (float, default = 0.1) – Dropout to be applied to the embeddings matrix

• full_embed_dropout (bool, default = False) – Boolean indicating if an entire embedding (i.e. the representation of one column) will be dropped in the batch. See: pytorch_widedeep.models.transformers._layers.FullEmbeddingDropout. If full_embed_dropout = True, embed_dropout is ignored.

• shared_embed (bool, default = False) –

The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: ‘The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns’. In other words, the idea is to let the model learn which column is embedded at the time.

• add_shared_embed (bool, default = False,) – The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.transformers._layers.SharedEmbeddings

• frac_shared_embed (float, default = 0.25) – The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column.

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• embed_continuous_activation (str, default = None) – String indicating the activation function to be applied to the continuous embeddings, if any. tanh, relu, leaky_relu and gelu are supported.

• cont_norm_layer (str, default = None,) – Type of normalization layer applied to the continuous features before they are embedded. Options are: layernorm, batchnorm or None.

• input_dim (int, default = 32) – The so-called dimension of the model. In general, is the number of embeddings used to encode the categorical and/or continuous columns.

• n_cross_attns (int, default = 1) – Number of times each perceiver block will cross attend to the input data (i.e. number of cross attention components per perceiver block). This should normally be 1. However, in the paper they describe some architectures (normally computer vision-related problems) where the Perceiver attends multiple times to the input array. Therefore, maybe multiple cross attention to the input array is also useful in some cases for tabular data

• n_cross_attn_heads (int, default = 4) – Number of attention heads for the cross attention component

• n_latents (int, default = 16) – Number of latents. This is the N parameter in the paper. As indicated in the paper, this number should be significantly lower than M (the number of columns in the dataset). Setting N closer to M defies the main purpose of the Perceiver, which is to overcome the transformer quadratic bottleneck

• latent_dim (int, default = 128) – Latent dimension.

• n_latent_heads (int, default = 4) – Number of attention heads per Latent Transformer

• n_latent_blocks (int, default = 4) – Number of transformer encoder blocks (normalised MHA + normalised FF) per Latent Transformer

• n_perceiver_blocks (int, default = 4) – Number of Perceiver blocks defined as [Cross Attention + Latent Transformer]

• share_weights (Boolean, default = False) – Boolean indicating if the weights will be shared between Perceiver blocks

• attn_dropout (float, default = 0.2) – Dropout that will be applied to the Multi-Head Attention layers

• ff_dropout (float, default = 0.1) – Dropout that will be applied to the FeedForward network

• transformer_activation (str, default = "gelu") – Transformer Encoder activation function. tanh, relu, leaky_relu, gelu, geglu and reglu are supported

• mlp_hidden_dims (List, Optional, default = None) – MLP hidden dimensions. If not provided it will default to [l, 4*l, 2*l] where l is the MLP input dimension

• mlp_activation (str, default = "relu") – MLP activation function. tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – Dropout that will be applied to the final MLP

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_and_cont_embed (nn.Module) – This is the module that processes the categorical and continuous columns

• perceiver_blks (nn.ModuleDict) – ModuleDict with the Perceiver blocks

• latents (nn.Parameter) – Latents that will be used for prediction

• perceiver_mlp (nn.Module) – MLP component in the model

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabPerceiver
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabPerceiver(column_idx=column_idx, embed_input=embed_input,
... continuous_cols=continuous_cols, n_latents=2, latent_dim=16,
... n_perceiver_blocks=2)
>>> out = model(X_tab)

property attention_weights: List

List with the attention weights. If the weights are not shared between perceiver blocks each element of the list will be a list itself containing the Cross Attention and Latent Transformer attention weights respectively

The shape of the attention weights is:

• Cross Attention: $$(N, C, L, F)$$

• Latent Attention: $$(N, T, L, L)$$

WHere N is the batch size, C is the number of Cross Attention heads, L is the number of Latents, F is the number of features/columns in the dataset and T is the number of Latent Attention heads

Return type

List

class pytorch_widedeep.models.transformers.tab_fastformer.TabFastFormer(column_idx, embed_input=None, embed_dropout=0.1, full_embed_dropout=False, shared_embed=False, add_shared_embed=False, frac_shared_embed=0.25, continuous_cols=None, embed_continuous_activation=None, cont_norm_layer=None, input_dim=32, n_heads=8, use_bias=False, n_blocks=4, attn_dropout=0.1, ff_dropout=0.2, share_qv_weights=False, share_weights=False, transformer_activation='relu', mlp_hidden_dims=None, mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=True)[source]

Defines an adaptation of a FastFormer model (arXiv:2108.09084) that can be used as the deeptabular component of a Wide & Deep model.

Parameters
• column_idx (Dict) – Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {‘education’: 0, ‘relationship’: 1, ‘workclass’: 2, …}

• embed_input (List) – List of Tuples with the column name and number of unique values e.g. [(‘education’, 11), …]

• embed_dropout (float, default = 0.1) – Dropout to be applied to the embeddings matrix

• full_embed_dropout (bool, default = False) – Boolean indicating if an entire embedding (i.e. the representation of one column) will be dropped in the batch. See: pytorch_widedeep.models.transformers._layers.FullEmbeddingDropout. If full_embed_dropout = True, embed_dropout is ignored.

• shared_embed (bool, default = False) –

The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: ‘The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns’. In other words, the idea is to let the model learn which column is embedded at the time.

• add_shared_embed (bool, default = False,) – The two embedding sharing strategies are: 1) add the shared embeddings to the column embeddings or 2) to replace the first frac_shared_embed with the shared embeddings. See pytorch_widedeep.models.transformers._layers.SharedEmbeddings

• frac_shared_embed (float, default = 0.25) – The fraction of embeddings that will be shared (if add_shared_embed = False) by all the different categories for one particular column.

• continuous_cols (List, Optional, default = None) – List with the name of the numeric (aka continuous) columns

• embed_continuous_activation (str, default = None) – String indicating the activation function to be applied to the continuous embeddings, if any. tanh, relu, leaky_relu and gelu are supported.

• cont_norm_layer (str, default = None,) – Type of normalization layer applied to the continuous features before they are embedded. Options are: layernorm, batchnorm or None.

• input_dim (int, default = 32) – The so-called dimension of the model. In general is the number of embeddings used to encode the categorical and/or continuous columns

• n_heads (int, default = 8) – Number of attention heads per FastFormer block

• use_bias (bool, default = False) – Boolean indicating whether or not to use bias in the Q, K, and V projection layers

• n_blocks (int, default = 4) – Number of FastFormer blocks

• attn_dropout (float, default = 0.2) – Dropout that will be applied to the Additive Attention layers

• ff_dropout (float, default = 0.1) – Dropout that will be applied to the FeedForward network

• share_qv_weights (bool, default = False) – Following the paper, this is a boolean indicating if the the value and the query transformation parameters will be shared

• share_weights (bool, default = False) – In addition to sharing the value and query transformation parameters, the parameters across different Fastformer layers can also be shared

• transformer_activation (str, default = "gelu") – Transformer Encoder activation function. tanh, relu, leaky_relu, gelu, geglu and reglu are supported

• mlp_hidden_dims (List, Optional, default = None) – MLP hidden dimensions. If not provided it will default to [l, 4*l, 2*l] where l is the MLP input dimension

• mlp_activation (str, default = "relu") – MLP activation function. tanh, relu, leaky_relu and gelu are supported

• mlp_dropout (float, default = 0.1) – Dropout that will be applied to the final MLP

• mlp_batchnorm (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the dense layers

• mlp_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers

• mlp_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• cat_and_cont_embed (nn.Module) – This is the module that processes the categorical and continuous columns

• transformer_blks (nn.Sequential) – Sequence of FasFormer blocks.

• transformer_mlp (nn.Module) – MLP component in the model

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import TabFastFormer
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ['a', 'b', 'c', 'd', 'e']
>>> embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
>>> continuous_cols = ['e']
>>> column_idx = {k:v for v,k in enumerate(colnames)}
>>> model = TabFastFormer(column_idx=column_idx, embed_input=embed_input, continuous_cols=continuous_cols)
>>> out = model(X_tab)

property attention_weights: List

List with the attention weights. Each element of the list is a tuple where the first and second elements are $$\alpha$$ and $$\beta$$ attention weights in the paper.

The shape of the attention weights is:

$$(N, H, F)$$

where N is the batch size, H is the number of attention heads and F is the number of features/columns in the dataset

Return type

List

Standard text classifier/regressor comprised by a stack of RNNs (LSTMs or GRUs).

In addition, there is the option to add a Fully Connected (FC) set of dense layers (referred as texthead) on top of the stack of RNNs

Parameters
• vocab_size (int) – number of words in the vocabulary

• rnn_type (str, default = 'lstm') – String indicating the type of RNN to use. One of lstm or gru

• hidden_dim (int, default = 64) – Hidden dim of the RNN

• n_layers (int, default = 3) – number of recurrent layers

• rnn_dropout (float, default = 0.1) – dropout for the dropout layer on the outputs of each RNN layer except the last layer

• bidirectional (bool, default = True) – indicates whether the staked RNNs are bidirectional

• use_hidden_state (str, default = True) – Boolean indicating whether to use the final hidden state or the RNN output as predicting features

• padding_idx (int, default = 1) – index of the padding token in the padded-tokenised sequences. I use the fastai tokenizer where the token index 0 is reserved for the ‘unknown’ word token

• embed_dim (int, Optional, default = None) – Dimension of the word embedding matrix if non-pretained word vectors are used

• embed_matrix (np.ndarray, Optional, default = None) – Pretrained word embeddings

• embed_trainable (bool, default = True) – Boolean indicating if the pretrained embeddings are trainable

• head_hidden_dims (List, Optional, default = None) – List with the sizes of the stacked dense layers in the head e.g: [128, 64]

• head_activation (str, default = "relu") – Activation function for the dense layers in the head. Currently tanh, relu, leaky_relu and gelu are supported

• head_dropout (float, Optional, default = None) – dropout between the dense layers in the head

• head_batchnorm (bool, default = False) – Whether or not to include batch normalization in the dense layers that form the ‘texthead’

• head_batchnorm_last (bool, default = False) – Boolean indicating whether or not to apply batch normalization to the last of the dense layers in the head

• head_linear_first (bool, default = False) – Boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• word_embed (nn.Module) – word embedding matrix

• rnn (nn.Module) – Stack of RNNs

• texthead (nn.Sequential) – Stack of dense layers on top of the RNN. This will only exists if head_layers_dim is not None

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import DeepText
>>> X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = DeepText(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)
>>> out = model(X_text)


Standard image classifier/regressor using a pretrained network (in particular ResNets) or a sequence of 4 convolution layers.

If pretrained=False the ‘backbone’ of DeepImage will be a sequence of 4 convolutional layers comprised by: Conv2d -> BatchNorm2d -> LeakyReLU. The 4th one will also add a final AdaptiveAvgPool2d operation.

If pretrained=True the ‘backbone’ will be ResNets. ResNets have 9 ‘components’ before the last dense layers. The first 4 are: Conv2d -> BatchNorm2d -> ReLU -> MaxPool2d. Then there are 4 additional resnet blocks comprised by a series of convolutions and then the final AdaptiveAvgPool2d. Overall, 4+4+1=9. The parameter freeze_n sets the number of layers to be frozen. For example, freeze_n=6 will freeze all but the last 3 layers.

In addition to all of the above, there is the option to add a fully connected set of dense layers (referred as imagehead) on top of the stack of CNNs

Parameters
• pretrained (bool, default = True) – Indicates whether or not we use a pretrained Resnet network or a series of conv layers (see conv_layer function)

• resnet_architecture (int, default = 18) – The resnet architecture. One of 18, 34 or 50

• freeze_n (int, default = 6) – number of layers to freeze. Must be less than or equal to 8. If 8 the entire ‘backbone’ of the network will be frozen

• head_hidden_dims (List, Optional, default = None) – List with the number of neurons per dense layer in the head. e.g: [64,32]

• head_activation (str, default = "relu") – Activation function for the dense layers in the head. Currently tanh, relu, leaky_relu and gelu are supported

• head_dropout (float, default = 0.1) – float indicating the dropout between the dense layers.

• head_batchnorm (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the dense layers

• head_batchnorm_last (bool, default = False) – Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

• head_linear_first (bool, default = False) – Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

Attributes
• backbone (nn.Sequential) – Sequential stack of CNNs comprising the ‘backbone’ of the network

• imagehead (nn.Sequential) – Sequential stack of dense layers comprising the FC-Head (aka imagehead)

• output_dim (int) – The output dimension of the model. This is a required attribute neccesary to build the WideDeep class

Example

>>> import torch
>>> from pytorch_widedeep.models import DeepImage
>>> X_img = torch.rand((2,3,224,224))
>>> model = DeepImage(head_hidden_dims=[512, 64, 8])
>>> out = model(X_img)


Main collector class that combines all wide, deeptabular (which can be a number of architectures), deeptext and deepimage models.

There are two options to combine these models that correspond to the two main architectures that pytorch-widedeep can build.

• Directly connecting the output of the model components to an ouput neuron(s).

• Adding a Fully-Connected Head (FC-Head) on top of the deep models. This FC-Head will combine the output form the deeptabular, deeptext and deepimage and will be then connected to the output neuron(s).

Parameters
• wide (nn.Module, Optional, default = None) – Wide model. I recommend using the Wide class in this package. However, it is possible to use a custom model as long as is consistent with the required architecture, see pytorch_widedeep.models.wide.Wide

• deeptabular (nn.Module, Optional, default = None) – currently pytorch-widedeep implements a number of possible architectures for the deeptabular component. See the documenation of the package. I recommend using the deeptabular components in this package. However, it is possible to use a custom model as long as is consistent with the required architecture.

• deeptext (nn.Module, Optional, default = None) – Model for the text input. Must be an object of class DeepText or a custom model as long as is consistent with the required architecture. See pytorch_widedeep.models.deep_text.DeepText

• deepimage (nn.Module, Optional, default = None) – Model for the images input. Must be an object of class DeepImage or a custom model as long as is consistent with the required architecture. See pytorch_widedeep.models.deep_image.DeepImage

• deephead (nn.Module, Optional, default = None) – Custom model by the user that will receive the outtput of the deep component. Typically a FC-Head (MLP)

• head_hidden_dims (List, Optional, default = None) – Alternatively, the head_hidden_dims param can be used to specify the sizes of the stacked dense layers in the fc-head e.g: [128, 64]. Use deephead or head_hidden_dims, but not both.

• head_dropout (float, default = 0.1) – If head_hidden_dims is not None, dropout between the layers in head_hidden_dims

• head_activation (str, default = "relu") – If head_hidden_dims is not None, activation function of the head layers. One of tanh, relu, gelu or leaky_relu

• head_batchnorm (bool, default = False) – If head_hidden_dims is not None, specifies if batch normalizatin should be included in the head layers

• head_batchnorm_last (bool, default = False) – If head_hidden_dims is not None, boolean indicating whether or not to apply batch normalization to the last of the dense layers

• head_linear_first (bool, default = False) – If head_hidden_dims is not None, boolean indicating whether the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

• pred_dim (int, default = 1) – Size of the final wide and deep output layer containing the predictions. 1 for regression and binary classification or number of classes for multiclass classification.

Examples

>>> from pytorch_widedeep.models import TabResnet, DeepImage, DeepText, Wide, WideDeep
>>> embed_input = [(u, i, j) for u, i, j in zip(["a", "b", "c"][:4], [4] * 3, [8] * 3)]
>>> column_idx = {k: v for v, k in enumerate(["a", "b", "c"])}
>>> wide = Wide(10, 1)
>>> deeptabular = TabResnet(blocks_dims=[8, 4], column_idx=column_idx, embed_input=embed_input)
>>> deeptext = DeepText(vocab_size=10, embed_dim=4, padding_idx=0)
>>> deepimage = DeepImage(pretrained=False)
>>> model = WideDeep(wide=wide, deeptabular=deeptabular, deeptext=deeptext, deepimage=deepimage)


Note

While I recommend using the wide and deeptabular components within this package when building the corresponding model components, it is very likely that the user will want to use custom text and image models. That is perfectly possible. Simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations (i.e. not the final prediction) so that these activations are collected by WideDeep and combined accordingly. In addition, the models MUST also contain an attribute output_dim with the size of these last layers of activations. See for example pytorch_widedeep.models.tab_mlp.TabMlp