Skip to content

The rec module

This module contains models are that specifically designed for recommendation systems. While the rest of the models can be accessed from the pytorch_widedeep.models module, models in this module need to be specifically imported from the rec module, e.g.:

from pytorch_widedeep.models.rec import DeepFactorizationMachine

The list of models here is not meant to be exhaustive, but it includes some common architectures such as factorization machines, field aware factorization machines or extreme factorization machines. More models will be added in the future.

DeepFactorizationMachine

Bases: BaseTabularModelWithAttention

Deep Factorization Machine (DeepFM) for recommendation systems, which is an adaptation of 'Factorization Machines' by Steffen Rendle. Presented in 'DeepFM: A Factorization-Machine based Neural Network for CTR Prediction' by Huifeng Guo, Ruiming Tang, Yunming Yey, Zhenguo Li, Xiuqiang He. 2017.

The implementation in this library takes advantage of all the functionalities available to encode categorical and continuous features. The model can be used with only the factorization machine

Note that this class implements only the 'Deep' component of the model described in the paper. The linear component is not implemented 'internally' and, if one wants to include it, it can be easily added using the 'wide' (aka linear) component available in this library. See the examples in the examples folder.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dictionary mapping column names to their corresponding index.

required
num_factors int

Number of factors for the factorization machine.

required
reduce_sum bool

Whether to reduce the sum in the factorization machine output.

True
cat_embed_input Optional[List[Tuple[str, int]]]

List of tuples with categorical column names and number of unique values.

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['piecewise', 'periodic', 'standard']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the mlp.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
mlp_dropout Optional[float]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

None

Attributes:

Name Type Description
mlp MLP

MLP component of the model if the mlp_hidden_dims parameter is not None. If None, the model will only return the output of the factorization machine.

Examples:

>>> from typing import Dict, List, Tuple
>>> import torch
>>> from torch import Tensor
>>> from pytorch_widedeep.models.rec import DeepFactorizationMachine
>>> X = torch.randint(0, 10, (16, 2))
>>> column_idx: Dict[str, int] = {"col1": 0, "col2": 1}
>>> cat_embed_input: List[Tuple[str, int]] = [("col1", 10), ("col2", 10)]
>>> fm = DeepFactorizationMachine(
...     column_idx=column_idx,
...     num_factors=8,
...     cat_embed_input=cat_embed_input,
...     mlp_hidden_dims=[16, 8]
... )
>>> out = fm(X)
Source code in pytorch_widedeep/models/rec/deepfm.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
class DeepFactorizationMachine(BaseTabularModelWithAttention):
    """
    Deep Factorization Machine (DeepFM) for recommendation systems, which is
    an adaptation of 'Factorization Machines' by Steffen Rendle. Presented
    in 'DeepFM: A Factorization-Machine based Neural Network for CTR
    Prediction' by Huifeng Guo, Ruiming Tang, Yunming Yey, Zhenguo Li,
    Xiuqiang He. 2017.

    The implementation in this library takes advantage of all the
    functionalities available to encode categorical and continuous features.
    The model can be used with only the factorization machine

    Note that this class implements only the 'Deep' component of the model
    described in the paper. The linear component is not
    implemented 'internally' and, if one wants to include it, it can be
    easily added using the 'wide' (aka linear) component available in this
    library. See the examples in the examples folder.

    Parameters
    ----------
    column_idx : Dict[str, int]
        Dictionary mapping column names to their corresponding index.
    num_factors : int
        Number of factors for the factorization machine.
    reduce_sum : bool, default=True
        Whether to reduce the sum in the factorization machine output.
    cat_embed_input : Optional[List[Tuple[str, int]]], default=None
        List of tuples with categorical column names and number of unique values.
    cat_embed_dropout : Optional[float], default=None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias : Optional[bool], default=None
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation : Optional[str], default=None
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    mlp: MLP
        MLP component of the model if the mlp_hidden_dims parameter is not
        None. If None, the model will only return the output of the
        factorization machine.

    Examples
    --------
    >>> from typing import Dict, List, Tuple
    >>> import torch
    >>> from torch import Tensor
    >>> from pytorch_widedeep.models.rec import DeepFactorizationMachine
    >>> X = torch.randint(0, 10, (16, 2))
    >>> column_idx: Dict[str, int] = {"col1": 0, "col2": 1}
    >>> cat_embed_input: List[Tuple[str, int]] = [("col1", 10), ("col2", 10)]
    >>> fm = DeepFactorizationMachine(
    ...     column_idx=column_idx,
    ...     num_factors=8,
    ...     cat_embed_input=cat_embed_input,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> out = fm(X)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        num_factors: int,
        reduce_sum: bool = True,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["piecewise", "periodic", "standard"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(DeepFactorizationMachine, self).__init__(
            column_idx=column_idx,
            input_dim=num_factors,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=False,
            add_shared_embed=None,
            frac_shared_embed=None,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.reduce_sum = reduce_sum

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        if self.mlp_hidden_dims is not None:

            if self.mlp_hidden_dims[-1] != self.input_dim:
                d_hidden = (
                    [self.input_dim * len(self.column_idx)]
                    + self.mlp_hidden_dims
                    + [self.input_dim]
                )
            else:
                d_hidden = [
                    self.input_dim * len(self.column_idx)
                ] + self.mlp_hidden_dims

            self.mlp = MLP(
                d_hidden=d_hidden,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:
        embed = self._get_embeddings(X)
        fm_output = factorization_machine(embed, self.reduce_sum)

        if self.mlp is None:
            return fm_output

        mlp_input = embed.view(embed.size(0), -1)
        mlp_output = self.mlp(mlp_input)

        if self.reduce_sum:
            mlp_output = mlp_output.sum(1, keepdim=True)

        return fm_output + mlp_output

    @property
    def output_dim(self) -> int:
        if self.reduce_sum:
            return 1
        else:
            return self.input_dim

DeepFieldAwareFactorizationMachine

Bases: BaseTabularModelWithAttention

Deep Field Aware Factorization Machine (DeepFFM) for recommendation systems. Adaptation of the paper 'Field-aware Factorization Machines in a Real-world Online Advertising System', Juan et al. 2017.

This class implements only the 'Deep' component of the model described in the paper. The linear component is not implemented 'internally' and, if one wants to include it, it can be easily added using the 'wide'/linear component in this library. See the examples in the examples folder.

Note that in this case, only categorical features are accepted. This is because the embeddings of each feature will be learned using all other features. Therefore these embeddings have to be all of the same nature. This does not occur if we mix categorical and continuous features.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dictionary mapping column names to their corresponding index.

required
num_factors int

Number of factors for the factorization machine.

required
reduce_sum bool

Whether to reduce the sum in the factorization machine output.

True
cat_embed_input Optional[List[Tuple[str, int]]]

List of tuples with categorical column names and number of unique values.

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the mlp.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
mlp_dropout Optional[float]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

None

Attributes:

Name Type Description
n_features int

Number of unique features/columns

n_tokens int

Number of unique values (tokens) in the full dataset (corpus)

encoders ModuleList

List of BaseTabularModelWithAttention instances. One per categorical column

mlp Module

Multi-layer perceptron. If None the output will be the output of the factorization machine (i.e. the sum of the interactions)

Examples:

>>> import torch
>>> from torch import Tensor
>>> from typing import Dict, List, Tuple
>>> from pytorch_widedeep.models.rec import DeepFieldAwareFactorizationMachine
>>> X = torch.randint(0, 10, (16, 2))
>>> column_idx: Dict[str, int] = {"col1": 0, "col2": 1}
>>> cat_embed_input: List[Tuple[str, int]] = [("col1", 10), ("col2", 10)]
>>> ffm = DeepFieldAwareFactorizationMachine(
...     column_idx=column_idx,
...     num_factors=4,
...     cat_embed_input=cat_embed_input,
...     mlp_hidden_dims=[16, 8]
... )
>>> output = ffm(X)
Source code in pytorch_widedeep/models/rec/deepffm.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
class DeepFieldAwareFactorizationMachine(BaseTabularModelWithAttention):
    """
    Deep Field Aware Factorization Machine (DeepFFM) for recommendation
    systems. Adaptation of the paper 'Field-aware Factorization Machines in a
    Real-world Online Advertising System', Juan et al. 2017.

    This class implements only the 'Deep' component of the model described in
    the paper. The linear component is not implemented 'internally' and, if
    one wants to include it, it can be easily added using the 'wide'/linear
    component in this library. See the examples in the examples folder.

    Note that in this case, only categorical features are accepted. This is
    because the embeddings of each feature will be learned using all other
    features. Therefore these embeddings have to be all of the same nature.
    This does not occur if we mix categorical and continuous features.

    Parameters
    ----------
    column_idx : Dict[str, int]
        Dictionary mapping column names to their corresponding index.
    num_factors : int
        Number of factors for the factorization machine.
    reduce_sum : bool, default=True
        Whether to reduce the sum in the factorization machine output.
    cat_embed_input : Optional[List[Tuple[str, int]]], default=None
        List of tuples with categorical column names and number of unique values.
    cat_embed_dropout : Optional[float], default=None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias : Optional[bool], default=None
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation : Optional[str], default=None
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    n_features: int
        Number of unique features/columns
    n_tokens: int
        Number of unique values (tokens) in the full dataset (corpus)
    encoders: nn.ModuleList
        List of `BaseTabularModelWithAttention` instances. One per categorical
        column
    mlp: nn.Module
        Multi-layer perceptron. If `None` the output will be the output of the
        factorization machine (i.e. the sum of the interactions)

    Examples
    --------
    >>> import torch
    >>> from torch import Tensor
    >>> from typing import Dict, List, Tuple
    >>> from pytorch_widedeep.models.rec import DeepFieldAwareFactorizationMachine
    >>> X = torch.randint(0, 10, (16, 2))
    >>> column_idx: Dict[str, int] = {"col1": 0, "col2": 1}
    >>> cat_embed_input: List[Tuple[str, int]] = [("col1", 10), ("col2", 10)]
    >>> ffm = DeepFieldAwareFactorizationMachine(
    ...     column_idx=column_idx,
    ...     num_factors=4,
    ...     cat_embed_input=cat_embed_input,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> output = ffm(X)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        num_factors: int,
        cat_embed_input: List[Tuple[str, int]],
        reduce_sum: bool = True,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(DeepFieldAwareFactorizationMachine, self).__init__(
            column_idx=column_idx,
            input_dim=num_factors,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=False,
            add_shared_embed=None,
            frac_shared_embed=None,
            continuous_cols=None,
            cont_norm_layer=None,
            embed_continuous_method=None,
            cont_embed_dropout=None,
            cont_embed_activation=None,
            quantization_setup=None,
            n_frequencies=None,
            sigma=None,
            share_last_layer=None,
            full_embed_dropout=None,
        )

        self.reduce_sum = reduce_sum

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.n_features = len(self.column_idx)
        self.n_tokens = sum([ei[1] for ei in cat_embed_input])

        self.encoders = nn.ModuleList(
            [
                BaseTabularModelWithAttention(**config)
                for config in self._get_encoder_configs()
            ]
        )

        if self.mlp_hidden_dims is not None:
            d_hidden = [
                self.n_features * (self.n_features - 1) // 2 * num_factors
            ] + self.mlp_hidden_dims
            self.mlp = MLP(
                d_hidden=d_hidden,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:

        interactions_l: List[Tensor] = []
        for i in range(len(self.column_idx)):
            for j in range(i + 1, len(self.column_idx)):
                # the syntax [i] and [j] is to keep the shape of the tensors
                # as they are sliced within '_get_embeddings'. This will
                # return a tensor of shape (b, 1, embed_dim). Then it has to
                # be squeezed to (b, embed_dim)  before multiplied
                embed_i = self.encoders[i]._get_embeddings(X[:, [i]]).squeeze(1)
                embed_j = self.encoders[j]._get_embeddings(X[:, [j]]).squeeze(1)
                interactions_l.append(embed_i * embed_j)

        interactions = torch.cat(interactions_l, dim=1)

        if self.mlp is not None:
            interactions = interactions.view(X.size(0), -1)
            deep_out = self.mlp(interactions)
        else:
            deep_out = interactions

        if self.reduce_sum:
            deep_out = deep_out.sum(dim=1, keepdim=True)

        return deep_out

    def _get_encoder_configs(self) -> List[Dict[str, Any]]:
        config: List[Dict[str, Any]] = []
        for col, _ in self.column_idx.items():
            cat_embed_input = [(col, self.n_tokens)]
            _config = {
                "column_idx": {col: 0},
                "input_dim": self.input_dim,
                "cat_embed_input": cat_embed_input,
                "cat_embed_dropout": self.cat_embed_dropout,
                "use_cat_bias": self.use_cat_bias,
                "cat_embed_activation": self.cat_embed_activation,
                "shared_embed": None,
                "add_shared_embed": None,
                "frac_shared_embed": None,
                "continuous_cols": None,
                "cont_norm_layer": None,
                "embed_continuous_method": None,
                "cont_embed_dropout": None,
                "cont_embed_activation": None,
                "quantization_setup": None,
                "n_frequencies": None,
                "sigma": None,
                "share_last_layer": None,
                "full_embed_dropout": None,
            }

            config.append(_config)

        return config

    @property
    def output_dim(self) -> int:
        if self.reduce_sum:
            return 1
        elif self.mlp_hidden_dims is not None:
            return self.mlp_hidden_dims[-1]
        else:
            return self.n_features * (self.n_features - 1) // 2 * self.input_dim

DeepInterestNetwork

Bases: BaseWDModelComponent

Adaptation of the Deep Interest Network (DIN) for recommendation systems as described in the paper: 'Deep Interest Network for Click-Through Rate Prediction' by Guorui Zhou et al. 2018.

Note that all the categorical- and continuous-related parameters refer to the categorical and continuous columns that are not part of the sequential columns and will be treated as standard tabular data.

This model requires some specific data preparation that allows for quite a lot of flexibility. Therefore, I have included a preprocessor (DINPreprocessor) in the preprocessing module that will take care of the data preparation.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dictionary mapping column names to their corresponding index.

required
target_item_col str

Name of the target item column. Note that this is not the target column. This algorithm relies on a sequence representation of interactions. The target item would be the next item in the sequence of interactions (e.g. item 6th in a sequence of 5 items), and our goal is to predict a given action on it.

'target_item'
user_behavior_confiq Tuple[List[str], int, int]

Configuration for user behavior sequence columns. Tuple containing: - List of column names that correspond to the user behavior sequence
- Number of unique feature values (n_tokens)
- Embedding dimension
Example: (["item_1", "item_2", "item_3"], 5, 8)

required
action_seq_config Optional[Tuple[List[str], int]]

Configuration for a so-called action sequence columns (for example a rating, or purchased/not-purchased, etc). Tuple containing:
- List of column names
- Number of unique feature values (n_tokens)
This action will always be learned as a 1d embedding and will be combined with the user behaviour. For example, imagine that the action is purchased/not-purchased. then per item in the user behaviour sequence there will be a binary action to learn 0/1. Such action will be represented by a float number that will multiply the corresponding item embedding in the user behaviour sequence.
Example: (["rating_1", "rating_2", "rating_3"], 5)
Internally, the embedding dimension will be set to 1

None
other_seq_cols_confiq Optional[List[Tuple[List[str], int, int]]]

Configuration for other sequential columns. List of tuples containing:
- List of column names that correspond to the sequential column
- Number of unique feature values (n_tokens)
- Embedding dimension
Example: [(["seq1_col1", "seq1_col2"], 5, 8), (["seq2_col1", "seq2_col2"], 5, 8)]

None
attention_unit_activation Literal['prelu', 'dice']

Activation function to use in the attention unit.

"prelu"
cat_embed_input Optional[List[Tuple[str, int, int]]]

Configuration for other columns. List of tuples containing:
- Column name
- Number of unique feature values (n_tokens)
- Embedding dimension

Note: From here in advance the remaining parameters are related to the categorical and continuous columns that are not part of the sequential columns and will be treated as standard tabular data.

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous Optional[bool]

Boolean indicating if the continuous columns will be embedded using one of the available methods: 'standard', 'periodic' or 'piecewise'. If None, it will default to 'False'.
ℹ️ NOTE: This parameter is deprecated and it will be removed in future releases. Please, use the embed_continuous_method parameter instead.

None
embed_continuous_method Optional[Literal['piecewise', 'periodic', 'standard']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dim Optional[int]

Size of the continuous embeddings. If the continuous columns are embedded, cont_embed_dim must be passed.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the mlp.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu', 'gelu' and 'preglu' are supported

None
mlp_dropout Optional[float]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

None

Attributes:

Name Type Description
user_behavior_indexes List[int]

List with the indexes of the user behavior columns

user_behavior_embed BaseTabularModelWithAttention

Embedding layer for the user

action_seq_indexes List[int]

List with the indexes of the rating sequence columns if the action_seq_config parameter is not None

action_embed BaseTabularModelWithAttention

Embedding layer for the rating sequence columns if the action_seq_config parameter is not None

other_seq_cols_indexes Dict[str, List[int]]

Dictionary with the indexes of the other sequential columns if the other_seq_cols_confiq parameter is not None

other_seq_cols_embed ModuleDict

Dictionary with the embedding layers for the other sequential columns if the other_seq_cols_confiq parameter is not None

other_cols_idx List[int]

List with the indexes of the other columns if the other_cols_config parameter is not None

other_col_embed BaseTabularModel

Embedding layer for the other columns if the other_cols_config parameter is not None

mlp Optional[MLP]

MLP component of the model. If None, no MLP will be used. This should almost always be not None.

Examples:

>>> import torch
>>> import numpy as np
>>> from torch import Tensor
>>> from typing import Dict, List, Tuple
>>> from pytorch_widedeep.models.rec import DeepInterestNetwork
>>> np_seed = np.random.seed(42)
>>> torch_seed = torch.manual_seed(42)
>>> num_users = 10
>>> num_items = 5
>>> num_contexts = 3
>>> seq_length = 3
>>> num_samples = 10
>>> user_ids = np.random.randint(0, num_users, num_samples)
>>> target_item_ids = np.random.randint(0, num_items, num_samples)
>>> context_ids = np.random.randint(0, num_contexts, num_samples)
>>> user_behavior = np.array(
...     [
...         np.random.choice(num_items, seq_length, replace=False)
...         for _ in range(num_samples)
...     ]
... )
>>> X_arr = np.column_stack((user_ids, target_item_ids, context_ids, user_behavior))
>>> X = torch.tensor(X_arr, dtype=torch.long)
>>> column_idx: Dict[str, int] = {
...     "user_id": 0,
...     "target_item": 1,
...     "context": 2,
...     "item_1": 3,
...     "item_2": 4,
...     "item_3": 5,
... }
>>> user_behavior_config: Tuple[List[str], int, int] = (
...     ["item_1", "item_2", "item_3"],
...     num_items,
...     8,
... )
>>> cat_embed_input: List[Tuple[str, int, int]] = [
...     ("user_id", num_users, 8),
...     ("context", num_contexts, 4),
... ]
>>> model = DeepInterestNetwork(
...     column_idx=column_idx,
...     target_item_col="target_item",
...     user_behavior_confiq=user_behavior_config,
...     cat_embed_input=cat_embed_input,
...     mlp_hidden_dims=[16, 8],
... )
>>> output = model(X)
Source code in pytorch_widedeep/models/rec/din.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
class DeepInterestNetwork(BaseWDModelComponent):
    """
    Adaptation of the Deep Interest Network (DIN) for recommendation systems
    as described in the paper: 'Deep Interest Network for Click-Through Rate
    Prediction' by Guorui Zhou et al. 2018.

    Note that all the categorical- and continuous-related parameters refer to
    the categorical and continuous columns that are not part of the
    sequential columns and will be treated as standard tabular data.

    This model requires some specific data preparation that allows for quite a
    lot of flexibility. Therefore, I have included a preprocessor
    (`DINPreprocessor`) in the preprocessing module that will take care of
    the data preparation.

    Parameters
    ----------
    column_idx : Dict[str, int]
        Dictionary mapping column names to their corresponding index.
    target_item_col : str
        Name of the target item column. Note that this is not the target
        column. This algorithm relies on a sequence representation of
        interactions. The target item would be the next item in the sequence
        of interactions (e.g. item 6th in a sequence of 5 items), and our
        goal is to predict a given action on it.
    user_behavior_confiq : Tuple[List[str], int, int]
        Configuration for user behavior sequence columns. Tuple containing:
        - List of column names that correspond to the user behavior sequence<br/>
        - Number of unique feature values (n_tokens)<br/>
        - Embedding dimension<br/>
        Example: `(["item_1", "item_2", "item_3"], 5, 8)`
    action_seq_config : Optional[Tuple[List[str], int]], default=None
        Configuration for a so-called action sequence columns (for example a
        rating, or purchased/not-purchased, etc). Tuple containing:<br/>
        - List of column names<br/>
        - Number of unique feature values (n_tokens)<br/>
        This action will **always** be learned as a 1d embedding and will be
        combined with the user behaviour. For example, imagine that the
        action is purchased/not-purchased. then per item in the user
        behaviour sequence there will be a binary action to learn 0/1. Such
        action will be represented by a float number that will multiply the
        corresponding item embedding in the user behaviour sequence.<br/>
        Example: `(["rating_1", "rating_2", "rating_3"], 5)`<br/>
        Internally, the embedding dimension will be set to 1
    other_seq_cols_confiq : Optional[List[Tuple[List[str], int, int]]], default=None
        Configuration for other sequential columns. List of tuples containing:<br/>
        - List of column names that correspond to the sequential column<br/>
        - Number of unique feature values (n_tokens)<br/>
        - Embedding dimension<br/>
        Example: `[(["seq1_col1", "seq1_col2"], 5, 8), (["seq2_col1", "seq2_col2"], 5, 8)]`
    attention_unit_activation : Literal["prelu", "dice"], default="prelu"
        Activation function to use in the attention unit.
    cat_embed_input : Optional[List[Tuple[str, int, int]]], default=None
        Configuration for other columns. List of tuples containing:<br/>
        - Column name<br/>
        - Number of unique feature values (n_tokens)<br/>
        - Embedding dimension<br/>

        **Note**: From here in advance the remaining parameters are related to
        the categorical and continuous columns that are not part of the
        sequential columns and will be treated as standard tabular data.

    cat_embed_dropout : Optional[float], default=None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias : Optional[bool], default=None
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation : Optional[str], default=None
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous : Optional[bool], default=None
        Boolean indicating if the continuous columns will be embedded using
        one of the available methods: _'standard'_, _'periodic'_
        or _'piecewise'_. If `None`, it will default to 'False'.<br/>
        :information_source: **NOTE**: This parameter is deprecated and it
         will be removed in future releases. Please, use the
         `embed_continuous_method` parameter instead.
    embed_continuous_method : Optional[Literal["piecewise", "periodic", "standard"]], default=None
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dim : Optional[int], default=None
        Size of the continuous embeddings. If the continuous columns are
        embedded, `cont_embed_dim` must be passed.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_, _'gelu'_ and _'preglu'_ are
          supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    user_behavior_indexes: List[int]
        List with the indexes of the user behavior columns
    user_behavior_embed: BaseTabularModelWithAttention
        Embedding layer for the user
    action_seq_indexes: List[int]
        List with the indexes of the rating sequence columns if the
        action_seq_config parameter is not None
    action_embed: BaseTabularModelWithAttention
        Embedding layer for the rating sequence columns if the
        action_seq_config parameter is not None
    other_seq_cols_indexes: Dict[str, List[int]]
        Dictionary with the indexes of the other sequential columns if the
        other_seq_cols_confiq parameter is not None
    other_seq_cols_embed: nn.ModuleDict
        Dictionary with the embedding layers for the other sequential columns
        if the other_seq_cols_confiq parameter is not None
    other_cols_idx: List[int]
        List with the indexes of the other columns if the other_cols_config
        parameter is not None
    other_col_embed: BaseTabularModel
        Embedding layer for the other columns if the other_cols_config
        parameter is not None
    mlp: Optional[MLP]
        MLP component of the model. If None, no MLP will be used. This should
        almost always be not None.

    Examples
    --------
    >>> import torch
    >>> import numpy as np
    >>> from torch import Tensor
    >>> from typing import Dict, List, Tuple
    >>> from pytorch_widedeep.models.rec import DeepInterestNetwork
    >>> np_seed = np.random.seed(42)
    >>> torch_seed = torch.manual_seed(42)
    >>> num_users = 10
    >>> num_items = 5
    >>> num_contexts = 3
    >>> seq_length = 3
    >>> num_samples = 10
    >>> user_ids = np.random.randint(0, num_users, num_samples)
    >>> target_item_ids = np.random.randint(0, num_items, num_samples)
    >>> context_ids = np.random.randint(0, num_contexts, num_samples)
    >>> user_behavior = np.array(
    ...     [
    ...         np.random.choice(num_items, seq_length, replace=False)
    ...         for _ in range(num_samples)
    ...     ]
    ... )
    >>> X_arr = np.column_stack((user_ids, target_item_ids, context_ids, user_behavior))
    >>> X = torch.tensor(X_arr, dtype=torch.long)
    >>> column_idx: Dict[str, int] = {
    ...     "user_id": 0,
    ...     "target_item": 1,
    ...     "context": 2,
    ...     "item_1": 3,
    ...     "item_2": 4,
    ...     "item_3": 5,
    ... }
    >>> user_behavior_config: Tuple[List[str], int, int] = (
    ...     ["item_1", "item_2", "item_3"],
    ...     num_items,
    ...     8,
    ... )
    >>> cat_embed_input: List[Tuple[str, int, int]] = [
    ...     ("user_id", num_users, 8),
    ...     ("context", num_contexts, 4),
    ... ]
    >>> model = DeepInterestNetwork(
    ...     column_idx=column_idx,
    ...     target_item_col="target_item",
    ...     user_behavior_confiq=user_behavior_config,
    ...     cat_embed_input=cat_embed_input,
    ...     mlp_hidden_dims=[16, 8],
    ... )
    >>> output = model(X)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        user_behavior_confiq: Tuple[List[str], int, int],
        target_item_col: str = "target_item",
        action_seq_config: Optional[Tuple[List[str], int]] = None,
        other_seq_cols_confiq: Optional[List[Tuple[List[str], int, int]]] = None,
        attention_unit_activation: Literal["prelu", "dice"] = "prelu",
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(DeepInterestNetwork, self).__init__()

        self.column_idx = {
            k: v for k, v in sorted(column_idx.items(), key=lambda x: x[1])
        }

        self.column_idx = column_idx
        self.target_item_col = target_item_col
        self.user_behavior_confiq = user_behavior_confiq
        self.action_seq_config = (
            (action_seq_config[0], action_seq_config[1], 1)
            if action_seq_config is not None
            else None
        )
        self.other_seq_cols_confiq = other_seq_cols_confiq
        self.cat_embed_input = cat_embed_input
        self.attention_unit_activation = attention_unit_activation

        self.cat_embed_dropout = cat_embed_dropout
        self.use_cat_bias = use_cat_bias
        self.cat_embed_activation = cat_embed_activation
        self.continuous_cols = continuous_cols
        self.cont_norm_layer = cont_norm_layer
        self.embed_continuous = embed_continuous
        self.embed_continuous_method = embed_continuous_method
        self.cont_embed_dim = cont_embed_dim
        self.cont_embed_dropout = cont_embed_dropout
        self.cont_embed_activation = cont_embed_activation
        self.quantization_setup = quantization_setup
        self.n_frequencies = n_frequencies
        self.sigma = sigma
        self.share_last_layer = share_last_layer
        self.full_embed_dropout = full_embed_dropout

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.target_item_idx = self.column_idx[target_item_col]

        self.user_behavior_indexes, self.user_behavior_embed = (
            self.set_user_behavior_indexes_and_embed(user_behavior_confiq)
        )
        self.user_behavior_dim = user_behavior_confiq[2]

        if self.action_seq_config is not None:
            self.action_seq_indexes, self.action_embed = (
                self._set_rating_indexes_and_embed(self.action_seq_config)
            )
        else:
            self.action_embed = None

        if self.other_seq_cols_confiq is not None:
            (
                self.other_seq_cols_indexes,
                self.other_seq_cols_embed,
                self.other_seq_dim,
            ) = self._set_other_seq_cols_indexes_embed_and_dim(
                self.other_seq_cols_confiq
            )
        else:
            self.other_seq_cols_embed = None
            self.other_seq_dim = 0

        if self.cat_embed_input is not None or self.continuous_cols is not None:
            self.other_cols_idx, self.other_col_embed, self.other_cols_dim = (
                self._set_other_cols_idx_embed_and_dim(
                    self.cat_embed_input, self.continuous_cols
                )
            )
        else:
            self.other_col_embed = None
            self.other_cols_dim = 0

        self.attention = ActivationUnit(
            user_behavior_confiq[2], attention_unit_activation
        )

        if self.mlp_hidden_dims is not None:
            mlp_input_dim = (
                self.user_behavior_dim * 2 + self.other_seq_dim + self.other_cols_dim
            )
            self.mlp = MLP(
                d_hidden=[mlp_input_dim] + self.mlp_hidden_dims,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:

        X_target_item = X[:, [self.target_item_idx]]
        item_embed = self.user_behavior_embed.cat_embed.embed(X_target_item.long())

        X_user_behavior = X[:, self.user_behavior_indexes]
        user_behavior_embed = self.user_behavior_embed._get_embeddings(X_user_behavior)
        # 0 is the padding index
        mask = (X_user_behavior != 0).float().to(X.device)

        if self.action_embed is not None:
            X_rating = X[:, self.action_seq_indexes]
            action_embed = self.action_embed._get_embeddings(X_rating)
            user_behavior_embed = user_behavior_embed * action_embed

        attention_scores = self.attention(item_embed, user_behavior_embed)
        attention_scores = attention_scores * mask
        user_interest = (attention_scores.unsqueeze(-1) * user_behavior_embed).sum(1)

        deep_out = torch.cat([item_embed.squeeze(1), user_interest], dim=1)

        if self.other_seq_cols_embed is not None:
            X_other_seq: Dict[str, Tensor] = {
                col: X[:, idx] for col, idx in self.other_seq_cols_indexes.items()
            }
            other_seq_embed = torch.cat(
                [
                    self.other_seq_cols_embed[col]._get_embeddings(X_other_seq[col])
                    for col in self.other_seq_cols_indexes.keys()
                ],
                dim=-1,
            ).sum(1)
            deep_out = torch.cat([deep_out, other_seq_embed], dim=1)

        if self.other_col_embed is not None:
            X_other_cols = X[:, self.other_cols_idx]
            other_cols_embed = self.other_col_embed._get_embeddings(X_other_cols)
            deep_out = torch.cat([deep_out, other_cols_embed], dim=1)

        if self.mlp is not None:
            deep_out = self.mlp(deep_out)

        return deep_out

    @staticmethod
    def _get_seq_cols_embed_confiq(tup: Tuple[List[str], int, int]) -> Dict[str, Any]:
        # tup[0] is the list of columns
        # tup[1] is the number of unique feat value or "n_tokens"
        # tup[2] is the embedding dimension

        # Once sliced, the indexes will go from 0 to len(tup[0]) it is assumed
        # that the columns in tup[0] are ordered of appearance in the input
        # data
        column_idx = {col: i for i, col in enumerate(tup[0])}

        # This is a hack so that I can use any BaseTabularModelWithAttention.
        # For this model to work 'cat_embed_input' is normally a List of
        # Tuples where the first element is the column name and the second is
        # the number of unique values for that column. That second elements
        # is added internally to compute what one could call "n_tokens". Here
        # I'm passing that value as the second element of the first tuple
        # and then adding 0s for the rest of the columns
        cat_embed_input = [(tup[0][0], tup[1])] + [(col, 0) for col in tup[0][1:]]

        input_dim = tup[2]

        col_config = {
            "column_idx": column_idx,
            "input_dim": input_dim,
            "cat_embed_input": cat_embed_input,
            "cat_embed_dropout": None,
            "use_cat_bias": None,
            "cat_embed_activation": None,
            "shared_embed": None,
            "add_shared_embed": None,
            "frac_shared_embed": None,
            "continuous_cols": None,
            "cont_norm_layer": None,
            "embed_continuous_method": None,
            "cont_embed_dropout": None,
            "cont_embed_activation": None,
            "quantization_setup": None,
            "n_frequencies": None,
            "sigma": None,
            "share_last_layer": None,
            "full_embed_dropout": None,
        }

        return col_config

    def _get_other_cols_embed_config(
        self,
        cat_embed_input: Optional[List[Tuple[str, int, int]]],
        continuous_cols: Optional[List[str]],
        column_idx: Dict[str, int],
    ) -> Dict[str, Any]:

        cols_config = {
            "column_idx": {col: i for i, col in enumerate(column_idx.keys())},
            "cat_embed_input": cat_embed_input,
            "cat_embed_dropout": self.cat_embed_dropout,
            "use_cat_bias": self.use_cat_bias,
            "cat_embed_activation": self.cat_embed_activation,
            "continuous_cols": continuous_cols,
            "cont_norm_layer": self.cont_norm_layer,
            "embed_continuous": self.embed_continuous,
            "embed_continuous_method": self.embed_continuous_method,
            "cont_embed_dim": self.cont_embed_dim,
            "cont_embed_dropout": self.cont_embed_dropout,
            "cont_embed_activation": self.cont_embed_activation,
            "quantization_setup": self.quantization_setup,
            "n_frequencies": self.n_frequencies,
            "sigma": self.sigma,
            "share_last_layer": self.share_last_layer,
            "full_embed_dropout": self.full_embed_dropout,
        }

        return cols_config

    def set_user_behavior_indexes_and_embed(
        self, user_behavior_confiq: Tuple[List[str], int, int]
    ) -> Tuple[List[int], BaseTabularModelWithAttention]:
        user_behavior_indexes = [
            self.column_idx[col] for col in user_behavior_confiq[0]
        ]
        user_behavior_embed = BaseTabularModelWithAttention(
            **self._get_seq_cols_embed_confiq(user_behavior_confiq)
        )
        return user_behavior_indexes, user_behavior_embed

    def _set_rating_indexes_and_embed(
        self, action_seq_config: Tuple[List[str], int, int]
    ) -> Tuple[List[int], BaseTabularModelWithAttention]:
        action_seq_indexes = [self.column_idx[col] for col in action_seq_config[0]]
        action_embed = BaseTabularModelWithAttention(
            **self._get_seq_cols_embed_confiq(action_seq_config)
        )
        return action_seq_indexes, action_embed

    def _set_other_seq_cols_indexes_embed_and_dim(
        self, other_seq_cols_confiq: List[Tuple[List[str], int, int]]
    ) -> Tuple[Dict[str, List[int]], nn.ModuleDict, int]:
        other_seq_cols_indexes: Dict[str, List[int]] = {}
        for i, el in enumerate(other_seq_cols_confiq):
            key = f"seq_{i}"
            idxs = [self.column_idx[col] for col in el[0]]
            other_seq_cols_indexes[key] = idxs
        other_seq_cols_config = {
            f"seq_{i}": self._get_seq_cols_embed_confiq(el)
            for i, el in enumerate(other_seq_cols_confiq)
        }
        other_seq_cols_embed = nn.ModuleDict(
            {
                key: BaseTabularModelWithAttention(**config)
                for key, config in other_seq_cols_config.items()
            }
        )
        other_seq_dim = sum([el[2] for el in other_seq_cols_confiq])

        return other_seq_cols_indexes, other_seq_cols_embed, other_seq_dim

    def _set_other_cols_idx_embed_and_dim(
        self,
        cat_embed_input: Optional[List[Tuple[str, int, int]]],
        continuous_cols: Optional[List[str]],
    ) -> Tuple[List[int], BaseTabularModelWithoutAttention, int]:
        other_cols_idx: Dict[str, int] = {}
        if cat_embed_input is not None:
            other_cols_idx = {
                col: self.column_idx[col] for col in [el[0] for el in cat_embed_input]
            }
        if continuous_cols is not None:
            other_cols_idx.update(
                {col: self.column_idx[col] for col in continuous_cols}
            )

        sorted_other_cols_idx = {
            k: v for k, v in sorted(other_cols_idx.items(), key=lambda x: x[1])
        }

        other_col_embed = BaseTabularModelWithoutAttention(
            **self._get_other_cols_embed_config(
                cat_embed_input, continuous_cols, sorted_other_cols_idx
            )
        )

        other_cols_dim = other_col_embed.output_dim

        return list(other_cols_idx.values()), other_col_embed, other_cols_dim

    @property
    def output_dim(self) -> int:
        if self.mlp_hidden_dims is not None:
            return self.mlp_hidden_dims[-1]
        else:
            return self.user_behavior_dim * 2 + self.other_seq_dim + self.other_cols_dim

ExtremeDeepFactorizationMachine

Bases: BaseTabularModelWithAttention

Adaptation of 'xDeepFM implementation: xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems' by Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, Guangzhong Sun and Enhong Chen, 2018

The implementation in this library takes advantage of all the functionalities available to encode categorical and continuous features. The model can be used with only the factorization machine

Note that this class implements only the 'Deep' component of the model described in the paper. The linear component is not implemented 'internally' and, if one wants to include it, it can be easily added using the 'wide'/linear component in this library. See the examples in the examples folder.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dictionary mapping column names to their corresponding index.

required
input_dim int

Embedding input dimensions

required
reduce_sum bool

Whether to reduce the sum in the factorization machine output.

True
cin_layer_dims List[int]

List with the number of units per CIN layer. e.g: [128, 64]

required
cat_embed_input Optional[List[Tuple[str, int]]]

List of tuples with categorical column names and number of unique values.

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['piecewise', 'periodic', 'standard']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

'standard'
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims Optional[List[int]]

List with the number of neurons per dense layer in the mlp.

None
mlp_activation Optional[str]

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
mlp_dropout Optional[float]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

None
mlp_batchnorm Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the dense layers

None
mlp_batchnorm_last Optional[bool]

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

None
mlp_linear_first Optional[bool]

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

None

Attributes:

Name Type Description
n_features int

Number of unique features/columns

cin CompressedInteractionNetwork

Instance of the CompressedInteractionNetwork class

mlp MLP

Instance of the MLP class if mlp_hidden_dims is not None. If None, the model will return directly the output of the CIN

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import ExtremeDeepFactorizationMachine
>>> X_tab = torch.randint(0, 10, (16, 2))
>>> column_idx = {"col1": 0, "col2": 1}
>>> cat_embed_input = [("col1", 10), ("col2", 10)]
>>> xdeepfm = ExtremeDeepFactorizationMachine(
...     column_idx=column_idx,
...     input_dim=4,
...     cin_layer_dims=[8, 16],
...     cat_embed_input=cat_embed_input,
...     mlp_hidden_dims=[16, 8]
... )
>>> output = xdeepfm(X_tab)
Source code in pytorch_widedeep/models/rec/xdeepfm.py
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
class ExtremeDeepFactorizationMachine(BaseTabularModelWithAttention):
    """
    Adaptation of 'xDeepFM implementation: xDeepFM: Combining Explicit and
    Implicit Feature Interactions for Recommender Systems' by Jianxun Lian,
    Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, Guangzhong Sun and
    Enhong Chen, 2018

    The implementation in this library takes advantage of all the
    functionalities available to encode categorical and continuous features.
    The model can be used with only the factorization machine

    Note that this class implements only the 'Deep' component of the model
    described in the paper. The linear component is not
    implemented 'internally' and, if one wants to include it, it can be
    easily added using the 'wide'/linear component in this library. See the
    examples in the examples folder.

    Parameters
    ----------
    column_idx : Dict[str, int]
        Dictionary mapping column names to their corresponding index.
    input_dim : int
        Embedding input dimensions
    reduce_sum : bool, default=True
        Whether to reduce the sum in the factorization machine output.
    cin_layer_dims : List[int]
        List with the number of units per CIN layer. e.g: _[128, 64]_
    cat_embed_input : Optional[List[Tuple[str, int]]], default=None
        List of tuples with categorical column names and number of unique values.
    cat_embed_dropout : Optional[float], default=None
        Categorical embeddings dropout. If `None`, it will default
        to 0.
    use_cat_bias : Optional[bool], default=None
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation : Optional[str], default=None
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    n_features: int
        Number of unique features/columns
    cin: CompressedInteractionNetwork
        Instance of the `CompressedInteractionNetwork` class
    mlp: MLP
        Instance of the `MLP` class if `mlp_hidden_dims` is not None. If None,
        the model will return directly the output of the `CIN`

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import ExtremeDeepFactorizationMachine
    >>> X_tab = torch.randint(0, 10, (16, 2))
    >>> column_idx = {"col1": 0, "col2": 1}
    >>> cat_embed_input = [("col1", 10), ("col2", 10)]
    >>> xdeepfm = ExtremeDeepFactorizationMachine(
    ...     column_idx=column_idx,
    ...     input_dim=4,
    ...     cin_layer_dims=[8, 16],
    ...     cat_embed_input=cat_embed_input,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> output = xdeepfm(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        input_dim: int,
        reduce_sum: bool = True,
        cin_layer_dims: List[int],
        cat_embed_input: List[Tuple[str, int]],
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["piecewise", "periodic", "standard"]
        ] = "standard",
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: Optional[List[int]] = None,
        mlp_activation: Optional[str] = None,
        mlp_dropout: Optional[float] = None,
        mlp_batchnorm: Optional[bool] = None,
        mlp_batchnorm_last: Optional[bool] = None,
        mlp_linear_first: Optional[bool] = None,
    ):
        super(ExtremeDeepFactorizationMachine, self).__init__(
            column_idx=column_idx,
            input_dim=input_dim,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=False,
            add_shared_embed=None,
            frac_shared_embed=None,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.reduce_sum = reduce_sum
        self.cin_layer_dims = cin_layer_dims

        self.n_features = len(self.column_idx)

        self.cin = CompressedInteractionNetwork(
            num_cols=self.n_features, cin_layer_dims=self.cin_layer_dims
        )

        if self.mlp_hidden_dims is not None:
            if (
                self.mlp_hidden_dims[-1] != sum(self.cin_layer_dims)
                and not self.reduce_sum
            ):
                d_hidden = (
                    [sum(self.cin_layer_dims)]
                    + self.mlp_hidden_dims
                    + [sum(self.cin_layer_dims)]
                )
            else:
                d_hidden = [sum(self.cin_layer_dims)] + self.mlp_hidden_dims

            self.mlp = MLP(
                d_hidden=d_hidden,
                activation=(
                    "relu" if self.mlp_activation is None else self.mlp_activation
                ),
                dropout=0.0 if self.mlp_dropout is None else self.mlp_dropout,
                batchnorm=False if self.mlp_batchnorm is None else self.mlp_batchnorm,
                batchnorm_last=(
                    False
                    if self.mlp_batchnorm_last is None
                    else self.mlp_batchnorm_last
                ),
                linear_first=(
                    False if self.mlp_linear_first is None else self.mlp_linear_first
                ),
            )
        else:
            self.mlp = None

    def forward(self, X: Tensor) -> Tensor:

        embeddings = self._get_embeddings(X)
        cin_out = self.cin(embeddings)

        if self.mlp is None:
            if self.reduce_sum:
                return cin_out.sum(dim=1, keepdim=True)
            return cin_out

        mlp_out = self.mlp(cin_out)

        if self.reduce_sum:
            cin_out = cin_out.sum(dim=1, keepdim=True)
            mlp_out = mlp_out.sum(dim=1, keepdim=True)

        return mlp_out + cin_out

    @property
    def output_dim(self):
        if self.reduce_sum:
            return 1
        else:
            return sum(self.cin_layer_dims)

AutoInt

Bases: BaseTabularModelWithAttention

Defines an AutoInt model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class implements the AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks architecture, which learns feature interactions through multi-head self-attention networks.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the AutoInt model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
input_dim int

Dimension of the input embeddings

required
num_heads int

Number of attention heads

4
num_layers int

Number of interacting layers (attention + residual)

2
reduction Literal['mean', 'cat']

How to reduce the output of the attention layers. Options are: 'mean': mean of attention outputs 'cat': concatenation of attention outputs

'mean'
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['standard', 'piecewise', 'periodic']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None

Attributes:

Name Type Description
attention_layers ModuleList

List of multi-head attention layers

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import AutoInt
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = AutoInt(
...     column_idx=column_idx,
...     input_dim=32,
...     cat_embed_input=cat_embed_input,
...     continuous_cols=["e"],
...     embed_continuous_method="standard",
...     num_heads=4,
...     num_layers=2
... )
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/rec/autoint.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
class AutoInt(BaseTabularModelWithAttention):
    r"""Defines an `AutoInt` model that can be used as the `deeptabular` component
    of a Wide & Deep model or independently by itself.

    This class implements the [AutoInt: Automatic Feature Interaction Learning via Self-Attentive
    Neural Networks](https://arxiv.org/abs/1810.11921) architecture, which learns feature
    interactions through multi-head self-attention networks.

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `AutoInt` model. Required to slice the tensors. e.g.
        _{'education': 0, 'relationship': 1, 'workclass': 2, ...}_.
    input_dim: int
        Dimension of the input embeddings
    num_heads: int, default = 4
        Number of attention heads
    num_layers: int, default = 2
        Number of interacting layers (attention + residual)
    reduction: str, default = "mean"
        How to reduce the output of the attention layers. Options are:
        _'mean'_: mean of attention outputs
        _'cat'_: concatenation of attention outputs
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.

    Attributes
    ----------
    attention_layers: nn.ModuleList
        List of multi-head attention layers

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import AutoInt
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = AutoInt(
    ...     column_idx=column_idx,
    ...     input_dim=32,
    ...     cat_embed_input=cat_embed_input,
    ...     continuous_cols=["e"],
    ...     embed_continuous_method="standard",
    ...     num_heads=4,
    ...     num_layers=2
    ... )
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        input_dim: int,
        num_heads: int = 4,
        num_layers: int = 2,
        reduction: Literal["mean", "cat"] = "mean",
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
    ):
        super(AutoInt, self).__init__(
            column_idx=column_idx,
            input_dim=input_dim,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=None,
            add_shared_embed=None,
            frac_shared_embed=None,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.num_heads = num_heads
        self.num_layers = num_layers
        self.reduction = reduction

        self.attention_layers = nn.ModuleList(
            [
                nn.MultiheadAttention(
                    embed_dim=input_dim,
                    num_heads=num_heads,
                    batch_first=True,
                )
                for _ in range(num_layers)
            ]
        )

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        x = self._get_embeddings(X)

        for layer in self.attention_layers:
            attn_output, _ = layer(x, x, x)
            x = attn_output + x

        if self.reduction == "mean":
            out = x.mean(dim=1)
        else:
            out = x.reshape(x.size(0), -1)

        return out

    @property
    def output_dim(self) -> int:
        if self.reduction == "mean":
            return self.input_dim
        else:
            return self.input_dim * len(self.column_idx)

AutoIntPlus

Bases: BaseTabularModelWithAttention

Defines an AutoIntPlus model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class implements an enhanced version of the AutoInt architecture, adding a parallel or stacked deep network and an optional gating mechanism to control the contribution of the attention-based and MLP branches.

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
input_dim int

Dimension of the input embeddings

required
num_heads int

Number of attention heads

4
num_layers int

Number of interacting layers (attention + residual)

2
reduction Literal['mean', 'cat']

How to reduce the output of the attention layers. Options are: 'mean': mean of attention outputs 'cat': concatenation of attention outputs

'mean'
structure Literal['stacked', 'parallel']

Structure of the model. Either 'parallel' or 'stacked'. If 'parallel', the output will be the concatenation of the attention and deep networks. If 'stacked', the attention output will be fed into the deep network.

'parallel'
gated bool

If True and structure is 'parallel', uses a gating mechanism to combine the attention and deep networks. Note: requires reduction='mean'.

True
cat_embed_input Optional[List[Tuple[str, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['standard', 'piecewise', 'periodic']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[100, 100]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
attention_layers ModuleList

List of multi-head attention layers

deep_network Module

The deep network component (MLP)

gate (Module, optional)

The gating network (if gated=True)

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import AutoIntPlus
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = AutoIntPlus(
...     column_idx=column_idx,
...     input_dim=32,
...     cat_embed_input=cat_embed_input,
...     continuous_cols=["e"],
...     embed_continuous_method="standard",
...     num_heads=4,
...     num_layers=2,
...     structure="parallel",
...     gated=True,
...     mlp_hidden_dims=[64, 32]
... )
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/rec/autoint_plus.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
class AutoIntPlus(BaseTabularModelWithAttention):
    r"""Defines an `AutoIntPlus` model that can be used as the `deeptabular` component
    of a Wide & Deep model or independently by itself.

    This class implements an enhanced version of the [AutoInt](https://arxiv.org/abs/1810.11921)
    architecture, adding a parallel or stacked deep network and an optional gating mechanism
    to control the contribution of the attention-based and MLP branches.

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the model. Required to slice the tensors. e.g.
        _{'education': 0, 'relationship': 1, 'workclass': 2, ...}_.
    input_dim: int
        Dimension of the input embeddings
    num_heads: int, default = 4
        Number of attention heads
    num_layers: int, default = 2
        Number of interacting layers (attention + residual)
    reduction: str, default = "mean"
        How to reduce the output of the attention layers. Options are:
        _'mean'_: mean of attention outputs
        _'cat'_: concatenation of attention outputs
    structure: str, default = "parallel"
        Structure of the model. Either _'parallel'_ or _'stacked'_. If _'parallel'_,
        the output will be the concatenation of the attention and deep networks.
        If _'stacked'_, the attention output will be fed into the deep network.
    gated: bool, default = True
        If True and structure is 'parallel', uses a gating mechanism to combine
        the attention and deep networks. Note: requires reduction='mean'.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    attention_layers: nn.ModuleList
        List of multi-head attention layers
    deep_network: nn.Module
        The deep network component (MLP)
    gate: nn.Module, optional
        The gating network (if gated=True)

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import AutoIntPlus
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = AutoIntPlus(
    ...     column_idx=column_idx,
    ...     input_dim=32,
    ...     cat_embed_input=cat_embed_input,
    ...     continuous_cols=["e"],
    ...     embed_continuous_method="standard",
    ...     num_heads=4,
    ...     num_layers=2,
    ...     structure="parallel",
    ...     gated=True,
    ...     mlp_hidden_dims=[64, 32]
    ... )
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        input_dim: int,
        num_heads: int = 4,
        num_layers: int = 2,
        reduction: Literal["mean", "cat"] = "mean",
        structure: Literal["stacked", "parallel"] = "parallel",
        gated: bool = True,
        cat_embed_input: Optional[List[Tuple[str, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: List[int] = [100, 100],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(AutoIntPlus, self).__init__(
            column_idx=column_idx,
            input_dim=input_dim,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            shared_embed=None,
            add_shared_embed=None,
            frac_shared_embed=None,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.input_dim = input_dim
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.reduction = reduction
        self.structure = structure
        self.gated = gated

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        self.attention_layers = nn.ModuleList(
            [
                nn.MultiheadAttention(
                    embed_dim=self.input_dim,
                    num_heads=self.num_heads,
                    batch_first=True,
                )
                for _ in range(num_layers)
            ]
        )

        if self.gated:
            self.gate = self._build_gate()

        deep_network_inp_dim = self._get_deep_network_input_dim()
        self.deep_network = MLP(
            [deep_network_inp_dim] + self.mlp_hidden_dims,
            self.mlp_activation,
            self.mlp_dropout,
            self.mlp_batchnorm,
            self.mlp_batchnorm_last,
            self.mlp_linear_first,
        )

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        x = self._get_embeddings(X)
        attn_output = self._apply_attention_layers(x)
        reduced_attn_output = self._reduce_attention_output(attn_output)
        if self.structure == "parallel":
            return self._parallel_output(reduced_attn_output, x)
        else:  # structure == "stacked"
            return self.deep_network(reduced_attn_output)

    def _get_deep_network_input_dim(self) -> int:
        if self.structure == "parallel":
            return self.input_dim * len(self.column_idx)
        elif self.reduction == "mean":  # structure == "stacked"
            return self.input_dim
        else:
            return self.input_dim * len(self.column_idx)

    def _build_gate(self) -> nn.Linear:
        self._setup_gating()
        return nn.Linear(self.input_dim * 2, self.input_dim)

    def _setup_gating(self):
        if self.structure == "stacked":
            raise ValueError(
                "Gating is not supported for stacked structure. Set `gated=False`."
            )
        if self.reduction != "mean":
            raise ValueError(
                "When using a gated structure, the reduction must be 'mean'."
            )
        if self.mlp_hidden_dims[-1] != self.input_dim:
            self.mlp_hidden_dims = self.mlp_hidden_dims + [self.input_dim]
            UserWarning(
                "When using a gated structure, the last hidden layer of "
                "the MLP must have the same dimension as the input. "
                "The last hidden layer has been set to the input dimension."
            )

    def _apply_attention_layers(self, x: torch.Tensor) -> torch.Tensor:
        attn_output = x
        for layer in self.attention_layers:
            layer_out, _ = layer(attn_output, attn_output, attn_output)
            attn_output = layer_out + attn_output
        return attn_output

    def _reduce_attention_output(self, attn_output: torch.Tensor) -> torch.Tensor:
        return (
            attn_output.mean(dim=1)
            if self.reduction == "mean"
            else attn_output.flatten(1)
        )

    def _parallel_output(
        self, attn_output: torch.Tensor, x: torch.Tensor
    ) -> torch.Tensor:
        deep_output = self.deep_network(x.reshape(x.size(0), -1))

        if not self.gated:
            return torch.cat([attn_output, deep_output], dim=1)

        combined_features = torch.cat([attn_output, deep_output], dim=-1)
        gate_weights = torch.sigmoid(self.gate(combined_features))
        return gate_weights * attn_output + (1 - gate_weights) * deep_output

    @property
    def output_dim(self) -> int:
        if self.structure == "parallel":
            if self.gated:
                return self.input_dim
            else:
                if self.reduction == "mean":
                    return self.mlp_hidden_dims[-1] + self.input_dim
                else:
                    return self.mlp_hidden_dims[-1] + self.input_dim * len(
                        self.column_idx
                    )
        else:
            return self.mlp_hidden_dims[-1]

Transformer

Bases: Module

Basic Encoder-Only Transformer Model for sequence classification/regression. As all other models in the library this model can be used as the deeptext component of a Wide & Deep model or independently by itself.

ℹ️ NOTE: This model is introduced in the context of recommendation systems and thought for sequences of any nature (e.g. items). It can, of course, still be used for text.

Parameters:

Name Type Description Default
vocab_size int

Number of words in the vocabulary

required
input_dim int

Dimension of the token embeddings

Param aliases: embed_dim, d_model.

required
seq_length int

Input sequence length

required
n_heads int

Number of attention heads per Transformer block

required
n_blocks int

Number of Transformer blocks

required
attn_dropout float

Dropout that will be applied to the Multi-Head Attention layers

0.1
ff_dropout float

Dropout that will be applied to the FeedForward network

0.1
ff_factor int

Multiplicative factor applied to the first layer of the FF network in each Transformer block, This is normally set to 4.

4
activation str

Transformer Encoder activation function. 'tanh', 'relu', 'leaky_relu', 'gelu', 'geglu' and 'reglu' are supported

'gelu'
padding_idx int

index of the padding token in the padded-tokenised sequences.

0
with_cls_token bool

Boolean indicating if a '[CLS]' token is included in the tokenized sequences. If present, the final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. NOTE: if included in the tokenized sequences it must be inserted as the first token in the sequences.

False
with_pos_encoding bool

Boolean indicating if positional encoding will be used

True
pos_encoding_dropout float

Positional encoding dropout

0.1
pos_encoder Optional[Module]

This model uses by default a standard positional encoding approach. However, any custom positional encoder can also be used and pass to the Transformer model via the 'pos_encoder' parameter

None

Attributes:

Name Type Description
embedding Module

Standard token embedding layer

pos_encoder Module

Positional Encoder

encoder Module

Sequence of Transformer blocks

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import Transformer
>>> X = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
>>> model = Transformer(vocab_size=4, seq_length=5, input_dim=8, n_heads=1, n_blocks=1)
>>> out = model(X)
Source code in pytorch_widedeep/models/rec/basic_transformer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
class Transformer(nn.Module):
    r"""
    Basic Encoder-Only Transformer Model for sequence
    classification/regression. As all other models in the library this model
    can be used as the `deeptext` component of a Wide & Deep model or
    independently by itself.

    :information_source: **NOTE**:
    This model is introduced in the context of recommendation systems and
    thought for sequences of any nature (e.g. items). It can, of course,
    still be used for text.

    Parameters
    ----------
    vocab_size: int
        Number of words in the vocabulary
    input_dim: int
        Dimension of the token embeddings

        Param aliases: `embed_dim`, `d_model`. <br/>
    seq_length: int, Optional, default = None
        Input sequence length
    n_heads: int, default = 8
        Number of attention heads per Transformer block
    n_blocks: int, default = 4
        Number of Transformer blocks
    attn_dropout: float, default = 0.2
        Dropout that will be applied to the Multi-Head Attention layers
    ff_dropout: float, default = 0.1
        Dropout that will be applied to the FeedForward network
    ff_factor: float, default = 4
        Multiplicative factor applied to the first layer of the FF network in
        each Transformer block, This is normally set to 4.
    activation: str, default = "gelu"
        Transformer Encoder activation function. _'tanh'_, _'relu'_,
        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
    padding_idx: int, default = 0
        index of the padding token in the padded-tokenised sequences.
    with_cls_token: bool, default = False
        Boolean indicating if a `'[CLS]'` token is included in the tokenized
        sequences. If present, the final hidden state corresponding to this
        token is used as the aggregated representation for classification and
        regression tasks. **NOTE**: if included in the tokenized sequences it
        must be inserted as the first token in the sequences.
    with_pos_encoding: bool, default = True
        Boolean indicating if positional encoding will be used
    pos_encoding_dropout: float, default = 0.1
        Positional encoding dropout
    pos_encoder: nn.Module, Optional, default = None
        This model uses by default a standard positional encoding approach.
        However, any custom positional encoder can also be used and pass to
        the Transformer model via the 'pos_encoder' parameter

    Attributes
    ----------
    embedding: nn.Module
        Standard token embedding layer
    pos_encoder: nn.Module
        Positional Encoder
    encoder: nn.Module
        Sequence of Transformer blocks

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import Transformer
    >>> X = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
    >>> model = Transformer(vocab_size=4, seq_length=5, input_dim=8, n_heads=1, n_blocks=1)
    >>> out = model(X)
    """

    @alias("input_dim", ["embed_dim", "d_model"])
    @alias("seq_length", ["max_length", "maxlen"])
    def __init__(
        self,
        *,
        vocab_size: int,
        seq_length: int,
        input_dim: int,
        n_heads: int,
        n_blocks: int,
        attn_dropout: float = 0.1,
        ff_dropout: float = 0.1,
        ff_factor: int = 4,
        activation: str = "gelu",
        use_linear_attention: bool = False,
        use_flash_attention: bool = False,
        padding_idx: int = 0,
        with_cls_token: bool = False,
        with_pos_encoding: bool = True,
        pos_encoding_dropout: float = 0.1,
        pos_encoder: Optional[nn.Module] = None,
    ):
        super().__init__()

        self.input_dim = input_dim
        self.seq_length = seq_length
        self.n_heads = n_heads
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.ff_factor = ff_factor
        self.activation = activation
        self.use_linear_attention = use_linear_attention
        self.use_flash_attention = use_flash_attention
        self.padding_idx = padding_idx
        self.with_cls_token = with_cls_token
        self.with_pos_encoding = with_pos_encoding
        self.pos_encoding_dropout = pos_encoding_dropout

        self.embedding = nn.Embedding(
            vocab_size, input_dim, padding_idx=self.padding_idx
        )

        if with_pos_encoding:
            if pos_encoder is not None:
                self.pos_encoder: Union[nn.Module, nn.Identity, PositionalEncoding] = (
                    pos_encoder
                )
            else:
                self.pos_encoder = PositionalEncoding(
                    input_dim, pos_encoding_dropout, seq_length
                )
        else:
            self.pos_encoder = nn.Identity()

        self.encoder = nn.Sequential()
        for i in range(n_blocks):
            self.encoder.add_module(
                "transformer_block" + str(i),
                TransformerEncoder(
                    input_dim,
                    n_heads,
                    False,  # use_qkv_bias
                    attn_dropout,
                    ff_dropout,
                    ff_factor,
                    activation,
                    use_linear_attention,
                    use_flash_attention,
                ),
            )

    def forward(self, X: Tensor) -> Tensor:
        x = self.embedding(X.long())
        x = self.pos_encoder(x)
        x = self.encoder(x)
        if self.with_cls_token:
            x = x[:, 0, :]
        else:
            x = x.flatten(1)
        return x

    @property
    def output_dim(self) -> int:
        if self.with_cls_token:
            output_dim = self.input_dim
        else:
            output_dim = self.input_dim * self.seq_length
        return output_dim

DeepCrossNetwork

Bases: BaseTabularModelWithoutAttention

Defines a DeepCrossNetwork model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class implements the Deep & Cross Network for Ad Click Predictions architecture, which automatically combines features to generate feature interactions in an explicit fashion and at each layer.

The cross layer implements the following equation:

\[x_{l+1} = x_0 \odot (W_l x_l + b_l) + x_l\]

where:

  • \(\odot\) represents element-wise multiplication
  • \(x_l\), \(x_{l+1}\) are the outputs from the \(l^{th}\) and \((l+1)^{th}\) cross layers
  • \(W_l\), \(b_l\) are the weight and bias parameters

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the DeepCrossNetwork model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
n_cross_layers int

Number of cross layers in the cross network

3
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['standard', 'piecewise', 'periodic']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[200, 100]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
cross_network Module

The cross network component

deep_network Module

The deep network component (MLP)

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import DeepCrossNetwork
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = DeepCrossNetwork(
...     column_idx=column_idx,
...     cat_embed_input=cat_embed_input,
...     continuous_cols=["e"],
...     n_cross_layers=2,
...     mlp_hidden_dims=[16, 8]
... )
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/rec/dcn.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
class DeepCrossNetwork(BaseTabularModelWithoutAttention):
    r"""Defines a `DeepCrossNetwork` model that can be used as the `deeptabular`
    component of a Wide & Deep model or independently by itself.

    This class implements the [Deep & Cross Network for Ad Click Predictions](https://arxiv.org/abs/1708.05123)
    architecture, which automatically combines features to generate feature interactions
    in an explicit fashion and at each layer.

    The cross layer implements the following equation:

    $$x_{l+1} = x_0 \odot (W_l x_l + b_l) + x_l$$

    where:

    * $\odot$ represents element-wise multiplication
    * $x_l$, $x_{l+1}$ are the outputs from the $l^{th}$ and $(l+1)^{th}$ cross layers
    * $W_l$, $b_l$ are the weight and bias parameters

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `DeepCrossNetwork` model. Required to slice the tensors. e.g.
        _{'education': 0, 'relationship': 1, 'workclass': 2, ...}_.
    n_cross_layers: int, default = 3
        Number of cross layers in the cross network
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    cross_network: nn.Module
        The cross network component
    deep_network: nn.Module
        The deep network component (MLP)

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import DeepCrossNetwork
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = DeepCrossNetwork(
    ...     column_idx=column_idx,
    ...     cat_embed_input=cat_embed_input,
    ...     continuous_cols=["e"],
    ...     n_cross_layers=2,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        n_cross_layers: int = 3,
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: List[int] = [200, 100],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(DeepCrossNetwork, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        embeddings_output_dim = self.cat_out_dim + self.cont_out_dim
        self.deep_network = MLP(
            [embeddings_output_dim] + self.mlp_hidden_dims,
            self.mlp_activation,
            self.mlp_dropout,
            self.mlp_batchnorm,
            self.mlp_batchnorm_last,
            self.mlp_linear_first,
        )
        self.cross_network = CrossNetwork(embeddings_output_dim, n_cross_layers)

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        x = self._get_embeddings(X)
        cross_output = self.cross_network(x)
        deep_output = self.deep_network(x)
        return torch.cat([cross_output, deep_output], dim=1)

    @property
    def output_dim(self) -> int:
        return self.mlp_hidden_dims[-1] + self.cat_out_dim + self.cont_out_dim

DeepCrossNetworkV2

Bases: BaseTabularModelWithoutAttention

Defines a DeepCrossNetworkV2 model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class implements the DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems architecture, which enhances the original DCN by introducing a more expressive cross network that uses multiple experts and matrix decomposition techniques to improve model capacity while maintaining computational efficiency.

The cross layer implements the following equation:

\[E_i(x_l) = x_0 \odot (U_l^i \cdot g(C_l^i \cdot g(V_l^iT x_l)) + b_l)\]

where:

  • \(\odot\) represents element-wise multiplication
  • \(U_l^i\), \(C_l^i\), \(V_l^i\) are the decomposed weight matrices for expert \(i\) at layer \(l\)
  • \(g\) is the activation function (ReLU)
  • \(b_l\) is the bias term

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the DeepCrossNetworkV2 model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
num_cross_layers int

Number of cross layers in the cross network

2
low_rank Optional[int]

Rank of the weight matrix decomposition. If None, full-rank weights are used

None
num_experts int

Number of expert networks in mixture of experts

2
expert_dropout float

Dropout rate for expert outputs

0.0
structure Literal['stacked', 'parallel']

Structure of the model. Either 'parallel' or 'stacked'. If 'parallel', the output will be the concatenation of the cross network and deep network outputs. If 'stacked', the cross network output will be fed into the deep network.

'parallel'
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['standard', 'piecewise', 'periodic']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[200, 100]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
cross_network Module

The cross network component with mixture of experts

deep_network Module

The deep network component (MLP)

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import DeepCrossNetworkV2
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = DeepCrossNetworkV2(
...     column_idx=column_idx,
...     cat_embed_input=cat_embed_input,
...     continuous_cols=["e"],
...     num_cross_layers=2,
...     low_rank=32,
...     num_experts=4,
...     mlp_hidden_dims=[16, 8]
... )
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/rec/dcnv2.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
class DeepCrossNetworkV2(BaseTabularModelWithoutAttention):
    r"""Defines a `DeepCrossNetworkV2` model that can be used as the `deeptabular`
    component of a Wide & Deep model or independently by itself.

    This class implements the
    [DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535)
    architecture, which enhances the original DCN by introducing a more expressive cross
    network that uses multiple experts and matrix decomposition techniques to improve
    model capacity while maintaining computational efficiency.

    The cross layer implements the following equation:

    $$E_i(x_l) = x_0 \odot (U_l^i \cdot g(C_l^i \cdot g(V_l^iT x_l)) + b_l)$$

    where:

    * $\odot$ represents element-wise multiplication
    * $U_l^i$, $C_l^i$, $V_l^i$ are the decomposed weight matrices for expert $i$ at layer $l$
    * $g$ is the activation function (ReLU)
    * $b_l$ is the bias term

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `DeepCrossNetworkV2` model. Required to slice the tensors. e.g.
        _{'education': 0, 'relationship': 1, 'workclass': 2, ...}_.
    num_cross_layers: int, default = 2
        Number of cross layers in the cross network
    low_rank: int, Optional, default = None
        Rank of the weight matrix decomposition. If None, full-rank weights are used
    num_experts: int, default = 2
        Number of expert networks in mixture of experts
    expert_dropout: float, default = 0.0
        Dropout rate for expert outputs
    structure: str, default = "parallel"
        Structure of the model. Either _'parallel'_ or _'stacked'_. If _'parallel'_,
        the output will be the concatenation of the cross network and deep network
        outputs. If _'stacked'_, the cross network output will be fed into the deep
        network.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`
    Attributes
    ----------
    cross_network: nn.Module
        The cross network component with mixture of experts
    deep_network: nn.Module
        The deep network component (MLP)

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import DeepCrossNetworkV2
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = DeepCrossNetworkV2(
    ...     column_idx=column_idx,
    ...     cat_embed_input=cat_embed_input,
    ...     continuous_cols=["e"],
    ...     num_cross_layers=2,
    ...     low_rank=32,
    ...     num_experts=4,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        num_cross_layers: int = 2,
        low_rank: Optional[int] = None,
        num_experts: int = 2,
        expert_dropout: float = 0.0,
        structure: Literal["stacked", "parallel"] = "parallel",
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: List[int] = [200, 100],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(DeepCrossNetworkV2, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.num_cross_layers = num_cross_layers
        self.num_experts = num_experts
        self.low_rank = low_rank
        self.expert_dropout = expert_dropout
        self.structure = structure

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        embeddings_output_dim = self.cat_out_dim + self.cont_out_dim
        self.deep_network = MLP(
            [embeddings_output_dim] + self.mlp_hidden_dims,
            self.mlp_activation,
            self.mlp_dropout,
            self.mlp_batchnorm,
            self.mlp_batchnorm_last,
            self.mlp_linear_first,
        )

        self.cross_network = CrossNetworkV2(
            input_dim=embeddings_output_dim,
            num_layers=self.num_cross_layers,
            low_rank=self.low_rank,
            num_experts=self.num_experts,
            expert_dropout=self.expert_dropout,
        )

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        x = self._get_embeddings(X)
        cross_output = self.cross_network(x)

        if self.structure == "stacked":
            deep_output = self.deep_network(cross_output)
            return deep_output
        else:  # parallel
            deep_output = self.deep_network(x)
            return torch.cat([cross_output, deep_output], dim=1)

    @property
    def output_dim(self) -> int:
        if self.structure == "stacked":
            return self.mlp_hidden_dims[-1]
        else:  # parallel
            return self.mlp_hidden_dims[-1] + self.cat_out_dim + self.cont_out_dim

GatedDeepCrossNetwork

Bases: BaseTabularModelWithoutAttention

Defines a GatedDeepCrossNetwork model that can be used as the deeptabular component of a Wide & Deep model or independently by itself.

This class implements the Gated Deep & Cross Network (GDCN) architecture as described in the paper Towards Deeper, Lighter and Interpretable Cross Network for CTR Prediction. The GDCN enhances the original DCN by introducing a gating mechanism in the cross network. The gating mechanism controls feature interactions by learning which interactions are more important.

The cross layer implements the following equation:

\[c_{i+1} = c_0 \odot (W^c \times c_i + b) \odot \sigma(W^g \times c_i) + c_i\]

where:

  • \(\odot\) represents element-wise multiplication
  • \(W^c\) and \(W^g\) are the cross and gate weight matrices respectively
  • \(\sigma\) is the sigmoid activation function

Parameters:

Name Type Description Default
column_idx Dict[str, int]

Dict containing the index of the columns that will be passed through the GatedDeepCrossNetwork model. Required to slice the tensors. e.g. {'education': 0, 'relationship': 1, 'workclass': 2, ...}.

required
num_cross_layers int

Number of cross layers in the cross network

3
structure Literal['stacked', 'parallel']

Structure of the model. Either 'parallel' or 'stacked'. If 'parallel', the output will be the concatenation of the cross network and deep network outputs. If 'stacked', the cross network output will be fed into the deep network.

'parallel'
cat_embed_input Optional[List[Tuple[str, int, int]]]

List of Tuples with the column name, number of unique values and embedding dimension. e.g. [(education, 11, 32), ...]

None
cat_embed_dropout Optional[float]

Categorical embeddings dropout. If None, it will default to 0.

None
use_cat_bias Optional[bool]

Boolean indicating if bias will be used for the categorical embeddings. If None, it will default to 'False'.

None
cat_embed_activation Optional[str]

Activation function for the categorical embeddings, if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

None
continuous_cols Optional[List[str]]

List with the name of the numeric (aka continuous) columns

None
cont_norm_layer Optional[Literal['batchnorm', 'layernorm']]

Type of normalization layer applied to the continuous features. Options are: 'layernorm' and 'batchnorm'. if None, no normalization layer will be used.

None
embed_continuous_method Optional[Literal['standard', 'piecewise', 'periodic']]

Method to use to embed the continuous features. Options are: 'standard', 'periodic' or 'piecewise'. The 'standard' embedding method is based on the FT-Transformer implementation presented in the paper: Revisiting Deep Learning Models for Tabular Data. The 'periodic' and_'piecewise'_ methods were presented in the paper: On Embeddings for Numerical Features in Tabular Deep Learning. Please, read the papers for details.

None
cont_embed_dropout Optional[float]

Dropout for the continuous embeddings. If None, it will default to 0.0

None
cont_embed_activation Optional[str]

Activation function for the continuous embeddings if any. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported. If None, no activation function will be applied.

None
quantization_setup Optional[Dict[str, List[float]]]

This parameter is used when the 'piecewise' method is used to embed the continuous cols. It is a dict where keys are the name of the continuous columns and values are lists with the boundaries for the quantization of the continuous_cols. See the examples for details. If If the 'piecewise' method is used, this parameter is required.

None
n_frequencies Optional[int]

This is the so called 'k' in their paper On Embeddings for Numerical Features in Tabular Deep Learning, and is the number of 'frequencies' that will be used to represent each continuous column. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
sigma Optional[float]

This is the sigma parameter in the paper mentioned when describing the previous parameters and it is used to initialise the 'frequency weights'. See their Eq 2 in the paper for details. If the 'periodic' method is used, this parameter is required.

None
share_last_layer Optional[bool]

This parameter is not present in the before mentioned paper but it is implemented in the official repo. If True the linear layer that turns the frequencies into embeddings will be shared across the continuous columns. If False a different linear layer will be used for each continuous column. If the 'periodic' method is used, this parameter is required.

None
full_embed_dropout Optional[bool]

If True, the full embedding corresponding to a column will be masked out/dropout. If None, it will default to False.

None
mlp_hidden_dims List[int]

List with the number of neurons per dense layer in the mlp.

[200, 100]
mlp_activation str

Activation function for the dense layers of the MLP. Currently 'tanh', 'relu', 'leaky_relu' and 'gelu' are supported

'relu'
mlp_dropout Union[float, List[float]]

float or List of floats with the dropout between the dense layers. e.g: [0.5,0.5]

0.1
mlp_batchnorm bool

Boolean indicating whether or not batch normalization will be applied to the dense layers

False
mlp_batchnorm_last bool

Boolean indicating whether or not batch normalization will be applied to the last of the dense layers

False
mlp_linear_first bool

Boolean indicating the order of the operations in the dense layer. If True: [LIN -> ACT -> BN -> DP]. If False: [BN -> DP -> LIN -> ACT]

True

Attributes:

Name Type Description
cross_network Module

The gated cross network component

deep_network Module

The deep network component (MLP)

Examples:

>>> import torch
>>> from pytorch_widedeep.models.rec import GatedDeepCrossNetwork
>>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
>>> colnames = ["a", "b", "c", "d", "e"]
>>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
>>> column_idx = {k: v for v, k in enumerate(colnames)}
>>> model = GatedDeepCrossNetwork(
...     column_idx=column_idx,
...     cat_embed_input=cat_embed_input,
...     continuous_cols=["e"],
...     num_cross_layers=2,
...     mlp_hidden_dims=[16, 8]
... )
>>> out = model(X_tab)
Source code in pytorch_widedeep/models/rec/gdcn.py
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
class GatedDeepCrossNetwork(BaseTabularModelWithoutAttention):
    r"""Defines a `GatedDeepCrossNetwork` model that can be used as the `deeptabular`
    component of a Wide & Deep model or independently by itself.

    This class implements the Gated Deep & Cross Network (GDCN) architecture
    as described in the paper
    [Towards Deeper, Lighter and Interpretable Cross Network for CTR Prediction](https://arxiv.org/pdf/2311.04635).
    The GDCN enhances the original DCN by introducing a gating mechanism in
    the cross network. The gating mechanism controls feature interactions by
    learning which interactions are more important.

    The cross layer implements the following equation:

    $$c_{i+1} = c_0 \odot (W^c \times c_i + b) \odot \sigma(W^g \times c_i) + c_i$$

    where:

    * $\odot$ represents element-wise multiplication
    * $W^c$ and $W^g$ are the cross and gate weight matrices respectively
    * $\sigma$ is the sigmoid activation function

    Parameters
    ----------
    column_idx: Dict
        Dict containing the index of the columns that will be passed through
        the `GatedDeepCrossNetwork` model. Required to slice the tensors. e.g.
        _{'education': 0, 'relationship': 1, 'workclass': 2, ...}_.
    num_cross_layers: int, default = 3
        Number of cross layers in the cross network
    structure: str, default = "parallel"
        Structure of the model. Either _'parallel'_ or _'stacked'_. If _'parallel'_,
        the output will be the concatenation of the cross network and deep network
        outputs. If _'stacked'_, the cross network output will be fed into the deep
        network.
    cat_embed_input: List, Optional, default = None
        List of Tuples with the column name, number of unique values and
        embedding dimension. e.g. _[(education, 11, 32), ...]_
    cat_embed_dropout: float, Optional, default = None
        Categorical embeddings dropout. If `None`, it will default to 0.
    use_cat_bias: bool, Optional, default = None,
        Boolean indicating if bias will be used for the categorical embeddings.
        If `None`, it will default to 'False'.
    cat_embed_activation: Optional, str, default = None,
        Activation function for the categorical embeddings, if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    continuous_cols : Optional[List[str]], default=None
        List with the name of the numeric (aka continuous) columns
    cont_norm_layer : Optional[Literal["batchnorm", "layernorm"]], default=None
        Type of normalization layer applied to the continuous features.
        Options are: _'layernorm'_ and _'batchnorm'_. if `None`, no
        normalization layer will be used.
    embed_continuous_method: Optional[Literal["piecewise", "periodic", "standard"]], default="standard"
        Method to use to embed the continuous features. Options are:
        _'standard'_, _'periodic'_ or _'piecewise'_. The _'standard'_
        embedding method is based on the FT-Transformer implementation
        presented in the paper: [Revisiting Deep Learning Models for
        Tabular Data](https://arxiv.org/abs/2106.11959v5). The _'periodic'_
        and_'piecewise'_ methods were presented in the paper: [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556).
        Please, read the papers for details.
    cont_embed_dropout : Optional[float], default=None
        Dropout for the continuous embeddings. If `None`, it will default to 0.0
    cont_embed_activation : Optional[str], default=None
        Activation function for the continuous embeddings if any. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported.
        If `None`, no activation function will be applied.
    quantization_setup : Optional[Dict[str, List[float]]], default=None
        This parameter is used when the _'piecewise'_ method is used to embed
        the continuous cols. It is a dict where keys are the name of the continuous
        columns and values are lists with the boundaries for the quantization
        of the continuous_cols. See the examples for details. If
        If the _'piecewise'_ method is used, this parameter is required.
    n_frequencies : Optional[int], default=None
        This is the so called _'k'_ in their paper [On Embeddings for
        Numerical Features in Tabular Deep Learning](https://arxiv.org/abs/2203.05556),
        and is the number of 'frequencies' that will be used to represent each
        continuous column. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    sigma : Optional[float], default=None
        This is the sigma parameter in the paper mentioned when describing the
        previous parameters and it is used to initialise the 'frequency
        weights'. See their Eq 2 in the paper for details. If
        the _'periodic'_ method is used, this parameter is required.
    share_last_layer : Optional[bool], default=None
        This parameter is not present in the before mentioned paper but it is implemented in
        the [official repo](https://github.com/yandex-research/rtdl-num-embeddings/tree/main).
        If `True` the linear layer that turns the frequencies into embeddings
        will be shared across the continuous columns. If `False` a different
        linear layer will be used for each continuous column.
        If the _'periodic'_ method is used, this parameter is required.
    full_embed_dropout: bool, Optional, default = None,
        If `True`, the full embedding corresponding to a column will be masked
        out/dropout. If `None`, it will default to `False`.
    mlp_hidden_dims: List, default = [200, 100]
        List with the number of neurons per dense layer in the mlp.
    mlp_activation: str, default = "relu"
        Activation function for the dense layers of the MLP. Currently
        _'tanh'_, _'relu'_, _'leaky_relu'_ and _'gelu'_ are supported
    mlp_dropout: float or List, default = 0.1
        float or List of floats with the dropout between the dense layers.
        e.g: _[0.5,0.5]_
    mlp_batchnorm: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the dense layers
    mlp_batchnorm_last: bool, default = False
        Boolean indicating whether or not batch normalization will be applied
        to the last of the dense layers
    mlp_linear_first: bool, default = False
        Boolean indicating the order of the operations in the dense
        layer. If `True: [LIN -> ACT -> BN -> DP]`. If `False: [BN -> DP ->
        LIN -> ACT]`

    Attributes
    ----------
    cross_network: nn.Module
        The gated cross network component
    deep_network: nn.Module
        The deep network component (MLP)

    Examples
    --------
    >>> import torch
    >>> from pytorch_widedeep.models.rec import GatedDeepCrossNetwork
    >>> X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
    >>> colnames = ["a", "b", "c", "d", "e"]
    >>> cat_embed_input = [(u, i, j) for u, i, j in zip(colnames[:4], [4] * 4, [8] * 4)]
    >>> column_idx = {k: v for v, k in enumerate(colnames)}
    >>> model = GatedDeepCrossNetwork(
    ...     column_idx=column_idx,
    ...     cat_embed_input=cat_embed_input,
    ...     continuous_cols=["e"],
    ...     num_cross_layers=2,
    ...     mlp_hidden_dims=[16, 8]
    ... )
    >>> out = model(X_tab)
    """

    def __init__(
        self,
        *,
        column_idx: Dict[str, int],
        num_cross_layers: int = 3,
        structure: Literal["stacked", "parallel"] = "parallel",
        cat_embed_input: Optional[List[Tuple[str, int, int]]] = None,
        cat_embed_dropout: Optional[float] = None,
        use_cat_bias: Optional[bool] = None,
        cat_embed_activation: Optional[str] = None,
        continuous_cols: Optional[List[str]] = None,
        cont_norm_layer: Optional[Literal["batchnorm", "layernorm"]] = None,
        embed_continuous: Optional[bool] = None,
        embed_continuous_method: Optional[
            Literal["standard", "piecewise", "periodic"]
        ] = None,
        cont_embed_dim: Optional[int] = None,
        cont_embed_dropout: Optional[float] = None,
        cont_embed_activation: Optional[str] = None,
        quantization_setup: Optional[Dict[str, List[float]]] = None,
        n_frequencies: Optional[int] = None,
        sigma: Optional[float] = None,
        share_last_layer: Optional[bool] = None,
        full_embed_dropout: Optional[bool] = None,
        mlp_hidden_dims: List[int] = [200, 100],
        mlp_activation: str = "relu",
        mlp_dropout: Union[float, List[float]] = 0.1,
        mlp_batchnorm: bool = False,
        mlp_batchnorm_last: bool = False,
        mlp_linear_first: bool = True,
    ):
        super(GatedDeepCrossNetwork, self).__init__(
            column_idx=column_idx,
            cat_embed_input=cat_embed_input,
            cat_embed_dropout=cat_embed_dropout,
            use_cat_bias=use_cat_bias,
            cat_embed_activation=cat_embed_activation,
            continuous_cols=continuous_cols,
            cont_norm_layer=cont_norm_layer,
            embed_continuous=embed_continuous,
            embed_continuous_method=embed_continuous_method,
            cont_embed_dim=cont_embed_dim,
            cont_embed_dropout=cont_embed_dropout,
            cont_embed_activation=cont_embed_activation,
            quantization_setup=quantization_setup,
            n_frequencies=n_frequencies,
            sigma=sigma,
            share_last_layer=share_last_layer,
            full_embed_dropout=full_embed_dropout,
        )

        self.num_cross_layers = num_cross_layers
        self.structure = structure

        self.mlp_hidden_dims = mlp_hidden_dims
        self.mlp_activation = mlp_activation
        self.mlp_dropout = mlp_dropout
        self.mlp_batchnorm = mlp_batchnorm
        self.mlp_batchnorm_last = mlp_batchnorm_last
        self.mlp_linear_first = mlp_linear_first

        embeddings_output_dim = self.cat_out_dim + self.cont_out_dim
        self.deep_network = MLP(
            [embeddings_output_dim] + self.mlp_hidden_dims,
            self.mlp_activation,
            self.mlp_dropout,
            self.mlp_batchnorm,
            self.mlp_batchnorm_last,
            self.mlp_linear_first,
        )

        self.cross_network = GatedCrossNetwork(embeddings_output_dim, num_cross_layers)

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        x = self._get_embeddings(X)
        cross_output = self.cross_network(x)

        if self.structure == "stacked":
            deep_output = self.deep_network(cross_output)
            return deep_output
        else:  # parallel
            deep_output = self.deep_network(x)
            return torch.cat([cross_output, deep_output], dim=1)

    @property
    def output_dim(self) -> int:
        if self.structure == "stacked":
            return self.mlp_hidden_dims[-1]
        else:  # parallel
            return self.mlp_hidden_dims[-1] + self.cat_out_dim + self.cont_out_dim