Skip to content

The preprocessing module

This module contains the classes that are used to prepare the data before being passed to the models. There is one Preprocessor per data mode or model component: wide, deeptabular, deepimage and deeptext.

WidePreprocessor

WidePreprocessor(wide_cols, crossed_cols=None)

Bases: BasePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. WidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

  • wide_cols (List[str]) –

    List of strings with the name of the columns that will label encoded and passed through the wide component

  • crossed_cols (Optional[List[Tuple[str, str]]], default: None ) –

    List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

Attributes:

  • wide_crossed_cols (List) –

    List with the names of all columns that will be label encoded

  • encoding_dict (Dict) –

    Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.

  • inverse_encoding_dict (Dict) –

    the inverse encoding dictionary

  • wide_dim (int) –

    Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import WidePreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
>>> X_wide = wide_preprocessor.fit_transform(df)
>>> X_wide
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> wide_preprocessor.encoding_dict
{'color_r': 1, 'color_b': 2, 'color_g': 3, 'color_size_r-s': 4, 'color_size_b-n': 5, 'color_size_g-l': 6}
>>> wide_preprocessor.inverse_transform(X_wide)
  color color_size
0     r        r-s
1     b        b-n
2     g        g-l
Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
67
68
69
70
71
72
73
74
75
def __init__(
    self, wide_cols: List[str], crossed_cols: Optional[List[Tuple[str, str]]] = None
):
    super(WidePreprocessor, self).__init__()

    self.wide_cols = wide_cols
    self.crossed_cols = crossed_cols

    self.is_fitted = False

fit

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def fit(self, df: pd.DataFrame) -> "WidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    WidePreprocessor
        `WidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(df)
    self.wide_crossed_cols = df_wide.columns.tolist()
    glob_feature_list = self._make_global_feature_list(
        df_wide[self.wide_crossed_cols]
    )
    # leave 0 for padding/"unseen" categories
    self.encoding_dict = {v: i + 1 for i, v in enumerate(glob_feature_list)}
    self.wide_dim = len(self.encoding_dict)
    self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
    self.inverse_encoding_dict[0] = "unseen"

    self.is_fitted = True

    return self

transform

transform(df)

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def transform(self, df: pd.DataFrame) -> np.ndarray:
    r"""
    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, attributes=["encoding_dict"])
    df_wide = self._prepare_wide(df)
    encoded = np.zeros([len(df_wide), len(self.wide_crossed_cols)])
    for col_i, col in enumerate(self.wide_crossed_cols):
        encoded[:, col_i] = df_wide[col].apply(
            lambda x: (
                self.encoding_dict[col + "_" + str(x)]
                if col + "_" + str(x) in self.encoding_dict
                else 0
            )
        )
    return encoded.astype("int64")

inverse_transform

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

  • encoded (ndarray) –

    numpy array with the encoded values that are the output from the transform method

Returns:

  • DataFrame

    Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        numpy array with the encoded values that are the output from the
        `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=self.wide_crossed_cols)

    if pd.__version__ >= "2.1.0":
        decoded = decoded.map(lambda x: self.inverse_encoding_dict[x])
    else:
        decoded = decoded.applymap(lambda x: self.inverse_encoding_dict[x])

    for col in decoded.columns:
        rm_str = "".join([col, "_"])
        decoded[col] = decoded[col].apply(lambda x: x.replace(rm_str, ""))
    return decoded

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

TabPreprocessor

TabPreprocessor(cat_embed_cols=None, continuous_cols=None, quantization_setup=None, cols_to_scale=None, auto_embed_dim=True, embedding_rule='fastai_new', default_embed_dim=16, with_attention=False, with_cls_token=False, shared_embed=False, verbose=1, *, scale=False, already_standard=None, **kwargs)

Bases: BasePreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

  • cat_embed_cols (Optional[Union[List[str], List[Tuple[str, int]]]], default: None ) –

    List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...])

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the continuous cols

  • quantization_setup (Optional[Union[int, Dict[str, Union[int, List[float]]]]], default: None ) –

    Continuous columns can be turned into categorical via pd.cut. If quantization_setup is an int, all continuous columns will be quantized using this value as the number of bins. Alternatively, a dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges can also be used.

  • cols_to_scale (Optional[Union[List[str], str]], default: None ) –

    List with the names of the columns that will be standarised via sklearn's StandardScaler. It can also be the string 'all' in which case all the continuous cols will be scaled.

  • auto_embed_dim (bool, default: True ) –

    Boolean indicating whether the embedding dimensions will be automatically defined via rule of thumb. See embedding_rule below.

  • embedding_rule (Literal[google, fastai_old, fastai_new], default: 'fastai_new' ) –

    If auto_embed_dim=True, this is the choice of embedding rule of thumb. Choices are:

    • fastai_new: \(min(600, round(1.6 \times n_{cat}^{0.56}))\)

    • fastai_old: \(min(50, (n_{cat}//{2})+1)\)

    • google: \(min(600, round(n_{cat}^{0.24}))\)

  • default_embed_dim (int, default: 16 ) –

    Dimension for the embeddings if the embedding dimension is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.

  • with_attention (bool, default: False ) –

    Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
    Param alias: for_transformer

  • with_cls_token (bool, default: False ) –

    Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical and/or continuous embeddings will be concatenated before being passed to the final MLP (if present).

  • shared_embed (bool, default: False ) –

    Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • verbose (int, default: 1 ) –
  • scale (bool, default: False ) –

    ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
    Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
    Param alias: scale_cont_cols.

  • already_standard (Optional[List[str]], default: None ) –

    ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
    List with the name of the continuous cols that do not need to be scaled/standarised.

Other Parameters:

  • **kwargs

    pd.cut and StandardScaler related args

Attributes:

  • embed_dim (Dict) –

    Dictionary where keys are the embed cols and values are the embedding dimensions. If with_attention is set to True this attribute is not generated during the fit process

  • label_encoder (LabelEncoder) –

    see pytorch_widedeep.utils.dense_utils.LabelEncder

  • cat_embed_input (List) –

    List of Tuples with the column name, number of individual values for that column and, If with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].

  • standardize_cols (List) –

    List of the columns that will be standarized

  • scaler (StandardScaler) –

    an instance of sklearn.preprocessing.StandardScaler

  • column_idx (Dict) –

    Dictionary where keys are column names and values are column indexes. This is neccesary to slice tensors

  • quantizer (Quantizer) –

    an instance of Quantizer

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import TabPreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l'], 'age': [25, 40, 55]})
>>> cat_embed_cols = [('color',5), ('size',5)]
>>> cont_cols = ['age']
>>> deep_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols)
>>> X_tab = deep_preprocessor.fit_transform(df)
>>> deep_preprocessor.cat_embed_cols
[('color', 5), ('size', 5)]
>>> deep_preprocessor.column_idx
{'color': 0, 'size': 1, 'age': 2}
>>> cont_df = pd.DataFrame({"col1": np.random.rand(10), "col2": np.random.rand(10) + 1})
>>> cont_cols = ["col1", "col2"]
>>> tab_preprocessor = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=3)
>>> ft_cont_df = tab_preprocessor.fit_transform(cont_df)
>>> # or...
>>> quantization_setup = {'col1': [0., 0.4, 1.], 'col2': [1., 1.4, 2.]}
>>> tab_preprocessor2 = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=quantization_setup)
>>> ft_cont_df2 = tab_preprocessor2.fit_transform(cont_df)
Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
@alias("with_attention", ["for_transformer"])
@alias("cat_embed_cols", ["embed_cols"])
@alias("scale", ["scale_cont_cols"])
@alias("quantization_setup", ["cols_and_bins"])
def __init__(
    self,
    cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
    continuous_cols: Optional[List[str]] = None,
    quantization_setup: Optional[
        Union[int, Dict[str, Union[int, List[float]]]]
    ] = None,
    cols_to_scale: Optional[Union[List[str], str]] = None,
    auto_embed_dim: bool = True,
    embedding_rule: Literal["google", "fastai_old", "fastai_new"] = "fastai_new",
    default_embed_dim: int = 16,
    with_attention: bool = False,
    with_cls_token: bool = False,
    shared_embed: bool = False,
    verbose: int = 1,
    *,
    scale: bool = False,
    already_standard: Optional[List[str]] = None,
    **kwargs,
):
    super(TabPreprocessor, self).__init__()

    self.continuous_cols = continuous_cols
    self.quantization_setup = quantization_setup
    self.cols_to_scale = cols_to_scale
    self.scale = scale
    self.already_standard = already_standard
    self.auto_embed_dim = auto_embed_dim
    self.embedding_rule = embedding_rule
    self.default_embed_dim = default_embed_dim
    self.with_attention = with_attention
    self.with_cls_token = with_cls_token
    self.shared_embed = shared_embed
    self.verbose = verbose

    self.quant_args = {
        k: v for k, v in kwargs.items() if k in pd.cut.__code__.co_varnames
    }
    self.scale_args = {
        k: v for k, v in kwargs.items() if k in StandardScaler().get_params()
    }

    self._check_inputs(cat_embed_cols)

    if with_cls_token:
        self.cat_embed_cols = (
            ["cls_token"] + cat_embed_cols  # type: ignore[operator]
            if cat_embed_cols is not None
            else ["cls_token"]
        )
    else:
        self.cat_embed_cols = cat_embed_cols  # type: ignore[assignment]

    self.is_fitted = False

fit

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
def fit(self, df: pd.DataFrame) -> BasePreprocessor:  # noqa: C901
    """Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TabPreprocessor
        `TabPreprocessor` fitted object
    """

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    self.column_idx: Dict[str, int] = {}

    # Categorical embeddings logic
    if self.cat_embed_cols is not None or self.quantization_setup is not None:
        self.cat_embed_input: List[Union[Tuple[str, int], Tuple[str, int, int]]] = (
            []
        )

    if self.cat_embed_cols is not None:
        df_cat, cat_embed_dim = self._prepare_categorical(df_adj)

        self.label_encoder = LabelEncoder(
            columns_to_encode=df_cat.columns.tolist(),
            shared_embed=self.shared_embed,
            with_attention=self.with_attention,
        )
        self.label_encoder.fit(df_cat)

        for k, v in self.label_encoder.encoding_dict.items():
            if self.with_attention:
                self.cat_embed_input.append((k, len(v)))
            else:
                self.cat_embed_input.append((k, len(v), cat_embed_dim[k]))

        self.column_idx.update({k: v for v, k in enumerate(df_cat.columns)})

    # Continuous columns logic
    if self.continuous_cols is not None:
        df_cont, cont_embed_dim = self._prepare_continuous(df_adj)

        # Standardization logic
        if self.standardize_cols is not None:
            self.scaler = StandardScaler(**self.scale_args).fit(
                df_cont[self.standardize_cols].values
            )
        elif self.verbose:
            warnings.warn("Continuous columns will not be normalised")

        # Quantization logic
        if self.cols_and_bins is not None:
            # we do not run 'Quantizer.fit' here since in the wild case
            # someone wants standardization and quantization for the same
            # columns, the Quantizer will run on the scaled data
            self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

            if self.with_attention:
                for col, n_cat, _ in cont_embed_dim:
                    self.cat_embed_input.append((col, n_cat))
            else:
                self.cat_embed_input.extend(cont_embed_dim)

        self.column_idx.update(
            {k: v + len(self.column_idx) for v, k in enumerate(df_cont)}
        )

    self.is_fitted = True

    return self

transform

transform(df)

Returns the processed dataframe as a np.ndarray

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
def transform(self, df: pd.DataFrame) -> np.ndarray:  # noqa: C901
    """Returns the processed `dataframe` as a np.ndarray

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, condition=self.is_fitted)

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    if self.cat_embed_cols is not None:
        df_cat = df_adj[self.cat_cols]
        df_cat = self.label_encoder.transform(df_cat)
    if self.continuous_cols is not None:
        df_cont = df_adj[self.continuous_cols]
        # Standardization logic
        if self.standardize_cols:
            df_cont[self.standardize_cols] = self.scaler.transform(
                df_cont[self.standardize_cols].values
            )
        # Quantization logic
        if self.cols_and_bins is not None:
            # Adjustment so I don't have to override the method
            # in 'ChunkTabPreprocessor'
            if self.quantizer.is_fitted:
                df_cont = self.quantizer.transform(df_cont)
            else:
                df_cont = self.quantizer.fit_transform(df_cont)
    try:
        df_deep = pd.concat([df_cat, df_cont], axis=1)
    except NameError:
        try:
            df_deep = df_cat.copy()
        except NameError:
            df_deep = df_cont.copy()

    return df_deep.values

inverse_transform

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

  • encoded (ndarray) –

    array with the output of the transform method

Returns:

  • DataFrame

    Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:  # noqa: C901
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=list(self.column_idx.keys()))
    # embeddings back to original category
    if self.cat_embed_cols is not None:
        decoded = self.label_encoder.inverse_transform(decoded)
    if self.continuous_cols is not None:
        # quantized cols to the mid point
        if self.cols_and_bins is not None:
            if self.verbose:
                print(
                    "Note that quantized cols will be turned into the mid point of "
                    "the corresponding bin"
                )
            for k, v in self.quantizer.inversed_bins.items():
                decoded[k] = decoded[k].map(v)
        # continuous_cols back to non-standarised
        try:
            decoded[self.standardize_cols] = self.scaler.inverse_transform(
                decoded[self.standardize_cols]
            )
        except Exception:  # KeyError:
            pass

    if "cls_token" in decoded.columns:
        decoded.drop("cls_token", axis=1, inplace=True)

    return decoded

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

Quantizer

Quantizer(quantization_setup, **kwargs)

Helper class to perform the quantization of continuous columns. It is included in this docs for completion, since depending on the value of the parameter 'quantization_setup' of the TabPreprocessor class, that class might have an attribute of type Quantizer. However, this class is designed to always run internally within the TabPreprocessor class.

Parameters:

  • quantization_setup (Dict[str, Union[int, List[float]]]) –

    Dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges.

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
61
62
63
64
65
66
67
68
69
def __init__(
    self,
    quantization_setup: Dict[str, Union[int, List[float]]],
    **kwargs,
):
    self.quantization_setup = quantization_setup
    self.quant_args = kwargs

    self.is_fitted = False

TextPreprocessor

TextPreprocessor(text_col, max_vocab=30000, min_freq=5, maxlen=80, pad_first=True, pad_idx=1, already_processed=False, word_vectors_path=None, n_cpus=None, verbose=1)

Bases: BasePreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

  • text_col (str) –

    column in the input dataframe containing the texts

  • max_vocab (int, default: 30000 ) –

    Maximum number of tokens in the vocabulary

  • min_freq (int, default: 5 ) –

    Minimum frequency for a token to be part of the vocabulary

  • maxlen (int, default: 80 ) –

    Maximum length of the tokenized sequences

  • pad_first (bool, default: True ) –

    Indicates whether the padding index will be added at the beginning or the end of the sequences

  • pad_idx (int, default: 1 ) –

    padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.

  • already_processed (Optional[bool], default: False ) –

    Boolean indicating if the sequence of elements is already processed or prepared. If this is the case, this Preprocessor will simply tokenize and pad the sequence.

    Param aliases: `not_text`. <br/>
    

    This parameter is thought for those cases where the input sequences are already fully processed or are directly not text (e.g. IDs)

  • word_vectors_path (Optional[str], default: None ) –

    Path to the pretrained word vectors

  • n_cpus (Optional[int], default: None ) –

    number of CPUs to used during the tokenization process

  • verbose (int, default: 1 ) –

    Enable verbose output.

Attributes:

  • vocab (Vocab) –

    an instance of pytorch_widedeep.utils.fastai_transforms.Vocab

  • embedding_matrix (ndarray) –

    Array with the pretrained embeddings

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import TextPreprocessor
>>> df_train = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> text_preprocessor = TextPreprocessor(text_col='text_column', max_vocab=25, min_freq=1, maxlen=10)
>>> text_preprocessor.fit_transform(df_train)
The vocabulary contains 24 tokens
array([[ 1,  1,  1,  1, 10, 11, 12, 13, 14, 15],
       [ 5,  9, 16, 17, 18,  9, 19, 20, 21, 22]], dtype=int32)
>>> df_te = pd.DataFrame({'text_column': ['you never know what is in the box']})
>>> text_preprocessor.transform(df_te)
array([[ 1,  1,  9, 16, 17, 18, 11,  0,  0, 13]], dtype=int32)
Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@alias("already_processed", ["not_text"])
def __init__(
    self,
    text_col: str,
    max_vocab: int = 30000,
    min_freq: int = 5,
    maxlen: int = 80,
    pad_first: bool = True,
    pad_idx: int = 1,
    already_processed: Optional[bool] = False,
    word_vectors_path: Optional[str] = None,
    n_cpus: Optional[int] = None,
    verbose: int = 1,
):
    super(TextPreprocessor, self).__init__()

    self.text_col = text_col
    self.max_vocab = max_vocab
    self.min_freq = min_freq
    self.maxlen = maxlen
    self.pad_first = pad_first
    self.pad_idx = pad_idx
    self.already_processed = already_processed
    self.word_vectors_path = word_vectors_path
    self.verbose = verbose
    self.n_cpus = n_cpus if n_cpus is not None else os.cpu_count()

    self.is_fitted = False

fit

fit(df)

Builds the vocabulary

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def fit(self, df: pd.DataFrame) -> BasePreprocessor:
    """Builds the vocabulary

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TextPreprocessor
        `TextPreprocessor` fitted object
    """
    texts = self._read_texts(df)

    tokens = get_texts(texts, self.already_processed, self.n_cpus)

    self.vocab: TVocab = Vocab(
        max_vocab=self.max_vocab,
        min_freq=self.min_freq,
        pad_idx=self.pad_idx,
    ).fit(
        tokens,
    )

    if self.verbose:
        print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
    if self.word_vectors_path is not None:
        self.embedding_matrix = build_embeddings_matrix(
            self.vocab, self.word_vectors_path, self.min_freq
        )

    self.is_fitted = True

    return self

transform

transform(df)

Returns the padded, 'numericalised' sequences

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequences

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    check_is_fitted(self, attributes=["vocab"])
    texts = self._read_texts(df)
    tokens = get_texts(texts, self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)

transform_sample

transform_sample(text)

Returns the padded, 'numericalised' sequence

Parameters:

  • text (str) –

    text to be tokenized and padded

Returns:

  • ndarray

    Padded, 'numericalised' sequence

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def transform_sample(self, text: str) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequence

    Parameters
    ----------
    text: str
        text to be tokenized and padded

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequence
    """
    check_is_fitted(self, attributes=["vocab"])
    tokens = get_texts([text], self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)[0]

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    return self.fit(df).transform(df)

inverse_transform

inverse_transform(padded_seq)

Returns the original text plus the added 'special' tokens

Parameters:

  • padded_seq (ndarray) –

    array with the output of the transform method

Returns:

  • DataFrame

    Pandas dataframe with the original text plus the added 'special' tokens

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def inverse_transform(self, padded_seq: np.ndarray) -> pd.DataFrame:
    """Returns the original text plus the added 'special' tokens

    Parameters
    ----------
    padded_seq: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original text plus the added 'special' tokens
    """
    texts = [self.vocab.inverse_transform(num) for num in padded_seq]
    return pd.DataFrame({self.text_col: texts})

ImagePreprocessor

ImagePreprocessor(img_col, img_path, width=224, height=224, verbose=1)

Bases: BasePreprocessor

Preprocessor to prepare the deepimage input dataset.

The Preprocessing consists simply on resizing according to their aspect ratio

Parameters:

  • img_col (str) –

    name of the column with the images filenames

  • img_path (str) –

    path to the dicrectory where the images are stored

  • width (int, default: 224 ) –

    width of the resulting processed image.

  • height (int, default: 224 ) –

    width of the resulting processed image.

  • verbose (int, default: 1 ) –

    Enable verbose output.

Attributes:

  • aap (AspectAwarePreprocessor) –

    an instance of pytorch_widedeep.utils.image_utils.AspectAwarePreprocessor

  • spp (SimplePreprocessor) –

    an instance of pytorch_widedeep.utils.image_utils.SimplePreprocessor

  • normalise_metrics (Dict) –

    Dict containing the normalisation metrics of the image dataset, i.e. mean and std for the R, G and B channels

Examples:

>>> import pandas as pd
>>>
>>> from pytorch_widedeep.preprocessing import ImagePreprocessor
>>>
>>> path_to_image1 = 'tests/test_data_utils/images/galaxy1.png'
>>> path_to_image2 = 'tests/test_data_utils/images/galaxy2.png'
>>>
>>> df_train = pd.DataFrame({'images_column': [path_to_image1]})
>>> df_test = pd.DataFrame({'images_column': [path_to_image2]})
>>> img_preprocessor = ImagePreprocessor(img_col='images_column', img_path='.', verbose=0)
>>> resized_images = img_preprocessor.fit_transform(df_train)
>>> new_resized_images = img_preprocessor.transform(df_train)

ℹ️ NOTE: Normalising metrics will only be computed when the fit_transform method is run. Running transform only will not change the computed metrics and running fit only simply instantiates the resizing functions.

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def __init__(
    self,
    img_col: str,
    img_path: str,
    width: int = 224,
    height: int = 224,
    verbose: int = 1,
):
    super(ImagePreprocessor, self).__init__()

    self.img_col = img_col
    self.img_path = img_path
    self.width = width
    self.height = height
    self.verbose = verbose

    self.aap = AspectAwarePreprocessor(self.width, self.height)
    self.spp = SimplePreprocessor(self.width, self.height)

    self.compute_normalising_computed = False

transform

transform(df)

Resizes the images to the input height and width.

Parameters:

  • df (DataFrame) –

    Input pandas dataframe with the img_col

Returns:

  • ndarray

    Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Resizes the images to the input height and width.


    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe with the `img_col`

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    image_list = df[self.img_col].tolist()
    if self.verbose:
        print("Reading Images from {}".format(self.img_path))
    imgs = [cv2.imread("/".join([self.img_path, img])) for img in image_list]

    # finding images with different height and width
    aspect = [(im.shape[0], im.shape[1]) for im in imgs]
    aspect_r = [a[0] / a[1] for a in aspect]
    diff_idx = [i for i, r in enumerate(aspect_r) if r != 1.0]

    if self.verbose:
        print("Resizing")
    resized_imgs = []
    for i, img in tqdm(enumerate(imgs), total=len(imgs), disable=self.verbose != 1):
        if i in diff_idx:
            resized_imgs.append(self.aap.preprocess(img))
        else:
            # if aspect ratio is 1:1, no need for AspectAwarePreprocessor
            resized_imgs.append(self.spp.preprocess(img))

    if not self.compute_normalising_computed:
        if self.verbose:
            print("Computing normalisation metrics")
        # mean and std deviation will only be computed when the fit method
        # is called
        mean_R, mean_G, mean_B = [], [], []
        std_R, std_G, std_B = [], [], []
        for rsz_img in resized_imgs:
            (mean_b, mean_g, mean_r), (std_b, std_g, std_r) = cv2.meanStdDev(
                rsz_img
            )
            mean_R.append(mean_r)
            mean_G.append(mean_g)
            mean_B.append(mean_b)
            std_R.append(std_r)
            std_G.append(std_g)
            std_B.append(std_b)
        self.normalise_metrics = dict(
            mean={
                "R": np.mean(mean_R) / 255.0,
                "G": np.mean(mean_G) / 255.0,
                "B": np.mean(mean_B) / 255.0,
            },
            std={
                "R": np.mean(std_R) / 255.0,
                "G": np.mean(std_G) / 255.0,
                "B": np.mean(std_B) / 255.0,
            },
        )
        self.compute_normalising_computed = True
    return np.asarray(resized_imgs)

fit_transform

fit_transform(df)

Combines fit and transform

Parameters:

  • df (DataFrame) –

    Input pandas dataframe

Returns:

  • ndarray

    Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    return self.fit(df).transform(df)

Chunked versions

Chunked versions of the preprocessors are also available. These are useful when the data is too big to fit in memory. See also the load_from_folder module in the library and the corresponding section here in the documentation.

Note that there is not a ChunkImagePreprocessor. This is because the processing of the images will occur inside the ImageFromFolder class in the load_from_folder module.

ChunkWidePreprocessor

ChunkWidePreprocessor(wide_cols, n_chunks, crossed_cols=None)

Bases: WidePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. ChunkWidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

  • wide_cols (List[str]) –

    List of strings with the name of the columns that will label encoded and passed through the wide component

  • crossed_cols (Optional[List[Tuple[str, str]]], default: None ) –

    List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

Attributes:

  • wide_crossed_cols (List) –

    List with the names of all columns that will be label encoded

  • encoding_dict (Dict) –

    Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.

  • inverse_encoding_dict (Dict) –

    the inverse encoding dictionary

  • wide_dim (int) –

    Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkWidePreprocessor
>>> chunk = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> chunk_wide_preprocessor = ChunkWidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols,
... n_chunks=1)
>>> X_wide = chunk_wide_preprocessor.fit_transform(chunk)
Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
254
255
256
257
258
259
260
261
262
263
264
265
266
def __init__(
    self,
    wide_cols: List[str],
    n_chunks: int,
    crossed_cols: Optional[List[Tuple[str, str]]] = None,
):
    super(ChunkWidePreprocessor, self).__init__(wide_cols, crossed_cols)

    self.n_chunks = n_chunks

    self.chunk_counter = 0

    self.is_fitted = False

partial_fit

partial_fit(chunk)

Fits the Preprocessor and creates required attributes

Parameters:

  • chunk (DataFrame) –

    Input pandas dataframe

Returns:

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
def partial_fit(self, chunk: pd.DataFrame) -> "ChunkWidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    chunk: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    ChunkWidePreprocessor
        `ChunkWidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(chunk)
    self.wide_crossed_cols = df_wide.columns.tolist()

    if self.chunk_counter == 0:
        self.glob_feature_set = set(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )
    else:
        self.glob_feature_set.update(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )

    self.chunk_counter += 1

    if self.chunk_counter == self.n_chunks:
        self.encoding_dict = {v: i + 1 for i, v in enumerate(self.glob_feature_set)}
        self.wide_dim = len(self.encoding_dict)
        self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
        self.inverse_encoding_dict[0] = "unseen"

        self.is_fitted = True

    return self

fit

fit(df)

Runs partial_fit. This is just to override the fit method in the base class. This class is not designed or thought to run fit

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py
305
306
307
308
309
310
def fit(self, df: pd.DataFrame) -> "ChunkWidePreprocessor":
    """
    Runs `partial_fit`. This is just to override the fit method in the base
    class. This class is not designed or thought to run fit
    """
    return self.partial_fit(df)

ChunkTabPreprocessor

ChunkTabPreprocessor(n_chunks, cat_embed_cols=None, continuous_cols=None, cols_and_bins=None, cols_to_scale=None, default_embed_dim=16, with_attention=False, with_cls_token=False, shared_embed=False, verbose=1, *, scale=False, already_standard=None, **kwargs)

Bases: TabPreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

  • n_chunks (int) –

    Number of chunks that the tabular dataset is divided by.

  • cat_embed_cols (Optional[Union[List[str], List[Tuple[str, int]]]], default: None ) –

    List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...])

  • continuous_cols (Optional[List[str]], default: None ) –

    List with the name of the continuous cols

  • cols_and_bins (Optional[Dict[str, List[float]]], default: None ) –

    Continuous columns can be turned into categorical via pd.cut. 'cols_and_bins' is dictionary where the keys are the column names to quantize and the values are a list of scalars indicating the bin edges.

  • cols_to_scale (Optional[Union[List[str], str]], default: None ) –

    List with the names of the columns that will be standarised via sklearn's StandardScaler

  • default_embed_dim (int, default: 16 ) –

    Dimension for the embeddings if the embed_dim is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.

  • with_attention (bool, default: False ) –

    Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
    Param alias: for_transformer

  • with_cls_token (bool, default: False ) –

    Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical (and continuous embeddings if present) will be concatenated before being passed to the final MLP (if present).

  • shared_embed (bool, default: False ) –

    Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.

  • verbose (int, default: 1 ) –
  • scale (bool, default: False ) –

    ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
    Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
    Param alias: scale_cont_cols.

  • already_standard (Optional[List[str]], default: None ) –

    ℹ️ note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
    List with the name of the continuous cols that do not need to be scaled/standarised.

Other Parameters:

  • **kwargs

    pd.cut and StandardScaler related args

Attributes:

  • embed_dim (Dict) –

    Dictionary where keys are the embed cols and values are the embedding dimensions. If with_attention is set to True this attribute is not generated during the fit process

  • label_encoder (LabelEncoder) –

    see pytorch_widedeep.utils.dense_utils.LabelEncder

  • cat_embed_input (List) –

    List of Tuples with the column name, number of individual values for that column and, If with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].

  • standardize_cols (List) –

    List of the columns that will be standarized

  • scaler (StandardScaler) –

    an instance of sklearn.preprocessing.StandardScaler if 'cols_to_scale' is not None or 'scale' is 'True'

  • column_idx (Dict) –

    Dictionary where keys are column names and values are column indexes. This is neccesary to slice tensors

  • quantizer (Quantizer) –

    an instance of Quantizer

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import ChunkTabPreprocessor
>>> np.random.seed(42)
>>> chunk_df = pd.DataFrame({'cat_col': np.random.choice(['A', 'B', 'C'], size=8),
... 'cont_col': np.random.uniform(1, 100, size=8)})
>>> cat_embed_cols = [('cat_col',4)]
>>> cont_cols = ['cont_col']
>>> tab_preprocessor = ChunkTabPreprocessor(
... n_chunks=1, cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols
... )
>>> X_tab = tab_preprocessor.fit_transform(chunk_df)
>>> tab_preprocessor.cat_embed_cols
[('cat_col', 4)]
>>> tab_preprocessor.column_idx
{'cat_col': 0, 'cont_col': 1}
Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
@alias("with_attention", ["for_transformer"])
@alias("cat_embed_cols", ["embed_cols"])
@alias("scale", ["scale_cont_cols"])
@alias("cols_and_bins", ["quantization_setup"])
def __init__(
    self,
    n_chunks: int,
    cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
    continuous_cols: Optional[List[str]] = None,
    cols_and_bins: Optional[Dict[str, List[float]]] = None,
    cols_to_scale: Optional[Union[List[str], str]] = None,
    default_embed_dim: int = 16,
    with_attention: bool = False,
    with_cls_token: bool = False,
    shared_embed: bool = False,
    verbose: int = 1,
    *,
    scale: bool = False,
    already_standard: Optional[List[str]] = None,
    **kwargs,
):
    super(ChunkTabPreprocessor, self).__init__(
        cat_embed_cols=cat_embed_cols,
        continuous_cols=continuous_cols,
        quantization_setup=None,
        cols_to_scale=cols_to_scale,
        auto_embed_dim=False,
        embedding_rule="google",  # does not matter, irrelevant
        default_embed_dim=default_embed_dim,
        with_attention=with_attention,
        with_cls_token=with_cls_token,
        shared_embed=shared_embed,
        verbose=verbose,
        scale=scale,
        already_standard=already_standard,
        **kwargs,
    )

    self.n_chunks = n_chunks
    self.chunk_counter = 0

    self.cols_and_bins = cols_and_bins  # type: ignore[assignment]
    if self.cols_and_bins is not None:
        self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

    self.embed_prepared = False
    self.continuous_prepared = False

ChunkTextPreprocessor

ChunkTextPreprocessor(text_col, n_chunks, root_dir=None, max_vocab=30000, min_freq=5, maxlen=80, pad_first=True, pad_idx=1, already_processed=False, word_vectors_path=None, n_cpus=None, verbose=1)

Bases: TextPreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

  • text_col (str) –

    column in the input dataframe containing either the texts or the filenames where the text documents are stored

  • n_chunks (int) –

    Number of chunks that the text dataset is divided by.

  • root_dir (Optional[str], default: None ) –

    If 'text_col' contains the filenames with the text documents, this is the path to the directory where those documents are stored.

  • max_vocab (int, default: 30000 ) –

    Maximum number of tokens in the vocabulary

  • min_freq (int, default: 5 ) –

    Minimum frequency for a token to be part of the vocabulary

  • maxlen (int, default: 80 ) –

    Maximum length of the tokenized sequences

  • pad_first (bool, default: True ) –

    Indicates whether the padding index will be added at the beginning or the end of the sequences

  • pad_idx (int, default: 1 ) –

    padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.

  • word_vectors_path (Optional[str], default: None ) –

    Path to the pretrained word vectors

  • n_cpus (Optional[int], default: None ) –

    number of CPUs to used during the tokenization process

  • verbose (int, default: 1 ) –

    Enable verbose output.

Attributes:

  • vocab (Vocab) –

    an instance of pytorch_widedeep.utils.fastai_transforms.ChunkVocab

  • embedding_matrix (ndarray) –

    Array with the pretrained embeddings if word_vectors_path is not None

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkTextPreprocessor
>>> chunk_df = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> chunk_text_preprocessor = ChunkTextPreprocessor(text_col='text_column', n_chunks=1,
... max_vocab=25, min_freq=1, maxlen=10, verbose=0, n_cpus=1)
>>> processed_chunk = chunk_text_preprocessor.fit_transform(chunk_df)
Source code in pytorch_widedeep/preprocessing/text_preprocessor.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
def __init__(
    self,
    text_col: str,
    n_chunks: int,
    root_dir: Optional[str] = None,
    max_vocab: int = 30000,
    min_freq: int = 5,
    maxlen: int = 80,
    pad_first: bool = True,
    pad_idx: int = 1,
    already_processed: Optional[bool] = False,
    word_vectors_path: Optional[str] = None,
    n_cpus: Optional[int] = None,
    verbose: int = 1,
):
    super(ChunkTextPreprocessor, self).__init__(
        text_col=text_col,
        max_vocab=max_vocab,
        min_freq=min_freq,
        maxlen=maxlen,
        pad_first=pad_first,
        pad_idx=pad_idx,
        already_processed=already_processed,
        word_vectors_path=word_vectors_path,
        n_cpus=n_cpus,
        verbose=verbose,
    )

    self.n_chunks = n_chunks
    self.root_dir = root_dir

    self.chunk_counter = 0

    self.is_fitted = False