The `preprocessing` module¶

This module contains the classes that are used to prepare the data before being passed to the models. There is one Preprocessor per data mode or model component: wide, deeptabular, deepimage and deeptext.

WidePreprocessor ¶

WidePreprocessor(wide_cols, crossed_cols=None)

Bases: BasePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. WidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

wide_cols (List[str]) –

List of strings with the name of the columns that will label encoded and passed through the wide component
crossed_cols (Optional[List[Tuple[str, str]]], default: None ) –

List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

Attributes:

wide_crossed_cols (List) –

List with the names of all columns that will be label encoded
encoding_dict (Dict) –

Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.
inverse_encoding_dict (Dict) –

the inverse encoding dictionary
wide_dim (int) –

Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import WidePreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
>>> X_wide = wide_preprocessor.fit_transform(df)
>>> X_wide
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> wide_preprocessor.encoding_dict
{'color_r': 1, 'color_b': 2, 'color_g': 3, 'color_size_r-s': 4, 'color_size_b-n': 5, 'color_size_g-l': 6}
>>> wide_preprocessor.inverse_transform(X_wide)
  color color_size
0     r        r-s
1     b        b-n
2     g        g-l

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def __init__(
    self, wide_cols: List[str], crossed_cols: Optional[List[Tuple[str, str]]] = None
):
    super(WidePreprocessor, self).__init__()

    self.wide_cols = wide_cols
    self.crossed_cols = crossed_cols

    self.is_fitted = False

fit ¶

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

WidePreprocessor –

WidePreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def fit(self, df: pd.DataFrame) -> "WidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    WidePreprocessor
        `WidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(df)
    self.wide_crossed_cols = df_wide.columns.tolist()
    glob_feature_list = self._make_global_feature_list(
        df_wide[self.wide_crossed_cols]
    )
    # leave 0 for padding/"unseen" categories
    self.encoding_dict = {v: i + 1 for i, v in enumerate(glob_feature_list)}
    self.wide_dim = len(self.encoding_dict)
    self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
    self.inverse_encoding_dict[0] = "unseen"

    self.is_fitted = True

    return self

transform ¶

transform(df)

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def transform(self, df: pd.DataFrame) -> np.ndarray:
    r"""
    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, attributes=["encoding_dict"])
    df_wide = self._prepare_wide(df)
    encoded = np.zeros([len(df_wide), len(self.wide_crossed_cols)])
    for col_i, col in enumerate(self.wide_crossed_cols):
        encoded[:, col_i] = df_wide[col].apply(
            lambda x: (
                self.encoding_dict[col + "_" + str(x)]
                if col + "_" + str(x) in self.encoding_dict
                else 0
            )
        )
    return encoded.astype("int64")

inverse_transform ¶

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

encoded (ndarray) –

numpy array with the encoded values that are the output from the transform method

Returns:

DataFrame –

Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        numpy array with the encoded values that are the output from the
        `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=self.wide_crossed_cols)

    if pd.__version__ >= "2.1.0":
        decoded = decoded.map(lambda x: self.inverse_encoding_dict[x])
    else:
        decoded = decoded.applymap(lambda x: self.inverse_encoding_dict[x])

    for col in decoded.columns:
        rm_str = "".join([col, "_"])
        decoded[col] = decoded[col].apply(lambda x: x.replace(rm_str, ""))
    return decoded

fit_transform ¶

fit_transform(df)

Combines fit and transform

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

TabPreprocessor ¶

TabPreprocessor(cat_embed_cols=None, continuous_cols=None, quantization_setup=None, cols_to_scale=None, auto_embed_dim=True, embedding_rule='fastai_new', default_embed_dim=16, with_attention=False, with_cls_token=False, shared_embed=False, verbose=1, *, scale=False, already_standard=None, **kwargs)

Bases: BasePreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

cat_embed_cols (Optional[Union[List[str], List[Tuple[str, int]]]], default: None ) –

List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...])
continuous_cols (Optional[List[str]], default: None ) –

List with the name of the continuous cols
quantization_setup (Optional[Union[int, Dict[str, Union[int, List[float]]]]], default: None ) –

Continuous columns can be turned into categorical via pd.cut. If quantization_setup is an int, all continuous columns will be quantized using this value as the number of bins. Alternatively, a dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges can also be used.
cols_to_scale (Optional[Union[List[str], str]], default: None ) –

List with the names of the columns that will be standarised via sklearn's StandardScaler. It can also be the string 'all' in which case all the continuous cols will be scaled.
auto_embed_dim (bool, default: True ) –

Boolean indicating whether the embedding dimensions will be automatically defined via rule of thumb. See embedding_rule below.
embedding_rule (Literal[google, fastai_old, fastai_new], default: 'fastai_new' ) –
If auto_embed_dim=True, this is the choice of embedding rule of thumb. Choices are:
- fastai_new: \(min(600, round(1.6 \times n_{cat}^{0.56}))\)
- fastai_old: \(min(50, (n_{cat}//{2})+1)\)
- google: \(min(600, round(n_{cat}^{0.24}))\)
default_embed_dim (int, default: 16 ) –

Dimension for the embeddings if the embedding dimension is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.
with_attention (bool, default: False ) –

Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
Param alias: for_transformer
with_cls_token (bool, default: False ) –

Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical and/or continuous embeddings will be concatenated before being passed to the final MLP (if present).
shared_embed (bool, default: False ) –

Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.
verbose (int, default: 1 ) –
scale (bool, default: False ) –

note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
Param alias: scale_cont_cols.
already_standard (Optional[List[str]], default: None ) –

note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
List with the name of the continuous cols that do not need to be scaled/standarised.

Other Parameters:

**kwargs –

pd.cut and StandardScaler related args

Attributes:

embed_dim (Dict) –

Dictionary where keys are the embed cols and values are the embedding dimensions. If with_attention is set to True this attribute is not generated during the fit process
label_encoder (LabelEncoder) –

see pytorch_widedeep.utils.dense_utils.LabelEncder
cat_embed_input (List) –

List of Tuples with the column name, number of individual values for that column and, If with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].
standardize_cols (List) –

List of the columns that will be standarized
scaler (StandardScaler) –

an instance of sklearn.preprocessing.StandardScaler
column_idx (Dict) –

Dictionary where keys are column names and values are column indexes. This is neccesary to slice tensors
quantizer (Quantizer) –

an instance of Quantizer

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import TabPreprocessor
>>> df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l'], 'age': [25, 40, 55]})
>>> cat_embed_cols = [('color',5), ('size',5)]
>>> cont_cols = ['age']
>>> deep_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols)
>>> X_tab = deep_preprocessor.fit_transform(df)
>>> deep_preprocessor.cat_embed_cols
[('color', 5), ('size', 5)]
>>> deep_preprocessor.column_idx
{'color': 0, 'size': 1, 'age': 2}
>>> cont_df = pd.DataFrame({"col1": np.random.rand(10), "col2": np.random.rand(10) + 1})
>>> cont_cols = ["col1", "col2"]
>>> tab_preprocessor = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=3)
>>> ft_cont_df = tab_preprocessor.fit_transform(cont_df)
>>> # or...
>>> quantization_setup = {'col1': [0., 0.4, 1.], 'col2': [1., 1.4, 2.]}
>>> tab_preprocessor2 = TabPreprocessor(continuous_cols=cont_cols, quantization_setup=quantization_setup)
>>> ft_cont_df2 = tab_preprocessor2.fit_transform(cont_df)

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

@alias("with_attention", ["for_transformer"])
@alias("cat_embed_cols", ["embed_cols"])
@alias("scale", ["scale_cont_cols"])
@alias("quantization_setup", ["cols_and_bins"])
def __init__(
    self,
    cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
    continuous_cols: Optional[List[str]] = None,
    quantization_setup: Optional[
        Union[int, Dict[str, Union[int, List[float]]]]
    ] = None,
    cols_to_scale: Optional[Union[List[str], str]] = None,
    auto_embed_dim: bool = True,
    embedding_rule: Literal["google", "fastai_old", "fastai_new"] = "fastai_new",
    default_embed_dim: int = 16,
    with_attention: bool = False,
    with_cls_token: bool = False,
    shared_embed: bool = False,
    verbose: int = 1,
    *,
    scale: bool = False,
    already_standard: Optional[List[str]] = None,
    **kwargs,
):
    super(TabPreprocessor, self).__init__()

    self.continuous_cols = continuous_cols
    self.quantization_setup = quantization_setup
    self.cols_to_scale = cols_to_scale
    self.scale = scale
    self.already_standard = already_standard
    self.auto_embed_dim = auto_embed_dim
    self.embedding_rule = embedding_rule
    self.default_embed_dim = default_embed_dim
    self.with_attention = with_attention
    self.with_cls_token = with_cls_token
    self.shared_embed = shared_embed
    self.verbose = verbose

    self.quant_args = {
        k: v for k, v in kwargs.items() if k in pd.cut.__code__.co_varnames
    }
    self.scale_args = {
        k: v for k, v in kwargs.items() if k in StandardScaler().get_params()
    }

    self._check_inputs(cat_embed_cols)

    if with_cls_token:
        self.cat_embed_cols = (
            ["cls_token"] + cat_embed_cols  # type: ignore[operator]
            if cat_embed_cols is not None
            else ["cls_token"]
        )
    else:
        self.cat_embed_cols = cat_embed_cols  # type: ignore[assignment]

    self.is_fitted = False

fit ¶

fit(df)

Fits the Preprocessor and creates required attributes

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

TabPreprocessor –

TabPreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

def fit(self, df: pd.DataFrame) -> BasePreprocessor:  # noqa: C901
    """Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TabPreprocessor
        `TabPreprocessor` fitted object
    """

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    self.column_idx: Dict[str, int] = {}

    # Categorical embeddings logic
    if self.cat_embed_cols is not None or self.quantization_setup is not None:
        self.cat_embed_input: List[Union[Tuple[str, int], Tuple[str, int, int]]] = (
            []
        )

    if self.cat_embed_cols is not None:
        df_cat, cat_embed_dim = self._prepare_categorical(df_adj)

        self.label_encoder = LabelEncoder(
            columns_to_encode=df_cat.columns.tolist(),
            shared_embed=self.shared_embed,
            with_attention=self.with_attention,
        )
        self.label_encoder.fit(df_cat)

        for k, v in self.label_encoder.encoding_dict.items():
            if self.with_attention:
                self.cat_embed_input.append((k, len(v)))
            else:
                self.cat_embed_input.append((k, len(v), cat_embed_dim[k]))

        self.column_idx.update({k: v for v, k in enumerate(df_cat.columns)})

    # Continuous columns logic
    if self.continuous_cols is not None:
        df_cont, cont_embed_dim = self._prepare_continuous(df_adj)

        # Standardization logic
        if self.standardize_cols is not None:
            self.scaler = StandardScaler(**self.scale_args).fit(
                df_cont[self.standardize_cols].values
            )
        elif self.verbose:
            warnings.warn("Continuous columns will not be normalised")

        # Quantization logic
        if self.cols_and_bins is not None:
            # we do not run 'Quantizer.fit' here since in the wild case
            # someone wants standardization and quantization for the same
            # columns, the Quantizer will run on the scaled data
            self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

            if self.with_attention:
                for col, n_cat, _ in cont_embed_dim:
                    self.cat_embed_input.append((col, n_cat))
            else:
                self.cat_embed_input.extend(cont_embed_dim)

        self.column_idx.update(
            {k: v + len(self.column_idx) for v, k in enumerate(df_cont)}
        )

    self.is_fitted = True

    return self

transform ¶

transform(df)

Returns the processed dataframe as a np.ndarray

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

def transform(self, df: pd.DataFrame) -> np.ndarray:  # noqa: C901
    """Returns the processed `dataframe` as a np.ndarray

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    check_is_fitted(self, condition=self.is_fitted)

    df_adj = self._insert_cls_token(df) if self.with_cls_token else df.copy()

    if self.cat_embed_cols is not None:
        df_cat = df_adj[self.cat_cols]
        df_cat = self.label_encoder.transform(df_cat)
    if self.continuous_cols is not None:
        df_cont = df_adj[self.continuous_cols]
        # Standardization logic
        if self.standardize_cols:
            df_cont[self.standardize_cols] = self.scaler.transform(
                df_cont[self.standardize_cols].values
            )
        # Quantization logic
        if self.cols_and_bins is not None:
            # Adjustment so I don't have to override the method
            # in 'ChunkTabPreprocessor'
            if self.quantizer.is_fitted:
                df_cont = self.quantizer.transform(df_cont)
            else:
                df_cont = self.quantizer.fit_transform(df_cont)
    try:
        df_deep = pd.concat([df_cat, df_cont], axis=1)
    except NameError:
        try:
            df_deep = df_cat.copy()
        except NameError:
            df_deep = df_cont.copy()

    return df_deep.values

inverse_transform ¶

inverse_transform(encoded)

Takes as input the output from the transform method and it will return the original values.

Parameters:

encoded (ndarray) –

array with the output of the transform method

Returns:

DataFrame –

Pandas dataframe with the original values

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

def inverse_transform(self, encoded: np.ndarray) -> pd.DataFrame:  # noqa: C901
    r"""Takes as input the output from the `transform` method and it will
    return the original values.

    Parameters
    ----------
    encoded: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original values
    """
    decoded = pd.DataFrame(encoded, columns=list(self.column_idx.keys()))
    # embeddings back to original category
    if self.cat_embed_cols is not None:
        decoded = self.label_encoder.inverse_transform(decoded)
    if self.continuous_cols is not None:
        # quantized cols to the mid point
        if self.cols_and_bins is not None:
            if self.verbose:
                print(
                    "Note that quantized cols will be turned into the mid point of "
                    "the corresponding bin"
                )
            for k, v in self.quantizer.inversed_bins.items():
                decoded[k] = decoded[k].map(v)
        # continuous_cols back to non-standarised
        try:
            decoded[self.standardize_cols] = self.scaler.inverse_transform(
                decoded[self.standardize_cols]
            )
        except Exception:  # KeyError:
            pass

    if "cls_token" in decoded.columns:
        decoded.drop("cls_token", axis=1, inplace=True)

    return decoded

fit_transform ¶

fit_transform(df)

Combines fit and transform

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

transformed input dataframe

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        transformed input dataframe
    """
    return self.fit(df).transform(df)

Quantizer ¶

Quantizer(quantization_setup, **kwargs)

Helper class to perform the quantization of continuous columns. It is included in this docs for completion, since depending on the value of the parameter 'quantization_setup' of the TabPreprocessor class, that class might have an attribute of type Quantizer. However, this class is designed to always run internally within the TabPreprocessor class.

Parameters:

quantization_setup (Dict[str, Union[int, List[float]]]) –

Dictionary where the keys are the column names to quantize and the values are the either integers indicating the number of bins or a list of scalars indicating the bin edges.

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

def __init__(
    self,
    quantization_setup: Dict[str, Union[int, List[float]]],
    **kwargs,
):
    self.quantization_setup = quantization_setup
    self.quant_args = kwargs

    self.is_fitted = False

TextPreprocessor ¶

TextPreprocessor(text_col, max_vocab=30000, min_freq=5, maxlen=80, pad_first=True, pad_idx=1, already_processed=False, word_vectors_path=None, n_cpus=None, verbose=1)

Bases: BasePreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

text_col (str) –

column in the input dataframe containing the texts
max_vocab (int, default: 30000 ) –

Maximum number of tokens in the vocabulary
min_freq (int, default: 5 ) –

Minimum frequency for a token to be part of the vocabulary
maxlen (int, default: 80 ) –

Maximum length of the tokenized sequences
pad_first (bool, default: True ) –

Indicates whether the padding index will be added at the beginning or the end of the sequences
pad_idx (int, default: 1 ) –

padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
already_processed (Optional[bool], default: False ) –
Boolean indicating if the sequence of elements is already processed or prepared. If this is the case, this Preprocessor will simply tokenize and pad the sequence.
```
Param aliases: `not_text`. <br/>
```
This parameter is thought for those cases where the input sequences are already fully processed or are directly not text (e.g. IDs)
word_vectors_path (Optional[str], default: None ) –

Path to the pretrained word vectors
n_cpus (Optional[int], default: None ) –

number of CPUs to used during the tokenization process
verbose (int, default: 1 ) –

Enable verbose output.

Attributes:

vocab (Vocab) –

an instance of pytorch_widedeep.utils.fastai_transforms.Vocab
embedding_matrix (ndarray) –

Array with the pretrained embeddings

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import TextPreprocessor
>>> df_train = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> text_preprocessor = TextPreprocessor(text_col='text_column', max_vocab=25, min_freq=1, maxlen=10)
>>> text_preprocessor.fit_transform(df_train)
The vocabulary contains 24 tokens
array([[ 1,  1,  1,  1, 10, 11, 12, 13, 14, 15],
       [ 5,  9, 16, 17, 18,  9, 19, 20, 21, 22]], dtype=int32)
>>> df_te = pd.DataFrame({'text_column': ['you never know what is in the box']})
>>> text_preprocessor.transform(df_te)
array([[ 1,  1,  9, 16, 17, 18, 11,  0,  0, 13]], dtype=int32)

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

@alias("already_processed", ["not_text"])
def __init__(
    self,
    text_col: str,
    max_vocab: int = 30000,
    min_freq: int = 5,
    maxlen: int = 80,
    pad_first: bool = True,
    pad_idx: int = 1,
    already_processed: Optional[bool] = False,
    word_vectors_path: Optional[str] = None,
    n_cpus: Optional[int] = None,
    verbose: int = 1,
):
    super(TextPreprocessor, self).__init__()

    self.text_col = text_col
    self.max_vocab = max_vocab
    self.min_freq = min_freq
    self.maxlen = maxlen
    self.pad_first = pad_first
    self.pad_idx = pad_idx
    self.already_processed = already_processed
    self.word_vectors_path = word_vectors_path
    self.verbose = verbose
    self.n_cpus = n_cpus if n_cpus is not None else os.cpu_count()

    self.is_fitted = False

fit ¶

fit(df)

Builds the vocabulary

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

TextPreprocessor –

TextPreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def fit(self, df: pd.DataFrame) -> BasePreprocessor:
    """Builds the vocabulary

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    TextPreprocessor
        `TextPreprocessor` fitted object
    """
    texts = self._read_texts(df)

    tokens = get_texts(texts, self.already_processed, self.n_cpus)

    self.vocab: TVocab = Vocab(
        max_vocab=self.max_vocab,
        min_freq=self.min_freq,
        pad_idx=self.pad_idx,
    ).fit(
        tokens,
    )

    if self.verbose:
        print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
    if self.word_vectors_path is not None:
        self.embedding_matrix = build_embeddings_matrix(
            self.vocab, self.word_vectors_path, self.min_freq
        )

    self.is_fitted = True

    return self

transform ¶

transform(df)

Returns the padded, 'numericalised' sequences

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequences

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    check_is_fitted(self, attributes=["vocab"])
    texts = self._read_texts(df)
    tokens = get_texts(texts, self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)

transform_sample ¶

transform_sample(text)

Returns the padded, 'numericalised' sequence

Parameters:

text (str) –

text to be tokenized and padded

Returns:

ndarray –

Padded, 'numericalised' sequence

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def transform_sample(self, text: str) -> np.ndarray:
    """Returns the padded, _'numericalised'_ sequence

    Parameters
    ----------
    text: str
        text to be tokenized and padded

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequence
    """
    check_is_fitted(self, attributes=["vocab"])
    tokens = get_texts([text], self.already_processed, self.n_cpus)
    return self._pad_sequences(tokens)[0]

fit_transform ¶

fit_transform(df)

Combines fit and transform

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

Padded, 'numericalised' sequences

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Padded, _'numericalised'_ sequences
    """
    return self.fit(df).transform(df)

inverse_transform ¶

inverse_transform(padded_seq)

Returns the original text plus the added 'special' tokens

Parameters:

padded_seq (ndarray) –

array with the output of the transform method

Returns:

DataFrame –

Pandas dataframe with the original text plus the added 'special' tokens

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def inverse_transform(self, padded_seq: np.ndarray) -> pd.DataFrame:
    """Returns the original text plus the added 'special' tokens

    Parameters
    ----------
    padded_seq: np.ndarray
        array with the output of the `transform` method

    Returns
    -------
    pd.DataFrame
        Pandas dataframe with the original text plus the added 'special' tokens
    """
    texts = [self.vocab.inverse_transform(num) for num in padded_seq]
    return pd.DataFrame({self.text_col: texts})

ImagePreprocessor ¶

ImagePreprocessor(img_col, img_path, width=224, height=224, verbose=1)

Bases: BasePreprocessor

Preprocessor to prepare the deepimage input dataset.

The Preprocessing consists simply on resizing according to their aspect ratio

Parameters:

img_col (str) –

name of the column with the images filenames
img_path (str) –

path to the dicrectory where the images are stored
width (int, default: 224 ) –

width of the resulting processed image.
height (int, default: 224 ) –

width of the resulting processed image.
verbose (int, default: 1 ) –

Enable verbose output.

Attributes:

aap (AspectAwarePreprocessor) –

an instance of pytorch_widedeep.utils.image_utils.AspectAwarePreprocessor
spp (SimplePreprocessor) –

an instance of pytorch_widedeep.utils.image_utils.SimplePreprocessor
normalise_metrics (Dict) –

Dict containing the normalisation metrics of the image dataset, i.e. mean and std for the R, G and B channels

Examples:

>>> import pandas as pd
>>>
>>> from pytorch_widedeep.preprocessing import ImagePreprocessor
>>>
>>> path_to_image1 = 'tests/test_data_utils/images/galaxy1.png'
>>> path_to_image2 = 'tests/test_data_utils/images/galaxy2.png'
>>>
>>> df_train = pd.DataFrame({'images_column': [path_to_image1]})
>>> df_test = pd.DataFrame({'images_column': [path_to_image2]})
>>> img_preprocessor = ImagePreprocessor(img_col='images_column', img_path='.', verbose=0)
>>> resized_images = img_preprocessor.fit_transform(df_train)
>>> new_resized_images = img_preprocessor.transform(df_train)

NOTE: Normalising metrics will only be computed when the fit_transform method is run. Running transform only will not change the computed metrics and running fit only simply instantiates the resizing functions.

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py

def __init__(
    self,
    img_col: str,
    img_path: str,
    width: int = 224,
    height: int = 224,
    verbose: int = 1,
):
    super(ImagePreprocessor, self).__init__()

    self.img_col = img_col
    self.img_path = img_path
    self.width = width
    self.height = height
    self.verbose = verbose

    self.aap = AspectAwarePreprocessor(self.width, self.height)
    self.spp = SimplePreprocessor(self.width, self.height)

    self.compute_normalising_computed = False

transform ¶

transform(df)

Resizes the images to the input height and width.

Parameters:

df (DataFrame) –

Input pandas dataframe with the img_col

Returns:

ndarray –

Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py

def transform(self, df: pd.DataFrame) -> np.ndarray:
    """Resizes the images to the input height and width.


    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe with the `img_col`

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    image_list = df[self.img_col].tolist()
    if self.verbose:
        print("Reading Images from {}".format(self.img_path))
    imgs = [cv2.imread("/".join([self.img_path, img])) for img in image_list]

    # finding images with different height and width
    aspect = [(im.shape[0], im.shape[1]) for im in imgs]
    aspect_r = [a[0] / a[1] for a in aspect]
    diff_idx = [i for i, r in enumerate(aspect_r) if r != 1.0]

    if self.verbose:
        print("Resizing")
    resized_imgs = []
    for i, img in tqdm(enumerate(imgs), total=len(imgs), disable=self.verbose != 1):
        if i in diff_idx:
            resized_imgs.append(self.aap.preprocess(img))
        else:
            # if aspect ratio is 1:1, no need for AspectAwarePreprocessor
            resized_imgs.append(self.spp.preprocess(img))

    if not self.compute_normalising_computed:
        if self.verbose:
            print("Computing normalisation metrics")
        # mean and std deviation will only be computed when the fit method
        # is called
        mean_R, mean_G, mean_B = [], [], []
        std_R, std_G, std_B = [], [], []
        for rsz_img in resized_imgs:
            (mean_b, mean_g, mean_r), (std_b, std_g, std_r) = cv2.meanStdDev(
                rsz_img
            )
            mean_R.append(mean_r)
            mean_G.append(mean_g)
            mean_B.append(mean_b)
            std_R.append(std_r)
            std_G.append(std_g)
            std_B.append(std_b)
        self.normalise_metrics = dict(
            mean={
                "R": np.mean(mean_R) / 255.0,
                "G": np.mean(mean_G) / 255.0,
                "B": np.mean(mean_B) / 255.0,
            },
            std={
                "R": np.mean(std_R) / 255.0,
                "G": np.mean(std_G) / 255.0,
                "B": np.mean(std_B) / 255.0,
            },
        )
        self.compute_normalising_computed = True
    return np.asarray(resized_imgs)

fit_transform ¶

fit_transform(df)

Combines fit and transform

Parameters:

df (DataFrame) –

Input pandas dataframe

Returns:

ndarray –

Resized images to the input height and width

Source code in pytorch_widedeep/preprocessing/image_preprocessor.py

def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
    """Combines `fit` and `transform`

    Parameters
    ----------
    df: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    np.ndarray
        Resized images to the input height and width
    """
    return self.fit(df).transform(df)

Chunked versions¶

Chunked versions of the preprocessors are also available. These are useful when the data is too big to fit in memory. See also the load_from_folder module in the library and the corresponding section here in the documentation.

Note that there is not a ChunkImagePreprocessor. This is because the processing of the images will occur inside the ImageFromFolder class in the load_from_folder module.

ChunkWidePreprocessor ¶

ChunkWidePreprocessor(wide_cols, n_chunks, crossed_cols=None)

Bases: WidePreprocessor

Preprocessor to prepare the wide input dataset

This Preprocessor prepares the data for the wide, linear component. This linear model is implemented via an Embedding layer that is connected to the output neuron. ChunkWidePreprocessor numerically encodes all the unique values of all categorical columns wide_cols + crossed_cols. See the Example below.

Parameters:

wide_cols (List[str]) –

List of strings with the name of the columns that will label encoded and passed through the wide component
crossed_cols (Optional[List[Tuple[str, str]]], default: None ) –

List of Tuples with the name of the columns that will be 'crossed' and then label encoded. e.g. [('education', 'occupation'), ...]. For binary features, a cross-product transformation is 1 if and only if the constituent features are all 1, and 0 otherwise.

Attributes:

wide_crossed_cols (List) –

List with the names of all columns that will be label encoded
encoding_dict (Dict) –

Dictionary where the keys are the result of pasting colname + '_' + column value and the values are the corresponding mapped integer.
inverse_encoding_dict (Dict) –

the inverse encoding dictionary
wide_dim (int) –

Dimension of the wide model (i.e. dim of the linear layer)

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkWidePreprocessor
>>> chunk = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
>>> wide_cols = ['color']
>>> crossed_cols = [('color', 'size')]
>>> chunk_wide_preprocessor = ChunkWidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols,
... n_chunks=1)
>>> X_wide = chunk_wide_preprocessor.fit_transform(chunk)

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def __init__(
    self,
    wide_cols: List[str],
    n_chunks: int,
    crossed_cols: Optional[List[Tuple[str, str]]] = None,
):
    super(ChunkWidePreprocessor, self).__init__(wide_cols, crossed_cols)

    self.n_chunks = n_chunks

    self.chunk_counter = 0

    self.is_fitted = False

partial_fit ¶

partial_fit(chunk)

Fits the Preprocessor and creates required attributes

Parameters:

chunk (DataFrame) –

Input pandas dataframe

Returns:

ChunkWidePreprocessor –

ChunkWidePreprocessor fitted object

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def partial_fit(self, chunk: pd.DataFrame) -> "ChunkWidePreprocessor":
    r"""Fits the Preprocessor and creates required attributes

    Parameters
    ----------
    chunk: pd.DataFrame
        Input pandas dataframe

    Returns
    -------
    ChunkWidePreprocessor
        `ChunkWidePreprocessor` fitted object
    """
    df_wide = self._prepare_wide(chunk)
    self.wide_crossed_cols = df_wide.columns.tolist()

    if self.chunk_counter == 0:
        self.glob_feature_set = set(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )
    else:
        self.glob_feature_set.update(
            self._make_global_feature_list(df_wide[self.wide_crossed_cols])
        )

    self.chunk_counter += 1

    if self.chunk_counter == self.n_chunks:
        self.encoding_dict = {v: i + 1 for i, v in enumerate(self.glob_feature_set)}
        self.wide_dim = len(self.encoding_dict)
        self.inverse_encoding_dict = {k: v for v, k in self.encoding_dict.items()}
        self.inverse_encoding_dict[0] = "unseen"

        self.is_fitted = True

    return self

fit ¶

fit(df)

Runs partial_fit. This is just to override the fit method in the base class. This class is not designed or thought to run fit

Source code in pytorch_widedeep/preprocessing/wide_preprocessor.py

def fit(self, df: pd.DataFrame) -> "ChunkWidePreprocessor":
    """
    Runs `partial_fit`. This is just to override the fit method in the base
    class. This class is not designed or thought to run fit
    """
    return self.partial_fit(df)

ChunkTabPreprocessor ¶

ChunkTabPreprocessor(n_chunks, cat_embed_cols=None, continuous_cols=None, cols_and_bins=None, cols_to_scale=None, default_embed_dim=16, with_attention=False, with_cls_token=False, shared_embed=False, verbose=1, *, scale=False, already_standard=None, **kwargs)

Bases: TabPreprocessor

Preprocessor to prepare the deeptabular component input dataset

Parameters:

n_chunks (int) –

Number of chunks that the tabular dataset is divided by.
cat_embed_cols (Optional[Union[List[str], List[Tuple[str, int]]]], default: None ) –

List containing the name of the categorical columns that will be represented by embeddings (e.g. ['education', 'relationship', ...]) or a Tuple with the name and the embedding dimension (e.g.: [ ('education',32), ('relationship',16), ...])
continuous_cols (Optional[List[str]], default: None ) –

List with the name of the continuous cols
cols_and_bins (Optional[Dict[str, List[float]]], default: None ) –

Continuous columns can be turned into categorical via pd.cut. 'cols_and_bins' is dictionary where the keys are the column names to quantize and the values are a list of scalars indicating the bin edges.
cols_to_scale (Optional[Union[List[str], str]], default: None ) –

List with the names of the columns that will be standarised via sklearn's StandardScaler
default_embed_dim (int, default: 16 ) –

Dimension for the embeddings if the embed_dim is not provided in the cat_embed_cols parameter and auto_embed_dim is set to False.
with_attention (bool, default: False ) –

Boolean indicating whether the preprocessed data will be passed to an attention-based model (more precisely a model where all embeddings must have the same dimensions). If True, the param cat_embed_cols must just be a list containing just the categorical column names: e.g. ['education', 'relationship', ...]. This is because they will all be encoded using embeddings of the same dim, which will be specified later when the model is defined.
Param alias: for_transformer
with_cls_token (bool, default: False ) –

Boolean indicating if a '[CLS]' token will be added to the dataset when using attention-based models. The final hidden state corresponding to this token is used as the aggregated representation for classification and regression tasks. If not, the categorical (and continuous embeddings if present) will be concatenated before being passed to the final MLP (if present).
shared_embed (bool, default: False ) –

Boolean indicating if the embeddings will be "shared" when using attention-based models. The idea behind shared_embed is described in the Appendix A in the TabTransformer paper: 'The goal of having column embedding is to enable the model to distinguish the classes in one column from those in the other columns'. In other words, the idea is to let the model learn which column is embedded at the time. See: pytorch_widedeep.models.transformers._layers.SharedEmbeddings.
verbose (int, default: 1 ) –
scale (bool, default: False ) –

note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
Bool indicating whether or not to scale/standarise continuous cols. It is important to emphasize that all the DL models for tabular data in the library also include the possibility of normalising the input continuous features via a BatchNorm or a LayerNorm.
Param alias: scale_cont_cols.
already_standard (Optional[List[str]], default: None ) –

note: this arg will be removed in upcoming releases. Please use cols_to_scale instead.
List with the name of the continuous cols that do not need to be scaled/standarised.

Other Parameters:

**kwargs –

pd.cut and StandardScaler related args

Attributes:

embed_dim (Dict) –

Dictionary where keys are the embed cols and values are the embedding dimensions. If with_attention is set to True this attribute is not generated during the fit process
label_encoder (LabelEncoder) –

see pytorch_widedeep.utils.dense_utils.LabelEncder
cat_embed_input (List) –

List of Tuples with the column name, number of individual values for that column and, If with_attention is set to False, the corresponding embeddings dim, e.g. [('education', 16, 10), ('relationship', 6, 8), ...].
standardize_cols (List) –

List of the columns that will be standarized
scaler (StandardScaler) –

an instance of sklearn.preprocessing.StandardScaler if 'cols_to_scale' is not None or 'scale' is 'True'
column_idx (Dict) –

Dictionary where keys are column names and values are column indexes. This is neccesary to slice tensors
quantizer (Quantizer) –

an instance of Quantizer

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from pytorch_widedeep.preprocessing import ChunkTabPreprocessor
>>> np.random.seed(42)
>>> chunk_df = pd.DataFrame({'cat_col': np.random.choice(['A', 'B', 'C'], size=8),
... 'cont_col': np.random.uniform(1, 100, size=8)})
>>> cat_embed_cols = [('cat_col',4)]
>>> cont_cols = ['cont_col']
>>> tab_preprocessor = ChunkTabPreprocessor(
... n_chunks=1, cat_embed_cols=cat_embed_cols, continuous_cols=cont_cols
... )
>>> X_tab = tab_preprocessor.fit_transform(chunk_df)
>>> tab_preprocessor.cat_embed_cols
[('cat_col', 4)]
>>> tab_preprocessor.column_idx
{'cat_col': 0, 'cont_col': 1}

Source code in pytorch_widedeep/preprocessing/tab_preprocessor.py

@alias("with_attention", ["for_transformer"])
@alias("cat_embed_cols", ["embed_cols"])
@alias("scale", ["scale_cont_cols"])
@alias("cols_and_bins", ["quantization_setup"])
def __init__(
    self,
    n_chunks: int,
    cat_embed_cols: Optional[Union[List[str], List[Tuple[str, int]]]] = None,
    continuous_cols: Optional[List[str]] = None,
    cols_and_bins: Optional[Dict[str, List[float]]] = None,
    cols_to_scale: Optional[Union[List[str], str]] = None,
    default_embed_dim: int = 16,
    with_attention: bool = False,
    with_cls_token: bool = False,
    shared_embed: bool = False,
    verbose: int = 1,
    *,
    scale: bool = False,
    already_standard: Optional[List[str]] = None,
    **kwargs,
):
    super(ChunkTabPreprocessor, self).__init__(
        cat_embed_cols=cat_embed_cols,
        continuous_cols=continuous_cols,
        quantization_setup=None,
        cols_to_scale=cols_to_scale,
        auto_embed_dim=False,
        embedding_rule="google",  # does not matter, irrelevant
        default_embed_dim=default_embed_dim,
        with_attention=with_attention,
        with_cls_token=with_cls_token,
        shared_embed=shared_embed,
        verbose=verbose,
        scale=scale,
        already_standard=already_standard,
        **kwargs,
    )

    self.n_chunks = n_chunks
    self.chunk_counter = 0

    self.cols_and_bins = cols_and_bins  # type: ignore[assignment]
    if self.cols_and_bins is not None:
        self.quantizer = Quantizer(self.cols_and_bins, **self.quant_args)

    self.embed_prepared = False
    self.continuous_prepared = False

ChunkTextPreprocessor ¶

ChunkTextPreprocessor(text_col, n_chunks, root_dir=None, max_vocab=30000, min_freq=5, maxlen=80, pad_first=True, pad_idx=1, already_processed=False, word_vectors_path=None, n_cpus=None, verbose=1)

Bases: TextPreprocessor

Preprocessor to prepare the deeptext input dataset

Parameters:

text_col (str) –

column in the input dataframe containing either the texts or the filenames where the text documents are stored
n_chunks (int) –

Number of chunks that the text dataset is divided by.
root_dir (Optional[str], default: None ) –

If 'text_col' contains the filenames with the text documents, this is the path to the directory where those documents are stored.
max_vocab (int, default: 30000 ) –

Maximum number of tokens in the vocabulary
min_freq (int, default: 5 ) –

Minimum frequency for a token to be part of the vocabulary
maxlen (int, default: 80 ) –

Maximum length of the tokenized sequences
pad_first (bool, default: True ) –

Indicates whether the padding index will be added at the beginning or the end of the sequences
pad_idx (int, default: 1 ) –

padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
word_vectors_path (Optional[str], default: None ) –

Path to the pretrained word vectors
n_cpus (Optional[int], default: None ) –

number of CPUs to used during the tokenization process
verbose (int, default: 1 ) –

Enable verbose output.

Attributes:

vocab (Vocab) –

an instance of pytorch_widedeep.utils.fastai_transforms.ChunkVocab
embedding_matrix (ndarray) –

Array with the pretrained embeddings if word_vectors_path is not None

Examples:

>>> import pandas as pd
>>> from pytorch_widedeep.preprocessing import ChunkTextPreprocessor
>>> chunk_df = pd.DataFrame({'text_column': ["life is like a box of chocolates",
... "You never know what you're gonna get"]})
>>> chunk_text_preprocessor = ChunkTextPreprocessor(text_col='text_column', n_chunks=1,
... max_vocab=25, min_freq=1, maxlen=10, verbose=0, n_cpus=1)
>>> processed_chunk = chunk_text_preprocessor.fit_transform(chunk_df)

Source code in pytorch_widedeep/preprocessing/text_preprocessor.py

def __init__(
    self,
    text_col: str,
    n_chunks: int,
    root_dir: Optional[str] = None,
    max_vocab: int = 30000,
    min_freq: int = 5,
    maxlen: int = 80,
    pad_first: bool = True,
    pad_idx: int = 1,
    already_processed: Optional[bool] = False,
    word_vectors_path: Optional[str] = None,
    n_cpus: Optional[int] = None,
    verbose: int = 1,
):
    super(ChunkTextPreprocessor, self).__init__(
        text_col=text_col,
        max_vocab=max_vocab,
        min_freq=min_freq,
        maxlen=maxlen,
        pad_first=pad_first,
        pad_idx=pad_idx,
        already_processed=already_processed,
        word_vectors_path=word_vectors_path,
        n_cpus=n_cpus,
        verbose=verbose,
    )

    self.n_chunks = n_chunks
    self.root_dir = root_dir

    self.chunk_counter = 0

    self.is_fitted = False

The preprocessing module¶

WidePreprocessor ¶

fit ¶

transform ¶

inverse_transform ¶

fit_transform ¶

TabPreprocessor ¶

fit ¶

transform ¶

inverse_transform ¶

fit_transform ¶

Quantizer ¶

TextPreprocessor ¶

fit ¶

transform ¶

transform_sample ¶

fit_transform ¶

inverse_transform ¶

ImagePreprocessor ¶

transform ¶

fit_transform ¶

Chunked versions¶

ChunkWidePreprocessor ¶

partial_fit ¶

fit ¶

ChunkTabPreprocessor ¶

ChunkTextPreprocessor ¶

The `preprocessing` module¶