Processors and Utils¶

Description of the main tools and utilities that one needs to prepare the data for a WideDeep model constructor.

The `preprocessing` module¶

There are 4 preprocessors, corresponding to 4 main components of the WideDeep model. These are

WidePreprocessor
TabPreprocessor
TextPreprocessor
ImagePreprocessor

Behind the scenes, these preprocessors use a series of helper funcions and classes that are in the utils module. If you were interested please go and have a look to the documentation

1. WidePreprocessor¶

The wide component of the model is a linear model that in principle, could be implemented as a linear layer receiving the result of on one-hot encoding categorical columns. However, this is not memory efficient. Therefore, we implement a liner layer as an Embedding layer plus a bias. I will explain in a bit more detail later.

With that in mind, WidePreprocessor simply encodes the categories numerically so that they are the indexes of the lookup table that is an Embedding layer.

For example

In [1]:

Copied!





import numpy as np
import pandas as pd
import pytorch_widedeep as wd

from pytorch_widedeep.datasets import load_adult
from pytorch_widedeep.preprocessing import WidePreprocessor
import numpy as np
import pandas as pd
import pytorch_widedeep as wd

from pytorch_widedeep.datasets import load_adult
from pytorch_widedeep.preprocessing import WidePreprocessor

/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [2]:

Copied!

df = load_adult(as_frame=True)
df.head()
df = load_adult(as_frame=True)
df.head()

Out[2]:

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	hours-per-week	native-country	income
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States	<=50K
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States	<=50K
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States	>50K
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States	>50K
4	18	?	103497	Some-college	10	Never-married	?	Own-child	White	Female	0	30	United-States	<=50K

In [3]:

Copied!





wide_cols = [
    "education",
    "relationship",
    "workclass",
    "occupation",
    "native-country",
    "gender",
]
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]
wide_cols = [
    "education",
    "relationship",
    "workclass",
    "occupation",
    "native-country",
    "gender",
]
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]

In [4]:

Copied!





wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)

In [5]:

Copied!

X_wide
X_wide

Out[5]:

array([[  1,  17,  23, ...,  89,  91, 316],
       [  2,  18,  23, ...,  89,  92, 317],
       [  3,  18,  24, ...,  89,  93, 318],
       ...,
       [  2,  20,  23, ...,  90, 103, 323],
       [  2,  17,  23, ...,  89, 103, 323],
       [  2,  21,  29, ...,  90, 115, 324]])

Note that the label encoding starts from 1. This is because it is convenient to leave 0 for padding, i.e. unknown categories. Let's take from example the first entry

In [6]:

Copied!

X_wide[0]
X_wide[0]

Out[6]:

array([  1,  17,  23,  32,  47,  89,  91, 316])

In [7]:

Copied!

wide_preprocessor.inverse_transform(X_wide[:1])
wide_preprocessor.inverse_transform(X_wide[:1])

Out[7]:

	education	relationship	workclass	occupation	native-country	gender	education_occupation	native-country_occupation
0	11th	Own-child	Private	Machine-op-inspct	United-States	Male	11th-Machine-op-inspct	United-States-Machine-op-inspct

As we can see, wide_preprocessor numerically encodes the wide_cols and the crossed_cols, which can be recovered using the method inverse_transform.

2. TabPreprocessor¶

The TabPreprocessor has a lot of different functionalities. Let's explore some of them in detail. In its basic use, the TabPreprocessor simply label encodes the categorical columns and normalises the numerical ones (unless otherwised specified).

In [8]:

Copied!

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.preprocessing import TabPreprocessor

In [9]:

Copied!





# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [
    ("education", 10),
    ("relationship", 8),
    ("workclass", 10),
    ("occupation", 10),
    ("native-country", 10),
]
continuous_cols = ["age", "hours-per-week"]
# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [
    ("education", 10),
    ("relationship", 8),
    ("workclass", 10),
    ("occupation", 10),
    ("native-country", 10),
]
continuous_cols = ["age", "hours-per-week"]

In [10]:

Copied!





tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    cols_to_scale=["age"],  # or scale=True or cols_to_scale=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    cols_to_scale=["age"],  # or scale=True or cols_to_scale=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)

In [11]:

Copied!

X_tab
X_tab

Out[11]:

array([[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -9.95128932e-01,  4.00000000e+01],
       [ 2.00000000e+00,  2.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -4.69415091e-02,  5.00000000e+01],
       [ 3.00000000e+00,  2.00000000e+00,  2.00000000e+00, ...,
         1.00000000e+00, -7.76316450e-01,  4.00000000e+01],
       ...,
       [ 2.00000000e+00,  4.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00,  1.41180837e+00,  4.00000000e+01],
       [ 2.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -1.21394141e+00,  2.00000000e+01],
       [ 2.00000000e+00,  5.00000000e+00,  7.00000000e+00, ...,
         1.00000000e+00,  9.74183408e-01,  4.00000000e+01]])

Note that the label encoding starts from 1. This is because it is convenient to leave 0 for padding, i.e. unknown categories. Let's take from example the first entry

In [12]:

Copied!

X_tab[0]
X_tab[0]

Out[12]:

array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
       -0.99512893, 40.        ])

In [13]:

Copied!

tab_preprocessor.inverse_transform(X_tab[:1])
tab_preprocessor.inverse_transform(X_tab[:1])

Out[13]:

	education	relationship	workclass	occupation	native-country	age	hours-per-week
0	11th	Own-child	Private	Machine-op-inspct	United-States	25.0	40.0

The TabPreprocessor will have a series of useful attributes that can later be used when instantiating the different Tabular Models, such us for example, the column indexes (used to slice the tensors, internally in the models) or the categorical embeddings set up

In [14]:

Copied!

tab_preprocessor.column_idx
tab_preprocessor.column_idx

Out[14]:

{'education': 0,
 'relationship': 1,
 'workclass': 2,
 'occupation': 3,
 'native-country': 4,
 'age': 5,
 'hours-per-week': 6}

In [15]:

Copied!

# column name, num unique, embedding dim
tab_preprocessor.cat_embed_input
# column name, num unique, embedding dim
tab_preprocessor.cat_embed_input

Out[15]:

[('education', 16, 10),
 ('relationship', 6, 8),
 ('workclass', 9, 10),
 ('occupation', 15, 10),
 ('native-country', 42, 10)]

As I mentioned, there is more one can do, such as for example, quantize (or bucketize) the continuous cols. For this we could use the quantization_setup param. This parameter accepts a number of different inputs and uses pd.cut under the hood to quantize the continuous cols. For more info, please, read the docs. Let's use it here to quantize "age" and "hours-per-week" in 4 and 5 "buckets" respectively

In [16]:

Copied!





quantization_setup = {
    "age": 4,
    "hours-per-week": 5,
}  # you can also pass a list of floats with the boundaries if you wanted
quant_tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    quantization_setup=quantization_setup,
)
qX_tab = quant_tab_preprocessor.fit_transform(df)
quantization_setup = {
    "age": 4,
    "hours-per-week": 5,
}  # you can also pass a list of floats with the boundaries if you wanted
quant_tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    quantization_setup=quantization_setup,
)
qX_tab = quant_tab_preprocessor.fit_transform(df)

/Users/javierrodriguezzaurin/Projects/pytorch-widedeep/pytorch_widedeep/preprocessing/tab_preprocessor.py:358: UserWarning: Continuous columns will not be normalised
  warnings.warn("Continuous columns will not be normalised")

In [17]:

Copied!

qX_tab
qX_tab

Out[17]:

array([[1, 1, 1, ..., 1, 1, 2],
       [2, 2, 1, ..., 1, 2, 3],
       [3, 2, 2, ..., 1, 1, 2],
       ...,
       [2, 4, 1, ..., 1, 3, 2],
       [2, 1, 1, ..., 1, 1, 1],
       [2, 5, 7, ..., 1, 2, 2]])

Note that the continuous columns that have been bucketised into quantiles are treated as any other categorical column

In [18]:

Copied!

quant_tab_preprocessor.cat_embed_input
quant_tab_preprocessor.cat_embed_input

Out[18]:

[('education', 16, 10),
 ('relationship', 6, 8),
 ('workclass', 9, 10),
 ('occupation', 15, 10),
 ('native-country', 42, 10),
 ('age', 4, 4),
 ('hours-per-week', 5, 4)]

Where the column 'age' has now 4 categories, which will be encoded using embeddings of 4 dims. Note that, as any other categorical columns, the categorical "counter" starts with 1. This is because all incoming values that are lower/higher than the existing lowest/highest value in the train (or already seen) dataset, will be encoded as 0.

In [19]:

Copied!

np.unique(qX_tab[:, quant_tab_preprocessor.column_idx["age"]])
np.unique(qX_tab[:, quant_tab_preprocessor.column_idx["age"]])

Out[19]:

array([1, 2, 3, 4])

Finally, if we now wanted to inverse_transform the transformed array into the original dataframe, we could still do it, but the continuous, bucketized columns will be transformed back to the middle of their quantile/bucket range

In [20]:

Copied!

df_decoded = quant_tab_preprocessor.inverse_transform(qX_tab)
df_decoded = quant_tab_preprocessor.inverse_transform(qX_tab)

Note that quantized cols will be turned into the mid point of the corresponding bin

In [21]:

Copied!

df.head(2)
df.head(2)

Out[21]:

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	capital-loss	hours-per-week	native-country	income
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States	<=50K
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States	<=50K

In [22]:

Copied!

df_decoded.head(2)
df_decoded.head(2)

Out[22]:

	education	relationship	workclass	occupation	native-country	age	hours-per-week
0	11th	Own-child	Private	Machine-op-inspct	United-States	26.0885	30.4
1	HS-grad	Husband	Private	Farming-fishing	United-States	44.3750	50.0

there is one final comment to make regarding to the inverse_transform functionality. As we mentioned before, the encoding 0 is reserved for values that fall outside the range covered by the data we used to run the fit method. For example

In [23]:

Copied!

df.age.min(), df.age.max()
df.age.min(), df.age.max()

Out[23]:

(17, 90)

All future age values outside that range will be encoded as 0 and decoded as NaN

In [24]:

Copied!

tmp_df = df.head(1).copy()
tmp_df.loc[:, "age"] = 5
tmp_df = df.head(1).copy()
tmp_df.loc[:, "age"] = 5

In [25]:

Copied!

tmp_df
tmp_df

Out[25]:

	age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	capital-loss	hours-per-week	native-country	income
0	5	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States	<=50K

In [26]:

Copied!

# quant_tab_preprocessor has already been fitted with a data that has an age range between 17 and 90
tmp_qX_tab = quant_tab_preprocessor.transform(tmp_df)
# quant_tab_preprocessor has already been fitted with a data that has an age range between 17 and 90
tmp_qX_tab = quant_tab_preprocessor.transform(tmp_df)

In [27]:

Copied!

tmp_qX_tab
tmp_qX_tab

Out[27]:

array([[1, 1, 1, 1, 1, 0, 2]])

In [28]:

Copied!

quant_tab_preprocessor.inverse_transform(tmp_qX_tab)
quant_tab_preprocessor.inverse_transform(tmp_qX_tab)

Note that quantized cols will be turned into the mid point of the corresponding bin

Out[28]:

	education	relationship	workclass	occupation	native-country	age	hours-per-week
0	11th	Own-child	Private	Machine-op-inspct	United-States	NaN	30.4

3. TextPreprocessor¶

This preprocessor returns the tokenised, padded sequences that will be directly fed to the stack of LSTMs.

In [29]:

Copied!

from pytorch_widedeep.preprocessing import TextPreprocessor
from pytorch_widedeep.preprocessing import TextPreprocessor

In [30]:

Copied!





# The airbnb dataset, which you could get from here:
# http://insideairbnb.com/get-the-data.html, is too big to be included in
# our datasets module (when including images). Therefore, go there,
# download it, and use the download_images.py script to get the images
# and the airbnb_data_processing.py to process the data. We'll find
# better datasets in the future ;). Note that here we are only using a
# small sample to illustrate the use, so PLEASE ignore the results, just
# focus on usage
df = pd.read_csv("../tmp_data/airbnb/airbnb_sample.csv")
# The airbnb dataset, which you could get from here:
# http://insideairbnb.com/get-the-data.html, is too big to be included in
# our datasets module (when including images). Therefore, go there,
# download it, and use the download_images.py script to get the images
# and the airbnb_data_processing.py to process the data. We'll find
# better datasets in the future ;). Note that here we are only using a
# small sample to illustrate the use, so PLEASE ignore the results, just
# focus on usage
df = pd.read_csv("../tmp_data/airbnb/airbnb_sample.csv")

In [31]:

Copied!

texts = df.description.tolist()
texts[:2]
texts = df.description.tolist()
texts[:2]

Out[31]:

["My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment.  You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children.  I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys.  I trust anyone who will be responding to this add would treat my home with care and respect .  Best Wishes  Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi internet,",
 "Lots of windows and light.  St Luke's Gardens are at the end of the block, and the river not too far the other way. Ten minutes walk if you go slowly. Buses to everywhere round the corner and shops, restaurants, pubs, the cinema and Waitrose . Bright Chelsea Apartment  This is a bright one bedroom ground floor apartment in an interesting listed building. There is one double bedroom and a living room/kitchen The apartment has a full  bathroom and the kitchen is fully equipped. Two wardrobes are available exclusively for guests and bedside tables and two long drawers. This sunny convenient compact flat is just around the corner from the Waitrose supermarket and all sorts of shops, cinemas, restaurants and pubs.  This is a lovely part of London. There is a fun farmers market in the King's Road at the weekend.  Buses to everywhere are just round the corner, and two underground stations are within ten minutes walk. There is a very nice pub round by St. Luke's gardens, 4 mins slow walk, the "]

In [32]:

Copied!





text_preprocessor = TextPreprocessor(text_col="description")
X_text = text_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)
text_preprocessor = TextPreprocessor(text_col="description")
X_text = text_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)

The vocabulary contains 2192 tokens

In [33]:

Copied!

print(X_text[0])
print(X_text[0])

[  29   48   37  367  818   17  910   17  177   15  122  349   53  879
 1174  126  393   40  911    0   23  228   71  819    9   53   55 1380
  225   11   18  308   18 1564   10  755    0  942  239   53   55    0
   11   36 1013  277 1974   70   62   15 1475    9  943    5  251    5
    0    5    0    5  177   53   37   75   11   10  294  726   32    9
   42    5   25   12   10   22   12  136  100  145]

4. ImagePreprocessor¶

ImagePreprocessor simply resizes the images, being aware of the aspect ratio.

In [34]:

Copied!

from pytorch_widedeep.preprocessing import ImagePreprocessor
from pytorch_widedeep.preprocessing import ImagePreprocessor

In [35]:

Copied!





image_preprocessor = wd.preprocessing.ImagePreprocessor(
    img_col="id", img_path="../tmp_data/airbnb/property_picture/"
)
X_images = image_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)
image_preprocessor = wd.preprocessing.ImagePreprocessor(
    img_col="id", img_path="../tmp_data/airbnb/property_picture/"
)
X_images = image_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)

Reading Images from ../tmp_data/airbnb/property_picture/
Resizing

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [00:01<00:00, 667.89it/s]

Computing normalisation metrics

In [36]:

Copied!

X_images[0].shape
X_images[0].shape

Out[36]:

(224, 224, 3)