Processors and Utils¶
Description of the main tools and utilities that one needs to prepare the data for a WideDeep
model constructor.
The preprocessing
module¶
There are 4 preprocessors, corresponding to 4 main components of the WideDeep
model. These are
WidePreprocessor
TabPreprocessor
TextPreprocessor
ImagePreprocessor
Behind the scenes, these preprocessors use a series of helper funcions and classes that are in the utils
module. If you were interested please go and have a look to the documentation
1. WidePreprocessor¶
The wide
component of the model is a linear model that in principle, could be implemented as a linear layer receiving the result of on one-hot encoding categorical columns. However, this is not memory efficient. Therefore, we implement a liner layer as an Embedding layer plus a bias. I will explain in a bit more detail later.
With that in mind, WidePreprocessor
simply encodes the categories numerically so that they are the indexes of the lookup table that is an Embedding layer.
For example
import numpy as np
import pandas as pd
import pytorch_widedeep as wd
from pytorch_widedeep.datasets import load_adult
from pytorch_widedeep.preprocessing import WidePreprocessor
/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
df = load_adult(as_frame=True)
df.head()
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States | <=50K |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States | >50K |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
wide_cols = [
"education",
"relationship",
"workclass",
"occupation",
"native-country",
"gender",
]
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)
X_wide
array([[ 1, 17, 23, ..., 89, 91, 316], [ 2, 18, 23, ..., 89, 92, 317], [ 3, 18, 24, ..., 89, 93, 318], ..., [ 2, 20, 23, ..., 90, 103, 323], [ 2, 17, 23, ..., 89, 103, 323], [ 2, 21, 29, ..., 90, 115, 324]])
Note that the label encoding starts from 1
. This is because it is convenient to leave 0
for padding, i.e. unknown categories. Let's take from example the first entry
X_wide[0]
array([ 1, 17, 23, 32, 47, 89, 91, 316])
wide_preprocessor.inverse_transform(X_wide[:1])
education | relationship | workclass | occupation | native-country | gender | education_occupation | native-country_occupation | |
---|---|---|---|---|---|---|---|---|
0 | 11th | Own-child | Private | Machine-op-inspct | United-States | Male | 11th-Machine-op-inspct | United-States-Machine-op-inspct |
As we can see, wide_preprocessor
numerically encodes the wide_cols
and the crossed_cols
, which can be recovered using the method inverse_transform
.
2. TabPreprocessor¶
The TabPreprocessor
has a lot of different functionalities. Let's explore some of them in detail. In its basic use, the TabPreprocessor
simply label encodes the categorical columns and normalises the numerical ones (unless otherwised specified).
from pytorch_widedeep.preprocessing import TabPreprocessor
# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [
("education", 10),
("relationship", 8),
("workclass", 10),
("occupation", 10),
("native-country", 10),
]
continuous_cols = ["age", "hours-per-week"]
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols,
continuous_cols=continuous_cols,
cols_to_scale=["age"], # or scale=True or cols_to_scale=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)
X_tab
array([[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ..., 1.00000000e+00, -9.95128932e-01, 4.00000000e+01], [ 2.00000000e+00, 2.00000000e+00, 1.00000000e+00, ..., 1.00000000e+00, -4.69415091e-02, 5.00000000e+01], [ 3.00000000e+00, 2.00000000e+00, 2.00000000e+00, ..., 1.00000000e+00, -7.76316450e-01, 4.00000000e+01], ..., [ 2.00000000e+00, 4.00000000e+00, 1.00000000e+00, ..., 1.00000000e+00, 1.41180837e+00, 4.00000000e+01], [ 2.00000000e+00, 1.00000000e+00, 1.00000000e+00, ..., 1.00000000e+00, -1.21394141e+00, 2.00000000e+01], [ 2.00000000e+00, 5.00000000e+00, 7.00000000e+00, ..., 1.00000000e+00, 9.74183408e-01, 4.00000000e+01]])
Note that the label encoding starts from 1
. This is because it is convenient to leave 0
for padding, i.e. unknown categories. Let's take from example the first entry
X_tab[0]
array([ 1. , 1. , 1. , 1. , 1. , -0.99512893, 40. ])
tab_preprocessor.inverse_transform(X_tab[:1])
education | relationship | workclass | occupation | native-country | age | hours-per-week | |
---|---|---|---|---|---|---|---|
0 | 11th | Own-child | Private | Machine-op-inspct | United-States | 25.0 | 40.0 |
The TabPreprocessor
will have a series of useful attributes that can later be used when instantiating the different Tabular Models, such us for example, the column indexes (used to slice the tensors, internally in the models) or the categorical embeddings set up
tab_preprocessor.column_idx
{'education': 0, 'relationship': 1, 'workclass': 2, 'occupation': 3, 'native-country': 4, 'age': 5, 'hours-per-week': 6}
# column name, num unique, embedding dim
tab_preprocessor.cat_embed_input
[('education', 16, 10), ('relationship', 6, 8), ('workclass', 9, 10), ('occupation', 15, 10), ('native-country', 42, 10)]
As I mentioned, there is more one can do, such as for example, quantize (or bucketize) the continuous cols. For this we could use the quantization_setup
param. This parameter accepts a number of different inputs and uses pd.cut
under the hood to quantize the continuous cols. For more info, please, read the docs. Let's use it here to quantize "age" and "hours-per-week" in 4 and 5 "buckets" respectively
quantization_setup = {
"age": 4,
"hours-per-week": 5,
} # you can also pass a list of floats with the boundaries if you wanted
quant_tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols,
continuous_cols=continuous_cols,
quantization_setup=quantization_setup,
)
qX_tab = quant_tab_preprocessor.fit_transform(df)
/Users/javierrodriguezzaurin/Projects/pytorch-widedeep/pytorch_widedeep/preprocessing/tab_preprocessor.py:358: UserWarning: Continuous columns will not be normalised warnings.warn("Continuous columns will not be normalised")
qX_tab
array([[1, 1, 1, ..., 1, 1, 2], [2, 2, 1, ..., 1, 2, 3], [3, 2, 2, ..., 1, 1, 2], ..., [2, 4, 1, ..., 1, 3, 2], [2, 1, 1, ..., 1, 1, 1], [2, 5, 7, ..., 1, 2, 2]])
Note that the continuous columns that have been bucketised into quantiles are treated as any other categorical column
quant_tab_preprocessor.cat_embed_input
[('education', 16, 10), ('relationship', 6, 8), ('workclass', 9, 10), ('occupation', 15, 10), ('native-country', 42, 10), ('age', 4, 4), ('hours-per-week', 5, 4)]
Where the column 'age' has now 4 categories, which will be encoded using embeddings of 4 dims. Note that, as any other categorical columns, the categorical "counter" starts with 1. This is because all incoming values that are lower/higher than the existing lowest/highest value in the train (or already seen) dataset, will be encoded as 0.
np.unique(qX_tab[:, quant_tab_preprocessor.column_idx["age"]])
array([1, 2, 3, 4])
Finally, if we now wanted to inverse_transform
the transformed array into the original dataframe, we could still do it, but the continuous, bucketized columns will be transformed back to the middle of their quantile/bucket range
df_decoded = quant_tab_preprocessor.inverse_transform(qX_tab)
Note that quantized cols will be turned into the mid point of the corresponding bin
df.head(2)
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States | <=50K |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
df_decoded.head(2)
education | relationship | workclass | occupation | native-country | age | hours-per-week | |
---|---|---|---|---|---|---|---|
0 | 11th | Own-child | Private | Machine-op-inspct | United-States | 26.0885 | 30.4 |
1 | HS-grad | Husband | Private | Farming-fishing | United-States | 44.3750 | 50.0 |
there is one final comment to make regarding to the inverse_transform
functionality. As we mentioned before, the encoding 0
is reserved for values that fall outside the range covered by the data we used to run the fit
method. For example
df.age.min(), df.age.max()
(17, 90)
All future age values outside that range will be encoded as 0 and decoded as NaN
tmp_df = df.head(1).copy()
tmp_df.loc[:, "age"] = 5
tmp_df
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States | <=50K |
# quant_tab_preprocessor has already been fitted with a data that has an age range between 17 and 90
tmp_qX_tab = quant_tab_preprocessor.transform(tmp_df)
tmp_qX_tab
array([[1, 1, 1, 1, 1, 0, 2]])
quant_tab_preprocessor.inverse_transform(tmp_qX_tab)
Note that quantized cols will be turned into the mid point of the corresponding bin
education | relationship | workclass | occupation | native-country | age | hours-per-week | |
---|---|---|---|---|---|---|---|
0 | 11th | Own-child | Private | Machine-op-inspct | United-States | NaN | 30.4 |
3. TextPreprocessor¶
This preprocessor returns the tokenised, padded sequences that will be directly fed to the stack of LSTMs.
from pytorch_widedeep.preprocessing import TextPreprocessor
# The airbnb dataset, which you could get from here:
# http://insideairbnb.com/get-the-data.html, is too big to be included in
# our datasets module (when including images). Therefore, go there,
# download it, and use the download_images.py script to get the images
# and the airbnb_data_processing.py to process the data. We'll find
# better datasets in the future ;). Note that here we are only using a
# small sample to illustrate the use, so PLEASE ignore the results, just
# focus on usage
df = pd.read_csv("../tmp_data/airbnb/airbnb_sample.csv")
texts = df.description.tolist()
texts[:2]
["My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment. You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children. I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys. I trust anyone who will be responding to this add would treat my home with care and respect . Best Wishes Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi internet,", "Lots of windows and light. St Luke's Gardens are at the end of the block, and the river not too far the other way. Ten minutes walk if you go slowly. Buses to everywhere round the corner and shops, restaurants, pubs, the cinema and Waitrose . Bright Chelsea Apartment This is a bright one bedroom ground floor apartment in an interesting listed building. There is one double bedroom and a living room/kitchen The apartment has a full bathroom and the kitchen is fully equipped. Two wardrobes are available exclusively for guests and bedside tables and two long drawers. This sunny convenient compact flat is just around the corner from the Waitrose supermarket and all sorts of shops, cinemas, restaurants and pubs. This is a lovely part of London. There is a fun farmers market in the King's Road at the weekend. Buses to everywhere are just round the corner, and two underground stations are within ten minutes walk. There is a very nice pub round by St. Luke's gardens, 4 mins slow walk, the "]
text_preprocessor = TextPreprocessor(text_col="description")
X_text = text_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)
The vocabulary contains 2192 tokens
print(X_text[0])
[ 29 48 37 367 818 17 910 17 177 15 122 349 53 879 1174 126 393 40 911 0 23 228 71 819 9 53 55 1380 225 11 18 308 18 1564 10 755 0 942 239 53 55 0 11 36 1013 277 1974 70 62 15 1475 9 943 5 251 5 0 5 0 5 177 53 37 75 11 10 294 726 32 9 42 5 25 12 10 22 12 136 100 145]
4. ImagePreprocessor¶
ImagePreprocessor
simply resizes the images, being aware of the aspect ratio.
from pytorch_widedeep.preprocessing import ImagePreprocessor
image_preprocessor = wd.preprocessing.ImagePreprocessor(
img_col="id", img_path="../tmp_data/airbnb/property_picture/"
)
X_images = image_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)
Reading Images from ../tmp_data/airbnb/property_picture/ Resizing
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [00:01<00:00, 667.89it/s]
Computing normalisation metrics
X_images[0].shape
(224, 224, 3)