20-Using-huggingface-within-widedeep
In this notebook we will show how to use Hugginface's tokenizers and models as they are integrated within the library. In notebook number 17 you can find examples on how to code your own, custom, Hugginface (hereafter HF) model and use it in combination of any other model in the library
In [1]:
Copied!
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from pytorch_widedeep import Trainer
from pytorch_widedeep.models import HFModel, WideDeep
from pytorch_widedeep.metrics import F1Score, Accuracy
from pytorch_widedeep.datasets import load_womens_ecommerce
from pytorch_widedeep.preprocessing import HFPreprocessor
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from pytorch_widedeep import Trainer
from pytorch_widedeep.models import HFModel, WideDeep
from pytorch_widedeep.metrics import F1Score, Accuracy
from pytorch_widedeep.datasets import load_womens_ecommerce
from pytorch_widedeep.preprocessing import HFPreprocessor
/Users/javierrodriguezzaurin/.pyenv/versions/3.10.13/envs/widedeep310/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
In [2]:
Copied!
df: pd.DataFrame = load_womens_ecommerce(as_frame=True) # type: ignore
df: pd.DataFrame = load_womens_ecommerce(as_frame=True) # type: ignore
In [3]:
Copied!
df.shape
df.shape
Out[3]:
(23486, 10)
In [4]:
Copied!
df.sample(3)
df.sample(3)
Out[4]:
Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|
7004 | 862 | 43 | Cute and feminine | Loved this sweater wrap and bought it in both ... | 5 | 1 | 2 | General | Tops | Knits |
12508 | 975 | 66 | Love it | The linen fabric is elegantly thin feels and l... | 5 | 1 | 3 | General | Jackets | Jackets |
10288 | 950 | 41 | Perfect for fall | This sweater is just as pictured. the fit is t... | 5 | 1 | 0 | General | Tops | Sweaters |
In [5]:
Copied!
# Let's do some mild preprocessing
df.columns = [c.replace(" ", "_").lower() for c in df.columns]
# classes from [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")
# group reviews with 1 and 2 scores into one class
df.loc[df.rating == 0, "rating"] = 1
# and back again to [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")
# Let's do some mild preprocessing
df.columns = [c.replace(" ", "_").lower() for c in df.columns]
# classes from [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")
# group reviews with 1 and 2 scores into one class
df.loc[df.rating == 0, "rating"] = 1
# and back again to [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")
In [6]:
Copied!
# drop short reviews
df = df[~df.review_text.isna()]
df["review_length"] = df.review_text.apply(lambda x: len(x.split(" ")))
df = df[df.review_length >= 5]
df = df.drop("review_length", axis=1).reset_index(drop=True)
# drop short reviews
df = df[~df.review_text.isna()]
df["review_length"] = df.review_text.apply(lambda x: len(x.split(" ")))
df = df[df.review_length >= 5]
df = df.drop("review_length", axis=1).reset_index(drop=True)
In [7]:
Copied!
df.shape
df.shape
Out[7]:
(22608, 10)
In [8]:
Copied!
# if you run this on a CPU, you might want to subsample the dataset. With that in mind I am simply going to stratify-sample to the minimum category occurrence and then sample at random
# If you run this on a GPU you can comment out the following two cells
df.rating.value_counts()
# if you run this on a CPU, you might want to subsample the dataset. With that in mind I am simply going to stratify-sample to the minimum category occurrence and then sample at random
# If you run this on a GPU you can comment out the following two cells
df.rating.value_counts()
Out[8]:
rating 3 12515 2 4904 1 2820 0 2369 Name: count, dtype: int64
In [9]:
Copied!
df = (
df.groupby("rating", group_keys=False)
.apply(lambda x: x.sample(min(len(x), 2369)))
.sample(1000)
)
df = (
df.groupby("rating", group_keys=False)
.apply(lambda x: x.sample(min(len(x), 2369)))
.sample(1000)
)
/var/folders/_2/lrjn1qn54c758tdtktr1bvkc0000gn/T/ipykernel_5886/895673206.py:3: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. .apply(lambda x: x.sample(min(len(x), 2369)))
In [10]:
Copied!
train, test = train_test_split(df, train_size=0.8, random_state=1, stratify=df.rating)
# possible model names currently supported in the library
model_names = [
"distilbert-base-uncased",
"bert-base-uncased",
"FacebookAI/roberta-base",
"albert-base-v2",
"google/electra-base-discriminator",
]
# Let's choose one. The syntax is the same for all the models
model_name = "distilbert-base-uncased"
train, test = train_test_split(df, train_size=0.8, random_state=1, stratify=df.rating)
# possible model names currently supported in the library
model_names = [
"distilbert-base-uncased",
"bert-base-uncased",
"FacebookAI/roberta-base",
"albert-base-v2",
"google/electra-base-discriminator",
]
# Let's choose one. The syntax is the same for all the models
model_name = "distilbert-base-uncased"
Now we can use the HFPreprocessor
class. As most things in this library, the integration with HF has been coded aiming for a flexible use. With this in mind, there are two ways one can use a HFPreprocessor
class.
- Passing a
text_col
andencode_params
as the class is instantiated and then using thefit
andtransform
as with any other preprocessor in the library - Without passing
text_col
andencode_params
as the class is instantiated and using theencode
method of theHFPreprocessor
which is simply a wrapper around the encode method of HF's tokenizers
Let's have a look
In [11]:
Copied!
tokenizer1 = HFPreprocessor(
model_name=model_name,
text_col="review_text",
num_workers=1,
encode_params={
"max_length": 90,
"padding": "max_length",
"truncation": True,
"add_special_tokens": True,
},
)
X_text_tr1 = tokenizer1.fit_transform(train)
X_text_te1 = tokenizer1.transform(test)
tokenizer1 = HFPreprocessor(
model_name=model_name,
text_col="review_text",
num_workers=1,
encode_params={
"max_length": 90,
"padding": "max_length",
"truncation": True,
"add_special_tokens": True,
},
)
X_text_tr1 = tokenizer1.fit_transform(train)
X_text_te1 = tokenizer1.transform(test)
In [12]:
Copied!
tokenizer2 = HFPreprocessor(
model_name=model_name,
num_workers=1,
)
X_text_tr2 = tokenizer2.encode(
train.review_text.tolist(),
max_length=90,
padding="max_length",
truncation=True,
add_special_tokens=True,
)
X_text_te2 = tokenizer2.encode(
test.review_text.tolist(),
max_length=90,
padding="max_length",
truncation=True,
add_special_tokens=True,
)
tokenizer2 = HFPreprocessor(
model_name=model_name,
num_workers=1,
)
X_text_tr2 = tokenizer2.encode(
train.review_text.tolist(),
max_length=90,
padding="max_length",
truncation=True,
add_special_tokens=True,
)
X_text_te2 = tokenizer2.encode(
test.review_text.tolist(),
max_length=90,
padding="max_length",
truncation=True,
add_special_tokens=True,
)
In [13]:
Copied!
all(X_text_tr1[0] == X_text_tr2[0])
all(X_text_tr1[0] == X_text_tr2[0])
Out[13]:
True
In [14]:
Copied!
# Now we define a model which is as easy as:
# Note that this will instantiation will lead to NO parameter trainable in the HF model.
# If you want to fine-tune the HF model, you can set the trainable parameters via the 'trainable_parameters' argument.
# Alternatively, you can use a head (MLP) via the 'head'-related arguments (see the docs for more details)
hf_model = HFModel(model_name=model_name)
# Now we define a model which is as easy as:
# Note that this will instantiation will lead to NO parameter trainable in the HF model.
# If you want to fine-tune the HF model, you can set the trainable parameters via the 'trainable_parameters' argument.
# Alternatively, you can use a head (MLP) via the 'head'-related arguments (see the docs for more details)
hf_model = HFModel(model_name=model_name)
In [15]:
Copied!
# And from here on is the same as any other WideDeep model
model = WideDeep(
deeptext=hf_model,
pred_dim=4,
)
trainer = Trainer(
model,
objective="multiclass",
metrics=[Accuracy(), F1Score(average=True)],
)
trainer.fit(
X_text=X_text_tr2,
target=train.rating.values,
n_epochs=1,
batch_size=64,
)
# If you run this on a CPU and you sampled the data, the metrics will not be better than a random guess. Remember, this is just a demo
# And from here on is the same as any other WideDeep model
model = WideDeep(
deeptext=hf_model,
pred_dim=4,
)
trainer = Trainer(
model,
objective="multiclass",
metrics=[Accuracy(), F1Score(average=True)],
)
trainer.fit(
X_text=X_text_tr2,
target=train.rating.values,
n_epochs=1,
batch_size=64,
)
# If you run this on a CPU and you sampled the data, the metrics will not be better than a random guess. Remember, this is just a demo
epoch 1: 100%|██████████| 13/13 [02:06<00:00, 9.75s/it, loss=3.2, metrics={'acc': 0.235, 'f1': 0.2336}]
In [17]:
Copied!
preds_text = trainer.predict_proba(X_text=X_text_te2)
pred_text_class = np.argmax(preds_text, 1)
acc_text = accuracy_score(test.rating, pred_text_class)
f1_text = f1_score(test.rating, pred_text_class, average="weighted")
print(f"Accuracy: {acc_text:.4f}")
print(f"F1: {f1_text:.4f}")
preds_text = trainer.predict_proba(X_text=X_text_te2)
pred_text_class = np.argmax(preds_text, 1)
acc_text = accuracy_score(test.rating, pred_text_class)
f1_text = f1_score(test.rating, pred_text_class, average="weighted")
print(f"Accuracy: {acc_text:.4f}")
print(f"F1: {f1_text:.4f}")
predict: 100%|██████████| 4/4 [00:05<00:00, 1.43s/it]
Accuracy: 0.2500 F1: 0.1000