Fastai transforms¶
I directly copied and pasted part of the transforms.py
module from
the fastai
library (from an old version). The reason to do such a thing is because
pytorch_widedeep
only needs the Tokenizer
and the Vocab
classes
there. This way I avoid extra dependencies. Credit for all the code in the
fastai_transforms
module in this pytorch-widedeep
package goes to
Jeremy Howard and the fastai
team. I only include the documentation here for
completion, but I strongly advise the user to read the fastai
documentation.
Tokenizer ¶
Tokenizer(tok_func=SpacyTokenizer, lang='en', pre_rules=None, post_rules=None, special_cases=None, n_cpus=None)
Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.
Setting some of the parameters of this class require perhaps some familiarity with the source code.
Parameters:
-
tok_func
(Callable
, default:SpacyTokenizer
) –Tokenizer Object. See
pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer
-
lang
(str
, default:'en'
) –Text's Language
-
pre_rules
(Optional[ListRules]
, default:None
) –Custom type:
Collection[Callable[[str], str]]
. These areCallable
objects that will be applied to the text (str) directly asrule(tok)
before being tokenized. -
post_rules
(Optional[ListRules]
, default:None
) –Custom type:
Collection[Callable[[str], str]]
. These areCallable
objects that will be applied to the tokens asrule(tokens)
after the text has been tokenized. -
special_cases
(Optional[Collection[str]]
, default:None
) –special cases to be added to the tokenizer via
Spacy
'sadd_special_case
method -
n_cpus
(Optional[int]
, default:None
) –number of CPUs to used during the tokenization process
Source code in pytorch_widedeep/utils/fastai_transforms.py
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
process_text ¶
process_text(t, tok)
Process and tokenize one text t
with tokenizer tok
.
Parameters:
-
t
(str
) –text to be processed and tokenized
-
tok
(BaseTokenizer
) –Instance of
BaseTokenizer
. Seepytorch_widedeep.utils.fastai_transforms.BaseTokenizer
Returns:
-
List[str]
–List of tokens
Source code in pytorch_widedeep/utils/fastai_transforms.py
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
|
process_all ¶
process_all(texts)
Process a list of texts. Parallel execution of process_text
.
Examples:
>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]
NOTE:
Note the token TK_MAJ
(xxmaj
), used to indicate the
next word begins with a capital in the original text. For more
details of special tokens please see the fastai
docs.
Returns:
-
List[List[str]]
–List containing lists of tokens. One list per "document"
Source code in pytorch_widedeep/utils/fastai_transforms.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 |
|
Vocab ¶
Vocab(max_vocab, min_freq, pad_idx=None)
Contains the correspondence between numbers and tokens.
Parameters:
-
max_vocab
(int
) –maximum vocabulary size
-
min_freq
(int
) –minimum frequency for a token to be considereds
-
pad_idx
(Optional[int]
, default:None
) –padding index. If
None
, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').
Attributes:
-
itos
(Collection
) –index to str
. Collection of strings that are the tokens of the vocabulary -
stoi
(defaultdict
) –str to index
. Dictionary containing the tokens of the vocabulary and their corresponding index
Source code in pytorch_widedeep/utils/fastai_transforms.py
366 367 368 369 370 371 372 373 374 |
|
create ¶
create(tokens)
Create a vocabulary object from a set of tokens.
Parameters:
-
tokens
(Tokens
) –Custom type:
Collection[Collection[str]]
seepytorch_widedeep.wdtypes
. Collection of collection of strings (e.g. list of tokenized sentences)
Examples:
>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'
NOTE:
Note the many special tokens that fastai
's' tokenizer adds. These
are particularly useful when building Language models and/or in
classification/Regression tasks. Please see the fastai
docs.
Returns:
-
Vocab
–An instance of a
Vocab
object
Source code in pytorch_widedeep/utils/fastai_transforms.py
376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 |
|
fit ¶
fit(tokens)
Calls the create
method. I simply want to honor fast ai naming, but
for consistency with the rest of the library I am including a fit method
Source code in pytorch_widedeep/utils/fastai_transforms.py
440 441 442 443 444 445 446 447 448 |
|
numericalize ¶
numericalize(t)
Convert a list of tokens t
to their ids.
Returns:
-
List[int]
–List of 'numericalsed' tokens
Source code in pytorch_widedeep/utils/fastai_transforms.py
450 451 452 453 454 455 456 457 458 |
|
transform ¶
transform(t)
Calls the numericalize
method. I simply want to honor fast ai naming,
but for consistency with the rest of the library I am including a
transform method
Source code in pytorch_widedeep/utils/fastai_transforms.py
460 461 462 463 464 465 466 |
|
textify ¶
textify(nums, sep=' ')
Convert a list of nums
(or indexes) to their tokens.
Returns:
-
List[str]
–List of tokens
Source code in pytorch_widedeep/utils/fastai_transforms.py
468 469 470 471 472 473 474 475 476 477 478 479 480 |
|
inverse_transform ¶
inverse_transform(nums, sep=' ')
Calls the textify
method. I simply want to honor fast ai naming, but
for consistency with the rest of the library I am including an
inverse_transform method
Source code in pytorch_widedeep/utils/fastai_transforms.py
482 483 484 485 486 487 488 489 490 491 492 |
|