Skip to content

Fastai transforms

I directly copied and pasted part of the transforms.py module from the fastai library (from an old version). The reason to do such a thing is because pytorch_widedeep only needs the Tokenizer and the Vocab classes there. This way I avoid extra dependencies. Credit for all the code in the fastai_transforms module in this pytorch-widedeep package goes to Jeremy Howard and the fastai team. I only include the documentation here for completion, but I strongly advise the user to read the fastai documentation.

Tokenizer

Tokenizer(tok_func=SpacyTokenizer, lang='en', pre_rules=None, post_rules=None, special_cases=None, n_cpus=None)

Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.

Setting some of the parameters of this class require perhaps some familiarity with the source code.

Parameters:

  • tok_func (Callable, default: SpacyTokenizer ) –

    Tokenizer Object. See pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer

  • lang (str, default: 'en' ) –

    Text's Language

  • pre_rules (Optional[ListRules], default: None ) –

    Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the text (str) directly as rule(tok) before being tokenized.

  • post_rules (Optional[ListRules], default: None ) –

    Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the tokens as rule(tokens) after the text has been tokenized.

  • special_cases (Optional[Collection[str]], default: None ) –

    special cases to be added to the tokenizer via Spacy's add_special_case method

  • n_cpus (Optional[int], default: None ) –

    number of CPUs to used during the tokenization process

Source code in pytorch_widedeep/utils/fastai_transforms.py
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def __init__(
    self,
    tok_func: Callable = SpacyTokenizer,
    lang: str = "en",
    pre_rules: Optional[ListRules] = None,
    post_rules: Optional[ListRules] = None,
    special_cases: Optional[Collection[str]] = None,
    n_cpus: Optional[int] = None,
):
    self.tok_func, self.lang, self.special_cases = tok_func, lang, special_cases
    self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules)
    self.post_rules = ifnone(post_rules, defaults.text_post_rules)
    self.special_cases = (
        special_cases if special_cases is not None else defaults.text_spec_tok
    )
    self.n_cpus = ifnone(n_cpus, defaults.cpus)

process_text

process_text(t, tok)

Process and tokenize one text t with tokenizer tok.

Parameters:

  • t (str) –

    text to be processed and tokenized

  • tok (BaseTokenizer) –

    Instance of BaseTokenizer. See pytorch_widedeep.utils.fastai_transforms.BaseTokenizer

Returns:

  • List[str]

    List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
    r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

    Parameters
    ----------
    t: str
        text to be processed and tokenized
    tok: ``BaseTokenizer``
        Instance of `BaseTokenizer`. See
        `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

    Returns
    -------
    List[str]
        List of tokens
    """
    for rule in self.pre_rules:
        t = rule(t)
    toks = tok.tokenizer(t)
    for rule in self.post_rules:
        toks = rule(toks)
    return toks

process_all

process_all(texts)

Process a list of texts. Parallel execution of process_text.

Examples:

>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

ℹ️ NOTE: Note the token TK_MAJ (xxmaj), used to indicate the next word begins with a capital in the original text. For more details of special tokens please see the fastai docs.

Returns:

  • List[List[str]]

    List containing lists of tokens. One list per "document"

Source code in pytorch_widedeep/utils/fastai_transforms.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
def process_all(self, texts: Collection[str]) -> List[List[str]]:
    r"""Process a list of texts. Parallel execution of ``process_text``.

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tok = Tokenizer()
    >>> tok.process_all(texts)
    [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

    :information_source: **NOTE**:
    Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
    next word begins with a capital in the original text. For more
    details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    List[List[str]]
        List containing lists of tokens. One list per "_document_"

    """

    if self.n_cpus <= 1:
        return self._process_all_1(texts)
    with ProcessPoolExecutor(self.n_cpus) as e:
        return sum(
            e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), []
        )

Vocab

Vocab(max_vocab, min_freq, pad_idx=None)

Contains the correspondence between numbers and tokens.

Parameters:

  • max_vocab (int) –

    maximum vocabulary size

  • min_freq (int) –

    minimum frequency for a token to be considereds

  • pad_idx (Optional[int], default: None ) –

    padding index. If None, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').

Attributes:

  • itos (Collection) –

    index to str. Collection of strings that are the tokens of the vocabulary

  • stoi (defaultdict) –

    str to index. Dictionary containing the tokens of the vocabulary and their corresponding index

Source code in pytorch_widedeep/utils/fastai_transforms.py
366
367
368
369
370
371
372
373
374
def __init__(
    self,
    max_vocab: int,
    min_freq: int,
    pad_idx: Optional[int] = None,
):
    self.max_vocab = max_vocab
    self.min_freq = min_freq
    self.pad_idx = pad_idx

create

create(tokens)

Create a vocabulary object from a set of tokens.

Parameters:

  • tokens (Tokens) –

    Custom type: Collection[Collection[str]] see pytorch_widedeep.wdtypes. Collection of collection of strings (e.g. list of tokenized sentences)

Examples:

>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'

ℹ️ NOTE: Note the many special tokens that fastai's' tokenizer adds. These are particularly useful when building Language models and/or in classification/Regression tasks. Please see the fastai docs.

Returns:

  • Vocab

    An instance of a Vocab object

Source code in pytorch_widedeep/utils/fastai_transforms.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
def create(
    self,
    tokens: Tokens,
) -> "Vocab":
    r"""Create a vocabulary object from a set of tokens.

    Parameters
    ----------
    tokens: Tokens
        Custom type: ``Collection[Collection[str]]``  see
        `pytorch_widedeep.wdtypes`. Collection of collection of
        strings (e.g. list of tokenized sentences)

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer, Vocab
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tokens = Tokenizer().process_all(texts)
    >>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
    >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
    [10, 11, 9, 12]
    >>> vocab.textify([10, 11, 9, 12])
    'machine learning is great'

    :information_source: **NOTE**:
    Note the many special tokens that ``fastai``'s' tokenizer adds. These
    are particularly useful when building Language models and/or in
    classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    Vocab
        An instance of a `Vocab` object
    """

    freq = Counter(p for o in tokens for p in o)
    itos = [o for o, c in freq.most_common(self.max_vocab) if c >= self.min_freq]
    for o in reversed(defaults.text_spec_tok):
        if o in itos:
            itos.remove(o)
        itos.insert(0, o)

    if self.pad_idx is not None and self.pad_idx != 1:
        itos.remove(PAD)
        itos.insert(self.pad_idx, PAD)
        # get the new 'xxunk' index
        xxunk_idx = np.where([el == "xxunk" for el in itos])[0][0]
    else:
        xxunk_idx = 0

    itos = itos[: self.max_vocab]
    if (
        len(itos) < self.max_vocab
    ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
        while len(itos) % 8 != 0:
            itos.append("xxfake")

    self.itos = itos
    self.stoi = defaultdict(
        lambda: xxunk_idx, {v: k for k, v in enumerate(self.itos)}
    )

    return self

fit

fit(tokens)

Calls the create method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a fit method

Source code in pytorch_widedeep/utils/fastai_transforms.py
440
441
442
443
444
445
446
447
448
def fit(
    self,
    tokens: Tokens,
) -> "Vocab":
    """
    Calls the `create` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including a fit method
    """
    return self.create(tokens)

numericalize

numericalize(t)

Convert a list of tokens t to their ids.

Returns:

  • List[int]

    List of 'numericalsed' tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
450
451
452
453
454
455
456
457
458
def numericalize(self, t: Collection[str]) -> List[int]:
    """Convert a list of tokens ``t`` to their ids.

    Returns
    -------
    List[int]
        List of '_numericalsed_' tokens
    """
    return [self.stoi[w] for w in t]

transform

transform(t)

Calls the numericalize method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py
460
461
462
463
464
465
466
def transform(self, t: Collection[str]) -> List[int]:
    """
    Calls the `numericalize` method. I simply want to honor fast ai naming,
    but for consistency with the rest of the library I am including a
    transform method
    """
    return self.numericalize(t)

textify

textify(nums, sep=' ')

Convert a list of nums (or indexes) to their tokens.

Returns:

  • List[str]

    List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
468
469
470
471
472
473
474
475
476
477
478
479
480
def textify(self, nums: Collection[int], sep=" ") -> Union[str, List[str]]:
    """Convert a list of ``nums`` (or indexes) to their tokens.

    Returns
    -------
    List[str]
        List of tokens
    """
    return (
        sep.join([self.itos[i] for i in nums])
        if sep is not None
        else [self.itos[i] for i in nums]
    )

inverse_transform

inverse_transform(nums, sep=' ')

Calls the textify method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including an inverse_transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py
482
483
484
485
486
487
488
489
490
491
492
def inverse_transform(
    self, nums: Collection[int], sep=" "
) -> Union[str, List[str]]:
    """
    Calls the `textify` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including an
    inverse_transform method
    """
    # I simply want to honor fast ai naming, but for consistency with the
    # rest of the library I am including an inverse_transform method
    return self.textify(nums, sep)