Skip to content

Fastai transforms

We directly copied and pasted part of the transforms.py module from the fastai library (from an old version). The reason to do such a thing is because pytorch_widedeep only needs the Tokenizer and the Vocab classes there. This way we avoid extra dependencies. Credit for all the code in the fastai_transforms module in this pytorch-widedeep package goes to Jeremy Howard and the fastai team. I only include the documentation here for completion, but I strongly advise the user to read the fastai documentation.

Tokenizer

Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.

Setting some of the parameters of this class require perhaps some familiarity with the source code.

Parameters:

Name Type Description Default
tok_func Callable

Tokenizer Object. See pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer

SpacyTokenizer
lang str

Text's Language

'en'
pre_rules Optional[ListRules]

Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the text (str) directly as rule(tok) before being tokenized.

None
post_rules Optional[ListRules]

Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the tokens as rule(tokens) after the text has been tokenized.

None
special_cases Optional[Collection[str]]

special cases to be added to the tokenizer via Spacy's add_special_case method

None
n_cpus Optional[int]

number of CPUs to used during the tokenization process

None
Source code in pytorch_widedeep/utils/fastai_transforms.py
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
class Tokenizer:
    r"""Class to combine a series of rules and a tokenizer function to tokenize
    text with multiprocessing.

    Setting some of the parameters of this class require perhaps some
    familiarity with the source code.

    Parameters
    ----------
    tok_func: Callable, default = ``SpacyTokenizer``
        Tokenizer Object. See `pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer`
    lang: str, default = "en"
        Text's Language
    pre_rules: ListRules, Optional, default = None
        Custom type: ``Collection[Callable[[str], str]]``. These are
        `Callable` objects that will be applied to the text (str) directly as
        `rule(tok)` before being tokenized.
    post_rules: ListRules, Optional, default = None
        Custom type: ``Collection[Callable[[str], str]]``. These are
        `Callable` objects that will be applied to the tokens as
        `rule(tokens)` after the text has been tokenized.
    special_cases: Collection, Optional, default= None
        special cases to be added to the tokenizer via ``Spacy``'s
        ``add_special_case`` method
    n_cpus: int, Optional, default = None
        number of CPUs to used during the tokenization process
    """

    def __init__(
        self,
        tok_func: Callable = SpacyTokenizer,
        lang: str = "en",
        pre_rules: Optional[ListRules] = None,
        post_rules: Optional[ListRules] = None,
        special_cases: Optional[Collection[str]] = None,
        n_cpus: Optional[int] = None,
    ):
        self.tok_func, self.lang, self.special_cases = tok_func, lang, special_cases
        self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules)
        self.post_rules = ifnone(post_rules, defaults.text_post_rules)
        self.special_cases = (
            special_cases if special_cases is not None else defaults.text_spec_tok
        )
        self.n_cpus = ifnone(n_cpus, defaults.cpus)

    def __repr__(self) -> str:
        res = f"Tokenizer {self.tok_func.__name__} in {self.lang} with the following rules:\n"
        for rule in self.pre_rules:
            res += f" - {rule.__name__}\n"
        for rule in self.post_rules:
            res += f" - {rule.__name__}\n"
        return res

    def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
        r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

        Parameters
        ----------
        t: str
            text to be processed and tokenized
        tok: ``BaseTokenizer``
            Instance of `BaseTokenizer`. See
            `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

        Returns
        -------
        List[str]
            List of tokens
        """
        for rule in self.pre_rules:
            t = rule(t)
        toks = tok.tokenizer(t)
        for rule in self.post_rules:
            toks = rule(toks)
        return toks

    def _process_all_1(self, texts: Collection[str]) -> List[List[str]]:
        """Process a list of ``texts`` in one process."""

        tok = self.tok_func(self.lang)
        if self.special_cases:
            tok.add_special_cases(self.special_cases)
        return [self.process_text(str(t), tok) for t in texts]

    def process_all(self, texts: Collection[str]) -> List[List[str]]:
        r"""Process a list of texts. Parallel execution of ``process_text``.

        Examples
        --------
        >>> from pytorch_widedeep.utils import Tokenizer
        >>> texts = ['Machine learning is great', 'but building stuff is even better']
        >>> tok = Tokenizer()
        >>> tok.process_all(texts)
        [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

        :information_source: **NOTE**:
        Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
        next word begins with a capital in the original text. For more
        details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

        Returns
        -------
        List[List[str]]
            List containing lists of tokens. One list per "_document_"

        """

        if self.n_cpus <= 1:
            return self._process_all_1(texts)

        with Pool(self.n_cpus) as p:
            partitioned_texts = partition_by_cores(texts, self.n_cpus)
            results = p.map(self._process_all_1, partitioned_texts)
            res = sum(results, [])
        return res

process_text

process_text(t, tok)

Process and tokenize one text t with tokenizer tok.

Parameters:

Name Type Description Default
t str

text to be processed and tokenized

required
tok BaseTokenizer

Instance of BaseTokenizer. See pytorch_widedeep.utils.fastai_transforms.BaseTokenizer

required

Returns:

Type Description
List[str]

List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
    r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

    Parameters
    ----------
    t: str
        text to be processed and tokenized
    tok: ``BaseTokenizer``
        Instance of `BaseTokenizer`. See
        `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

    Returns
    -------
    List[str]
        List of tokens
    """
    for rule in self.pre_rules:
        t = rule(t)
    toks = tok.tokenizer(t)
    for rule in self.post_rules:
        toks = rule(toks)
    return toks

process_all

process_all(texts)

Process a list of texts. Parallel execution of process_text.

Examples:

>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

ℹ️ NOTE: Note the token TK_MAJ (xxmaj), used to indicate the next word begins with a capital in the original text. For more details of special tokens please see the fastai docs.

Returns:

Type Description
List[List[str]]

List containing lists of tokens. One list per "document"

Source code in pytorch_widedeep/utils/fastai_transforms.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
def process_all(self, texts: Collection[str]) -> List[List[str]]:
    r"""Process a list of texts. Parallel execution of ``process_text``.

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tok = Tokenizer()
    >>> tok.process_all(texts)
    [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

    :information_source: **NOTE**:
    Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
    next word begins with a capital in the original text. For more
    details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    List[List[str]]
        List containing lists of tokens. One list per "_document_"

    """

    if self.n_cpus <= 1:
        return self._process_all_1(texts)

    with Pool(self.n_cpus) as p:
        partitioned_texts = partition_by_cores(texts, self.n_cpus)
        results = p.map(self._process_all_1, partitioned_texts)
        res = sum(results, [])
    return res

Vocab

Contains the correspondence between numbers and tokens.

Parameters:

Name Type Description Default
max_vocab int

maximum vocabulary size

required
min_freq int

minimum frequency for a token to be considereds

required
pad_idx Optional[int]

padding index. If None, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').

None

Attributes:

Name Type Description
itos Collection

index to str. Collection of strings that are the tokens of the vocabulary

stoi defaultdict

str to index. Dictionary containing the tokens of the vocabulary and their corresponding index

Source code in pytorch_widedeep/utils/fastai_transforms.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
class Vocab:
    r"""Contains the correspondence between numbers and tokens.

    Parameters
    ----------
    max_vocab: int
        maximum vocabulary size
    min_freq: int
        minimum frequency for a token to be considereds
    pad_idx: int, Optional, default = None
        padding index. If `None`, Fastai's Tokenizer leaves the 0 index
        for the unknown token (_'xxunk'_) and defaults to 1 for the padding
        token (_'xxpad'_).

    Attributes
    ----------
    itos: Collection
        `index to str`. Collection of strings that are the tokens of the
        vocabulary
    stoi: defaultdict
        `str to index`. Dictionary containing the tokens of the vocabulary and
        their corresponding index
    """

    def __init__(
        self,
        max_vocab: int,
        min_freq: int,
        pad_idx: Optional[int] = None,
        special_cases: Optional[Collection[str]] = None,
    ):
        self.max_vocab = max_vocab
        self.min_freq = min_freq
        self.pad_idx = pad_idx
        self.special_cases = (
            special_cases if special_cases is not None else defaults.text_spec_tok
        )

    def create(
        self,
        tokens: Tokens,
    ) -> "Vocab":
        r"""Create a vocabulary object from a set of tokens.

        Parameters
        ----------
        tokens: Tokens
            Custom type: ``Collection[Collection[str]]``  see
            `pytorch_widedeep.wdtypes`. Collection of collection of
            strings (e.g. list of tokenized sentences)

        Examples
        --------
        >>> from pytorch_widedeep.utils import Tokenizer, Vocab
        >>> texts = ['Machine learning is great', 'but building stuff is even better']
        >>> tokens = Tokenizer().process_all(texts)
        >>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
        >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
        [10, 11, 9, 12]
        >>> vocab.textify([10, 11, 9, 12])
        'machine learning is great'

        :information_source: **NOTE**:
        Note the many special tokens that ``fastai``'s' tokenizer adds. These
        are particularly useful when building Language models and/or in
        classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

        Returns
        -------
        Vocab
            An instance of a `Vocab` object
        """

        freq = Counter(p for o in tokens for p in o)
        itos = [o for o, c in freq.most_common(self.max_vocab) if c >= self.min_freq]
        for o in reversed(self.special_cases):  # type: ignore[arg-type]
            if o in itos:
                itos.remove(o)
            itos.insert(0, o)

        if self.pad_idx is not None and self.pad_idx != 1:
            itos.remove(PAD)
            itos.insert(self.pad_idx, PAD)
            # get the new 'xxunk' index
            xxunk_idx = np.where([el == "xxunk" for el in itos])[0][0]
        else:
            xxunk_idx = 0

        itos = itos[: self.max_vocab]
        if (
            len(itos) < self.max_vocab
        ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
            while len(itos) % 8 != 0:
                itos.append("xxfake")

        self.itos = itos
        self.stoi = defaultdict(
            lambda: xxunk_idx, {v: k for k, v in enumerate(self.itos)}
        )

        return self

    def fit(
        self,
        tokens: Tokens,
    ) -> "Vocab":
        """
        Calls the `create` method. I simply want to honor fast ai naming, but
        for consistency with the rest of the library I am including a fit method
        """
        return self.create(tokens)

    def numericalize(self, t: Collection[str]) -> List[int]:
        """Convert a list of tokens ``t`` to their ids.

        Returns
        -------
        List[int]
            List of '_numericalsed_' tokens
        """
        return [self.stoi[w] for w in t]

    def transform(self, t: Collection[str]) -> List[int]:
        """
        Calls the `numericalize` method. I simply want to honor fast ai naming,
        but for consistency with the rest of the library I am including a
        transform method
        """
        return self.numericalize(t)

    def textify(self, nums: Collection[int], sep=" ") -> Union[str, List[str]]:
        """Convert a list of ``nums`` (or indexes) to their tokens.

        Returns
        -------
        List[str]
            List of tokens
        """
        return (
            sep.join([self.itos[i] for i in nums])
            if sep is not None
            else [self.itos[i] for i in nums]
        )

    def inverse_transform(
        self, nums: Collection[int], sep=" "
    ) -> Union[str, List[str]]:
        """
        Calls the `textify` method. I simply want to honor fast ai naming, but
        for consistency with the rest of the library I am including an
        inverse_transform method
        """
        # I simply want to honor fast ai naming, but for consistency with the
        # rest of the library I am including an inverse_transform method
        return self.textify(nums, sep)

    def __getstate__(self):
        return {"itos": self.itos}

    def __setstate__(self, state: dict):
        self.itos = state["itos"]
        self.stoi = defaultdict(int, {v: k for k, v in enumerate(self.itos)})

create

create(tokens)

Create a vocabulary object from a set of tokens.

Parameters:

Name Type Description Default
tokens Tokens

Custom type: Collection[Collection[str]] see pytorch_widedeep.wdtypes. Collection of collection of strings (e.g. list of tokenized sentences)

required

Examples:

>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'

ℹ️ NOTE: Note the many special tokens that fastai's' tokenizer adds. These are particularly useful when building Language models and/or in classification/Regression tasks. Please see the fastai docs.

Returns:

Type Description
Vocab

An instance of a Vocab object

Source code in pytorch_widedeep/utils/fastai_transforms.py
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
def create(
    self,
    tokens: Tokens,
) -> "Vocab":
    r"""Create a vocabulary object from a set of tokens.

    Parameters
    ----------
    tokens: Tokens
        Custom type: ``Collection[Collection[str]]``  see
        `pytorch_widedeep.wdtypes`. Collection of collection of
        strings (e.g. list of tokenized sentences)

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer, Vocab
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tokens = Tokenizer().process_all(texts)
    >>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
    >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
    [10, 11, 9, 12]
    >>> vocab.textify([10, 11, 9, 12])
    'machine learning is great'

    :information_source: **NOTE**:
    Note the many special tokens that ``fastai``'s' tokenizer adds. These
    are particularly useful when building Language models and/or in
    classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    Vocab
        An instance of a `Vocab` object
    """

    freq = Counter(p for o in tokens for p in o)
    itos = [o for o, c in freq.most_common(self.max_vocab) if c >= self.min_freq]
    for o in reversed(self.special_cases):  # type: ignore[arg-type]
        if o in itos:
            itos.remove(o)
        itos.insert(0, o)

    if self.pad_idx is not None and self.pad_idx != 1:
        itos.remove(PAD)
        itos.insert(self.pad_idx, PAD)
        # get the new 'xxunk' index
        xxunk_idx = np.where([el == "xxunk" for el in itos])[0][0]
    else:
        xxunk_idx = 0

    itos = itos[: self.max_vocab]
    if (
        len(itos) < self.max_vocab
    ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
        while len(itos) % 8 != 0:
            itos.append("xxfake")

    self.itos = itos
    self.stoi = defaultdict(
        lambda: xxunk_idx, {v: k for k, v in enumerate(self.itos)}
    )

    return self

fit

fit(tokens)

Calls the create method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a fit method

Source code in pytorch_widedeep/utils/fastai_transforms.py
446
447
448
449
450
451
452
453
454
def fit(
    self,
    tokens: Tokens,
) -> "Vocab":
    """
    Calls the `create` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including a fit method
    """
    return self.create(tokens)

numericalize

numericalize(t)

Convert a list of tokens t to their ids.

Returns:

Type Description
List[int]

List of 'numericalsed' tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
456
457
458
459
460
461
462
463
464
def numericalize(self, t: Collection[str]) -> List[int]:
    """Convert a list of tokens ``t`` to their ids.

    Returns
    -------
    List[int]
        List of '_numericalsed_' tokens
    """
    return [self.stoi[w] for w in t]

transform

transform(t)

Calls the numericalize method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py
466
467
468
469
470
471
472
def transform(self, t: Collection[str]) -> List[int]:
    """
    Calls the `numericalize` method. I simply want to honor fast ai naming,
    but for consistency with the rest of the library I am including a
    transform method
    """
    return self.numericalize(t)

textify

textify(nums, sep=' ')

Convert a list of nums (or indexes) to their tokens.

Returns:

Type Description
List[str]

List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
474
475
476
477
478
479
480
481
482
483
484
485
486
def textify(self, nums: Collection[int], sep=" ") -> Union[str, List[str]]:
    """Convert a list of ``nums`` (or indexes) to their tokens.

    Returns
    -------
    List[str]
        List of tokens
    """
    return (
        sep.join([self.itos[i] for i in nums])
        if sep is not None
        else [self.itos[i] for i in nums]
    )

inverse_transform

inverse_transform(nums, sep=' ')

Calls the textify method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including an inverse_transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py
488
489
490
491
492
493
494
495
496
497
498
def inverse_transform(
    self, nums: Collection[int], sep=" "
) -> Union[str, List[str]]:
    """
    Calls the `textify` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including an
    inverse_transform method
    """
    # I simply want to honor fast ai naming, but for consistency with the
    # rest of the library I am including an inverse_transform method
    return self.textify(nums, sep)