Text utils¶
Collection of helper function that facilitate processing text.
simple_preprocess ¶
simple_preprocess(doc, lower=False, deacc=False, min_len=2, max_len=15)
This is Gensim
's simple_preprocess
with a lower
param to
indicate wether or not to lower case all the token in the doc
For more information see: Gensim
utils module
Parameters:
-
doc
(str
) –Input document.
-
lower
(bool
, default:False
) –Lower case tokens in the input doc
-
deacc
(bool
, default:False
) –Remove accent marks from tokens using
Gensim
'sdeaccent
-
min_len
(int
, default:2
) –Minimum length of token (inclusive). Shorter tokens are discarded.
-
max_len
(int
, default:15
) –Maximum length of token in result (inclusive). Longer tokens are discarded.
Examples:
>>> from pytorch_widedeep.utils import simple_preprocess
>>> simple_preprocess('Machine learning is great')
['Machine', 'learning', 'is', 'great']
Returns:
-
List[str]
–List with the processed tokens
Source code in pytorch_widedeep/utils/text_utils.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
get_texts ¶
get_texts(texts, already_processed=False, n_cpus=None)
Tokenization using Fastai
's Tokenizer
because it does a
series of very convenients things during the tokenization process
See pytorch_widedeep.utils.fastai_utils.Tokenizer
Parameters:
-
texts
(List[str]
) –List of str with the texts (or documents). One str per document
-
already_processed
(Optional[bool]
, default:False
) –Boolean indicating if the text is already processed and we simply want to tokenize it. This parameter is thought for those cases where the input sequences might not be text (but IDs, or anything else) and we just want to tokenize it
-
n_cpus
(Optional[int]
, default:None
) –number of CPUs to used during the tokenization process
Examples:
>>> from pytorch_widedeep.utils import get_texts
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> get_texts(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]
Returns:
-
List[List[str]]
–List of lists, one list per 'document' containing its corresponding tokens
-
information_source: **NOTE**:
– -
`get_texts` uses `pytorch_widedeep.utils.fastai_transforms.Tokenizer`.
– -
Such tokenizer uses a series of convenient processing steps, including
– -
the addition of some special tokens, such as `TK_MAJ` (`xxmaj`), used to
– -
indicate the next word begins with a capital in the original text. For more
– -
details of special tokens please see the [`fastai` `docs](https://docs.fast.ai/text.core.html#Tokenizing)
–
Source code in pytorch_widedeep/utils/text_utils.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
pad_sequences ¶
pad_sequences(seq, maxlen, pad_first=True, pad_idx=1)
Given a List of tokenized and numericalised
sequences it will return
padded sequences according to the input parameters.
Parameters:
-
seq
(List[int]
) –List of int with the
numericalised
tokens -
maxlen
(int
) –Maximum length of the padded sequences
-
pad_first
(bool
, default:True
) –Indicates whether the padding index will be added at the beginning or the end of the sequences
-
pad_idx
(int
, default:1
) –padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
Examples:
>>> from pytorch_widedeep.utils import pad_sequences
>>> seq = [1,2,3]
>>> pad_sequences(seq, maxlen=5, pad_idx=0)
array([0, 0, 1, 2, 3], dtype=int32)
Returns:
-
ndarray
–numpy array with the padded sequences
Source code in pytorch_widedeep/utils/text_utils.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
build_embeddings_matrix ¶
build_embeddings_matrix(vocab, word_vectors_path, min_freq, verbose=1)
Build the embedding matrix using pretrained word vectors.
Returns pretrained word embeddings. If a word in our vocabulary is not among the pretrained embeddings it will be assigned the mean pretrained word-embeddings vector
Parameters:
-
vocab
(Union[Vocab, ChunkVocab]
) –see
pytorch_widedeep.utils.fastai_utils.Vocab
-
word_vectors_path
(str
) –path to the pretrained word embeddings
-
min_freq
(int
) –minimum frequency required for a word to be in the vocabulary
-
verbose
(int
, default:1
) –level of verbosity. Set to 0 for no verbosity
Returns:
-
ndarray
–Pretrained word embeddings
Source code in pytorch_widedeep/utils/text_utils.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|