torchtext.data.utils¶

get_tokenizer¶

torchtext.data.utils.get_tokenizer(tokenizer, language='en')[source]¶

Generate tokenizer function for a string sentence.

Parameters:

tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
language – Default en

Examples

>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

ngrams_iterator¶

torchtext.data.utils.ngrams_iterator(token_list, ngrams)[source]¶

Return an iterator that yields the given tokens and their ngrams.

Parameters:	token_list – A list of tokens ngrams – the number of ngrams.

Examples

>>> token_list = ['here', 'we', 'are']
>>> list(ngrams_iterator(token_list, 2))
>>> ['here', 'here we', 'we', 'we are', 'are']

torchtext.data.utils¶

get_tokenizer¶

ngrams_iterator¶

Docs

Tutorials

Resources