Shortcuts

torchtext.data.utils

get_tokenizer

torchtext.data.utils.get_tokenizer(tokenizer, language='en')[source]

Generate tokenizer function for a string sentence.

Parameters:
  • tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
  • language – Default en

Examples

>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

ngrams_iterator

torchtext.data.utils.ngrams_iterator(token_list, ngrams)[source]

Return an iterator that yields the given tokens and their ngrams.

Parameters:
  • token_list – A list of tokens
  • ngrams – the number of ngrams.

Examples

>>> token_list = ['here', 'we', 'are']
>>> list(ngrams_iterator(token_list, 2))
>>> ['here', 'here we', 'we', 'we are', 'are']
Read the Docs v: latest
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources