Shortcuts

torchtext.data.functional

generate_sp_model

torchtext.data.functional.generate_sp_model(filename, vocab_size=20000, model_type='unigram', model_prefix='m_user')[source]

Train a SentencePiece tokenizer.

Parameters:
  • filename – the data file for training SentencePiece model.
  • vocab_size – the size of vocabulary (Default: 20,000).
  • model_type – the type of SentencePiece model, including unigram, bpe, char, word.
  • model_prefix – the prefix of the files saving model and vocab.
Outputs:
The model and vocab are saved in two separate files with
model_prefix.

Examples

>>> from torchtext.data.functional import generate_sp_model
>>> generate_sp_model('test.csv', vocab_size=23456, model_prefix='spm_user')

load_sp_model

torchtext.data.functional.load_sp_model(spm_path)[source]

Load a sentencepiece model for file.

Parameters:spm_path – the file path saving the sentencepiece model.
Outputs:
output: a SentencePiece model.

Examples

>>> from torchtext.data.functional import load_sp_model
>>> sp_model = load_sp_model("m_user.model")

sentencepiece_numericalizer

torchtext.data.functional.sentencepiece_numericalizer(sp_model)[source]
A sentencepiece model to numericalize a text sentence into
a generator over the ids.
Parameters:sp_model – a SentencePiece model.
Outputs:
output: a generator with the input of text sentence and the output of the
corresponding ids based on SentencePiece model.

Examples

>>> from torchtext.data.functional import sentencepiece_numericalizer
>>> sp_id_generator = sentencepiece_numericalizer(sp_model)
>>> list_a = ["sentencepiece encode as pieces", "examples to   try!"]
>>> list(sp_id_generator(list_a))
    [[9858, 9249, 1629, 1305, 1809, 53, 842],
     [2347, 13, 9, 150, 37]]

sentencepiece_tokenizer

torchtext.data.functional.sentencepiece_tokenizer(sp_model)[source]
A sentencepiece model to tokenize a text sentence into
a generator over the tokens.
Parameters:sp_model – a SentencePiece model.
Outputs:
output: a generator with the input of text sentence and the output of the
corresponding tokens based on SentencePiece model.

Examples

>>> from torchtext.data.functional import sentencepiece_tokenizer
>>> sp_tokens_generator = sentencepiece_tokenizer(sp_model)
>>> list_a = ["sentencepiece encode as pieces", "examples to   try!"]
>>> list(sp_tokens_generator(list_a))
    [['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'],
     ['_example', 's', '_to', '_try', '!']]

custom_replace

torchtext.data.functional.custom_replace(replace_pattern)[source]

A transform to convert text string.

Examples

>>> from torchtext.data.functional import custom_replace
>>> custom_replace_transform = custom_replace([(r'S', 's'), (r'\s+', ' ')])
>>> list_a = ["Sentencepiece encode  aS  pieces", "exampleS to   try!"]
>>> list(custom_replace_transform(list_a))
    ['sentencepiece encode as pieces', 'examples to try!']

simple_space_split

torchtext.data.functional.simple_space_split(iterator)[source]

A transform to split text string by spaces.

Examples

>>> from torchtext.data.functional import simple_space_split
>>> list_a = ["Sentencepiece encode as pieces", "example to try!"]
>>> list(simple_space_split(list_a))
    [['Sentencepiece', 'encode', 'as', 'pieces'], ['example', 'to', 'try!']]

numericalize_tokens_from_iterator

torchtext.data.functional.numericalize_tokens_from_iterator(vocab, iterator, removed_tokens=None)[source]

Yield a list of ids from an token iterator with a vocab.

Parameters:
  • vocab – the vocabulary convert token into id.
  • iterator – the iterator yield a list of tokens.
  • removed_tokens – removed tokens from output dataset (Default: None)

Examples

>>> from torchtext.data.functional import simple_space_split
>>> from torchtext.data.functional import numericalize_tokens_from_iterator
>>> vocab = {'Sentencepiece' : 0, 'encode' : 1, 'as' : 2, 'pieces' : 3}
>>> ids_iter = numericalize_tokens_from_iterator(vocab,
>>>                               simple_space_split(["Sentencepiece as pieces",
>>>                                                   "as pieces"]))
>>> for ids in ids_iter:
>>>     print([num for num in ids])
>>> [0, 2, 3]
>>> [2, 3]
Read the Docs v: latest
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources