Nlp_tools subpackage melusine.nlp_tools
¶
TODO : ADD DESCRIPTION OF THE SUBPACKAGE
List of submodules¶
Tokenizer melusine.nlp_tools.tokenizer
¶
- class melusine.nlp_tools.tokenizer.RegexTokenizer(tokenizer_regex: str = '\\w+(?:[\\?\\-\\"_]\\w+)*', stopwords: Optional[List[str]] = None)[source]¶
Bases:
object
- FILENAME = 'tokenizer.json'¶
Tokenize text using a regex split pattern.
Phraser melusine.nlp_tools.phraser
¶
Embedding melusine.nlp_tools.embedding
¶
- class melusine.nlp_tools.embedding.Embedding(tokens_column=None, workers=40, random_seed=42, iter=15, size=300, method='word2vec_cbow', min_count=100)[source]¶
Bases:
object
Class to train embeddings with Word2Vec algorithm. Attributes ———- word2id: dict,
Word vocabulary (key: word, value: word_index.
- embeddingGensim KeyedVector Instance,
Gensim KeyedVector Instance relative to the specific trained or imported embedding.
- methodstr,
- One of the following :
“word2vec_sg” : Trains a Word2Vec Embedding using the Skip-Gram method (usually takes a long time).
“word2vec_cbow” : Trains a Word2Vec Embedding using the Continuous Bag-Of-Words method.
“lsa_docterm” : Trains an Embedding by using an SVD on a Document-Term Matrix.
“lsa_tfidf” : Trains an Embedding by using an SVD on a TF-IDFized Document-Term Matrix.
- train_paramsdict,
- Additional parameters for the embedding training. Check the following documentation :
gensim.models.Word2Vec for Word2Vec Embeddings
sklearn.decomposition.TruncatedSVD for LSA Embeddings
If left untouched, the default training values will be kept from the aforementioned packages.
>>> from melusine.nlp_tools.embedding import Embedding >>> embedding = Embedding() >>> embedding.train(X) # noqa >>> embedding.save(filepath) # noqa >>> embedding = Embedding().load(filepath) # noqa