Nlp_tools subpackage melusine.nlp_tools

TODO : ADD DESCRIPTION OF THE SUBPACKAGE

List of submodules

Tokenizer melusine.nlp_tools.tokenizer

class melusine.nlp_tools.tokenizer.RegexTokenizer(tokenizer_regex: str = '\\w+(?:[\\?\\-\\"_]\\w+)*', stopwords: Optional[List[str]] = None)[source]

Bases: object

FILENAME = 'tokenizer.json'

Tokenize text using a regex split pattern.

tokenize(text: str) Sequence[str][source]

Apply the full tokenization pipeline on a text. Parameters ———- text: str

Input text to be tokenized

Returns
tokens: Sequence[str]

List of tokens

class melusine.nlp_tools.tokenizer.Tokenizer(input_column='text', output_column='tokens', stop_removal=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Tokenizer class to split text into tokens.

fit(X, y=None)[source]

Unused method. Defined only for compatibility with scikit-learn API.

transform(data)[source]

Applies tokenize method on pd.Dataframe.

Parameters
datapandas.DataFrame,

Data on which transformations are applied.

Returns
pandas.DataFrame

Phraser melusine.nlp_tools.phraser

class melusine.nlp_tools.phraser.Phraser(input_column='tokens', output_column='tokens', **phraser_args)[source]

Bases: object

Train and use a Gensim Phraser.

FILENAME = 'gensim_phraser_meta.pkl'
PHRASER_FILENAME = 'gensim_phraser'
fit(df, y=None)[source]
transform(df)[source]

Embedding melusine.nlp_tools.embedding

class melusine.nlp_tools.embedding.Embedding(tokens_column=None, workers=40, random_seed=42, iter=15, size=300, method='word2vec_cbow', min_count=100)[source]

Bases: object

Class to train embeddings with Word2Vec algorithm. Attributes ———- word2id: dict,

Word vocabulary (key: word, value: word_index.

embeddingGensim KeyedVector Instance,

Gensim KeyedVector Instance relative to the specific trained or imported embedding.

methodstr,
One of the following :
  • “word2vec_sg” : Trains a Word2Vec Embedding using the Skip-Gram method (usually takes a long time).

  • “word2vec_cbow” : Trains a Word2Vec Embedding using the Continuous Bag-Of-Words method.

  • “lsa_docterm” : Trains an Embedding by using an SVD on a Document-Term Matrix.

  • “lsa_tfidf” : Trains an Embedding by using an SVD on a TF-IDFized Document-Term Matrix.

train_paramsdict,
Additional parameters for the embedding training. Check the following documentation :
  • gensim.models.Word2Vec for Word2Vec Embeddings

  • sklearn.decomposition.TruncatedSVD for LSA Embeddings

If left untouched, the default training values will be kept from the aforementioned packages.

>>> from melusine.nlp_tools.embedding import Embedding
>>> embedding = Embedding()
>>> embedding.train(X)  # noqa
>>> embedding.save(filepath)  # noqa
>>> embedding = Embedding().load(filepath)  # noqa
load(filepath)[source]

Method to load Embedding object.

save(filepath)[source]

Method to save Embedding object.

train(X)[source]

Train embeddings with the desired word embedding algorithm (default is Word2Vec). Parameters ———- X : pd.Dataframe

Containing a clean body column.

train_word2vec()[source]

Fits a Word2Vec Embedding on the given documents, and update the embedding attribute.

Stemmer melusine.nlp_tools.stemmer

Lemmatizer melusine.nlp_tools.lemmatizer

Emoji Flagger melusine.nlp_tools.emoji_flagger

class melusine.nlp_tools.emoji_flagger.DeterministicEmojiFlagger(input_column: str = 'last_body', output_column: str = 'last_body', flag_emoji: str = ' flag_emoji_ ')[source]

Bases: object

FILENAME = 'emoji_flagger.json'
fit(df, y=None)[source]
transform(df)[source]