Prepare_email subpackage melusine.prepare_email

_images/schema_2.png

List of submodules

Transfer & Reply melusine.prepare_email.manage_transfer_reply

melusine.prepare_email.manage_transfer_reply.add_boolean_answer(row)[source]

Compute boolean Series which return True if the “header” starts with given regex ‘transfer_subject’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
rowrow of pd.Dataframe, columns [‘header’]
Returns
pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'header' column
>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_answer
>>> add_boolean_answer(data.iloc[0])  # apply for 1 sample
>>> data.apply(add_boolean_answer, axis=1)  # apply to all samples
melusine.prepare_email.manage_transfer_reply.add_boolean_transfer(row)[source]

Compute boolean Series which return True if the “header” starts with given regex ‘answer_subject’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
rowrow of pd.Dataframe, columns [‘header’]
Returns
pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'header' column
>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
>>> add_boolean_transfer(data.iloc[0])  # apply for 1 sample
>>> data.apply(add_boolean_transfer, axis=1)  # apply to all samples
melusine.prepare_email.manage_transfer_reply.check_mail_begin_by_transfer(row)[source]

Compute boolean Series which return True if the “body” starts with given regex ‘begin_transfer’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
rowrow of pd.Dataframe, columns [‘body’]
Returns
pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'body' column
>>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
>>> check_mail_begin_by_transfer(data.iloc[0])  # apply for 1 sample
>>> data.apply(check_mail_begin_by_transfer, axis=1)  # apply to all samples
melusine.prepare_email.manage_transfer_reply.update_info_for_transfer_mail(row)[source]

Extracts and updates informations from forwarded mails, such as: body, from, to, header, date. - It changes the header by the initial subject (extracted from forward email). - It removes the header from emails’ body.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
rowrow of pd.Dataframe,
columns [‘body’, ‘header’, ‘from’, ‘to’, ‘date’, ‘is_begin_by_transfer’]
Returns
pd.DataFrame

Examples

>>> import pandas as pd
>>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> data['is_begin_by_transfer'] = data.apply(check_mail_begin_by_transfer, axis=1)
>>> # data contains columns ['from', 'to', 'date', 'header', 'body', 'is_begin_by_transfer']
>>> from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
>>> update_info_for_transfer_mail(data.iloc[0])  # apply for 1 sample
>>> data.apply(update_info_for_transfer_mail, axis=1)  # apply to all samples

Cleaning melusine.prepare_email.cleaning

Cleaning of the body and the header

melusine.prepare_email.cleaning.clean_body(row, flags=True)[source]
Clean body column. The cleaning involves the following operations:
  • Cleaning the text

  • Removing the multiple spaces

  • Flagging specific items (postal code, phone number, date…)

Parameters
rowrow of pandas.Dataframe object,

Data contains ‘last_body’ column.

flagsboolean, optional

True if you want to flag relevant info, False if not. Default value, True.

Returns
row of pandas.DataFrame object or pandas.Series if apply to all DF.
melusine.prepare_email.cleaning.clean_header(row, flags=True)[source]
Clean the header column. The cleaning involves the following operations:
  • Removing the transfers and answers indicators

  • Cleaning the text

  • Flagging specific items (postal code, phone number, date…)

Parameters
rowrow of pandas.Dataframe object,

Data contains ‘header’ column.

flagsboolean, optional

True if you want to flag relevant info, False if not. Default value, True.

Returns
row of pd.DataFrame object or pandas.Series if apply to all DF.
melusine.prepare_email.cleaning.clean_text(text)[source]
Clean a string. The cleaning involves the following operations:
  • Putting all letters to lowercase

  • Removing all the accents

  • Removing all line breaks

  • Removing all symbols and punctuations

  • Removing the multiple spaces

Parameters
textstr
Returns
str
melusine.prepare_email.cleaning.flag_items(text, flags=True)[source]
Flag relevant information

ex : amount, phone number, email address, postal code (5 digits)..

Parameters
textstr,

Body content.

flagsboolean, optional

True if you want to flag relevant info, False if not. Default value, True.

Returns
str
melusine.prepare_email.cleaning.remove_accents(text, use_unidecode=False)[source]

Remove accents from text Using unidecode is more powerful but much more time consuming Exemple: the joined ‘ae’ character is converted to ‘a’ + ‘e’ by unidecode while it is suppressed by unicodedata.

melusine.prepare_email.cleaning.remove_apostrophe(text)[source]

Remove apostrophes from text

melusine.prepare_email.cleaning.remove_line_break(text)[source]

Remove line breaks from text

melusine.prepare_email.cleaning.remove_multiple_spaces_and_strip_text(text)[source]

Remove multiple spaces, strip text, and remove ‘-’, ‘*’ characters.

Parameters
textstr,

Header content.

Returns
str
melusine.prepare_email.cleaning.remove_superior_symbol(text)[source]

Remove superior and inferior symbols from text

melusine.prepare_email.cleaning.remove_transfer_answer_header(text)[source]

Remove historic and transfers indicators in the header. Ex: “Tr:”, “Re:”, “Fwd”, etc.

Parameters
textstr,

Header content.

Returns
str
melusine.prepare_email.cleaning.text_to_lowercase(text)[source]

Set all letters to lowercase

Build Email Historic melusine.prepare_email.build_historic

melusine.prepare_email.build_historic.build_historic(row)[source]

Rebuilds and structures historic of emails from the whole contents. Function has to be applied with apply method of a DataFrame along an axis=1. For each email of the historic, it segments the body into 2 different parts (2 keys of dict):

{‘text’: extract raw text without metadata,

‘meta’: get transition from the ‘transition_list’ defined in the conf.json }.

Parameters
rowrow,

A pandas.DataFrame row object with ‘body’ column.

Returns
list

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'body' column
>>> from melusine.prepare_email.build_historic import build_historic
>>> build_historic(data.iloc[0])  # apply for 1 sample
>>> data.apply(build_historic, axis=1)  # apply to all samples
melusine.prepare_email.build_historic.is_only_typo(text)[source]

check if the string contains any word character

Email Segmenting melusine.prepare_email.mail_segmenting

melusine.prepare_email.mail_segmenting.split_message_to_sentences(text, sep_='(.*?[;.,?!])')[source]

Split each sentences in a text

melusine.prepare_email.mail_segmenting.structure_email(row)[source]

1. Splits parts of each messages in historic and tags them. For example a tag can be hello, body, greetings etc 2. Extracts the meta informations of each messages

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
rowrow of pd.Dataframe, apply on column [‘structured_historic’]
Returns
list of dictsone dict per message

Examples

>>> import pandas as pd
>>> from melusine.prepare_email.build_historic import build_historic
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> data['structured_historic'] = data.apply(build_historic, axis=1)
>>> # data contains column ['structured_historic']
>>> from melusine.prepare_email.mail_segmenting import structure_email
>>> structure_email(data.iloc[0])  # apply for 1 sample
>>> data.apply(structure_email, axis=1)  # apply to all samples
melusine.prepare_email.mail_segmenting.structure_message(message)[source]

Splits parts of a message and tags them. For example a tag can be hello, body, greetings etc Extracts the meta informations of the message

Parameters
messagedict
Returns
dict
melusine.prepare_email.mail_segmenting.structure_meta(meta)[source]

Extract meta informations (date, from, to, header) from string meta

Parameters
metastr
Returns
tuple(dict, string)
melusine.prepare_email.mail_segmenting.tag(string)[source]

Tags a string.

Parameters
stringstr,
Returns
tupleslist of tuples and boolean
melusine.prepare_email.mail_segmenting.tag_parts_message(text)[source]

Splits message into sentences, tags them and merges two sentences in a row having the same tag.

Parameters
textstr,
Returns
list of tuples
melusine.prepare_email.mail_segmenting.tag_sentence(sentence, default='BODY')[source]

Tag a sentence. If the sentence cannot be tagged it will tag the subsentences

Parameters
sentencestr,
Returns
list of tuplessentence, tag
melusine.prepare_email.mail_segmenting.tag_signature(row, token_threshold=5)[source]

Function to be called after the mail_segmenting function as it requires a “structured_body” column. This function detects parts of a message that qualify as “signature”. Exemples of parts qualifying as signature are sender name, company name, phone number, etc.

The methodology to detect a signature is the following: - Look for a THANKS or GREETINGS part indicating that the message is approaching the end - Check the length of the following message parts currently tagged as “BODY” - (The maximum number of words is specified through the variable “signature_token_threshold”) - If ALL the “ending parts” contain few words => tag them as “SIGNATURE” parts - Otherwise : cancel the signature tagging

Parameters
rowpd.Series

Row of an email DataFrame

Returns
structured_bodyUpdated structured body

Process Email Metadata melusine.prepare_email.metadata_engineering

class melusine.prepare_email.metadata_engineering.Dummifier(columns_to_dummify=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'], copy=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer to dummifies categorial features and list of . Compatible with scikit-learn API.

fit(X, y=None)[source]

Store dummified features to avoid inconsistance of new data which could contain new labels (unknown from train data).

transform(X, y=None)[source]

Dummify features and keep only common labels with pretrained data.

class melusine.prepare_email.metadata_engineering.MetaAttachmentType(topn_extension=100)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates ‘type’ feature extracted from regex in metadata. It extracts types of attached files.

Compatible with scikit-learn API.

static encode_type(row, top_ext)[source]
fit(X, y=None)[source]
static get_attachment_type(row)[source]

Gets type from attachment.

static get_top_attachment_type(X, n=100)[source]

Returns list of most common types of attachment.

transform(X)[source]

Encode extensions

class melusine.prepare_email.metadata_engineering.MetaDate(regex_date_format='\\w+ (\\d+) (\\w+) (\\d{4}) (\\d{2}) h (\\d{2})', date_format='%d/%m/%Y %H:%M')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates new features from dates such as:
  • hour

  • minute

  • dayofweek

Compatible with scikit-learn API.

Parameters
date_formatstr, optional

Regex to extract date from text.

date_formatstr, optional

A date format.

date_formatting(row, regex_format)[source]

Set a date in the right format

fit(X, y=None)[source]

Unused method. Defined only for compatibility with scikit-learn API.

static get_dayofweek(row)[source]

Get day of the week from date

static get_hour(row)[source]

Get hour from date

static get_min(row)[source]
transform(X)[source]
class melusine.prepare_email.metadata_engineering.MetaExtension(topn_extension=100)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates ‘extension’ feature extracted from regex in metadata. It extracts extension of mail adresses.

Compatible with scikit-learn API.

static encode_extension(row, top_ext)[source]
fit(X, y=None)[source]
static get_extension(row)[source]

Gets extension from email address.

static get_top_extension(X, n=100)[source]

Returns list of most common extensions.

transform(X)[source]

Encode extensions

Extract Email Body & Header melusine.prepare_email.body_header_extraction

melusine.prepare_email.body_header_extraction.extract_body(message_dict)[source]

Extracts the body from a message dictionary.

Parameters
message_dictdict
Returns
str
melusine.prepare_email.body_header_extraction.extract_header(message_dict)[source]

Extracts the header from a message dictionary.

Parameters
message_dictdict
Returns
str
melusine.prepare_email.body_header_extraction.extract_last_body(row)[source]

Extracts the body from the last message of the conversation. The conversation is structured as a dictionary.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters
message_dictdict
Returns
str