Prepare_email subpackage `melusine.prepare_email`¶

List of submodules¶

Transfer & Reply melusine.prepare_email.manage_transfer_reply
Cleaning melusine.prepare_email.cleaning
Build Email Historic melusine.prepare_email.build_historic
Email Segmenting melusine.prepare_email.mail_segmenting
Process Email Metadata melusine.prepare_email.metadata_engineering
Extract Email Body & Header melusine.prepare_email.body_header_extraction

Transfer & Reply `melusine.prepare_email.manage_transfer_reply`¶

melusine.prepare_email.manage_transfer_reply.add_boolean_answer(row)[source]¶

Compute boolean Series which return True if the “header” starts with given regex ‘transfer_subject’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

rowrow of pd.Dataframe, columns [‘header’]

Returns

pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'header' column

>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_answer
>>> add_boolean_answer(data.iloc[0])  # apply for 1 sample
>>> data.apply(add_boolean_answer, axis=1)  # apply to all samples

melusine.prepare_email.manage_transfer_reply.add_boolean_transfer(row)[source]¶

Compute boolean Series which return True if the “header” starts with given regex ‘answer_subject’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

rowrow of pd.Dataframe, columns [‘header’]

Returns

pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'header' column

>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
>>> add_boolean_transfer(data.iloc[0])  # apply for 1 sample
>>> data.apply(add_boolean_transfer, axis=1)  # apply to all samples

melusine.prepare_email.manage_transfer_reply.check_mail_begin_by_transfer(row)[source]¶

Compute boolean Series which return True if the “body” starts with given regex ‘begin_transfer’, False if not.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

rowrow of pd.Dataframe, columns [‘body’]

Returns

pd.Series

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'body' column

>>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
>>> check_mail_begin_by_transfer(data.iloc[0])  # apply for 1 sample
>>> data.apply(check_mail_begin_by_transfer, axis=1)  # apply to all samples

melusine.prepare_email.manage_transfer_reply.update_info_for_transfer_mail(row)[source]¶

Extracts and updates informations from forwarded mails, such as: body, from, to, header, date. - It changes the header by the initial subject (extracted from forward email). - It removes the header from emails’ body.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

rowrow of pd.Dataframe,
columns [‘body’, ‘header’, ‘from’, ‘to’, ‘date’, ‘is_begin_by_transfer’]

Returns

pd.DataFrame

Examples

>>> import pandas as pd
>>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> data['is_begin_by_transfer'] = data.apply(check_mail_begin_by_transfer, axis=1)
>>> # data contains columns ['from', 'to', 'date', 'header', 'body', 'is_begin_by_transfer']

>>> from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
>>> update_info_for_transfer_mail(data.iloc[0])  # apply for 1 sample
>>> data.apply(update_info_for_transfer_mail, axis=1)  # apply to all samples

Cleaning `melusine.prepare_email.cleaning`¶

Cleaning of the body and the header

melusine.prepare_email.cleaning.clean_body(row, flags=True)[source]¶

Clean body column. The cleaning involves the following operations:

Cleaning the text
Removing the multiple spaces
Flagging specific items (postal code, phone number, date…)

Parameters

rowrow of pandas.Dataframe object,: Data contains ‘last_body’ column.
flagsboolean, optional: True if you want to flag relevant info, False if not. Default value, True.

Returns

row of pandas.DataFrame object or pandas.Series if apply to all DF.

melusine.prepare_email.cleaning.clean_header(row, flags=True)[source]¶

Clean the header column. The cleaning involves the following operations:

Removing the transfers and answers indicators
Cleaning the text
Flagging specific items (postal code, phone number, date…)

Parameters

rowrow of pandas.Dataframe object,: Data contains ‘header’ column.
flagsboolean, optional: True if you want to flag relevant info, False if not. Default value, True.

Returns

row of pd.DataFrame object or pandas.Series if apply to all DF.

melusine.prepare_email.cleaning.clean_text(text)[source]¶

Clean a string. The cleaning involves the following operations:

Putting all letters to lowercase
Removing all the accents
Removing all line breaks
Removing all symbols and punctuations
Removing the multiple spaces

Parameters

textstr

Returns

str

melusine.prepare_email.cleaning.flag_items(text, flags=True)[source]¶

Flag relevant information: ex : amount, phone number, email address, postal code (5 digits)..

Parameters

textstr,: Body content.
flagsboolean, optional: True if you want to flag relevant info, False if not. Default value, True.

Returns

str

melusine.prepare_email.cleaning.remove_accents(text, use_unidecode=False)[source]¶: Remove accents from text Using unidecode is more powerful but much more time consuming Exemple: the joined ‘ae’ character is converted to ‘a’ + ‘e’ by unidecode while it is suppressed by unicodedata.

melusine.prepare_email.cleaning.remove_apostrophe(text)[source]¶: Remove apostrophes from text

melusine.prepare_email.cleaning.remove_line_break(text)[source]¶: Remove line breaks from text

melusine.prepare_email.cleaning.remove_multiple_spaces_and_strip_text(text)[source]¶

Remove multiple spaces, strip text, and remove ‘-’, ‘*’ characters.

Parameters

textstr,: Header content.

Returns

str

melusine.prepare_email.cleaning.remove_superior_symbol(text)[source]¶: Remove superior and inferior symbols from text

melusine.prepare_email.cleaning.remove_transfer_answer_header(text)[source]¶

Remove historic and transfers indicators in the header. Ex: “Tr:”, “Re:”, “Fwd”, etc.

Parameters

textstr,: Header content.

Returns

str

melusine.prepare_email.cleaning.text_to_lowercase(text)[source]¶: Set all letters to lowercase

Build Email Historic `melusine.prepare_email.build_historic`¶

melusine.prepare_email.build_historic.build_historic(row)[source]¶

Rebuilds and structures historic of emails from the whole contents. Function has to be applied with apply method of a DataFrame along an axis=1. For each email of the historic, it segments the body into 2 different parts (2 keys of dict):

{‘text’: extract raw text without metadata,: ‘meta’: get transition from the ‘transition_list’ defined in the conf.json }.

Parameters

rowrow,: A pandas.DataFrame row object with ‘body’ column.

Returns

list

Examples

>>> import pandas as pd
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> # data contains a 'body' column

>>> from melusine.prepare_email.build_historic import build_historic
>>> build_historic(data.iloc[0])  # apply for 1 sample
>>> data.apply(build_historic, axis=1)  # apply to all samples

melusine.prepare_email.build_historic.is_only_typo(text)[source]¶: check if the string contains any word character

Email Segmenting `melusine.prepare_email.mail_segmenting`¶

melusine.prepare_email.mail_segmenting.split_message_to_sentences(text, sep_='(.*?[;.,?!])')[source]¶: Split each sentences in a text

melusine.prepare_email.mail_segmenting.structure_email(row)[source]¶

1. Splits parts of each messages in historic and tags them. For example a tag can be hello, body, greetings etc 2. Extracts the meta informations of each messages

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

rowrow of pd.Dataframe, apply on column [‘structured_historic’]

Returns

list of dictsone dict per message

Examples

>>> import pandas as pd
>>> from melusine.prepare_email.build_historic import build_historic
>>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle')
>>> data['structured_historic'] = data.apply(build_historic, axis=1)
>>> # data contains column ['structured_historic']

>>> from melusine.prepare_email.mail_segmenting import structure_email
>>> structure_email(data.iloc[0])  # apply for 1 sample
>>> data.apply(structure_email, axis=1)  # apply to all samples

melusine.prepare_email.mail_segmenting.structure_message(message)[source]¶

Splits parts of a message and tags them. For example a tag can be hello, body, greetings etc Extracts the meta informations of the message

Parameters

messagedict

Returns

dict

melusine.prepare_email.mail_segmenting.structure_meta(meta)[source]¶

Extract meta informations (date, from, to, header) from string meta

Parameters

metastr

Returns

tuple(dict, string)

melusine.prepare_email.mail_segmenting.tag(string)[source]¶

Tags a string.

Parameters

stringstr,

Returns

tupleslist of tuples and boolean

melusine.prepare_email.mail_segmenting.tag_parts_message(text)[source]¶

Splits message into sentences, tags them and merges two sentences in a row having the same tag.

Parameters

textstr,

Returns

list of tuples

melusine.prepare_email.mail_segmenting.tag_sentence(sentence, default='BODY')[source]¶

Tag a sentence. If the sentence cannot be tagged it will tag the subsentences

Parameters

sentencestr,

Returns

list of tuplessentence, tag

melusine.prepare_email.mail_segmenting.tag_signature(row, token_threshold=5)[source]¶

Function to be called after the mail_segmenting function as it requires a “structured_body” column. This function detects parts of a message that qualify as “signature”. Exemples of parts qualifying as signature are sender name, company name, phone number, etc.

The methodology to detect a signature is the following: - Look for a THANKS or GREETINGS part indicating that the message is approaching the end - Check the length of the following message parts currently tagged as “BODY” - (The maximum number of words is specified through the variable “signature_token_threshold”) - If ALL the “ending parts” contain few words => tag them as “SIGNATURE” parts - Otherwise : cancel the signature tagging

Parameters

rowpd.Series: Row of an email DataFrame

Returns

structured_bodyUpdated structured body

Process Email Metadata `melusine.prepare_email.metadata_engineering`¶

class melusine.prepare_email.metadata_engineering.Dummifier(columns_to_dummify=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'], copy=True)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer to dummifies categorial features and list of . Compatible with scikit-learn API.

fit(X, y=None)[source]¶: Store dummified features to avoid inconsistance of new data which could contain new labels (unknown from train data).

transform(X, y=None)[source]¶: Dummify features and keep only common labels with pretrained data.

class melusine.prepare_email.metadata_engineering.MetaAttachmentType(topn_extension=100)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates ‘type’ feature extracted from regex in metadata. It extracts types of attached files.

Compatible with scikit-learn API.

static encode_type(row, top_ext)[source]¶

fit(X, y=None)[source]¶

static get_attachment_type(row)[source]¶: Gets type from attachment.

static get_top_attachment_type(X, n=100)[source]¶: Returns list of most common types of attachment.

transform(X)[source]¶: Encode extensions

class melusine.prepare_email.metadata_engineering.MetaDate(regex_date_format='\\w+ (\\d+) (\\w+) (\\d{4}) (\\d{2}) h (\\d{2})', date_format='%d/%m/%Y %H:%M')[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates new features from dates such as:

hour
minute
dayofweek

Compatible with scikit-learn API.

Parameters

date_formatstr, optional: Regex to extract date from text.
date_formatstr, optional: A date format.

date_formatting(row, regex_format)[source]¶: Set a date in the right format

fit(X, y=None)[source]¶: Unused method. Defined only for compatibility with scikit-learn API.

static get_dayofweek(row)[source]¶: Get day of the week from date

static get_hour(row)[source]¶: Get hour from date

static get_min(row)[source]¶

transform(X)[source]¶

class melusine.prepare_email.metadata_engineering.MetaExtension(topn_extension=100)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer which creates ‘extension’ feature extracted from regex in metadata. It extracts extension of mail adresses.

Compatible with scikit-learn API.

static encode_extension(row, top_ext)[source]¶

fit(X, y=None)[source]¶

static get_extension(row)[source]¶: Gets extension from email address.

static get_top_extension(X, n=100)[source]¶: Returns list of most common extensions.

transform(X)[source]¶: Encode extensions

Extract Email Body & Header `melusine.prepare_email.body_header_extraction`¶

melusine.prepare_email.body_header_extraction.extract_body(message_dict)[source]¶

Extracts the body from a message dictionary.

Parameters

message_dictdict

Returns

str

melusine.prepare_email.body_header_extraction.extract_header(message_dict)[source]¶

Extracts the header from a message dictionary.

Parameters

message_dictdict

Returns

str

melusine.prepare_email.body_header_extraction.extract_last_body(row)[source]¶

Extracts the body from the last message of the conversation. The conversation is structured as a dictionary.

To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).

Parameters

message_dictdict

Returns

str

Prepare_email subpackage melusine.prepare_email¶

List of submodules¶

Transfer & Reply melusine.prepare_email.manage_transfer_reply¶

Cleaning melusine.prepare_email.cleaning¶

Build Email Historic melusine.prepare_email.build_historic¶

Email Segmenting melusine.prepare_email.mail_segmenting¶

Process Email Metadata melusine.prepare_email.metadata_engineering¶

Extract Email Body & Header melusine.prepare_email.body_header_extraction¶

Prepare_email subpackage `melusine.prepare_email`¶

Transfer & Reply `melusine.prepare_email.manage_transfer_reply`¶

Cleaning `melusine.prepare_email.cleaning`¶

Build Email Historic `melusine.prepare_email.build_historic`¶

Email Segmenting `melusine.prepare_email.mail_segmenting`¶

Process Email Metadata `melusine.prepare_email.metadata_engineering`¶

Extract Email Body & Header `melusine.prepare_email.body_header_extraction`¶