Prepare_email subpackage melusine.prepare_email
¶

List of submodules¶
Transfer & Reply melusine.prepare_email.manage_transfer_reply
¶
- melusine.prepare_email.manage_transfer_reply.add_boolean_answer(row)[source]¶
Compute boolean Series which return True if the “header” starts with given regex ‘transfer_subject’, False if not.
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- rowrow of pd.Dataframe, columns [‘header’]
- Returns
- pd.Series
Examples
>>> import pandas as pd >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> # data contains a 'header' column
>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_answer >>> add_boolean_answer(data.iloc[0]) # apply for 1 sample >>> data.apply(add_boolean_answer, axis=1) # apply to all samples
- melusine.prepare_email.manage_transfer_reply.add_boolean_transfer(row)[source]¶
Compute boolean Series which return True if the “header” starts with given regex ‘answer_subject’, False if not.
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- rowrow of pd.Dataframe, columns [‘header’]
- Returns
- pd.Series
Examples
>>> import pandas as pd >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> # data contains a 'header' column
>>> from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer >>> add_boolean_transfer(data.iloc[0]) # apply for 1 sample >>> data.apply(add_boolean_transfer, axis=1) # apply to all samples
- melusine.prepare_email.manage_transfer_reply.check_mail_begin_by_transfer(row)[source]¶
Compute boolean Series which return True if the “body” starts with given regex ‘begin_transfer’, False if not.
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- rowrow of pd.Dataframe, columns [‘body’]
- Returns
- pd.Series
Examples
>>> import pandas as pd >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> # data contains a 'body' column
>>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer >>> check_mail_begin_by_transfer(data.iloc[0]) # apply for 1 sample >>> data.apply(check_mail_begin_by_transfer, axis=1) # apply to all samples
- melusine.prepare_email.manage_transfer_reply.update_info_for_transfer_mail(row)[source]¶
Extracts and updates informations from forwarded mails, such as: body, from, to, header, date. - It changes the header by the initial subject (extracted from forward email). - It removes the header from emails’ body.
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- rowrow of pd.Dataframe,
- columns [‘body’, ‘header’, ‘from’, ‘to’, ‘date’, ‘is_begin_by_transfer’]
- Returns
- pd.DataFrame
Examples
>>> import pandas as pd >>> from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> data['is_begin_by_transfer'] = data.apply(check_mail_begin_by_transfer, axis=1) >>> # data contains columns ['from', 'to', 'date', 'header', 'body', 'is_begin_by_transfer']
>>> from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail >>> update_info_for_transfer_mail(data.iloc[0]) # apply for 1 sample >>> data.apply(update_info_for_transfer_mail, axis=1) # apply to all samples
Cleaning melusine.prepare_email.cleaning
¶
Cleaning of the body and the header
- melusine.prepare_email.cleaning.clean_body(row, flags=True)[source]¶
- Clean body column. The cleaning involves the following operations:
Cleaning the text
Removing the multiple spaces
Flagging specific items (postal code, phone number, date…)
- Parameters
- rowrow of pandas.Dataframe object,
Data contains ‘last_body’ column.
- flagsboolean, optional
True if you want to flag relevant info, False if not. Default value, True.
- Returns
- row of pandas.DataFrame object or pandas.Series if apply to all DF.
- melusine.prepare_email.cleaning.clean_header(row, flags=True)[source]¶
- Clean the header column. The cleaning involves the following operations:
Removing the transfers and answers indicators
Cleaning the text
Flagging specific items (postal code, phone number, date…)
- Parameters
- rowrow of pandas.Dataframe object,
Data contains ‘header’ column.
- flagsboolean, optional
True if you want to flag relevant info, False if not. Default value, True.
- Returns
- row of pd.DataFrame object or pandas.Series if apply to all DF.
- melusine.prepare_email.cleaning.clean_text(text)[source]¶
- Clean a string. The cleaning involves the following operations:
Putting all letters to lowercase
Removing all the accents
Removing all line breaks
Removing all symbols and punctuations
Removing the multiple spaces
- Parameters
- textstr
- Returns
- str
- melusine.prepare_email.cleaning.flag_items(text, flags=True)[source]¶
- Flag relevant information
ex : amount, phone number, email address, postal code (5 digits)..
- Parameters
- textstr,
Body content.
- flagsboolean, optional
True if you want to flag relevant info, False if not. Default value, True.
- Returns
- str
- melusine.prepare_email.cleaning.remove_accents(text, use_unidecode=False)[source]¶
Remove accents from text Using unidecode is more powerful but much more time consuming Exemple: the joined ‘ae’ character is converted to ‘a’ + ‘e’ by unidecode while it is suppressed by unicodedata.
- melusine.prepare_email.cleaning.remove_multiple_spaces_and_strip_text(text)[source]¶
Remove multiple spaces, strip text, and remove ‘-’, ‘*’ characters.
- Parameters
- textstr,
Header content.
- Returns
- str
- melusine.prepare_email.cleaning.remove_superior_symbol(text)[source]¶
Remove superior and inferior symbols from text
Build Email Historic melusine.prepare_email.build_historic
¶
- melusine.prepare_email.build_historic.build_historic(row)[source]¶
Rebuilds and structures historic of emails from the whole contents. Function has to be applied with apply method of a DataFrame along an axis=1. For each email of the historic, it segments the body into 2 different parts (2 keys of dict):
- {‘text’: extract raw text without metadata,
‘meta’: get transition from the ‘transition_list’ defined in the conf.json }.
- Parameters
- rowrow,
A pandas.DataFrame row object with ‘body’ column.
- Returns
- list
Examples
>>> import pandas as pd >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> # data contains a 'body' column
>>> from melusine.prepare_email.build_historic import build_historic >>> build_historic(data.iloc[0]) # apply for 1 sample >>> data.apply(build_historic, axis=1) # apply to all samples
Email Segmenting melusine.prepare_email.mail_segmenting
¶
- melusine.prepare_email.mail_segmenting.split_message_to_sentences(text, sep_='(.*?[;.,?!])')[source]¶
Split each sentences in a text
- melusine.prepare_email.mail_segmenting.structure_email(row)[source]¶
1. Splits parts of each messages in historic and tags them. For example a tag can be hello, body, greetings etc 2. Extracts the meta informations of each messages
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- rowrow of pd.Dataframe, apply on column [‘structured_historic’]
- Returns
- list of dictsone dict per message
Examples
>>> import pandas as pd >>> from melusine.prepare_email.build_historic import build_historic >>> data = pd.read_pickle('./tutorial/data/emails_anonymized.pickle') >>> data['structured_historic'] = data.apply(build_historic, axis=1) >>> # data contains column ['structured_historic']
>>> from melusine.prepare_email.mail_segmenting import structure_email >>> structure_email(data.iloc[0]) # apply for 1 sample >>> data.apply(structure_email, axis=1) # apply to all samples
- melusine.prepare_email.mail_segmenting.structure_message(message)[source]¶
Splits parts of a message and tags them. For example a tag can be hello, body, greetings etc Extracts the meta informations of the message
- Parameters
- messagedict
- Returns
- dict
- melusine.prepare_email.mail_segmenting.structure_meta(meta)[source]¶
Extract meta informations (date, from, to, header) from string meta
- Parameters
- metastr
- Returns
- tuple(dict, string)
- melusine.prepare_email.mail_segmenting.tag(string)[source]¶
Tags a string.
- Parameters
- stringstr,
- Returns
- tupleslist of tuples and boolean
- melusine.prepare_email.mail_segmenting.tag_parts_message(text)[source]¶
Splits message into sentences, tags them and merges two sentences in a row having the same tag.
- Parameters
- textstr,
- Returns
- list of tuples
- melusine.prepare_email.mail_segmenting.tag_sentence(sentence, default='BODY')[source]¶
Tag a sentence. If the sentence cannot be tagged it will tag the subsentences
- Parameters
- sentencestr,
- Returns
- list of tuplessentence, tag
- melusine.prepare_email.mail_segmenting.tag_signature(row, token_threshold=5)[source]¶
Function to be called after the mail_segmenting function as it requires a “structured_body” column. This function detects parts of a message that qualify as “signature”. Exemples of parts qualifying as signature are sender name, company name, phone number, etc.
The methodology to detect a signature is the following: - Look for a THANKS or GREETINGS part indicating that the message is approaching the end - Check the length of the following message parts currently tagged as “BODY” - (The maximum number of words is specified through the variable “signature_token_threshold”) - If ALL the “ending parts” contain few words => tag them as “SIGNATURE” parts - Otherwise : cancel the signature tagging
- Parameters
- rowpd.Series
Row of an email DataFrame
- Returns
- structured_bodyUpdated structured body
Process Email Metadata melusine.prepare_email.metadata_engineering
¶
- class melusine.prepare_email.metadata_engineering.Dummifier(columns_to_dummify=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'], copy=True)[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transformer to dummifies categorial features and list of . Compatible with scikit-learn API.
- class melusine.prepare_email.metadata_engineering.MetaAttachmentType(topn_extension=100)[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transformer which creates ‘type’ feature extracted from regex in metadata. It extracts types of attached files.
Compatible with scikit-learn API.
- class melusine.prepare_email.metadata_engineering.MetaDate(regex_date_format='\\w+ (\\d+) (\\w+) (\\d{4}) (\\d{2}) h (\\d{2})', date_format='%d/%m/%Y %H:%M')[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
- Transformer which creates new features from dates such as:
hour
minute
dayofweek
Compatible with scikit-learn API.
- Parameters
- date_formatstr, optional
Regex to extract date from text.
- date_formatstr, optional
A date format.
- class melusine.prepare_email.metadata_engineering.MetaExtension(topn_extension=100)[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transformer which creates ‘extension’ feature extracted from regex in metadata. It extracts extension of mail adresses.
Compatible with scikit-learn API.
Extract Email Body & Header melusine.prepare_email.body_header_extraction
¶
- melusine.prepare_email.body_header_extraction.extract_body(message_dict)[source]¶
Extracts the body from a message dictionary.
- Parameters
- message_dictdict
- Returns
- str
- melusine.prepare_email.body_header_extraction.extract_header(message_dict)[source]¶
Extracts the header from a message dictionary.
- Parameters
- message_dictdict
- Returns
- str
- melusine.prepare_email.body_header_extraction.extract_last_body(row)[source]¶
Extracts the body from the last message of the conversation. The conversation is structured as a dictionary.
To be used with methods such as: apply(func, axis=1) or apply_by_multiprocessing(func, axis=1, **kwargs).
- Parameters
- message_dictdict
- Returns
- str