This project is a parsing attempt at the Enron email dataset, found here: https://www.cs.cmu.edu/~enron/
My approach differs by:
- Converting each file into a canon format (there is a potential difference for timezones) to de-duplicate and allow for efficient parsing.
- Another cache specifically for hashed email messages.
- Separating parent and child emails, parsing both.
- Logical matching of users by names, emails, any other alias format, and postprocessing matching.
The default locations if you prefer not to change them, is to have the "maildir" folder in the "input" folder that's a sibling to the "src" folder.
All output-related files will be generated within an "output" folder, that is also a sibling of the "src" folder by default.
- Configure the input and output location variables in main.py at your leisure.
- Run main.py (NOTE: An array of errors are expected, as not every file can be parsed. This makes up a tiny portion of all files.)
- Configure the output location variables if applicable within postprocessing_pipeline.py, as well as the desired output locations.
- Run postprocessing_pipeline.py.
- To my knowledge, there are approximately 61 users with the alias of only one character; this forms a tiny percentage.
- There are a little over 4,000 email items (approx 2.1% of the data) that have a -1 value for the "sender_id" field. I may decide to address this in the future, otherwise all are welcome to contribute by way of a pull request.
- The timezone of child emails are assumed to follow their parent timezone. If time-data is important, you can map the timezone of each email to each user in the group, impute user's timezone with the mode timezone, and associate with the "sender_id" user.
- I opted not to distinguish CC and To, as users are generally inconsistent in how these fields are used by nature. Moreover, groups also contain the sender's id.
This section details the schema of the Parquet files generated by the pipeline, located in the output/ directory.
This table contains the deduplicated and merged user profiles.
| Column Name | Data Type | Description |
|---|---|---|
user_id |
int |
A unique integer identifier for each user profile. |
first_name |
string |
The user's parsed first name (can be empty if not found). |
last_name |
string |
The user's parsed last name (can be empty if not found). |
generated_aliases |
list[string] |
Aliases automatically generated based on the user's first and last name. Used primarily for matching. |
aliases |
list[string] |
All known email aliases associated with the user, including generated ones and those extracted from emails. |
This table contains the consolidated groups of users who communicated together.
| Column Name | Data Type | Description |
|---|---|---|
group_id |
int |
A unique integer identifier for each communication group. |
user_ids |
list[int] |
A list of user_ids belonging to this group. |
This is the main table containing processed email metadata, including sender, subject, and parent relationships.
| Column Name | Data Type | Description |
|---|---|---|
email_hash |
string |
The MD5 hash of the canonicalized email content, serving as a unique ID. |
group_id |
int |
The group_id associated with this email's communication context. |
subject |
string |
The subject line of the email. |
date |
datetime |
The original timestamp of the email. |
norm_date |
datetime |
The date adjusted to 12:00 UTC. |
sender_id |
int |
The user_id of the sender of this email. |
parent_hash |
string (nullable) |
The email_hash of the parent email in a conversation thread, if applicable. Null if no parent. |
This is a junction table linking emails to the users involved in their communication (sender and recipients within a group).
| Column Name | Data Type | Description |
|---|---|---|
email_hash |
string |
The email_hash of the email. |
user_id |
int |
The user_id of a user associated with this email. |
This is a junction table linking emails to the communication groups they belong to.
| Column Name | Data Type | Description |
|---|---|---|
email_hash |
string |
The email_hash of the email. |
group_id |
int |
The group_id of the communication group this email belongs to. |