Skip to content

HotProtato/EnronEmailParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This project is a parsing attempt at the Enron email dataset, found here: https://www.cs.cmu.edu/~enron/

My approach differs by:

  1. Converting each file into a canon format (there is a potential difference for timezones) to de-duplicate and allow for efficient parsing.
  2. Another cache specifically for hashed email messages.
  3. Separating parent and child emails, parsing both.
  4. Logical matching of users by names, emails, any other alias format, and postprocessing matching.

Instructions

The default locations if you prefer not to change them, is to have the "maildir" folder in the "input" folder that's a sibling to the "src" folder.

All output-related files will be generated within an "output" folder, that is also a sibling of the "src" folder by default.

  1. Configure the input and output location variables in main.py at your leisure.
  2. Run main.py (NOTE: An array of errors are expected, as not every file can be parsed. This makes up a tiny portion of all files.)
  3. Configure the output location variables if applicable within postprocessing_pipeline.py, as well as the desired output locations.
  4. Run postprocessing_pipeline.py.

Limitations & Contributing

  1. To my knowledge, there are approximately 61 users with the alias of only one character; this forms a tiny percentage.
  2. There are a little over 4,000 email items (approx 2.1% of the data) that have a -1 value for the "sender_id" field. I may decide to address this in the future, otherwise all are welcome to contribute by way of a pull request.
  3. The timezone of child emails are assumed to follow their parent timezone. If time-data is important, you can map the timezone of each email to each user in the group, impute user's timezone with the mode timezone, and associate with the "sender_id" user.
  4. I opted not to distinguish CC and To, as users are generally inconsistent in how these fields are used by nature. Moreover, groups also contain the sender's id.

Output Data Schema

This section details the schema of the Parquet files generated by the pipeline, located in the output/ directory.

user_table_updated.parquet

This table contains the deduplicated and merged user profiles.

Column Name Data Type Description
user_id int A unique integer identifier for each user profile.
first_name string The user's parsed first name (can be empty if not found).
last_name string The user's parsed last name (can be empty if not found).
generated_aliases list[string] Aliases automatically generated based on the user's first and last name. Used primarily for matching.
aliases list[string] All known email aliases associated with the user, including generated ones and those extracted from emails.

groups_updated.parquet

This table contains the consolidated groups of users who communicated together.

Column Name Data Type Description
group_id int A unique integer identifier for each communication group.
user_ids list[int] A list of user_ids belonging to this group.

final_email_table.parquet

This is the main table containing processed email metadata, including sender, subject, and parent relationships.

Column Name Data Type Description
email_hash string The MD5 hash of the canonicalized email content, serving as a unique ID.
group_id int The group_id associated with this email's communication context.
subject string The subject line of the email.
date datetime The original timestamp of the email.
norm_date datetime The date adjusted to 12:00 UTC.
sender_id int The user_id of the sender of this email.
parent_hash string (nullable) The email_hash of the parent email in a conversation thread, if applicable. Null if no parent.

email_user_junction.parquet

This is a junction table linking emails to the users involved in their communication (sender and recipients within a group).

Column Name Data Type Description
email_hash string The email_hash of the email.
user_id int The user_id of a user associated with this email.

email_group_junction.parquet

This is a junction table linking emails to the communication groups they belong to.

Column Name Data Type Description
email_hash string The email_hash of the email.
group_id int The group_id of the communication group this email belongs to.

About

A parser that converts the messy Enron email dataset into structured parquet files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages