GitHub - HotProtato/EnronEmailParser: A parser that converts the messy Enron email dataset into structured parquet files.

This project is a parsing attempt at the Enron email dataset, found here: https://www.cs.cmu.edu/~enron/

My approach differs by:

Converting each file into a canon format (there is a potential difference for timezones) to de-duplicate and allow for efficient parsing.
Another cache specifically for hashed email messages.
Separating parent and child emails, parsing both.
Logical matching of users by names, emails, any other alias format, and postprocessing matching.

Instructions

The default locations if you prefer not to change them, is to have the "maildir" folder in the "input" folder that's a sibling to the "src" folder.

All output-related files will be generated within an "output" folder, that is also a sibling of the "src" folder by default.

Configure the input and output location variables in main.py at your leisure.
Run main.py (NOTE: An array of errors are expected, as not every file can be parsed. This makes up a tiny portion of all files.)
Configure the output location variables if applicable within postprocessing_pipeline.py, as well as the desired output locations.
Run postprocessing_pipeline.py.

Limitations & Contributing

To my knowledge, there are approximately 61 users with the alias of only one character; this forms a tiny percentage.
There are a little over 4,000 email items (approx 2.1% of the data) that have a -1 value for the "sender_id" field. I may decide to address this in the future, otherwise all are welcome to contribute by way of a pull request.
The timezone of child emails are assumed to follow their parent timezone. If time-data is important, you can map the timezone of each email to each user in the group, impute user's timezone with the mode timezone, and associate with the "sender_id" user.
I opted not to distinguish CC and To, as users are generally inconsistent in how these fields are used by nature. Moreover, groups also contain the sender's id.

Output Data Schema

This section details the schema of the Parquet files generated by the pipeline, located in the output/ directory.

`user_table_updated.parquet`

This table contains the deduplicated and merged user profiles.

Column Name	Data Type	Description
`user_id`	`int`	A unique integer identifier for each user profile.
`first_name`	`string`	The user's parsed first name (can be empty if not found).
`last_name`	`string`	The user's parsed last name (can be empty if not found).
`generated_aliases`	`list[string]`	Aliases automatically generated based on the user's first and last name. Used primarily for matching.
`aliases`	`list[string]`	All known email aliases associated with the user, including generated ones and those extracted from emails.

`groups_updated.parquet`

This table contains the consolidated groups of users who communicated together.

Column Name	Data Type	Description
`group_id`	`int`	A unique integer identifier for each communication group.
`user_ids`	`list[int]`	A list of `user_id`s belonging to this group.

`final_email_table.parquet`

This is the main table containing processed email metadata, including sender, subject, and parent relationships.

Column Name	Data Type	Description
`email_hash`	`string`	The MD5 hash of the canonicalized email content, serving as a unique ID.
`group_id`	`int`	The `group_id` associated with this email's communication context.
`subject`	`string`	The subject line of the email.
`date`	`datetime`	The original timestamp of the email.
`norm_date`	`datetime`	The date adjusted to 12:00 UTC.
`sender_id`	`int`	The `user_id` of the sender of this email.
`parent_hash`	`string` (nullable)	The `email_hash` of the parent email in a conversation thread, if applicable. Null if no parent.

`email_user_junction.parquet`

This is a junction table linking emails to the users involved in their communication (sender and recipients within a group).

Column Name	Data Type	Description
`email_hash`	`string`	The `email_hash` of the email.
`user_id`	`int`	The `user_id` of a user associated with this email.

`email_group_junction.parquet`

This is a junction table linking emails to the communication groups they belong to.

Column Name	Data Type	Description
`email_hash`	`string`	The `email_hash` of the email.
`group_id`	`int`	The `group_id` of the communication group this email belongs to.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instructions

Limitations & Contributing

Output Data Schema

`user_table_updated.parquet`

`groups_updated.parquet`

`final_email_table.parquet`

`email_user_junction.parquet`

`email_group_junction.parquet`

About

Uh oh!

Releases

Packages

Languages

HotProtato/EnronEmailParser

Folders and files

Latest commit

History

Repository files navigation

Instructions

Limitations & Contributing

Output Data Schema

user_table_updated.parquet

groups_updated.parquet

final_email_table.parquet

email_user_junction.parquet

email_group_junction.parquet

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`user_table_updated.parquet`

`groups_updated.parquet`

`final_email_table.parquet`

`email_user_junction.parquet`

`email_group_junction.parquet`

Packages