Mining email data for fun and profit

Tl;dr EmailEngine supports mirroring email accounts to ElasticSearch near real-time, which makes it possible to run data analytics queries against all the mirrored emails.

One of the coolest features for EmailEngine (v2.21.0 and newer) is the ability to mirror all emails from an IMAP account to ElasticSearch. That is, whenever a new email arrives in the email account, it gets copied to ElasticSearch, and once that email is deleted from the server, it also disappears from ElasticSearch. Flag changes (seen/unseen etc.) get synced as well.

Emails are mirrored using something you could jokingly call eventual consistency. By eventual, I mean anything between 1 second to 15 minutes. By consistency, I mean best effort. In reality, you could expect the mirrored data to reflect the original quite accurately.

Install the latest version of EmailEngine and once it is running, open to access the dashboard.

By default, support for ElasticSearch mirroring is not enabled, so you have to activate it manually. Navigate to Configuration -> Service and then scroll down to the Labs section. Mark Allow to use Document Store checkbox and save the settings.

Now you should have a new configuration section available called Document Store. Click on the link to get to the ElasticSearch settings.

Next, you can provide the ElasticSearch API URL and, if needed, username and password for basic authentication. Also, mark the Enable syncing to the Document Store checkbox.

Before continuing, you should check if EmailEngine can connect to your ElasticSearch instance or not. There's an action link to verify configuration settings. Click on the hamburger menu and select Test connection.

There is no point continuing until you get a Connection successful response from that test.

Once everything checks out, click on the Update settings button, and that's it. EmailEngine is now mirroring emails to ElasticSearch!

Add an email account to EmailEngine, and EmailEngine should start copying all emails from that account to ElasticSearch. If you have a lot of emails on that account, then this syncing will take a lot of time as EmailEngine needs to process every email separately.

There are two options for accessing message data from ElasticSearch.

  1. One is to fetch messages using EmailEngine's regular API and set documentStore=true query argument. Eg. GET /v1/account/account_id/messages?path=INBOX&documentStore=true
  2. The other is to run queries against ElasticSearch directly.

Both options are valid. Next, I'll show some examples of running ElasticSearch queries against stored emails using Kibana.

This is what the stored emails look like in ElasticSearch.

One neat thing you can do is generate a donut chart about folder sizes of an email account.

  1. Open the visualizer in Kibana
  2. Search for account:account_id
  3. Select Donut chart
  4. To the "Slice by" area, drag the path field
  5. To the "Size by" area, drag the size field
  6. Click on the size field and select sum as the function, also select Bytes as the Value format
  7. Click on the label setting button and select Show value as the values option

That's it. You get a donut chart showing which folder takes up the most space.

Well, that was easy

Do you want to find all emails that include PDF attachments? Say no more; just run the following query:

account:account_id and attachments:{ filename: "*.pdf" }

And you have your answer.

The potential to use email accounts to dig for data like this is limitless.

Currently, the mirroring feature is in testing, and there is no ETA of when it will land in the public release. Probably sometime in June. Further updates might break the functionality, so use it for testing and do not build anything for production on top of it.

Any thoughts? Join our Discord server.