Temporal Evolution of Communities in the Enron Email Data Set

The collapse of Enron, a U.S. company honoured in six consecutive years by "Fortune" as "America's Most Innovative Company", caused one of the biggest bankruptcy cases in US-history. To investigate the case, a data set of approximately 1.5 million e-mails sent or received by Enron employees was published by FERC, the Federal Energy Regulatory Commission.

We analyzed the interaction behavior of the Enron employees based on their email data which consists of roughly 245,000 messaged sent from January 2000 to March 2002. As expected, the interaction graph which represents the e-mail exchange between individuals shows a low density, a right-skewed degree distribution and a short average distance between vertices (small-world eff ect). These measures indicate that the graph has a clustered structure. Furthermore, since the data set encompasses e-mail interactions over a period of approximately three years, it is particularly suitable for the analysis of subgraph evolutions.

To get a first impression we applied DenGraph on the email interaction graph. As expected, the parameters ε and η have a high influence on the outcome of the DenGraph-clustering. We chose the parameter combination which yielded the best cluster performance and/or optimal modularity. However, the noise ratio and the number of clusters are also important indicators that should not be neglected.

Afterwards, we did an analysis of the temporal evolution of the detected communities in the Enron graph. For this, we generated interaction graphs over specific time periods and applied DenGraph to observe the temporal subgraph evolution based on graph and cluster statistics. The number of discovered clusters varies for all graphs. We observe that the values for weighted and unweighted modularity are in general comparable. As expected, the unweighted modularity is in the majority of cases lower than the weighted. The clustering coefficient fluctuates slightly around an average value of 0.4. Furthermore, a correlation between the number of edges and the number of updates can be observed: When the number of edges increases, the number of positive updates increases as well, usually followed by a period with a higher number of negative updates. Therefore, fluctuations in the number of edges result in fluctuations in the number of updates. In some intervals a correlation between the number of positive updates and the number of splits can be seen. The same holds for the number of negative updates and the number of splits: The number of splits increases when many edges are lost.