Table of Contents
Created by gh-md-toc
Entropy of Complex Networks
This is the repository for the Entropy in Complex Networks project funded by the grant agreement 2016/23/B/ST6/03962 of the National Science Center in Poland. The research project started in September 2017 and will finish in August 2020.
Research project objectives/ Research hypothesis
Entropy is the fundamental concept in the theory of information and it is commonly used in different areas of computing science. In machine learning entropy is used for feature selection (mutual information), for text analysis and mining (mutual marginal relevance), for decision tree induction (information gain), and many other. Despite common usage of entropy in computer science, our research suggests that there are serious deficiencies when it comes to the entropy of networked data. Literature review reveals multiple, oftentimes conflicting definitions of entropy proposed over the years for graphs and networs. These definitions can be broadly classified as thermo-dynamic entropies, statistical entropies, entropies based on the information theory, topological entropies, or entropies defined by information functionals. Each of these definitions tries to capture a certain aspect of network structure and to evaluate the information value of the network. One should also add that most of previous theoretical work has been done within the scope of the graph theory, not taking into consideration the volumes of data available in networks today. Consequently, entropy definitions present in the literature tend to be computationally expensive up to the point of not being applicable to large networks.
The main research objective of this proposal is, on the one hand, the development of computationally feasible and intuitive entropy definitions for networks, with special emphasis on network characteristics. We plan to develop operational entropy definitions for signed networks and multiplex multi-modal networks. On the other hand, the research plan contains work packages related to the design and development of algorithms that would utilize the proposed entropy to solve typical problems arising in network analysis and mining: finding the partitioning of the network into modules (akin to the classical problem of data clustering) and classifying generative and empirical networks. The third problem which we aim to solve within the scope of this project is the problem of network sampling and statistical inference on samples, along with its confidence. A closely related problem to network sampling is network compression, where we plan to investigate the efficacy of algorithmic entropy (also known as Kolmogorov’s complexity).
Research project methodology
The project has been divided into two disjoint, chronologically ordered sets of work packages. During the first project phase we plan to investigate theoretical foundations of network science and develop the adaptation and extension of classical entropy definitions for complex networks, with the special emphasis on signed networks and multiplex multi-modal networks. During the second project phase we will concentrate on the computationalization of entropy for complex networks: the development of an entropy-based algorithm for generative and empirical network classification, the development of an entropy-based algorithm for detecting network modules, the development of an entropy-based algorithm for network sampling, the evaluation of the usability of algorithmic entropy for network compression. Main research methodology will consist of theoretical investigation and experimental evaluation. Experiments will be conducted both on generative networks originating from popular theoretical network models (random networks, scale-free networks, small-world networks) as well as on empirical networks downloaded from public repositories (The Koblenz Network Collection, Stanford Large Network Dataset Collection).
Expected impact of the research project on the development of science, civilization and society
Proposed research aims at the development of techniques for network analysis and mining. Taking into consideration the widespread of network datasets usage, the growth of their volume and the ubiquitousness of network computing (social networks, semantic networks, technological networks), the project will provide scientific tools for the development of new algorithms and network processing methods, such as network visualization, network modularization, network compression or network classification. The scientific results of the project will inform further investigations in bioinformatics (protein structure processing), sociology (social network analysis) and other disciplines which require large scale network processing.
Popular description
Entropy is commonly used in science to measure the degree of disorder. The higher the entropy of a given system, the more disorganized the system is. In other words, high entropy represents a high degree of uncertainty about the state of the system. Take the example of a box full of white and black balls. If there are an equal number of white and black balls, the uncertainty about the color of the randomly chosen ball is the highest, which means that the entropy of the box is maximal. On the other hand, if the box contains balls of only one color, there is no uncertainty involved and the entropy of the box equals 0. This simple analogy can be applied to any complex system, within which we want to apply a learning process.
In the world of complex networks algorithms oftentimes are faced with the need to estimate various properties of vertices. Strongly regular networks have low entropy, which does not necessarily imply the uniform distribution of vertex properties, such as degree, betweenness or closeness. It is possible that the network contains a few groups of vertices, and within each group the variance of properties is low, whereas there are significant intra-group differences. Knowing the entropy of centrality measures of vertices can be very useful for network processing algorithms, because it allows to direct the algorithm along the gradients of increasing or decreasing entropy, depending on the task at hand. It is worth mentioning however, that there are many conflicting definitions of network entropy proposed in the scientific literature, and these definitions are often incompatible, while trying to grasp different aspects of network topology. In addition, many of these definitions require computations which are simply infeasible with regard to sizes of contemporary complex networks (e.g. social networks) and render these definitions practically unusable.
The goal of this project is to examine the properties of entropies in the world of complex networks, with a special emphasis on signed networks, multiplex networks, and multimodal networks. In signed networks each edge has a valence (most often positive vs. negative), which represents the semantics of the relationship depicted as the edge, such as preference vs. dislike. Multiplex network contain edges that belong to different classes, for instance one type of edges represents the flow of e-mail messages while the other type of edges represents organizational hierarchy relations). Finally, multimodal networks contain vertices that belong to disjoint classes, such as companies and persons with their relationships). The main goal of the project is to develop intuitive, computationally feasible entropy definitions and extending these definitions to new classes of complex networks, as well as proving the utility of the proposed definitions by incorporating them in selected network processing algorithms.
In the first project phase we will adapt different entropy definitions to the world of complex networks and we will simplify the definitions to make them computationally feasible with respect to sizes of contemporary complex networks. In the second project phase we will operationalize definitions introduced in the first phase by introducing entropy-aware network processing algorithms. We have selected three classes of algorithms: network classification (assigning a given network to a general network topology class), network modularization (finding groups of vertices which form coherent and dense structures within the network), and network compression (by merging vertices and edges, or by sampling vertices and edges). This way we hope to prove the universal usefulness
Research team
- Mikołaj Morzy, Ph.D., D.Sc., Poznan University of Technology, Mikolaj.Morzy@put.poznan.pl
- Tomasz Kajdanowicz, Ph.D., Wrocław University of Science and Technology, Tomasz.Kajdanowicz@pwr.wroc.pl
Publications
- On Measuring the Complexity of Networks: Kolmogorov Complexity vs. Entropy Mikołaj Morzy, Tomasz Kajdanowicz, and Przemysław Kazienko Complexity Volume 2017, Article ID 3250301 pdf bibtex
One of the most popular methods of estimating the complexity of networks is to measure the entropy of network invariants, such as adjacency matrices or degree sequences. Unfortunately, entropy and all entropy-based information-theoretic measures have several vulnerabilities when applied to evaluation of network complexity. These measures neither are independent of a particular representation of the network, nor can capture the properties of the generative process, which produces the network. As a result, every definition of network complexity, which is based on the entropy of some network invariants, suffers from similar flaws. We argue that entropy is not sufficient to adequately represent the computational description of the network. Instead, we advocate the use of the algorithmic entropy as the basis for complexity definition for networks. Algorithmic entropy evaluates the complexity of the description required for a lossless recreation of the network. In particular, a well-known implementation of algorithmic entropy – Kolmogorov complexity (K-complexity for short) has been considered in the paper. This measure is not affected by a particular choice of network features and it does not depend on the method of network representation. We present new examples of entropy-deceiving networks, which are algorithmically simple and recursive, but they produce high entropy estimations for various network invariants. We perform experiments on Shannon entropy and K-complexity for gradually evolving networks. The results of these experiments point to K-complexity as the more robust and reliable measure of network complexity. The original contribution of the paper includes the introduction of several new entropy-deceiving networks and the empirical comparison of entropy and K-complexity as fundamental quantities for constructing complexity measures for networks.
- Using Graph and Vertex Entropy to Compare Empirical Graphs with Theoretical Graph Models Tomasz Kajdanowicz, Mikołaj Morzy Entropy 18.9 (2016): 320. pdf bibtex
Over the years, several theoretical graph generation models have been proposed. Among the most prominent are: the Erdős–Renyi random graph model, Watts–Strogatz small world model, Albert–Barabási preferential attachment model, Price citation model, and many more. Often, researchers working with real-world data are interested in understanding the generative phenomena underlying their empirical graphs. They want to know which of the theoretical graph generation models would most probably generate a particular empirical graph. In other words, they expect some similarity assessment between the empirical graph and graphs artificially created from theoretical graph generation models. Usually, in order to assess the similarity of two graphs, centrality measure distributions are compared. For a theoretical graph model this means comparing the empirical graph to a single realization of a theoretical graph model, where the realization is generated from the given model using an arbitrary set of parameters. The similarity between centrality measure distributions can be measured using standard statistical tests, e.g., the Kolmogorov–Smirnov test of distances between cumulative distributions. However, this approach is both error-prone and leads to incorrect conclusions, as we show in our experiments. Therefore, we propose a new method for graph comparison and type classification by comparing the entropies of centrality measure distributions (degree centrality, betweenness centrality, closeness centrality). We demonstrate that our approach can help assign the empirical graph to the most similar theoretical model using a simple unsupervised learning method.
- Entropy of Various Matrix Energies for Complex Networks, Mikołaj Morzy, Tomasz Kajdanowicz Entropy 2018: From Physics to Information Sciences and Geometry, 14/05/2018 - 16/05/2018, Barcelona, Spain, poster presentation pdf
Matrix energy is the sum of the absolute values of its eigenvalues. In the context of networks, several different energies have been proposed, depending on the particular form of the network matrix. Examples include the regular energy defined over the adjacency matrix, the Laplacian energy defined over the Laplacian of the network, or the Randić energy based on the so-called Randić matrix, in which each entry is the inverse of the square root of the product of node degrees, if the two nodes are adjacent.
The main problem with the above definition of network energy is the fact that it is hardly operational. Defined over the entire network matrix, the energy of that matrix is a single number trying to describe the network. In large complex networks, there can exist heterogeneous parts of the network with highly diversified topologies and properties, resulting in significant variation of local energy.
In this work, we diverge from the traditional approach of computer science, where entropy is understood in the light of Shannon’s information theory. Instead, we use entropy in the physical sense, as the measure of the dispersion of energy, where energy is computed from various network-invariant matrix representations. We look at how the entropy of various network energies changes with the gradual change of network topology and we compare energies of node egocentric networks with other well-known node descriptors, such as in-degree, out-degree, betweenness, local clustering coefficient, eccentricity, Bonacich power, Katz centrality, and other indexes. We experiment with various models of complex networks (random network model of Erdos–Renyi, small-world network model of Watts and Strogatz, preferential attachment model of Albert–Barabasi, random network model of Waxman) and we observe the behavior of energies and their entropies as the hyper-parameters of network models change.
- Graph Energies of Egocentric Networks and Their Correlation with Vertex Centrality Measures Mikołaj Morzy, Tomasz Kajdanowicz Entropy 2018, 20(12), 916 pdf bibtex
Graph energy is the energy of the matrix representation of the graph, where the energy of a matrix is the sum of singular values of the matrix. Depending on the definition of a matrix, one can contemplate graph energy, Randi{'{c}} energy, Laplacian energy, distance energy, and many others. Although theoretical properties of various graph energies have been investigated in the past in the areas of mathematics, chemistry, physics, or graph theory, these explorations have been limited to relatively small graphs representing chemical compounds or theoretical graph classes with strictly defined properties. In this paper we investigate the usefulness of the concept of graph energy in the context of large, complex networks. We show that when graph energies are applied to local egocentric networks, the values of these energies correlate strongly with vertex centrality measures. In particular, for some generative network models graph energies tend to correlate strongly with the betweenness and the eigencentrality of vertices. As the exact computation of these centrality measures is expensive and requires global processing of a network, our research opens the possibility of devising efficient algorithms for the estimation of these centrality measures based only on local information.