A Tool for Large Network Analyses
___________________________________________________________________
GraphCrunch runs under Linux, MacOS, and Windows Cygwin.
The versions for Linux and MacOS are statically compiled and thus are ready to use after downloading and unpacking the compressed archive. Due to the licensing issues, the Windows Cygwin version requires the LEDA 5.0.1 Cygwin license and the gcc 3.4 compiler. Note that if you intend to make modifications to the Linux or MacOS versions of GraphCrunch, you will need the LEDA licenses for these operating systems in order to compile your changes.
The system needs to have up to 20MB of disk space available (depending on the operating system) for installing GraphCrunch.
GraphCrunch is primarily intended for use on graphs with fewer than 10000 nodes and 50000 edges. The memory and CPU requirements increase considerably when processing larger graphs, or particularly dense graphs.
Note that processing a large number of model networks may put a demand on the available disk space in the system (storing a single network takes about 600 KB of disk space). We recommend processing of up to 30 networks per network model.
___________________________________________________________________
Input data graphs should be in either LEDA graph format (.gw) or "edge list" format (.txt). Output model networks are in LEDA graph format. A description of the LEDA graph format is available here.
The "edge list" text file format is simply a graph adjacency list, i.e., a list of node pairs (edges of the network) separated by tabs or spaces, with one node pair per line. An example of an "edge list" text file representing a "triangle" graph is:
node1 node2
node1 node3
node2 node3
The current implementation of GraphCrunch deals with undirected, simple (i.e., no loops or multiple edges) graphs. Thus, for either of the above two formats, GraphCrunch will remove self-loops, duplicate edges, and edge directions. Also, the current version of GraphCrunch does not account for labels on nodes or edges. Thus, even though input graphs in LEDA format may have labels on nodes and edges, LEDA parameter types of node and edge labels will be modified by GraphCrunch as necessary for program execution. Similarly, even though input graphs in "edge list" formats may have a third "edge weight" (or "edge label") column, all edge weights (and node labels) are currently ignored.
___________________________________________________________________
You can download GraphCrunch here.
After downloading compressed archive, unpack it by running the following command:
zcat graphcrunch-DD-MMM-YYYY.tar.gz
| tar vxf –
where “DD-MMM-YYYY’’ corresponds to the date of the latest release. A directory called “graphcrunch” will be created containing subdirectories described bellow.
Detailed instructions for Cygwin are available here.
After unpacking and installing GraphCrunch, you will be presented with the following directory structure.
Directory: |
Content: |
./run-dialog |
A text User Interface wizard for running GraphCrunch. |
./crunch |
A command-line tool for running GraphCrunch. |
README.txt |
The "readme" file with the explanations on how to use GraphCrunch. |
batch/ |
Scripts and temporary data for remote batch processing. |
contrib/ |
Scripts and utilities that use additional software (e.g., gnuplot or R); users can contribute with their own add-ons. |
data/ |
Folder containing sets of intermediate files. |
doc/ |
Advanced documentation for extending the software package and troubleshooting. |
plots/ |
Visualized output files for the user-friendly graphical interpretations of the results. |
sample_graphs/ |
Some sample graphs to try out. |
scripts/ |
Scripts that use the results of network model generators and programs computing network properties to create the statistics that summarize the results. |
src/ |
Source code (for advanced users/developers). |
tmp/ |
Temporary data. |
GraphCrunch is implemented in C++, Perl, and Bourne Shell scripts. The programs for generating model networks and calculating network properties are implemented using C++ and the LEDA library for combinatorial and geometric computing. Shell scripts collect the output of these programs and create the results of the analyses and graphical and tabular representation of the statistics summarizing the results. The programs and the shell scripts are organized in the directories as presented in the table above. This organization allows for easy extendibility of the software to include additional network models and measures. For example, to add a new model, the user only needs to put a model generator program in the src/ directory and to place the script that runs the new program in the scripts/ directory. The same flexibility applies to the addition of new network properties. In addition to the easy extendibility, GraphCrunch has built-in parallel computing capabilities: it distributes its processes over a user specified cluster of machines.
There are three ways of running GraphCrunch: via the command-line interface, the run-dialog interface, and the on-line web interface. Run-dialog interface is available for the Linux and MacOS versions of GraphCrunch (but not for Cygwin). Both command-line and run-dialog interfaces allow for large-scale scientific computing network analysis and modeling projects. On-line web user interface is provided for non-expert users and those with less intensive processing needs; we recommend that first-time users start with on-line GraphCrunch.
The command-line interface allows for specifying all of the following in a single command:
· the real-world network (input graph) to be processed,
· random graph models against which the data is to be compared,
· the number of networks to be generated per random graph model,
· network properties and comparisons between the data and the model networks,
· the name of the output data file.
Network properties and models currently supported by GraphCrunch are presented in tables below.
Global Properties: |
Degree distribution |
Clustering coefficient |
Clustering spectrum |
Average diameter |
Spectrum of shortest path lengths |
Local Properties: |
Relative graphlet frequency distance (RGF-distance) [6] |
Graphlet degree distribution agreement (GDD-agreement) [7] |
Models: |
Erdös–Rényi random graphs [1] |
Random graphs with the same degree distribution as the data [2] |
Scale-free Barabasi-Albert model graphs [3] |
N-dimensional geometric random graphs [4] |
Stickiness model graphs [5] |
Tables: Network properties and models currently supported by GraphCrunch.
The syntax for the
command is:
./crunch -f graph_filename [-m model] [-p parameter] [-c comparison]
[-o output filename] [-n num random graphs] [-r machine]
Example of the command (explanation of the command fields is provided below) is:
./crunch -f Net1.gw -p
"diameter-avg clustcoef-avg graphlet-count"
-c "degree-distrib diameter-spectrum clust-spectrum
gdd-agreement:amean gdd-agreement:gmean graphlet-dist" -m "er er_dd geo sf
sticky" -n 3 -o output_example.tsv
Note: Before running any commands, make sure you "cd" to the graphcrunch directory.
After you run “./crunch” command on a graph file (e.g., Net1.gw) for the first time, it is stored in the data set “collection” located in the graphcrunch/data directory. Each input graph file becomes a “data set” named after the graph file (without the “.gw” or “.txt” extension).
The second time you want to run calculations on Net1.gw, use “-d” instead of “–f” (i.e., “./crunch -d Net1.gw”) in the command example above. This will overwrite the previous analysis.
Alternately, you can use “-F” option instead of “–f” (or “–d”), which will either add a graph file (if one with the same name does not already exist) or look for a graph in the collection with the same name (but will not overwrite the existing graph).
Network models that can be specified in part “-m model” of the command are specified in the “Models” table above, and are denoted by “er”, “er_dd”, “sf”, “geo”, and “sticky” in the command, respectively (without the quotes).
Network properties in GraphCrunch are subdivided into parameters (denoted by “-p” in the above command) and comparisons (denoted by “-c” in the command).
The “parameters” denote network properties computed on both the data and model networks.
Parameter notation in GraphCrunch |
Parameter description |
clustcoef-avg |
Average clustering coefficient of the network |
diameter-avg |
Average diameter of the network |
graphlet-count |
Total number of graphlets in the network |
Additionally, average diameters and average clustering coefficients of model networks can be compared to those of a real-world network (i.e., can be used as “comparisons”, with “–c” in the command) in two ways, as presented below.
Notation |
Explanation |
diameter-avg:difference |
Absolute values of difference between the average diameters of two networks |
clustcoef-avg:difference |
Absolute values of difference between the average clustering coefficients of two networks |
diameter-avg:percenterr |
Percentage difference between the average diameters of two networks |
clustcoef-avg:percenterr |
Percentage difference between the average clustering coefficients of two networks |
“Comparisons” are properties of a real-world network that are compared against those of model networks.
Comparison notation in GraphCrunch |
Comparison description |
degree-distrib |
Degree distribution |
clust-spectrum |
Clustering spectrum |
diameter-spectrum |
Spectrum
of shortest path lengths |
gdd-agreement |
Graphlet degree distribution agreement (GDD-agreement) |
graphlet-dist |
Relative graphlet frequency distance (RGF-distance) |
The degree distributions, the clustering spectra, and the shortest path length spectra of two networks are compared by the Pearson’s correlation.
GDD-agreements and RGF-distances are calculated as in [7] and [6], respectively. GDD-agreement can be computed as either arithmetic
or a geometric mean. To distinguish between these two, use the notation
presented below.
Notation |
Explanation |
gdd-agreement gdd-agreement:amean |
GDD-agreement, arithmetic mean |
gdd-agreement:gmean |
GDD-agreement, geometric mean |
GraphCrunch supports parallel processing. A user can distribute the GraphCrunch processing over a cluster of machines, by including –r “machine1 machine2 machineN” in the above command, where “machine1”, “machine2” and “machineN” are machine names.
The run-dialog interface provides the same functionality as the command-line interface, but in a more user-friendly manner: it guides a user through the sequence of screens, step by step. It is more “error-forgiving” than the command line-interface.
If you have Perl 5 and dialog installed (we recommend Perl 5.6+ as well as dialog 0.3+ or Xdialog), you may start run-dialog interface with the following command:
./run-dialog
<input_graph_name>
Note: The first time
you run run-dialog interface, you need to add a data set. If don't have any
graph files, you can try one of the samples in the sample_graphs directory. To add a
data set, you can either provide the input graph on the command line by typing
“./run-dialog <input_graph_name>” or run just “./run-dialog”
command and then choose “[Add dataset]”.
After this command is issued, you need to perform (un)selecting of checkboxes in order to specify the following:
· the random graph models,
· the number of networks per random graph model,
· network parameters and comparisons,
· the name of the output file.
The processing with the specified selections can either start immediately, (by choosing “Done - Run crunch” option), or “advanced options” can be configured to change network models, parameters, comparisons, or the machines over which to distribute the processing.
To navigate between menu options, use the up/down arrow keys. To navigate between buttons, use the left/right or tab keys. Press space to toggle a checkbox.
The example screenshots for using the run-dialog interface are available here.
The GraphCrunch
on-line web user interface is intended for less experienced users and those
with less intensive processing needs. It is a user-friendly interface, and we
recommend it for first-time GraphCrunch users. It is available here.
___________________________________________________________________
GraphCrunch creates three types of output: the tabular output file, the set of intermediate files, and the visualized output.
The tabular output file is a spreadsheet of tab-separated
values (.tsv) that contains summarized output statistics.
An example of the GraphCrunch output file in which network “Net1.gw” is compared against five
network models with three random networks per network model and with respect to
all network properties currently supported by GraphCrunch can be found here.
The command that produces these results is:
./crunch -f
Net1.gw -p “diameter-avg clustcoef-avg
graphlet-count” -c “degree-distrib
diameter-spectrum clust-spectrum
gdd-agreement:amean gdd-agreement:gmean graphlet-dist” -m “er er_dd geo sf
sticky” -n 3 -o output_example.tsv
The set of intermediate files is stored in the
subdirectories of the data/<input-file-name> directory. These files
include generated model networks corresponding to the input network, in LEDA
graph (.gw) format (they are saved in the subdirectory data/<input-file-
name>/models), and the files containing the network properties (e.g., distance
spectra, graphlet counts, graphlet degree distributions etc.). The intermediate
files allow for additional analyses of the results without performing any
additional compute-intensive processing. Also, the tabular output file contains
only the statistics of network parameter similarities between the data and the
model networks, but not the results from which the statistics were computed
that are contained in the intermediate files. For example, the GDD-agreement
presented in the output file is a single number between 0 and 1 while the
intermediate files contain the actual GDDs for each of the 73 orbits. Thus,
intermediate files are needed if one needs to perform an analysis of a
particular GDD. Similar holds for other network properties.
The visualized
output is a set of files (in .ps format) that contain
user-friendly graphical interpretations of the results presented in the output
file. One graphical file (plot) is created per network property. Each of the
plots illustrates the fit of the network models to one or more real-world
networks with respect to the given property. Thus, it is possible to
simultaneously compare the fit of network models to many real-world networks
with respect to one property in a single plot. Before executing the plot
function of GraphCrunch, the following must be satisfied: (1) the data sets
must be processed with “./crunch” command and the corresponding
intermediate files must exist in the data/ directory; and (2) “gnuplot” command-line plotting
utility must be available on a user’s system. Visualized output files are
created by running the plot.sh
script (found in the contrib/ directory); the
syntax for running this script is:
./contrib/plot.sh -d real-world-network -m model -c
comparison -o output-file
where multiple real-world networks or models may be specified,
as illustrated in the example below. Some examples of plots are presented here: four network properties for three data sets and
five network models per data set have been illustrated. The command used to
produce the first example is:
./contrib/plot.sh -d Net1 -d Net2 -d Net3 -m er -m er_dd -m geo -m sf -m sticky -c gdd-agreement:amean
-o plot_ex_gdd
___________________________________________________________________
GraphCrunch allows considerable extensibility with a little bit of scripting knowledge. See doc/CUSTOMIZING.txt for more details.
___________________________________________________________________
___________________________________________________________________
GraphCrunch is created by T. Milenković, J. Lai, and N. Pržulj at UC Irvine. Thanks to David Hubin for testing and advice.
Primary contact: nprzulj [at] uci [dot] edu
GraphCrunch is freely available for academic use only. The shell scripts in this distribution may be used for any purpose with proper attribution.
Disclaimer: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
To simplify installation, we include a copy of the UI::Dialog 1.08 module for Perl, written by Kevin C. Krinke <kckrinke@opendoorsoftware.com> With contributions by Jeroen Bulten <jeroenb@tunix.nl>, Julian Gilbey <jdg@polya.uklinux.net>, and Alfonso E.M. <alfonso@el-magnifico.org>. More details in the scripts/perl/UI-Dialog-1.08 directory. Dialog was created by Thomas E. Dickey et. al.
UI-Dialog is under the GNU Lesser General Public License (GNU LGPL), and is considered separate from the GraphCrunch licensing.
Dialog is under the GNU General Public License (GNU GPL) and is considered separate from the GraphCrunch licensing. As per Section 3b of the GNU GPL we offer to provide a copy of the Dialog 1.1 source code if requested, within the first three years of release.
___________________________________________________________________