28391
Comment:

28393

Deletions are marked like this.  Additions are marked like this. 
Line 197:  Line 197: 
* <<Color2(Entropy Metric,white,lightgrey)>>: The Entropy is the average unpredictability in a random variable, which is equivalent to its information content. The base of the logarithm used is and it is defined as  * <<Color2(Entropy Metric,white,lightgrey)>>: The Entropy is the average unpredictability in a random variable, which is equivalent to its information content. The base of the logarithm used is 10 and it is defined as 
Cytoscape Network Inference Toolbox (Cyni)
Cytoscape Network Inference Toolbox (Cyni) is a new Cytoscape App that puts together several tools that allow inducing networks from biological data. Each of the tools can be used independently or together to perform several tasks.
The goal of Cyni is to make network inference more accessible to biologists by providing userfriendly solution as well as provide a framework to bioinformaticiens to develop and apply their new techniques. Data imputation and discretization techniques are provided along with several known inference algorithms to make this tool fully operational for any kind inference requirement. While data imputation and discretization techniques allow you modify Cytoscape tables, network inference algorithms produce a new network after applying the chosen technique. Cyni requires 3.0 version or newer of Cytoscape and is available in the Cy3 App Store.
Contents
Typographic Color Conventions
In order to help readers to match the User Interface elements of the Cyni App with the elements described in this document. The following typography based on colors has been used to refer to any element displayed on the User Interface.
Menu Item
Dialog Button
Dialog Label
1.Installation
Cyni is already available through Cy3 App Store. However, its installation could also be done by using the App Manager found in the Apps Menu. Once App Manager is open, you should click on Install from File. A new dialog will appear to help you find the Cyni app jar file that you should have previously saved it in your computer. After installing the right jar file, cyni will be already available to be used without need to restart Cy3. Moreover, next time Cytoscape is opened, cyni will be always available. If Cyni is not longer needed, it can be uninstalled easily by using the App Manager.
The Cyni App Jar file can be found in the Downloads section.
2.Using Cyni Toolbox
Once Cyni is installed, it will install a new Cyni Tools menu hierarchy under the Tools main menu. The Cyni Tools menu contains four options which open a new dialog when they are chosen. In these dialog, several parameters need to be set or the default values will be applied. The functionality that these four options provide is explained in the next sections.
3.App Development through Cyni API
Cyni Toolbox is not only an application to perform a result, but it also provides a framework to other app developers to facilitate the implementation of other algorithms. The goal of Cyni framework is to provide standarized, extensible and configurable default solutions for all components different from the core algorithms, such as GUI, parameter handling, configuration of distance/similarity measures. Thereby, developers can focus on their core expertise, instead of trying to spend significant effort constructing software components foreign to their expertise. You can find all documentation to produce new apps through Cyni framework at the Cyni App Development documentation.
4.Loading Data Tables
Cyni Tool Box provides the option of loading tables that are independent of any network into Cytoscape. This feature should be available in future versions of Cytoscape 3, but it is already available in Cyni Toolbox. This option is available in Cyni Tools menu as the first item in that menu, called Add Table. There is the possibility to load it by choosing a File or by a URL. The user interface is the same than Cytoscape provides to load data to existing network tables, so users should be familiar to this interface. Once the table is loaded a new tab in the table panel is shown, this tab is called Unassigned Tables. Under this tab, there will be any table that its data is not assigned to a network.
5.Data Imputation Algorithms
Biological data produced from experiments can bring several artifacts, such as noise and fluctuations that occur during the experiments. Values that probably present these effects are considered as missing values because their use on analysis like network inference could produce incorrect results. There are several methods to deal with this kind of data. Cyni Toolbox wants to provide several Data Imputation techniques to users who have to use data that contains missing values. This tool is available in Cyni Tools menu as the second item in that menu, called Impute Missing Data. Once this option is clicked, a new dialog is shown as the one in Figure.
In this dialog, we can see two dropbox elements that allow users to select the two main elements of this feature. The first element is the technique to use and the second one is the table data where there could be some missing values. After these two elements are chosen, the dialog gets expanded and the parameters related to the chosen options are displayed. The following techniques are available in Cyni Toolbox.
5.1. Zero Imputation Algorithm
One of the simplest ways to deal with missing values is to impute missing values to zero. In Cyni Toolbox, this basic technique as other more sophisticated techniques requires to define the missing values. That is, the values in the data that will be considered as missing values. Cyni provides four ways to define these values, as a single value or by using several thresholds. In the Cyni Dialog shown in Figure, you can see some of these options.
If the option By a double Threshold for the parameter How to define a missing value is selected, the parameters Missing Value if lower than and Missing Value if larger than will be shown and the values entered will be used to define the interval used to find the missing values in the chosen table data. Otherwise, there is the possibility to choose a single threshold or just a single value. If the option By a single value is chosen only the parameter Missing Value will be shown and any value in Table Data that corresponds to that value will be considered a missing value.
In any case, if there is an empty cell, this will also be considered a missing value.
Once the input parameters has been selected, clicking on Impute Missing Data will modify the found missing values in the chosen Table Data with the ones estimated and a message with the number of missing values estimated will be displayed.
5.2. Row Average (RAV)
Another simple way to impute data in a table is to replace missing values by the average value of the nonmissing values of the row that contains the missing values. Cyni provides also this technique and its input parameters are the same than for Zero Imputation.
5.3. Bayesian Principal Component Analysis (BPCA)
Bayesian Principal Component Analysis is an advanced technique, which involves Bayesian estimation together with the iterative expectation maximization algorithm. A complete description and its implementation in several programming languages can be found in this link. Cyni Toolbox has been granted to use BPCA implementation from its author to make available this technique to all Cytoscape users. The dialog to set this technique's parameters contains the same input parameters than previous techniques and it is shown in figure below. The meaning of each input parameter is the same than previous techniques so clicking on Impute Missing Data will modify the found missing values in the chosen Table Data with the ones estimated by BPCA algorithm.
6. Discretization Algorithms
Many machine learning techniques can be applied only to data sets composed of categorical attributes but a lot of data sets include continuous variables. Cyni Toolbox intends to provide to users a tool that could be used as previous step in the final goal of generating network inference or just as another independent functionality. This tool is available in Cyni Tools menu as the third item in that menu, called Discretize Data. Once this option is clicked, a new dialog is shown as the one in Figure.
In this dialog, we can see two dropbox elements that allow users to select the two main elements of this feature. The first element is the technique to use and the second one is the table data where there are continuous values that need to be discretized. After these two elements are chosen, the dialog gets expanded and the parameters related to the chosen options are displayed. The following techniques are available in Cyni Toolbox.
6.1. Equal Frequency/Width Algorithm
The equalwidth discretization algorithm determines the minimum and maximum values of the discretized attribute and then divides the range into the userdefined number of equal width discrete intervals. The equalfrequency algorithm determines the minimum and maximum values of the discretized attribute, sorts all values in ascending order, and divides the range into a userdefined number of intervals so that every interval contains the same number of sorted values.Cyni provides these two possibilities in one algorithm. Figure shows the dialog where the parameters for this technique can be chosen.
Once the input parameters has been selected, clicking on Discretize Data will produce a new column for each chosen column to discretize. These new columns will have the same name of original columns with the prefix “nominal.”. The input parameters for this technique are:
Intervals: The number of intervals
Use Equal Frequency: Allow users to choose between equal with algorithm or equal frequency algorithm. If not selected, the equal with technique will be used.
Apply same discretization thresholds for all selected attributes: The discretization technique can be applied independently to each selected column/attribute or all selected columns/attributes will be considered as a long column and so the generated intervals will be common for all columns/attributes. If not selected, there will be different intervals for each selected attribute.
Numerical Attributes: The list of column/attributes that need to be discretized. This is a multiple selected box and it shows the names of columns that contains numerical values for the selected table. Each selected name of column in this list will mean that its column will be discretized.
6.2. Manual Discretization
Cyni Toolbox also provides the possibility to users to define their own thresholds to create the desired discrete data. This option is available for a limit number of intervals and for each selected interval the corresponding thresholds used to discretize the data must be specified.
7. Inference Algorithms
The goal of network inference algorithms in biology is to identify, represent and simulate the dependencies between variables such as gene expression, protein and metabolite levels, but also environmental conditions or phenotypes.
Network inference algorithms are available in Cyni Tools menu as the third item in that menu, called Infer Network. Once this option is clicked, a new dialog is shown as the one in Figure.
In this dialog, we can see two dropbox elements that allow users to select the two main elements of this feature. The first element is the algorithm to use and the second one is the table data, which correspond to the data related to the nodes of the future network. After these two elements are chosen, the dialog gets expanded and the parameters related to the chosen options are displayed. The following algorithms are available in Cyni Toolbox.
7.1. Basic Correlation Algorithm
The objective of this kind of algorithm is to explain observed correlations between biological elements such as genes by the presence of other biological elements. Networks inferred by this algorithm are constructed by computing similarity measures for each pair of elements. If similarity value is above a certain threshold, the two biological elements represented by two nodes get connected in the network, if not, it remains unconnected.
Figure(above) shows the dialog where the parameters for this algorithm can be chosen. Once the input parameters has been selected, clicking Infer Network will start generating the new network and a status dialog will appear to inform users of the current state. The input parameters for this algorithm are:
Threshold to add new edge: Set the threshold used to decide if the similarity between to elements is high enough to display an edge between these two elements.
Use only absolute values for correlation: This parameter is used to allow users to work with signed values of calculated correlations or only take its absolute value.
Metric: It displays a list of available similarity measures for this kind of algorithm.
Data Attributes: A multiple selection list is displayed. This list contains all the column names of the chosen table that contain the right type of values according to the selected metric.
7.2. Mutual Information Algorithm
The objective of this algorithm(Butte et Kohane, 2003) is, instead of calculating correlation coefficients, to compute the entropy of gene expression patterns and the mutual information between biological elements such as genes. Networks inferred by this algorithm are constructed by computing the mutual information for each pair of elements. If mutual information value is above a certain threshold, the two biological elements represented by two nodes get connected in the network, if not, it remains unconnected. The mutual information is calculated by discrete data so if the available data is continous, it will have to be discretized before using this algorithm.
Figure(above) shows the dialog where the parameters for this algorithm can be chosen. Once the input parameters has been selected, clicking Infer Network will start generating the new network and a status dialog will appear to inform users of the current state. The input parameters for this algorithm are:
Threshold to add new edge: Set the threshold used to decide if mutual information is enough to draw an edge.
Use selected nodes only: This parameter is only useful if the chosen table belongs to a open network in Cytoscape. In this case, if user wants to select only some nodes in the network view, the algorithm will only be applied to those nodes and the resultant network will only contain those networks.
Data Attributes: A multiple selection list is displayed. This list contains all the column names of the chosen table that might contain the discretized values.
7.3 Bayesian K2 Algorithm
This algorithm was developed by Cooper and Herskovits and it presents a Bayesian algorithm for constructing a probabilistic network from a databases. This algorithm begins by making the assumption that a node has no parents, then starts adding parents to that node whose addition better improve the probability of the resulting network. When there is no parent to add that improves that probability, the work for that node is finished and then another node is considered to add parents to it. The ordering of nodes in the algorithm is of high importance because this technique only considers as possible parents the nodes that have already been filled with parents. Therefore, the first node to consider will always be empty of parents and the second one will only be able to have the first node as a parent. Figure(below) shows the dialog where the parameters for this algorithm can be chosen.
Once the input parameters has been selected, clicking Infer Network will start generating the new network and a status dialog will appear to inform users of the current state. The input parameters for this algorithm are:
Maximum number of parents: It tells the algorithm when to stop adding parents to anode even if the probability of the resulting network get better.
Generate random ordering: This parameter allows to changes the order of nodes randomly, so every time that the algorithm will be applied with this parameter selected a new different network will be created.
Use selected nodes only: This parameter is only useful if the chosen table data belongs to a current network. In this case, if user wants to select only some nodes in the network view, the algorithm will only be applied to those nodes and the resultant network will only contain those networks.
Metric: It displays a list of available similarity measures for this kind of algorithm.
Data Attributes: A multiple selection list is displayed. This list contains all the column names of the chosen table that contain the right type of values according to the selected metric.
7.4 Hill Climbing
This algorithm applies an heuristic search to learn new Bayesian networks. The algorithm performs a local search that does not guarantee an optimal global solution but it produces a local solution that turns to be a very good solution. The process starts from an initial solution, usually a network with only nodes and no edges, then at each step the algorithm analyzes all possible operations and choose the one with highest probability. The possible operations for this algorithm are three: Add, Remove or Reverse an edge. For all nodes, all these three operations are considered and only the operation with higher probability is the one that will be finally performed. Then again the process is repeated until there is no more improvement. Figure(below) shows the dialog where the parameters for this algorithm can be chosen.
Once the input parameters has been selected, clicking Infer Network will start generating the new network and a status dialog will appear to inform users of the current state. The input parameters for this algorithm are:
Maximum number of parents: It tells the algorithm when to stop adding parents to a node, even if the probability of the resulting network get better.
Check reverse edges: This parameter allows users to decide whether they want to consider the option of reversing edges to try to find a better solution or not.
Parameters if a network associated to table data: This a group of parameters that only makes sense if the chosen table data belongs to a open network in Cytoscape.
Use network associated as initial search: If this option is chosen the network with its edges will be used as initial solution for the algorithm, so any new operation will be applied to those edges and nodes.
Use selected nodes only: If user wants to select only some nodes in the network view, the algorithm will only be applied to those nodes and the resultant network will only contain those networks.
Keep selected edges: This parameter allows to define the selcted edges in the network view as locked edges, which means than the solution that the algorithm will provide will always contain those edges. This parameter only makes sense if the Use network associated as initial search is also selected.
Metric: It displays a list of available similarity measures for this kind of algorithm.
Data Attributes: A multiple selection list is displayed. This list contains all the column names of the chosen table that contain the right type of values according to the selected metric.
8. Measurements
Cyni Toolbox does not only provides algorithms to be applied to data tables to produce new networks or modify data tables. Similarity measurements is another element very important to produce new networks and there are a wide variety of measures proposed by researchers. These measures can also be used for many other purposes and we think that is important to make them available to all Cytoscape users as well as allow researchers to integrate other new measures to Cytoscape. For this reason, Cyni provides a way to add and store new developed measures by the Cyni framework. Any new added measure technique will be available along with already integrated techniques through Cyni Toolbox and the way to used or develop other measures is widely explained in the Cyni App Development documentation.
Nowadays, Cyni Toolbox can offer the following list of measurements divided by types.
8.1 Correlation Metrics
This group of measurements includes all metrics that search for any statistical relationship between two random variables or two sets of data. This implies that these metrics will only compare one row of data against another one. Other possibilities such as one row data against several ones correspond to other groups of measurements. The list of correlation metrics available in Cyni so far are:
Pearson Correlation: This value is obtained by divinding the covariance of the two variables by the product of their standard deviations.
Spearman's rank correlation: This is a nonparametric measure of the correlation between the two rows of data. In this metric, the current values in the rows are replaced by their ranks and then the liner correlation coefficient is applied.
Kendall's Tau correlation: This is also a nonparametric measure. Instead of using the numerical difference of ranks, it uses only the relative ordering of the ranks.
8.2 Local Score Metrics
This group of measurements have in common that the final value can be decomposed as the sum or product of the score of each individual node. Therefore, in this case, there is the possibility to compare a row against several rows. The list of metrics available for this group in Cyni so far are:
Bayesian (K2) Metric: It is also known as K2 metric and is defined as
Entropy Metric: The Entropy is the average unpredictability in a random variable, which is equivalent to its information content. The base of the logarithm used is 10 and it is defined as
AIC Metric
Minimmum Description Length (MDL) : The MDL score metric (log joint probability of structure and database)
Bayesian Dirichlet Equivalent (Bde) Metric: The Bde metric is defined as
9. Authors
CYNI was implemented by Oriol Guitart Pla, and conceived by Frank Rügheimer and Benno Schwikowski.
10. Acknowledgments
The implementation of Bayesian Principal Component Analysis (BPCA) incorporates the original java code by Shigeyuki Oba (Oba et al. 2003). The CYNI project is part of http://nrnb.org, and funded by NIH grant P41 GM103504.
11. Downloads
12. References
Cytoscape Original Paper: Shannon P, Markiel A, Ozier O, Baliga N, Amin N, Ramage D, Schwikowski B, Ideker T (2003) Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res' 13(11): 24982504
Cooper GF, Herskovits E: A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 1992, 9:309:347
Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., and Ishii, S. A Bayesian Missing value estimation method for gene expression profile data, Bioinformatics 2003 19, pp.20882096 (2003).
Jose A. Gamez, Juan L. Mateo, and Jose M. Puerta. 2011. Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Discov. 22, 12 (January 2011)
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009) The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.
Butte AJ, Kohane IS (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, pp 418–429b.