Generalized Score Functions for Causal Discovery
Huang et al. (2018). Generalized score functions for causal discovery.A score-based algorithm requires two components: the score function and the search procedure. In addition to Markov assumptions, an ideal score function should meet the basic criteria as following
- characterization of conditional dependency
- local consistency and score equivalence
- robustness to multiple, if not all, data types and distributions, including
- non-linear relations
- non-Gaussian distributions
- (infinite) multi-dimensionality
- heterogeneity (i.e., continuous/discreet, stationary/time-series data)
- tractability and computational efficiency
While the previous algorithms are limited in assuming a certain model class or distribution to begin with,
Generalized Score Functions (GSF) is a novel effort to unify score-based approaches into a generalized
framework that can work on a wide class of data types and distributions.
Methods (refs) | Non linearity | Non Gaussian | Multi dimensionality | Heterogeneous data |
---|---|---|---|---|
BIC, MDL BGe, BDeu BDe, BCCD |
No | No | No | No |
Mixed BIC | No | No | No | Stationary only |
LiNGAM ICA-LiNGAM Direct-LiNGAM |
No | Yes | No | Yes |
ANM CAM PNL |
Yes | Yes | No | Yes |
GSF | Yes | Yes | Yes | Stationary only |
Notations
Let
is a continuous feature map projecting a point onto is a positive definite kernel function such that- probabilty law of
is and - the space of square integrable functions - for any point
in a sample of observations,- the reproducing property yields
with being the weight - its empirical feature map
- the reproducing property yields
represents the kernel matrix with each element , and is the centralized kernel matrix where with idenity matrix and vector of 1's
1. Characterization of Conditional Independence
The score function must act as a proxy for conditional independence test, which, in the general case, is characterized by conditional covariance operators. The kernel-based conditional independence test is described as follow

Lemma 1 (extracted from the paper)
where
For all functions
Additionally, the operator
Let
2. Regression in RKHS
Given random variables

Furthermore, each component

Given
The most straightforward test for goodness-of-fit is the maximum log-likelihood for the
regression problem in (1). Unfortunately, we do not know the distribution of
3. Formulation
For a particular observation
According the reproducing property,
We now turn our attention to the reformulated problem
The maximal value of log-likelihood is represented as

4. Score Functions
Maximizing the likelihood can cause over-fitting. Thus, the proposed score functions are Cross-Validation Log-Likelihood and Marginal Log-Likelihood.
4.1. CV Log-Likelihood
Given a dataset
The score of DAG

The formula can be derived using matrix manipulation. Notice that line 2 approximates
4.2. Marginal Likelihood
The marginal likelihood score is estimated as

Reliability
A score function must also satisfy Local Consistency and Score Equivalence.
Local Consistency

The score function is expected to give lower scores to models assigning with incorrect indendency constraints. It has been shown that under mild conditions, CV likelihood score is locally consistent according to Lemma 2.

The model selected by CV likelihood has the same performance with the model selected by the optimal benchmark that is fitted on the entire sample. That is, CV likelihood score supports learning causal relations that hold in the true data generating distribution. This conclusion at the same time supports
Marginal likelihood score is also proven to yield local consistency according to Lemma 3.

The score can then be expressed in terms of Bayesian information criterion (BIC), plus a constant, in which BIC is locally consistent.
Score Equivalence

On the other hand, the previous studies provide evidence that non-linear regression gives different scores to models in an equivalence class. Thus, neither CV or Marginal likelihood meet Score Equivalence.
However, the authors show that we can exploit local consistency and local score change to compare the reliability of any two DAGs. It is observed that the learned DAG with the same structure and orientation as the data generative distribution gives the highest scores. However, since there have not been any theoretical proof for this yet, we resort to the problem of searching the optimal Markov equivalence class.
In general, if conditions in Lemma 2 and Lemma 3 hold, the CV likelihood and Marginal likelihood, under RKHS regression framework, are guaranteed to find the Markov equivalence class that is closely consistent with the data generating distribution by using Greedy Equivalence Search algorithm (GES). Details can be found in Proposition 1 and Proposition 2.
Previous score-based methods
- BIC, MDL : Estimating the Dimension of a Model
- BGe : Learning Gaussian Networks
- BDeu, BDe : Learning Bayesian networks: The combination of knowledge and statistical data.
- BCCD : A Bayesian Approach to Constraint Based Causal Inference
- Mixed BIC : Causal discovery from databases with discrete and continuous variables.
- LiNGAM : LiNGAM: non-Gaussian methods for estimating causal structure
- ICA-LiNGAM : A linear non-Gaussian acyclic model for causal discovery
- Direct-LiNGAM : DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model
- ANM : Nonlinear causal discovery with additive noise models
- CAM : CAM: Causal Additive Models, highdimensional order search and penalized regression
- PNL : On the identifiability of the post-nonlinear causal model