Generalized Score Functions for Causal Discovery

Huang et al. (2018). Generalized score functions for causal discovery.

A score-based algorithm requires two components: the score function and the search procedure. In addition to Markov assumptions, an ideal score function should meet the basic criteria as following

  • characterization of conditional dependency
  • local consistency and score equivalence
  • robustness to multiple, if not all, data types and distributions, including
    • non-linear relations
    • non-Gaussian distributions
    • (infinite) multi-dimensionality
    • heterogeneity (i.e., continuous/discreet, stationary/time-series data)
  • tractability and computational efficiency

While the previous algorithms are limited in assuming a certain model class or distribution to begin with, Generalized Score Functions (GSF) is a novel effort to unify score-based approaches into a generalized framework that can work on a wide class of data types and distributions.

Methods (refs) Non linearity Non Gaussian Multi dimensionality Heterogeneous data
BIC, MDL
BGe, BDeu
BDe, BCCD
No No No No
Mixed BIC No No No Stationary only
LiNGAM
ICA-LiNGAM
Direct-LiNGAM
No Yes No Yes
ANM
CAM
PNL
Yes Yes No Yes
GSF Yes Yes Yes Stationary only
Comparison of score-based methods

Notations

Let X be a random variable with domain X in any dimension.
HX is defined as a Reproducing Kernel Hilbert Space (RKHS) on X, in which

  1. ϕX:XHX is a continuous feature map projecting a point xX onto HX

  2. kX:X×X R is a positive definite kernel function such that kX(x,x)=ϕX(x),ϕX(x)=kX(.,x),kX(.,x)
  3. probabilty law of X is PX and HXL2(PX) - the space of square integrable functions
    • This is a property of completeness, basically saying that with the probability of existence PX, HX contain functions that are square integrable. In HX, L2 norm is defined.

  4. for any point x in a sample of n observations,
    • the reproducing property yields xX,fHX,f(.),kX(.,x)=f(x)=i=1nkX(xi,x)ci with ci being the weight
    • its empirical feature map kx=[kX(x1,x),...,kX(xn,x)]T

  5. KX represents the n×n kernel matrix with each element Kij=kX(xi,xj), and KX~=HKXH is the centralized kernel matrix where H=I1n11T with n×n idenity matrix I and vector of 1's 1
Similar notations are applied to other variables Y and Z

1. Characterization of Conditional Independence

The score function must act as a proxy for conditional independence test, which, in the general case, is characterized by conditional covariance operators. The kernel-based conditional independence test is described as follow

kernel-conditional-independence-test

Lemma 1 (extracted from the paper)

where ΣXX|Z=ΣXXΣXZΣZZ1ΣZX.

For all functions gHX and fHZ, the cross-covariance operator ΣZX:HXHZ of a random vector (X,Z) on X×Z defined in Fukumizu, Bach & Jordan (2004) is f,ΣZXgHZ=EXZ[g(X)f(Z)]EX[g(X)]EZ[f(Z)]

Additionally, the operator ΣXX|Z captures conditional variance of a random variable X through g,ΣXX|ZgHX=EZ[VarX|Z[g(X)|Z]]

Let Z¨:=[Y,Z]. Similarly, we have g,ΣXX|Z¨gHX=EZ¨[VarX|Z¨[g(X)|Z¨]]

2. Regression in RKHS

Given random variables X,Z over measureable spaces X,Z, the authors exploit regression framework in RKHS to project Z onto finite-dimensional HX, and encode the dependency by regressing ϕX(X) on Z ϕX(X)=F(Z)+U,F:ZHX(1) Consider the two regression models in RKHS ϕX(X)=F1(Z)+U1,ϕX(X)=F2(Z¨)+U2(2) From Lemma 1, we can derive the equality between the conditional variance operators as

eq4
Equation 4 (extracted from the paper)

ϕX(X) is a vector in infinite-dimensional space, and can be written as ϕX(X)=[ϕ1(X),...,ϕi(X),...]T. Since ϕX(X) is a feature map in HX, there exists a function gHX such that g=ϕi(X). Thus, Equation 4 holds for any ϕi(X).

Furthermore, each component ϕi(X) and ϕj(X) is orthogonal, and Cov(ϕi(X),ϕj(X))=0 for ij. Notice that ϕX(X)|Z is also a random vector with ϕX(X)|Z=[ϕ1(X)|Z,...,ϕi(X)|Z,...]T, the covariance matrix of Var[ϕX(X)|Z] is a diagonal matrix in which each non-zero element corresponds to Var[ϕi(X)] Thus, we can derive

eq3
Equation 3 (extracted from the paper)

Given Z, Y gives no information to predict X. Graphically, if Z is a parent of X and X and Y are independent conditioning on Z, there should not be any arrow directly pointing to X from Y. Thus, the first structural equation in (2) is a better fit. Here, causal learning can be viewed as a problem of model selection, and the task becomes how to construct a score function for selection regression model in RKHS.

The most straightforward test for goodness-of-fit is the maximum log-likelihood for the regression problem in (1). Unfortunately, we do not know the distribution of ϕX(X) since X is infinite-dimensional. The strategy is to reformulate the problem so that it is expressed only in kernel functions, which is tractable in finite space.

3. Formulation

For a particular observation (x,z), we map ϕX(x) into the empirical feature map kX that is in ndimensional space kx=[kX(x1,x),ϕX(x)HX...kX(xn,x),ϕX(x)HX]

According the reproducing property, kx has the same vector structure as ϕX(X) wherein each element is equivalent to a function fi(x) given by the linear combination. This means that regressing kX on Z does not cause loss of information, that is the log-likelihood score reflects model compatibility while remains tractable.

We now turn our attention to the reformulated problem kx=F~(z)+U~(5) Assuming Gaussian errors, we fit a kernel ridge regression on finite sample size (regression with L2 regularization) to estimate F^~Z(z)i=1nkZ(zi,z)ci where the coefficient matrix cRn can be found by solving min1n||YcKZ||Rn2+λcTKZcλ:regularization parameter This gives c^=Y(KZ+λnI)1 The fitted function is F^~Z(z)=c^KZ=KX~(KZ+λnI)1KZ The estimated covariance matrix is Σ^=1n(K~XF^~Z)(K~XF^~Z)T=nλ2K~X(K~Z+nλI)2K~X

The maximal value of log-likelihood is represented as

LL
Maximum Value of Log-Likehood (extracted from the paper)
From line 2 to line 3, trace{(KX~W~^ΦZ)TΣ^1(KX~W~^ΦZ)}=trace{Σ^1(KX~W~^ΦZ)T(KX~W~^ΦZ)} =trace{Σ^1(KX~F^~Z)(KX~F^~Z)T}=trace{Σ^Σ^1}=trace{I}=n

4. Score Functions

Maximizing the likelihood can cause over-fitting. Thus, the proposed score functions are Cross-Validation Log-Likelihood and Marginal Log-Likelihood.

4.1. CV Log-Likelihood

Given a dataset D with m variables X1,X2,...,Xm, we split D into a training set of size n1 and a test set of size n0, then repeat the procedure Q times. Let D1,i(q) and D0,i(q) be the qth training set and the qth training set for variable Xi.

The score of DAG Gh is represented by SCV(Gh;D)=i=1mSCV(Xi,PAiGh) SCV(Xi,PAiGh)=1Qq=1Ql(Fi(q)~^|D0,i(q)) Xi is the target variable and PAiGh is the set of predictors in the regression function in Equation 5. The individual likelihood function is represented with kernel matrices as follow

cv
CV Log-Likehood (extracted from the paper)

The formula can be derived using matrix manipulation. Notice that line 2 approximates Σ^(q) by adding λ to give rise to kernel functions. The paper also proposes removing the fitted values and using only KZ~ to reduce overfitting. This is equivalent to regressing kX on the empty set. Please refer to the paper for more details.

4.2. Marginal Likelihood

The marginal likelihood score is estimated as SM(Gh;D)=i=1mSM(Xi,PAiGh) with SM(Xi,PAiGh)=log p(Xi|PAiGh,σi^), and the noise variance σi^ is learned by maximizing marginal likelihood with gradient methods.

mg
Marginal Log-Likehood (extracted from the paper)

Reliability

A score function must also satisfy Local Consistency and Score Equivalence.

Local Consistency

def1
Definition of Local Consistency (extracted from the paper)

The score function is expected to give lower scores to models assigning with incorrect indendency constraints. It has been shown that under mild conditions, CV likelihood score is locally consistent according to Lemma 2.
def1
Lemma 2 (extracted from the paper)

The model selected by CV likelihood has the same performance with the model selected by the optimal benchmark that is fitted on the entire sample. That is, CV likelihood score supports learning causal relations that hold in the true data generating distribution. This conclusion at the same time supports K-fold cross validation with reasonably large K.

Marginal likelihood score is also proven to yield local consistency according to Lemma 3.

lemma3
Lemma 3 (extracted from the paper)

The score can then be expressed in terms of Bayesian information criterion (BIC), plus a constant, in which BIC is locally consistent.

Score Equivalence

def2
Definition of Score Equivalence (extracted from the paper)

On the other hand, the previous studies provide evidence that non-linear regression gives different scores to models in an equivalence class. Thus, neither CV or Marginal likelihood meet Score Equivalence.

However, the authors show that we can exploit local consistency and local score change to compare the reliability of any two DAGs. It is observed that the learned DAG with the same structure and orientation as the data generative distribution gives the highest scores. However, since there have not been any theoretical proof for this yet, we resort to the problem of searching the optimal Markov equivalence class.

In general, if conditions in Lemma 2 and Lemma 3 hold, the CV likelihood and Marginal likelihood, under RKHS regression framework, are guaranteed to find the Markov equivalence class that is closely consistent with the data generating distribution by using Greedy Equivalence Search algorithm (GES). Details can be found in Proposition 1 and Proposition 2.

Previous score-based methods