from IPython.display import Image
Image('../../Python_probability_statistics_machine_learning_2E.png',width=200)
So far, we have considered parametric methods that reduce inference or prediction to parameter-fitting. However, for these to work, we had to assume a specific functional form for the unknown probability distribution of the data. Nonparametric methods eliminate the need to assume a specific functional form by generalizing to classes of functions.
We have already made heavy use of this method with the histogram, which is a special case of kernel density estimation. The histogram can be considered the crudest and most useful nonparametric method, that estimates the underlying probability distribution of the data.
To be formal and place the histogram on the same footing as our earlier estimations, suppose that $\mathscr{X}=[0,1]^d$ is the $d$ dimensional unit cube and that $h$ is the bandwidth or size of a bin or sub-cube. Then, there are $N\approx(1/h)^d$ such bins, each with volume $h^d$, $\lbrace B_1,B_2,\ldots,B_N \rbrace$. With all this in place, we can write the histogram has a probability density estimator of the form,
$$ \hat{p}_h(x) = \sum_{k=1}^N \frac{\hat{\theta}_k}{h} I(x\in B_k) $$where
$$ \hat{\theta}_k=\frac{1}{n} \sum_{j=1}^n I(X_j\in B_k) $$is the fraction of data points ($X_k$) in each bin, $B_k$. We want to bound the bias and variance of $\hat{p}_h(x)$. Keep in mind that we are trying to estimate a function of $x$, but the set of all possible probability distribution functions is extremely large and hard to manage. Thus, we need to restrict our attention to the following class of probability distribution of so-called Lipschitz functions,
$$ \mathscr{P}(L) = \lbrace p\colon \vert p(x)-p(y)\vert \le L \Vert x-y\Vert, \forall \: x,y \rbrace $$Roughly speaking, these are the density functions whose slopes (i.e., growth rates) are bounded by $L$. It turns out that the bias of the histogram estimator is bounded in the following way,
$$ \int\vert p(x)-\mathbb{E}(\hat{p}_h(x))\vert dx \le L h\sqrt{d} $$Similarly, the variance is bounded by the following,
$$ \mathbb{V}(\hat{p}_h(x)) \le \frac{C}{n h^d} $$for some constant $C$. Putting these two facts together means that the risk is bounded by,
$$ R(p,\hat{p}) = \int \mathbb{E}(p(x) -\hat{p}_h(x))^2 dx \le L^2 h^2 d + \frac{C}{n h^d} $$This upper bound is minimized by choosing
$$ h = \left(\frac{C}{L^2 n d}\right)^\frac{1}{d+2} $$In particular, this means that,
$$ \sup_{p\in\mathscr{P}(L)} R(p,\hat{p}) \le C_0 \left(\frac{1}{n}\right)^{\frac{2}{d+2}} $$where the constant $C_0$ is a function of $L$. There is a theorem [wasserman2004all] that shows this bound in tight, which basically means that the histogram is a really powerful probability density estimator for Lipschitz functions with risk that goes as $\left(\frac{1}{n}\right)^{\frac{2}{d+2}}$. Note that this class of functions is not necessarily smooth because the Lipschitz condition admits non-smooth functions. While this is a reassuring result, we typically do not know which function class (Lipschitz or not) a particular probability belongs to ahead of time. Nonetheless, the rate at which the risk changes with both dimension $d$ and $n$ samples would be hard to understand without this result. Figure shows the probability distribution function of the $\beta(2,2)$ distribution compared to computed histograms for different values of $n$. The box plots on each of the points show how the variation in each bin of the histogram reduces with increasing $n$. The risk function $R(p,\hat{p})$ above is based upon integrating the squared difference between the histogram (as a piecewise function of $x$) and the probability distribution function.
Programming Tip.
The following snippet is the main element of the code for Figure.
def generate_samples(n,ntrials=500):
phat = np.zeros((nbins,ntrials))
for k in range(ntrials):
d = rv.rvs(n)
phat[:,k],_=histogram(d,bins,density=True)
return phat
The code uses the histogram
function from Numpy.
To be consistent with the
risk function $R(p,\hat{p})$, we have to make sure
the bins
keyword argument
is formatted correctly using a sequence of
bin-edges instead of just a single
integer. Also, the density=True
keyword
argument normalizes the histogram
appropriately so that the comparison between
it and the probability distribution
function of the simulated beta distribution
is correctly scaled.
The box plots on each of the points show how the variation in each bin of the histogram reduces with increasing $n$.
We can extend our methods to other function classes using kernel functions. A one-dimensional smoothing kernel is a smooth function $K$ with the following properties,
$$ \begin{align*} \int K(x) dx &= 1 \\\ \int x K(x) dx &= 0 \\\ 0< \int x^2 K(x) dx &< \infty \\\ \end{align*} $$For example, $K(x)=I(x)/2$ is the boxcar kernel, where $I(x)=1$ when $\vert x\vert\le 1$ and zero otherwise. The kernel density estimator is very similar to the histogram, except now we put a kernel function on every point as in the following,
$$ \hat{p}(x)=\frac{1}{n}\sum_{i=1}^n \frac{1}{h^d} K\left(\frac{\Vert x-X_i\Vert}{h}\right) $$where $X\in \mathbb{R}^d$. Figure shows an example of a kernel density estimate using a Gaussian kernel function, $K(x)=e^{-x^2/2}/\sqrt{2\pi}$. There are five data points shown by the vertical lines in the upper panel. The dotted lines show the individual $K(x)$ function at each of the data points. The lower panel shows the overall kernel density estimate, which is the scaled sum of the upper panel.
There is an important technical result in [wasserman2004all] that states that kernel density estimators are minimax in the sense we discussed in the maximum likelihood the section ch:stats:sec:mle. In broad strokes, this means that the analogous risk for the kernel density estimator is approximately bounded by the following factor,
$$ R(p,\hat{p}) \lesssim n^{-\frac{2 m}{2 m+d}} $$for some constant $C$ where $m$ is a factor related to bounding the derivatives of the probability density function. For example, if the second derivative of the density function is bounded, then $m=2$. This means that the convergence rate for this estimator decreases with increasing dimension $d$.
The upper panel shows the individual kernel functions placed at each of the data points. The lower panel shows the composite kernel density estimate which is the sum of the individual functions in the upper panel.
As a practical matter, the tricky part of the kernel density estimator (which includes the histogram as a special case) is that we need to somehow compute the bandwidth $h$ term using data. There are several rule-of-thumb methods that for some common kernels, including Silverman's rule and Scott's rule for Gaussian kernels. For example, Scott's factor is to simply compute $h=n^{ -1/(d+4) }$ and Silverman's is $h=(n (d+2)/4)^{ (-1/(d+4)) }$. Rules of this kind are derived by assuming the underlying probability density function is of a certain family (e.g., Gaussian), and then deriving the best $h$ for a certain type of kernel density estimator, usually equipped with extra functional properties (say, continuous derivatives of a certain order). In practice, these rules seem to work pretty well, especially for uni-modal probability density functions. Avoiding these kinds of assumptions means computing the bandwidth from data directly and that is where cross validation comes in.
Cross-validation is a method to estimate the bandwidth from the data itself. The idea is to write out the following Integrated Squared Error (ISE),
$$ \begin{align*} \textnormal{ISE}(\hat{p}_h,p)&=\int (p(x)-\hat{p}_h(x))^2 dx\\\ &= \int \hat{p}_h(x)^2 dx - 2\int p(x) \hat{p}_h dx + \int p(x)^2 dx \end{align*} $$The problem with this expression is the middle term 1,
only interested in relative changes in the ISE.
$$ \int p(x)\hat{p}_h dx $$where $p(x)$ is what we are trying to estimate with $\hat{p}_h$. The form of the last expression looks like an expectation of $\hat{p}_h$ over the density of $p(x)$, $\mathbb{E}(\hat{p}_h)$. The approach is to approximate this with the mean,
$$ \mathbb{E}(\hat{p}_h) \approx \frac{1}{n}\sum_{i=1}^n \hat{p}_h(X_i) $$The problem with this approach is that $\hat{p}_h$ is computed using the same data that the approximation utilizes. The way to get around this is to split the data into two equally sized chunks $D_1$, $D_2$; and then compute $\hat{p}_h$ for a sequence of different $h$ values over the $D_1$ set. Then, when we apply the above approximation for the data ($Z_i$) in the $D_2$ set,
$$ \mathbb{E}(\hat{p}_h) \approx \frac{1}{\vert D_2\vert}\sum_{Z_i\in D_2} \hat{p}_h(Z_i) $$Plugging this approximation back into the integrated squared error provides the objective function,
$$ \texttt{ISE}\approx \int \hat{p}_h(x)^2 dx-\frac{2}{\vert D_2\vert}\sum_{Z_i\in D_2} \hat{p}_h(Z_i) $$Some code will make these steps concrete. We will need some tools from Scikit- learn.
The last term is of no interest because we are↩
from sklearn.model_selection import train_test_split
from sklearn.neighbors.kde import KernelDensity
The train_test_split
function makes it easy to split and
keep track of the
$D_1$ and $D_2$ sets we need for cross validation. Scikit-learn
already has a
powerful and flexible implementation of kernel density estimators.
To compute
the objective function, we need some
basic numerical integration tools from
Scipy. For this example, we
will generate samples from a $\beta(2,2)$
distribution, which is
implemented in the stats
submodule in Scipy.
import numpy as np
np.random.seed(123456)
from scipy.integrate import quad
from scipy import stats
rv= stats.beta(2,2)
n=100 # number of samples to generate
d = rv.rvs(n)[:,None] # generate samples as column-vector
Programming Tip.
The use of the [:,None]
in the last line formats the
Numpy array returned by
the rvs
function into a Numpy vector with a column
dimension of one. This is
required by the KernelDensity
constructor because
the column dimension is
used for different features (in general) for Scikit-
learn. Thus, even though we
only have one feature, we still need to comply with
the structured input that
Scikit-learn relies upon. There are many ways to
inject the additional
dimension other than using None
. For example, the more
cryptic, np.c_
, or
the less cryptic [:,np.newaxis]
can do the same, as can
the np.reshape
function.
The next step is to split the data into two halves and loop over each of the $h_i$ bandwidths to create a separate kernel density estimator based on the $D_1$ data,
train,test,_,_=train_test_split(d,d,test_size=0.5)
kdes=[KernelDensity(bandwidth=i).fit(train)
for i in [.05,0.1,0.2,0.3]]
Programming Tip.
Note that the single underscore symbol in Python refers to
the last evaluated
result. the above code unpacks the tuple returned by
train_test_split
into
four elements. Because we are only interested in the
first two, we assign the
last two to the underscore symbol. This is a stylistic
usage to make it clear
to the reader that the last two elements of the tuple are
unused.
Alternatively, we could assign the last two elements to a pair of dummy
variables that we do not use later, but then the reader skimming the code may
think that those dummy variables are relevant.
The last step is to loop over the so-created kernel density estimators and compute the objective function.
for i in kdes:
f = lambda x: np.exp(i.score_samples(x))
f2 = lambda x: f([[x]])**2
print('h=%3.2f\t %3.4f'%(i.bandwidth,quad(f2,0,1)[0]
-2*np.mean(f(test))))
Programming Tip.
The lambda functions defined in the last block are
necessary because
Scikit-learn implements the return value of the kernel density
estimator as a
logarithm via the score_samples
function. The numerical
quadrature function
quad
from Scipy computes the $\int \hat{p}_h(x)^2 dx$ part
of the objective
function.
%matplotlib inline
from __future__ import division
from matplotlib.pylab import subplots
fig,ax=subplots()
xi = np.linspace(0,1,100)[:,None]
for i in kdes:
f=lambda x: np.exp(i.score_samples(x))
f2 = lambda x: f(x)**2
_=ax.plot(xi,f(xi),label='$h$='+str(i.bandwidth))
_=ax.set_xlabel('$x$',fontsize=28)
_=ax.set_ylabel('$y$',fontsize=28)
_=ax.plot(xi,rv.pdf(xi),'k:',lw=3,label='true')
_=ax.legend(loc=0)
ax2 = ax.twinx()
_=ax2.hist(d,20,alpha=.3,color='gray')
_=ax2.axis(ymax=50)
_=ax2.set_ylabel('count',fontsize=28)
fig.tight_layout()
fig.savefig('fig-statistics/nonparametric_003.png')
Each line above is a different kernel density estimator for the given bandwidth as an approximation to the true density function. A plain histogram is imprinted on the bottom for reference.
Scikit-learn has many more advanced tools to automate this kind of hyper-parameter (i.e., kernel density bandwidth) search. To utilize these advanced tools, we need to format the current problem slightly differently by defining the following wrapper class.
class KernelDensityWrapper(KernelDensity):
def predict(self,x):
return np.exp(self.score_samples(x))
def score(self,test):
f = lambda x: self.predict(x)
f2 = lambda x: f([[x]])**2
return -(quad(f2,0,1)[0]-2*np.mean(f(test)))
This is tantamount to reorganizing the above previous code
into functions that
Scikit-learn requires. Next, we create the
dictionary of parameters we want to
search over (params
) below
and then start the grid search with the fit
function,
from sklearn.model_selection import GridSearchCV
params = {'bandwidth':np.linspace(0.01,0.5,10)}
clf = GridSearchCV(KernelDensityWrapper(), param_grid=params,cv=2)
clf.fit(d)
print (clf.best_params_)
The grid search iterates over all the elements in the params
dictionary and
reports the best bandwidth over that list of parameter values.
The cv
keyword
argument above specifies that we want to split the data
into two equally-sized
sets for training and testing. We can
also examine the values of the objective
function for each point
on the grid as follow,
clf.cv_results_['mean_test_score']
Keep in mind that the grid search examines multiple folds for cross
validation
to compute the above means and standard deviations. Note that there
is also a
RandomizedSearchCV
in case you would rather specify a distribution
of
parameters instead of a list. This is particularly useful for searching very
large parameter spaces where an exhaustive grid search would be too
computationally expensive. Although kernel density estimators are easy to
understand and have many attractive analytical properties, they become
practically prohibitive for large, high-dimensional data sets.
Regression Estimators
Beyond estimating the underlying probability density, we can use nonparametric methods to compute estimators of the underlying function that is generating the data. Nonparametric regression estimators of the following form are known as linear smoothers,
$$ \hat{y}(x) = \sum_{i=1}^n \ell_i(x) y_i $$To understand the performance of these smoothers, we can define the risk as the following,
$$ R(\hat{y},y) = \mathbb{E}\left( \frac{1}{n} \sum_{i=1}^n (\hat{y}(x_i)-y(x_i))^2 \right) $$and find the best $\hat{y}$ that minimizes this. The problem with this metric is that we do not know $y(x)$, which is why we are trying to approximate it with $\hat{y}(x)$. We could construct an estimation by using the data at hand as in the following,
$$ \hat{R}(\hat{y},y) =\frac{1}{n} \sum_{i=1}^n (\hat{y}(x_i)-Y_i)^2 $$where we have substituted the data $Y_i$ for the unknown function value, $y(x_i)$. The problem with this approach is that we are using the data to estimate the function and then using the same data to evaluate the risk of doing so. This kind of double-dipping leads to overly optimistic estimators. One way out of this conundrum is to use leave-one-out cross validation, wherein the $\hat{y}$ function is estimated using all but one of the data pairs, $(X_i,Y_i)$. Then, this missing data element is used to estimate the above risk. Notationally, this is written as the following,
$$ \hat{R}(\hat{y},y) =\frac{1}{n} \sum_{i=1}^n (\hat{y}_{(-i)}(x_i)-Y_i)^2 $$where $\hat{y}_{(-i)}$ denotes computing the estimator without using the $i^{th}$ data pair. Unfortunately, for anything other than relatively small data sets, it quickly becomes computationally prohibitive to use leave-one-out cross validation in practice. We'll get back to this issue shortly, but let's consider a concrete example of such a nonparametric smoother.
Regression
The simplest possible nonparametric regression method is the $k$-nearest neighbors regression. This is easier to explain in words than to write out in math. Given an input $x$, find the closest one of the $k$ clusters that contains it and then return the mean of the data values in that cluster. As a univariate example, let's consider the following chirp waveform,
$$ y(x)=\cos\left(2\pi\left(f_o x + \frac{BW x^2}{2\tau}\right)\right) $$This waveform is important in high-resolution radar applications. The $f_o$ is the start frequency and $BW/\tau$ is the frequency slope of the signal. For our example, the fact that it is nonuniform over its domain is important. We can easily create some data by sampling the chirp as in the following,
from numpy import cos, pi
xi = np.linspace(0,1,100)[:,None]
xin = np.linspace(0,1,12)[:,None]
f0 = 1 # init frequency
BW = 5
y = np.cos(2*pi*(f0*xin+(BW/2.0)*xin**2))
We can use this data to construct a simple nearest neighbor estimator using Scikit-learn,
from sklearn.neighbors import KNeighborsRegressor
knr=KNeighborsRegressor(2)
knr.fit(xin,y)
Programming Tip.
Scikit-learn has a fantastically consistent interface. The
fit
function above
fits the model parameters to the data. The corresponding
predict
function
returns the output of the model given an arbitrary input. We
will spend a lot
more time on Scikit-learn in the machine learning chapter. The
[:,None]
part
at the end is just injecting a column dimension into the array
in order to
satisfy the dimensional requirements of Scikit-learn.
from matplotlib.pylab import subplots
fig,ax=subplots()
yi = cos(2*pi*(f0*xi+(BW/2.0)*xi**2))
_=ax.plot(xi,yi,'k--',lw=2,label=r'$y(x)$')
_=ax.plot(xin,y,'ko',lw=2,ms=11,color='gray',alpha=.8,label='$y(x_i)$')
_=ax.fill_between(xi.flat,yi.flat,knr.predict(xi).flat,color='gray',alpha=.3)
_=ax.plot(xi,knr.predict(xi),'k-',lw=2,label='$\hat{y}(x)$')
_=ax.set_aspect(1/4.)
_=ax.axis(ymax=1.05,ymin=-1.05)
_=ax.set_xlabel(r'$x$',fontsize=24)
_=ax.legend(loc=0)
fig.set_tight_layout(True)
fig.savefig('fig-statistics/nonparametric_004.png')
The dotted line shows the chirp signal and the solid line shows the nearest neighbor estimate. The gray circles are the sample points that we used to fit the nearest neighbor estimator. The shaded area shows the gaps between the estimator and the unsampled chirp.
Figure shows the sampled signal (gray circles) against the values generated by the nearest neighbor estimator (solid line). The dotted line is the full unsampled chirp signal, which increases in frequency with $x$. This is important for our example because it adds a non-stationary aspect to this problem in that the function gets progressively wigglier with increasing $x$. The area between the estimated curve and the signal is shaded in gray. Because the nearest neighbor estimator uses only two nearest neighbors, for each new $x$, it finds the two adjacent $X_i$ that bracket the $x$ in the training data and then averages the corresponding $Y_i$ values to compute the estimated value. That is, if you take every adjacent pair of sequential gray circles in the Figure, you find that the horizontal solid line splits the pair on the vertical axis. We can adjust the number of nearest neighbors by changing the constructor,
knr=KNeighborsRegressor(3)
knr.fit(xin,y)
fig,ax=subplots()
_=ax.plot(xi,yi,'k--',lw=2,label=r'$y(x)$')
_=ax.plot(xin,y,'ko',lw=2,ms=11,color='gray',alpha=.8,label='$y(x_i)$')
_=ax.fill_between(xi.flat,yi.flat,knr.predict(xi).flat,color='gray',alpha=.3)
_=ax.plot(xi,knr.predict(xi),'k-',lw=2,label='$\hat{y}(x)$')
_=ax.set_aspect(1/4.)
_=ax.axis(ymax=1.05,ymin=-1.05)
_=ax.set_xlabel(r'$x$',fontsize=24)
_=ax.legend(loc=0)
fig.set_tight_layout(True)
fig.savefig('fig-statistics/nonparametric_005.png')
which produces the following corresponding Figure.
This is the same as [Figure](#fig:nonparametric_004) except that here there are three nearest neighbors used to build the estimator.
For this example, Figure shows that with more nearest neighbors the fit performs poorly, especially towards the end of the signal, where there is increasing variation, because the chirp is not uniformly continuous.
Scikit- learn provides many tools for cross validation. The following code sets up the tools for leave-one-out cross validation,
from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
The LeaveOneOut
object is an iterable that produces a set of
disjoint indices
of the data --- one for fitting the model (training set) and
one for evaluating
the model (testing set). The next block loops over the
disjoint sets of
training and test indicies iterates provided by the loo
variable to evaluate
the estimated risk, which is accumulated in the out
list.
out=[]
for train_index, test_index in loo.split(xin):
_=knr.fit(xin[train_index],y[train_index])
out.append((knr.predict(xi[test_index])-y[test_index])**2)
print( 'Leave-one-out Estimated Risk: ',np.mean(out),)
The last line in the code above reports leave-one-out's estimated risk. Linear smoothers of this type can be rewritten in using the following matrix,
$$ \mathscr{S} = \left[ \ell_i(x_j) \right]_{i,j} $$so that
$$ \hat{\mathbf{y}} = \mathscr{S} \mathbf{y} $$where $\mathbf{y}=\left[Y_1,Y_2,\ldots,Y_n\right]\in \mathbb{R}^n$ and $\hat{ \mathbf{y} }=\left[\hat{y}(x_1),\hat{y}(x_2),\ldots,\hat{y}(x_n)\right]\in \mathbb{R}^n$. This leads to a quick way to approximate leave-one-out cross validation as the following,
$$ \hat{R}=\frac{1}{n}\sum_{i=1}^n\left(\frac{y_i-\hat{y}(x_i)}{1-\mathscr{S}_{i,i}}\right)^2 $$However, this does not reproduce the approach in the code above because it assumes that each $\hat{y}_{(-i)}(x_i)$ is consuming one fewer nearest neighbor than $\hat{y}(x)$.
We can get this $\mathscr{S}$ matrix from the knr
object
as in the following,
_= knr.fit(xin,y) # fit on all data
S=(knr.kneighbors_graph(xin)).todense()/float(knr.n_neighbors)
The todense
part reformats the sparse matrix that is
returned into a regular
Numpy matrix
. The following shows a subsection
of this $\mathcal{S}$ matrix,
print(S[:5,:5])
The sub-blocks show the windows of the the y
data that are being
processed by
the nearest neighbor estimator. For example,
print(np.hstack([knr.predict(xin[:5]),(S*y)[:5]]))#columns match
Or, more concisely checking all entries for approximate equality,
np.allclose(knr.predict(xin),S*y)
which shows that the results from the nearest neighbor object and the matrix multiply match.
Programming Tip.
Note that because we formatted the
returned $\mathscr{S}$ as a Numpy matrix, we
automatically get the matrix
multiplication instead of default element-wise
multiplication in the S*y
term.
For estimating the probability density, we started with the histogram and moved to the more general kernel density estimate. Likewise, we can also extend regression from nearest neighbors to kernel-based regression using the Nadaraya-Watson kernel regression estimator. Given a bandwidth $h>0$, the kernel regression estimator is defined as the following,
$$ \hat{y}(x)=\frac{\sum_{i=1}^n K\left(\frac{x-x_i}{h}\right) Y_i}{\sum_{i=1}^n K \left( \frac{x-x_i}{h} \right)} $$Unfortunately, Scikit-learn does not implement this
regression estimator;
however, Jan Hendrik Metzen makes a compatible
version available on
github.com
.
import sys
sys.path.append('../src-statistics')
xin = np.linspace(0,1,20)[:,None]
y = cos(2*pi*(f0*xin+(BW/2.0)*xin**2)).flatten()
from kernel_regression import KernelRegression
This code makes it possible to internally optimize over the bandwidth
parameter
using leave-one-out cross validation by specifying a grid of
potential bandwidth
values (gamma
), as in the following,
kr = KernelRegression(gamma=np.linspace(6e3,7e3,500))
kr.fit(xin,y)
Figure shows the kernel estimator (heavy black line) using the Gaussian kernel compared to the nearest neighbor estimator (solid light black line). As before, the data points are shown as circles. Figure shows that the kernel estimator can pick out the sharp peaks that are missed by the nearest neighbor estimator.
The heavy black line is the Gaussian kernel estimator. The light black line is the nearest neighbor estimator. The data points are shown as gray circles. Note that unlike the nearest neighbor estimator, the Gaussian kernel estimator is able to pick out the sharp peaks in the training data.
Thus, the difference between nearest neighbor and kernel estimation is that the latter provides a smooth moving averaging of points whereas the former provides a discontinuous averaging. Note that kernel estimates suffer near the boundaries where there is mismatch between the edges and the kernel function. This problem gets worse in higher dimensions because the data naturally drift towards the boundaries (this is a consequence of the curse of dimensionality). Indeed, it is not possible to simultaneously maintain local accuracy (i.e., low bias) and a generous neighborhood (i.e., low variance). One way to address this problem is to create a local polynomial regression using the kernel function as a window to localize a region of interest. For example,
$$ \hat{y}(x)=\sum_{i=1}^n K\left(\frac{x-x_i}{h}\right) (Y_i-\alpha - \beta x_i)^2 $$and now we have to optimize over the two linear parameters $\alpha$ and $\beta$. This method is known as local linear regression [loader2006local], [hastie2013elements]. Naturally, this can be extended to higher-order polynomials. Note that these methods are not yet implemented in Scikit-learn.
fig,ax=subplots()
#fig.set_size_inches((12,4))
_=ax.plot(xi,kr.predict(xi),'k-',label='kernel',lw=3)
_=ax.plot(xin,y,'o',lw=3,color='gray',ms=12)
_=ax.plot(xi,yi,'--',color='gray',label='chirp')
_=ax.plot(xi,knr.predict(xi),'k-',label='nearest')
_=ax.axis(ymax=1.1,ymin=-1.1)
_=ax.set_aspect(1/4.)
_=ax.axis(ymax=1.05,ymin=-1.05)
_=ax.set_xlabel(r'$x$',fontsize=24)
_=ax.set_ylabel(r'$y$',fontsize=24)
_=ax.legend(loc=0)
fig.savefig('fig-statistics/nonparametric_006.png')
The so-called curse of dimensionality occurs as we move into higher and higher dimensions. The term was coined by Bellman in 1961 while he was studying adaptive control processes. Nowadays, the term is vaguely refers to anything that becomes more complicated as the number of dimensions increases substantially. Nevertheless, the concept is useful for recognizing and characterizing the practical difficulties of high- dimensional analysis and estimation.
Consider the volume of an $d$-dimensional sphere of radius $r$,
$$ V_s(d,r)=\frac{\pi ^{d/2} r^d}{\Gamma \left(\frac{d}{2}+1\right)} $$Further, consider the sphere $V_s(d,1/2)$ enclosed by an $d$ dimensional unit cube. The volume of the cube is always equal to one, but $\lim_{d\rightarrow\infty} V_s(d,1/2) = 0$. What does this mean? It means that the volume of the cube is pushed away from its center, where the embedded hypersphere lives. Specifically, the distance from the center of the cube to its vertices in $d$ dimensions is $\sqrt{d}/2$, whereas the distance from the center of the inscribing sphere is $1/2$. This diagonal distance goes to infinity as $d$ does. For a fixed $d$, the tiny spherical region at the center of the cube has many long spines attached to it, like a hyper-dimensional sea urchin or porcupine.
Another way to think about this is to consider the $\epsilon>0$ thick peel of the hypersphere,
$$ \mathcal{P}_{\epsilon} =V_s(d,r) - V_s(d,r-\epsilon) $$Then, we consider the following limit,
$$ \begin{equation} \lim_{d\rightarrow\infty}\mathcal{P}_{\epsilon} =\lim_{d\rightarrow\infty} V_s(d,r)\left(1 - \frac{V_s(d,r-\epsilon)}{V_s(d,r)}\right) \label{_auto1} \tag{1} \end{equation} $$ $$ \begin{equation} \ =\lim_{d\rightarrow\infty} V_s(d,r)\left(1 -\lim_{d\rightarrow\infty} \left(\frac{r-\epsilon}{r}\right)^d\right) \label{_auto2} \tag{2} \end{equation} $$ $$ \begin{equation} \ =\lim_{d\rightarrow\infty} V_s(d,r) \label{_auto3} \tag{3} \end{equation} $$So, in the limit, the volume of the $\epsilon$-thick peel consumes the volume of the hypersphere.
What are the consequences of this? For methods that rely on nearest neighbors, exploiting locality to lower bias becomes intractable. For example, suppose we have an $d$ dimensional space and a point near the origin we want to localize around. To estimate behavior around this point, we need to average the unknown function about this point, but in a high-dimensional space, the chances of finding neighbors to average are slim. Looked at from the opposing point of view, suppose we have a binary variable, as in the coin- flipping problem. If we have 1000 trials, then, based on our earlier work, we can be confident about estimating the probability of heads. Now, suppose we have 10 binary variables. Now we have $2^{ 10 }=1024$ vertices to estimate. If we had the same 1000 points, then at least 24 vertices would not get any data. To keep the same resolution, we would need 1000 samples at each vertex for a grand total of $1000\times 1024 \approx 10^6$ data points. So, for a ten fold increase in the number of variables, we now have about 1000 more data points to collect to maintain the same statistical resolution. This is the curse of dimensionality.
Perhaps some code will clarify this. The following code generates samples in two dimensions that are plotted as points in Figure with the inscribed circle in two dimensions. Note that for $d=2$ dimensions, most of the points are contained in the circle.
import numpy as np
v=np.random.rand(1000,2)-1/2.
from matplotlib.patches import Circle
from matplotlib.pylab import subplots
fig,ax=subplots()
fig.set_size_inches((5,5))
_=ax.set_aspect(1)
_=ax.scatter(v[:,0],v[:,1],color='gray',alpha=.3)
_=ax.add_patch(Circle((0,0),0.5,alpha=.8,lw=3.,fill=False))
fig.savefig('fig-statistics/curse_of_dimensionality_001.pdf')
Two dimensional scatter of points randomly and independently uniformly distributed in the unit square. Note that most of the points are contained in the circle. Counter to intuition, this does not persist as the number of dimensions increases.
The next code block describes the core computation in Figure. For each of the dimensions, we create a set of uniformly distributed random variates along each dimension and then compute how close each $d$ dimensional vector is to the origin. Those that measure one half are those contained in the hypersphere. The histogram of each measurment is shown in the corresponding panel in the Figure. The dark vertical line shows the threshold value. Values to the left of this indicate the population that are contained in the hypersphere. Thus, Figure shows that as $d$ increases, fewer points are contained in the inscribed hypersphere. The following code paraphrases the content of Figure.
fig,ax=subplots()
for d in [2,3,5,10,20,50]:
v=np.random.rand(5000,d)-1/2.
ax.hist([np.linalg.norm(i) for i in v])
siz = [ 2,3,5,10,20,50 ]
fig,axs=subplots(3,2,sharex=True)
fig.set_size_inches((10,6))
for ax,k in zip(axs.flatten(),siz):
v=np.random.rand(5000,k)-1/2.
_=ax.hist([np.linalg.norm(i) for i in v],color='gray',density=True);
_=ax.vlines(0.5,0,ax.axis()[-1]*1.1,lw=3)
_=ax.set_title('$d=%d$'%k,fontsize=20)
_=ax.tick_params(labelsize='small',top=False,right=False)
_=ax.spines['top'].set_visible(False)
_=ax.spines['right'].set_visible(False)
_=ax.spines['left'].set_visible(False)
_=ax.yaxis.set_visible(False)
_=ax.axis(ymax=3.5)
fig.set_tight_layout(True)
fig.savefig('fig-statistics/curse_of_dimensionality_002.pdf')
Each panel shows the histogram of lengths of uniformly distributed $d$ dimensional random vectors. The population to the left of the dark vertical line are those that are contained in the inscribed hypersphere. This shows that fewer points are contained in the hypersphere with increasing dimension.
Nonparametric Tests
Determining whether or not two sets of observations derive from the same underlying probability distribution is an important problem. The most popular way to do this is with a standard t-test, but that requires assumptions about normality that may be hard to justify, which leads to nonparametric methods can get at this questions without such assumptions.
Let $V$ and $W$ be continuous random variables. The variable $V$ is stochastically larger than $W$ if,
$$ \mathbb{P}(V\ge x) \ge \mathbb{P}(W\ge x) $$for all $x\in \mathbb{R}$ with strict inequality for at least one $x$. The term stochastically smaller means the obverse of this. For example, the black line density function shown in Figure is stochastically larger than the gray one.
import numpy as np
from scipy import stats
from matplotlib.pylab import subplots
fig,ax=subplots()
xi = np.linspace(0,2,100)
_=ax.plot(xi,stats.norm(1,0.25).pdf(xi),lw=3,color='k')
_=ax.plot(xi,stats.beta(2,4).pdf(xi),lw=3,color='gray')
_=ax.spines['top'].set_visible(0)
_=ax.spines['right'].set_visible(0)
_=ax.tick_params(labelsize='medium',top=False,right=False)
ax.set_aspect(1/2.5)
fig.savefig('fig-statistics/nonparametric_tests_001.png')
The black line density function is stochastically larger than the gray one.
The Mann-Whitney-Wilcoxon Test approaches the following alternative hypotheses
Suppose we have two data sets $X$ and $Y$ and we want to know if they are drawn from the same underlying probability distribution or if one is stochastically greater than the other. There are $n_x$ elements in $X$ and $n_y$ elements in $Y$. If we combine these two data sets and rank them, then, under the null hypothesis, any data element should be as likely as any other to be assigned any particular rank. that is, the combined set $Z$,
$$ Z = \lbrace X_1,\ldots,X_{n_x}, Y_1,\ldots,Y_{n_y} \rbrace $$contains $n=n_x+n_y$ elements. Thus, any assignment of $n_y$ ranks from the integers $\lbrace 1,\ldots,n \rbrace$ to $\lbrace Y_1,\ldots,Y_{n_y} \rbrace$ should be equally likely (i.e., $\mathbb{P}={ \binom{n}{n_y} }^{-1}$). Importantly, this property is independent of the $F$ distribution.
That is, we can define the $U$ statistic as the following,
$$ U_X =\sum_{i=1}^{n_x}\sum_{j=1}^{n_y}\mathbb{I}(X_i\ge Y_j) $$where $\mathbb{I}(\cdot)$ is the usual indicator function. For an interpretation, this counts the number of times that elements of $Y$ outrank elements of $X$. For example, let us suppose that $X=\lbrace 1,3,4,5,6 \rbrace$ and $Y=\lbrace 2,7,8,10,11 \rbrace$. We can get a this in one move using Numpy broadcasting,
import numpy as np
x = np.array([ 1,3,4,5,6 ])
y = np.array([2,7,8,10,11])
U_X = (y <= x[:,None]).sum()
U_Y = (x <= y[:,None]).sum()
print (U_X, U_Y)
Note that
$$ U_X+U_Y =\sum_{i=1}^{n_x}\sum_{j=1}^{n_y} \mathbb{I}(Y_i\ge X_j)+\mathbb{I}(X_i\ge Y_j)= n_x n_y $$because $\mathbb{I}(Y_i\ge X_j)+\mathbb{I}(X_i\ge Y_j)=1$. We can verify this in Python,
print ((U_X+U_Y) == len(x)*len(y))
Now that we can compute the $U_X$ statistic, we have to characterize it. Let us consider $U_X$. If $H_0$ is true, then $X$ and $Y$ are identically distributed random variables. Thus all $\binom{n_x+n_y}{n_x}$ allocations of the $X$-variables in the ordered combined sample are equally likely. Among these, there are $\binom{n_x+n_y-1}{n_x}$ allocations have a $Y$ variable as the largest observation in the combined sample. For these, omitting this largest observation does not affect $U_X$ because it would not have been counted anyway. The other $\binom{n_x+n_y-1}{n_x-1}$ allocations have an element of $X$ as the largest observation. Omitting this observation reduces $U_X$ by $n_y$.
With all that, suppose $N_{n_x,n_y}(u)$ be the number of allocations of $X$ and $Y$ elements that result in $U_X=u$. Under $H_0$ situation of equally likely outcomes, we have
$$ p_{n_x, n_y}(u) = \mathbb{P}(U_X=u)=\frac{N_{n_x,n_y}(u)}{\binom{n_x+n_y}{n_x}} $$From our previous discussion, we have the recursive relationship,
$$ N_{n_x,n_y}(u) = N_{n_x,n_y-1}(u) + N_{n_x-1,n_y}(u-n_y) $$After dividing all of this by $\binom{n_x+n_y}{n_x}$ and using the $p_{n_x, n_y}(u)$ notation above, we obtain the following,
$$ p_{n_x, n_y}(u) = \frac{n_y}{n_x+n_y} p_{n_x,n_y-1}(u)+\frac{n_x}{n_x+n_y} p_{n_x-1,n_y}(u-n_y) $$where $0\le u\le n_x n_y$. To start this recursion, we need the following initial conditions,
$$ \begin{eqnarray*} p_{0,n_y}(u_x=0) & = & 1 \\ p_{0,n_y}(u_x>0) & = & 0 \\ p_{n_x,0}(u_x=0) & = & 1 \\ p_{n_x,0}(u_x>0) & = & 0 \end{eqnarray*} $$To see how this works in Python,
def prob(n,m,u):
if u<0: return 0
if n==0 or m==0:
return int(u==0)
else:
f = m/float(m+n)
return (f*prob(n,m-1,u) +
(1-f)*prob(n-1,m,u-m))
These are shown in Figure and approach a normal distribution for large $n_x,n_y$, with the following mean and variance,
$$ \begin{eqnarray} \mathbb{E}(U) & = & \frac{n_x n_y}{2} \\ \mathbb{V}(U) & = & \frac{n_x n_y (n_x+n_y+1)}{12} \end{eqnarray} \label{eq:ustatmv} \tag{4} $$The variance becomes more complicated when there are ties.
fig,axs=subplots(2,2)
fig.tight_layout()
ax=axs[0,0]
ax.tick_params(axis='both', which='major', labelsize=10)
_=ax.stem([prob(2,2,i) for i in range(2*2+1)],linefmt='k-',markerfmt='ko',basefmt='k-')
_=ax.set_title(r'$n_x=%d,n_y=%d$'%(2,2),fontsize=14)
ax=axs[0,1]
ax.tick_params(axis='both', which='major', labelsize=10)
_=ax.stem([prob(4,2,i) for i in range(4*2+1)],linefmt='k-',markerfmt='ko',basefmt='k-')
_=ax.set_title(r'$n_x=%d,n_y=%d$'%(4,2),fontsize=14)
ax=axs[1,0]
ax.tick_params(axis='both', which='major', labelsize=10)
_=ax.stem([prob(6,7,i) for i in range(6*7+1)],linefmt='k-',markerfmt='ko',basefmt='k-')
_=ax.set_title(r'$n_x=%d,n_y=%d$'%(6,7),fontsize=14)
ax=axs[1,1]
ax.tick_params(axis='both', which='major', labelsize=10)
_=ax.stem([prob(8,12,i) for i in range(8*12+1)],linefmt='k-',markerfmt='ko',basefmt='k-')
_=ax.set_title(r'$n_x=%d,n_y=%d$'%(8,12),fontsize=14)
fig.savefig('fig-statistics/nonparametric_tests_002.png')
The normal approximation to the distribution improves with increasing $n_x, n_y$.
Example
We are trying to determine whether or not one network configuration is faster than another. We obtain the following round-trip times for each of the networks.
X=np.array([ 50.6,31.9,40.5,38.1,39.4,35.1,33.1,36.5,38.7,42.3 ])
Y=np.array([ 28.8,30.1,18.2,38.5,44.2,28.2,32.9,48.8,39.5,30.7 ])
Because there are too few elements to use the
scipy.stats.mannwhitneyu
function (which internally uses the normal
approximation to the U-statistic), we
can use our custom function above, but
first we need to compute the $U_X$
statistic using Numpy,
U_X = (Y <= X[:,None]).sum()
For the p-value, we want to compute the probability that the observed $U_X$ statistic at least as great as what was observed,
print(sum(prob(10,10,i) for i in range(U_X,101)))
This is close to the usual five percent p-value threshold so it is possible at a slightly higher threshold to conclude that the two sets of samples do not originate from the same underlying distribution. Keep in mind that the usual five percent threshold is just a guideline. Ultimately, it is up to the analyst to make the call.
To prove Equation 4, we assume there are no ties. One way to get at the result $\mathbb{E}(U)= n_x n_y/2$,
$$ \mathbb{E}(U_Y) = \sum_j\sum_i\mathbb{P}(X_i \leq Y_j) $$because $\mathbb{E}(\mathbb{I}(X_i\leq Y_j))=\mathbb{P}(X_i \leq Y_j)$. Further, because all the subscripted $X$ and $Y$ variables are drawn independently from the same distribution, we have
$$ \mathbb{E}(U_Y) = n_x n_y \mathbb{P}(X \leq Y) $$and also,
$$ \mathbb{P}(X \leq Y) + \mathbb{P}(X \ge Y) =1 $$because those are the two mutually exclusive conditions. Because the $X$ variables and $Y$ variables are drawn from the same distribution, we have $\mathbb{P}(X \leq Y) = \mathbb{P}(X \ge Y)$, which means $ \mathbb{P}(X \leq Y)=1/2$ and therefore $\mathbb{E}(U_Y)= n_x n_y /2$. Another way to get the same result, is to note that, as we showed earlier, $U_X+U_Y = n_x n_y $. Then, taking the expectation of both sides noting that $\mathbb{E}(U_X)=\mathbb{E}(U_Y)=\mathbb{E}(U)$, gives
$$ 2 \mathbb{E}(U) = n_x n_y $$which gives $\mathbb{E}(U)=n_x n_y /2$.
Getting the variance is trickier. To start, we compute the following,
$$ \mathbb{E}(U_X U_Y) = \sum_i \sum_j \sum_k \sum_l \mathbb{P}( X_i\ge Y_j \land X_k \le Y_l ) $$Of these terms, we have $\mathbb{P}( Y_j \le X_i\le Y_j)=0$ because these are continuous random variables. Let's consider the terms of the following type, $\mathbb{P}( Y_i \le X_k \le Y_l)$. To reduce the notational noise, let's rewrite this as $\mathbb{P}( Z \le X \le Y)$. Writing this out gives
$$ \mathbb{P}( Z \le X \le Y) = \int_{\mathbb{R}} \int_Z^\infty (F(Y)-F(Z))f(y)f(z)dy dz $$where $F$ is the cumulative density function and $f$ is the probability density function ($dF(x)/dx = f(x)$). Let's break this up term by term. Using some calculus for the term,
$$ \int_Z^\infty F(Y)f(y)dy = \int_{F(Z)}^1 F dF\ = \frac{1}{2}\left(1-F(Z) \right) $$Then, integrating out the $Z$ variable from this result, we obtain the following,
$$ \int_{\mathbb{R}} \frac{1}{2}\left(1-\frac{F(Z)^2}{2}\right) f(z) dz = \frac{1}{3} $$Next, we compute,
$$ \begin{eqnarray*} \int_{\mathbb{R}} F(Z) \int_Z^\infty f(y) dy f(z) dz &=&\int_{\mathbb{R}} (1-F(Z)) F(Z) f(z) dz \\ &=&\int_{\mathbb{R}} (1-F) F dF =\frac{1}{6} \end{eqnarray*} $$Finally, assembling the result, we have
$$ \mathbb{P}( Z \le X \le Y) = \frac{1}{3}- \frac{1}{6} = \frac{1}{6} $$Also, terms like $\mathbb{P}(X_k\ge Y_i \land X_m \le Y_i) = \mathbb{P}(X_m\le Y_i \le X_k)=1/6$ by the same reasoning. That leaves the terms like $\mathbb{P}(X_k\ge Y_i\land X_m\le Y_l)=1/4$ because of mutual independence and $\mathbb{P}(X_k\ge Y_i)=1/2$. Now that we have all the terms, we have to assemble the combinatorics to get the final answer.
There are $ n_y (n_y -1) n_x + n_x (n_x -1) n_y $ terms of type $\mathbb{P}( Y_i \le X_k \le Y_l)$. There are $ n_y (n_y -1) n_x (n_x -1)$ terms like $\mathbb{P}(X_k\ge Y_i\land X_m\le Y_l)$. Putting this all together, this means that
$$ \mathbb{E}(U_X U_Y) = \frac{n_x n_y(n_x+n_y-2)}{6}+\frac{n_x n_y(n_x-1)(n_y-1)}{4} $$To assemble the $\mathbb{E}(U^2)$ result, we need to appeal to our earlier result,
$$ U_X+U_Y = n_x n_y $$Squaring both sides of this and taking the expectation gives,
$$ \mathbb{E}(U_X^2) + 2 \mathbb{E}(U_X U_Y)+\mathbb{E}(U_Y^2) = n_x^2 n_y^2 $$Because $\mathbb{E}(U_X^2)=\mathbb{E}(U_X^2)=\mathbb{E}(U)$, we can simplify this as the following,
$$ \begin{eqnarray*} \mathbb{E}(U^2) &=& \frac{n_x^2 n_y^2 - 2 \mathbb{E}(U_X U_Y)}{2}\\ \mathbb{E}(U^2) &=& \frac{n_x n_y (1+n_x +n_y +3 n_x n_y )}{12} \end{eqnarray*} $$Then, since $\mathbb{V}(U) = \mathbb{E}(U^2)- \mathbb{E}(U)^2$, we finally have
$$ \mathbb{V}(U) = \frac{n_x n_y (1+ n_x +n_y )}{12} $$