Case Study: Fitting a Spline

This study shows how to fit noisy data when the underlying functional form is unknown. The function is modeled with a spline.

The Problem

Our goal is to determine a function f(m) from data for a different function F(m,a) where

F(m,a) = f(m) + sum_n c_n * (a*m) ** (2*n)

and the sum is over positive integers (n=1,2,3...). So f(m) = F(m,a=0) is the desired output. We have three sets of data, each with a different value of a and a variety of m values:

set  1/a  a*m    F(m,a)
-----------------------
  A 10.0  0.1  0.41(10)
          0.3  0.89(10)
          0.5  1.04(10)
          0.7  1.21(10)
          0.9  1.63(10)
-----------------------
  B  5.0  0.3  0.68(10)
          0.5  0.94(10)
          0.7  1.17(10)
          0.9  1.57(10)
-----------------------
  C  2.5  0.5  0.70(10)
          0.7  1.00(10)
          0.9  1.47(10)
-----------------------

There are statistical correlations between the data values, so we have dumped the data (using gvar.dump(data, 'spline.p')) into a file called 'spline.p' that can be read by the fit code.

We do not know the functional form of f(m), so we parameterize it using a cubic spline, where the function is described by specifying its values at specific points called knots. The spline approximates the function between each adjacent pair of knots with a cubic polynomial tailored to that interval. The polynomials are stitched together at the knots to keep the function smooth from one interval to the next.

This kind of problem arises in analyses of numerical simulations of QCD, where parameter a is the grid spacing.

Spline Fit

The following code reads the fit data from file 'spline.p', and fits it using a cubic spline (gvar.cspline.CSpline()):

import gvar as gv
import lsqfit
import numpy as np

def main():
    # do the fit
    param, data = collect_data('spline.p')
    F, prior = make_fcn_prior(param)
    fit = lsqfit.nonlinear_fit(data=data, prior=prior, fcn=F)
    print(fit)

    # create f(m)
    f = gv.cspline.CSpline(fit.p['mknot'], fit.p['fknot'])

    # create error budget
    outputs = {'f(1)':f(1), 'f(5)':f(5), 'f(9)':f(9)}
    inputs = {'data':data}
    inputs.update(prior)
    print(gv.fmt_values(outputs))
    print(gv.fmt_errorbudget(outputs=outputs, inputs=inputs))

def make_fcn_prior(param):
    " return fit function, fit prior "
    def F(p):
        f = gv.cspline.CSpline(p['mknot'], p['fknot'])
        ans = {}
        for s in param:
            ainv, am = param[s]
            m  = am * ainv
            ans[s] = f(m)
            for i,ci in enumerate(p['c']):
                ans[s] += ci * am ** (2 * (i + 1))
        return ans
    prior = gv.gvar(dict(
        mknot=['1.00(1)', '1.5(5)', '3(1)', '9.00(1)'],
        fknot=['0(1)', '1(1)', '1(1)', '1(1)'],
        c=['0(1)'] * 5,
        ))
    return F, prior

def collect_data(datafile):
    " return parameters, data for data sets A, B, and C "
    # param[k] = (1/a, a*m) for k in ['A', 'B', 'C']
    param = dict(
        A=(10., np.array([0.1, 0.3, 0.5, 0.7, 0.9])),
        B=(5., np.array([0.3, 0.5, 0.7, 0.9])),
        C=(2.5, np.array([0.5, 0.7, 0.9])),
        )
    # data[k] = array of values for F(m,a)
    data = gv.load(datafile)
    return param, data

if __name__ == "__main__":
    main()

Data parameters are stored in dictionary param and the fit function is F(p). The fit function models f(m) using a cubic spline and then adds a*m terms appropriate for each data set.

The fit parameters are the locations mknot and function values fknot at the spline knots, in addition to the coefficients c in the a*m series. The number of knots and c terms is determined by experimentation: we start with a couple of terms and add more of each until the fit stops improving — that is, until chi2/dof stops going down and logGBF stops going up.

Running this script gives the following output:

Least Square Fit:
  chi2/dof [dof] = 0.46 [12]    Q = 0.94    logGBF = 9.2202

Parameters:
        mknot 0   1.000 (10)     [ 1.000 (10) ]  
              1    1.34 (13)     [  1.50 (50) ]  
              2    3.29 (30)     [  3.0 (1.0) ]  
              3   9.000 (10)     [ 9.000 (10) ]  
        fknot 0    0.40 (10)     [  0.0 (1.0) ]  
              1    0.60 (11)     [  1.0 (1.0) ]  
              2    0.85 (10)     [  1.0 (1.0) ]  
              3    0.92 (10)     [  1.0 (1.0) ]  
            c 0    0.49 (19)     [  0.0 (1.0) ]  
              1   -0.39 (57)     [  0.0 (1.0) ]  
              2    0.14 (81)     [  0.0 (1.0) ]  
              3    0.64 (79)     [  0.0 (1.0) ]  
              4    0.86 (71)     [  0.0 (1.0) ]  

Settings:
  svdcut/n = 1e-12/0    tol = (1e-08*,1e-10,1e-10)    (itns/time = 9/0.0)

Values:
               f(1): 0.40(10)            
               f(5): 0.89(10)            
               f(9): 0.92(10)            

Partial % Errors:
                f(1)      f(5)      f(9)
----------------------------------------
     data:     24.14     11.00     10.63
    mknot:      0.10      0.40      0.52
    fknot:      4.82      2.20      2.12
        c:      0.25      0.97      0.97
----------------------------------------
    total:     24.62     11.27     10.90

Given the knot values and locations from the fit, we construct the spline function f(m) using

f = gv.cspline.CSpline(fit.p['mknot'], fit.p['fknot'])

This is the function we sought from the fit.

The quality of a function’s spline representation depends critically on the number and location of the spline knots. Here the first and last knots are placed at the lowest and highest m values for which we have data, since splines are more reliable for interpolation than for extrapolation. The locations of the interior knots are weighted towards smaller m, based on inspection of the data, but are relatively unconstrained so the fitter can make the best choice. We use four knots in all; three knots give marginal fits (chi2/dof=2). Using five knots improves chi2 somewhat (chi2/dof=0.35), but also decreases the Bayes Factor significantly (logGBF=5.6); and the fit results are almost unchanged from the fit with four knots. More knots would be needed if the data were more accurate.

The script generates an error budget for f(m) at a few values of m. These show that the errors come almost entirely from the initial errors in the data; very little uncertainty comes from the spline parameters.

The fit result for f(m) is the black dotted line in the following figure, while the gray band shows the 1-sigma uncertainty in f(m).

_images/eg-spline.png

The data are shown in color, with dotted lines showing the fit results for each set. The fit is excellent overall. Even at m=9, where the data pull away, the fit gives 10% accuracy.

These data are artificially generated so we know what the real f(m) function is. It is plotted in the figure as a black dashed line. The fit result for f(m) agrees well with the exact result for all m.