A common statistical problem is to describe a relationship between two measurements that are not linearly related
(in a straight line).
When such a relationship can be mathematically defined (e.g. y=x2), variables can be transformed using programs in the Numerical Transformation Program Page
and the relatively simple linear relationship retained.
Often however, a curved relationship that exists may appear regular and consistent, but a mathematical definition of that relationship is not available, and an empitical "best fit" algorithm, such as the polynomial curve fitting from the Curve Fitting Program Page
is required.
The polynomial curve fit uses the formula y=a + b1x +
b2x2 + b3x3 + b4x4....
As each increase in power bends the relationship into a sharper curve, the
combination of all the coefficients will be able to produce a curve of
potentially any level of complexity. In bio-social science, however, curve fitting
beyond the third power is seldom necessary or meaningful.
Curve fitting can be easily accomplished by using multiple regression as described in the Multiple Regression Explained Page
, where the single x variable can be transformed into x2, x3,
x4, and so on, and the combination subjected to multiple regression analysis.
Curve fitting has been used successfully in laboratories, to define relationships
between the results of a test (e.g. the depth of a color reaction) to the amount of a chemical (e.g. sugar) present.
The problem of using curve fitting when more than the mean values of the fit
are required is the difficulty of assigning variance and the confidence interval
of the fitted curve. The least square statistics is seldom useful here, as
each of the coefficient has its own variance, and it is difficult to integrate them.
An even more difficult issue is that, for many biological measurements, variance
increases with the scale of measurement, so that the confidence interval
around y increases as the x value increases.
Altman (see reference) described a two stage procedure that solves this problem.
In the first stage, the standard curve fitting for the mean value is carried out.
In the second stage, the distance between y of each data point and the mean y
from the curve fit is obtained, and its absolute value used to perform another
curve fit, so that a variable confidence interval can be defined.
The program in the Curve Fitting Program Page
uses
Altman's algorithm, and it can be used as follows.
- The data is entered as two columns separated by spaces or tabs. Col 1 is
the independent (x) variable, and col 2 the dependent (y) variable. Each data point is in a row.
- The power the curve fit the mean can be defined, 1 a straight line, 2 a curve with
one hump, 3 curve with 2 humps, and so on. The power is capped at 5 as
curves fitted beyond that are seldom meaningful in biosocial sciences.
- The power to curve fit the standard deviation around the mean can also be
defined. Unless there is a good theoretical reason, 0 or 1 is usually sufficient.
The power is capped at 3.
- The percentage confidence interval required by the user. The 95% confidence interval is the most commonly used one, but the program allows users to change this to any percent (such as 90% or 99%)
Reference
Altman DG (1993) Constructing age-related reference
centiles using absolute residuals. Statistics in Medicine 12(10):917-924
x | y |
1 | 10 |
1 | 11 |
2 | 18 |
2 | 22 |
3 | 20 |
3 | 30 |
4 | 19 |
4 | 31 |
5 | 30 |
5 | 45 |
6 | 40 |
6 | 60 |
The example data in the table to the left are from the program
Curve Fitting Program Page
is computer generated, so that x and y has a curved relationship, and
the variance of y increases with x.
We will fit the mean y value to the power of 3, and the standard deviation to the power of 1. we will require the program to draw the 95% confidence interval of the curve oger the range of values
The results are as follows.
Mean regression line
| Coeff |
Cons | -7 |
x1 | 23.5317 |
x2 | -6.4881 |
x3 | 0.6944 |
|
StanDard Deviation
| Coeff |
Cons | -1.6711 |
x1 | 2.3276 |
|
The output is to the right. The first table is the curve for the mean value, and here y = -7 + 23.53x
- 6.49x2 + 0.69x3.
This is followed by the regression line for the standard deviation, SD = -1.67 + 2.33x, which defines the Standard Deviation from
the curve fitted mean for any x value
If we were to combine the two formulae, we can now have the two equations that
can be used to draw the 95% confidence interval lines.
From the first table, the curve of mean is y = -7 + 23.5317x - 6.4881x2
+ 0.6944x3
95%CI lines
| Low | High |
Con | -3.7247 | -10.2753 |
x1 | 18.9697 | 28.0938 |
x2 | -6.4881 | -6.4881 |
x3 | 0.6944 | 0.6944 |
|
From the second table, the standard deviation from the mean curve is
SD = -3.7247 + 10.2753x
The 95% confidence interval is mean ±1.96SD, so by combining the two
fitted lines, we can obtain the upper and lower 95% CI lines, as shown in the table to the right. These are as follows.
- The lower line : y = -3.7247 + 18.9697x - 6.4881x2 + 0.6944x3
- The upper line : y = -10.2753 + 28.0836x - 6.4881x2 + 0.6944x3
Please note that the coefficients for the CI lines would be different should percentage confidence other than 95% is used (such as 90% or 99%)
Data points
X | Y | yx | sd | z | Percentile |
1 | 10 | 10.7381 | 0.6565 | -1.1243 | 13.04 |
1 | 11 | 10.7381 | 0.6565 | 0.3989 | 65.50 |
2 | 18 | 19.6667 | 2.9841 | -0.5585 | 28.82 |
2 | 22 | 19.6667 | 2.9841 | 0.7819 | 78.29 |
3 | 20 | 23.9524 | 5.3117 | -0.7441 | 22.84 |
3 | 30 | 23.9524 | 5.3117 | 1.1386 | 87.26 |
4 | 19 | 27.7619 | 7.6392 | -1.147 | 12.57 |
4 | 31 | 27.7619 | 7.6392 | 0.4239 | 66.42 |
5 | 30 | 35.2619 | 9.9668 | -0.5279 | 29.88 |
5 | 45 | 35.2619 | 9.9668 | 0.9771 | 83.57 |
6 | 40 | 50.619 | 12.2944 | -0.8637 | 19.39 |
6 | 60 | 50.619 | 12.2944 | 0.763 | 77.73 |
|
The data points and their deviation from the mean line are then presented, as in the second table to the right. The abbreviations are:
- X and Y are the original x and y values of the data point
- yx is the curved fitted mean y for the x value X
- sd is the standard deviation of y at the x value X
- z = (Y - yx)/sd, and represents the difference between Y and its
curve fitted value yx in standard deviation units.
- Percentile is a transformation of z into probability percentile, assuming a normal distribution.
These coefficients are now available to transform any x value to y value, using the polynomial transformation utility
available in the Numerical Transformation Program Page

Finally, the curvefit bitmap, with the original data points (black round circles), and the 3 curves (best fit mean, the upper confidence inte3rval, and the lower confidence interval (in this example 95% confidence interval), are displayed
The R code was used to check the correctness of the php program during development. It is included for anyone interested to write their own program or to check the algorithm. Feedback, advice, and criticism are all very welcomed.
# Curvefit Altman's algorithm
# Altman DG (1993) Constructing age-related reference centiles using absolute residuals.
# Statistics in Medicine 12(10):917-924
myDat = ("
x y
1 10
1 11
2 18
2 22
3 20
3 30
4 19
4 31
5 30
5 45
6 40
6 60
")
# Set power of polynomial curve fitting
pwLine = 3 # power of fitting the line
pwVar = 2 # power of fitting the variance
cfInt = 95 # % confidence interval
# Calculate z (2 tails) from %CI
z = qnorm((100 - (100 - cfInt) / 2) / 100)
# create dataframe from input data df is the dataframe
df <- read.table(textConnection(myDat),header=TRUE)
summary(df)
# Curve fit the line
resLine <- lm(formula = df$y ~ poly(df$x, pwLine, raw=TRUE))
summary(resLine)
# Extract Coefficients
coefLine <- coef(summary(resLine))[1:(pwLine+1)]
# Display coefficients for line
print("Coeff:Line")
coefLine
# Calculate curve fitted value for y using function
CalCurveFitValues <- function(coefVec, datVec)
{
f = length(coefVec)
n = length(datVec)
vecResult <- vector()
for ( i in 1:n)
{
x = datVec[i]
y = coefVec[1] + coefVec[2] * x
for (j in seq(3,f, by=1)){ y = y + coefVec[j] * x^(j-1) }
vecResult[i] = y
}
vecResult
}
# Add curve fitted y to data frame
df$Yx <- CalCurveFitValues(coefLine,df$x)
# Curve fit variance
#absolute difference between y and curve fitted y
vecAbsDif <- abs(df$Yx - df$y)
# Curve fit absolute difference against x
resVar<-lm(formula = vecAbsDif ~ poly(df$x, pwVar, raw=TRUE))
summary(resVar)
# Extract Coefficients
coefVar <- coef(summary(resVar))[1:(pwVar+1)]
# Calculate coefficients for SD
coefSD <- coefVar * sqrt(pi / 2)
# Display coefficients for SD
# print("Coeff:SD")
coefSD
#Calculate SD z and percentile of data
df$SD <- CalCurveFitValues(coefSD,df$x)
df$z <- (df$Yx - df$y) / df$SD
df$Pctile <- round(pnorm(df$z) * 100, 1)
# Display the data frame
df
# Create table of calculations for range of predictor
# divided into 50 intervals
#Produce table
minv = min(df$x)
maxv = max(df$x)
x <- seq(from = minv, to = maxv, length.out = 50)
dfTable <- data.frame(x)
# Calculate and append all columns into dfTable,
# rounded to 4 decimal places
dfTable$y <- round(CalCurveFitValues(coefLine, dfTable$x),4)
dfTable$SD <- CalCurveFitValues(coefSD,dfTable$x)
dfTable$CILow <- dfTable$y - z * dfTable$SD
dfTable$CIHigh <- dfTable$y + z * dfTable$SD
# Display table
dfTable