Introducing BLA: A Comprehensive R Package for Boundary Line Analysis

Data Science

R Packages

CRAN

Author

Chawezi Miti

Published

June 3, 2024

Author's picture BLA Logo

BLA Logo

We are excited to announce the release of a new R package, BLA, on CRAN. This package is designed to provide a robust set of tools for fitting boundary line models to data sets. Whether you’re dealing with crop yield predictions or analyzing ecological data, BLA offers an array of methods to help you derive meaningful insights from your data. The boundary line approach is suitable for analyzing biological response datasets from surveys i.e. cases in which multiple potential limiting factors to biological response occur but are not controlled as done in experimental conditions. In the case, one can determine the largest expected biological response (also called the boundary value) for a given value of a factor interest. Additionally, the most limiting factor to the biological response can be modeled from the observed potential limiting factors.

Overview of BLA

The BLA package builds on the foundational work by Webb (1972) and incorporates a variety of heuristic and statistical boundary line fitting methods developed since then. We give an overview of these methods in a recent paper (Miti et al., 2024a). These includes the (1) Binning Methods that involves selecting of boundary points by segmenting the independent variable into bins of equal width from which the boundary point is chosen based on a given criteria e.g. 95\(^{\rm th}\) percentile of the response variable, (2) Bolides Algorithm, an advanced technique for selecting boundary points introduced by Schnug et al. (1995), (3) Quantile Regression, which models the boundary line as a given quantile (Cade and Noon, 2003), and (4) Censored Bivariate Normal Model, a statistical method developed by Milne et al. (2006) that is based on the maximum likelihood approach assuming the data takes a censored bivariate normal distribution.

Additionally, BLA provides data exploration tools to identify outliers and checking for evidence of boundary presence in your data before diving into the boundary line analysis. Post boundary line fit functions are also available for a comprehensive analysis. The Table below shows a summary of the available functions in BLA.

Exploratory	Model fitting	Post-hoc
`summastat()`	`blbin()`	`predictBL()`
`exp_bounary()`	`bolides()`	`limfactor()`
`startValues()`	`blqr()`
`ble_profile()`	`cbvn()`

Installation and loading

Getting started with BLA is straightforward. You can install the current version of the package directly from CRAN with the following command:

install.packages("BLA")

The package can now be loaded using the library() function. Additionally, the aplpack package is loaded which contains a function for the identification of outliers.

library(BLA)
library(aplpack)

Use case for BLA

BLA is particularly useful in scenarios where you need to model the relationship between a response variable,y, and a potentially limiting factor, x. For example, in agricultural studies, you might want to analyze the relationship between crop yield and soil P concentration data from non-experimental settings (Lark et al., 2020). BLA helps you determine the most efficient biological response (yield) for a given value of soil P concentration. Below is an example of how you can use the censored bivariate normal model (the cbvn() function) to carryout boundary line analysis. This process includes initial data exploration, fitting the boundary line model and prediction of boundary values.

Data exploration

There are three important exploratory steps required in boundary line analysis. These are (1) checking the distribution of the potential limiting and response factors to fulfill the assumption of normality or normality with a censor, (2) detection of outliers and (3) assessing the evidence that the dataset exhibits a boundary which limits the response variable (Miti et al., 2024b).

1. Checking variable distribution indices and plots

The function summastat() can be used for this purpose. We Start with soil P concentration which we will denote as x.

x <- soil$P 

summastat(x)

        Mean Median Quartile.1 Quartile.3 Variance       SD Skewness
[1,] 25.9647     22         16         32 207.0066 14.38772 1.840844
     Octile skewness Kurtosis No. outliers
[1,]       0.3571429 5.765138           43

From the plots and the summary statistics, the x variable can not be assumed to be from a normal distribution and hence requires transformation to fulfill the assumption of normality. You can do this by taking the natural log of x.

summastat(log(x))

         Mean   Median Quartile.1 Quartile.3  Variance        SD  Skewness
[1,] 3.126361 3.091042   2.772589   3.465736 0.2556936 0.5056615 0.1297406
     Octile skewness    Kurtosis No. outliers
[1,]      0.08395839 -0.05372586            0

Normality can now be assumed after transformation. Next, you check the yield variable which we denote as y.

y <- soil$yield

summastat(y)

         Mean  Median Quartile.1 Quartile.3 Variance       SD   Skewness
[1,] 9.254813 9.36468   8.203703   10.39477 3.456026 1.859039 -0.4819805
     Octile skewness Kurtosis No. outliers
[1,]     -0.05793291 1.292635            7

From the plots and the summary statistics, the variable y can be assumed to be from a normal distribution.

2. Outlier identification and removal

This is done using the bagplot() function from the aplpack package. Its input is a matrix and hence we assign the x and y variables to a matrix data_ur.

nobs <- length(soil$P) 
data_ur <- matrix(NA, nobs, 2) # Create a matrix
data_ur[,1] <- log(soil$P) 
data_ur[,2] <- soil$yield

bag <- bagplot(data_ur, show.whiskers = F,create.plot = TRUE) # Identify outliers

All the points outside the bag are considered outliers and can be removed from the analysis.

data <- rbind(bag$pxy.bag, bag$pxy.outer) # Exclude bivariate outliers

3. Evidence of boundary presence

This is done using the function expl_boundary() to check if the are grounds for us to fit a boundary line model to the dataset. there should be evidence of a limiting relationship of x and y at the upper bounds of the data scatter.

x <- data[,1]
y <- data[,2]

expl_boundary(x,y,10,1000)

Note: This function might take longer to execute when running a large number of simulations.

    Index Section    value
1      sd    Left 1.045711
2      sd   Right 1.115379
3 Mean sd    Left 1.131335
4 Mean sd   Right 1.203422
5 p_value    Left 0.032000
6 p_value   Right 0.025000

The p-values in the left and right sections are less than 0.05, indicating evidence of an upper boundary existence as opposed to a random scatter of points. This justifies the fitting of a boundary line model to the dataset.

Boundary line fitting

Based on the structure of the data at the upper edge, a trapezium model can be fitted. Take note that any other model of your choice that is biologically plausible can be fitted.

It arguments include (1) a dataframe,data, containing the x and y variables, (2) a vector of initial start values for the optimization that includes the parameters of the boundary model and for the distribution (i.e. mean(x), mean(y), sd(x), sd(y) and cor(x,y)), and (3) the measurement error value for the response variable, sigh. The rest of the inputs are related to the plot features as in the function plot() from base R.

The start values for the preferred model can be determined using the startValues() function. Set the argument model to the desired model e.g. model="trapezium", run the function and click on the plot of y against x, the points that make up the structure of the model at the upper edges of the data scatter.

startValues("trapezium")

Using the obtained values for the model, create a vector of start values.

data<-data.frame(x,y) 
start<-c(4,3,13.6, 35,-5,3,9,0.50,1.9,0.05) # initial start values for optimization

model <- cbvn(data, start = start, model = "trapezium", sigh=0.7, 
                 xlab = expression("Soil P/ log(mg l"^-1*")"), 
                 ylab = expression("Yield/ t ha"^-1), 
                 pch = 16, plot = TRUE, col = "grey40", cex = 0.8)

model

$Model
[1] "trapezium"

$Equation
[1] y = min (β₁ + β₂x, β₀, β₃ + β₄x)

$Parameters
         Estimate Standard error
β₁     4.29795522    1.035391840
β₂     3.23397375    0.460850826
β₀    13.15187257    0.206656034
β₃    33.17267393    1.693458789
β₄    -5.22857503    0.393758941
mux    3.12597270    0.006451086
muy    9.30482617    0.022879293
sdx    0.50053107    0.004561474
sdy    1.60754420    0.018448808
rcorr  0.04150832    0.014806076

$AIC
            
mvn 32429.55
BL  32391.07

The output gives the name of model fitted which is a trapezium in the case, its equation form, its parameters (the first 5 rows) and corresponding standard errors, and the AIC values for the boundary model, BL, and a corresponding multivariate normal model, mvn. Since the AIC for the BL is smaller than mvn, the boundary line model is appropriate. We interpret the first rising section of the boundary model as showing the P-limited response of yield, and the soil P concentration at which this intersects the horizontal section as a critical value (it is consistent with advisory index values). We think that the falling section of the line is likely to reflect the accumulation of P in the soil in areas with consistent large limitations on crop yield (e.g. soil compaction in headlands).

Predicting boundary yield

Using the fitted model, you can predict the boundary yield for each data point based on soil P values:

P_values <- log(soil$P)
P_values[is.na(x)] <- mean(x, na.rm = TRUE) # replace missing values with mean pH
predicted_yield <- predictBL(model, P_values)


head(predicted_yield) # predicted yield for the first six farms

[1] 11.74445 11.40372 11.74445 11.40372 11.40372 12.33408

Determining critical soil P concentration value

The critical soil P concentration value is the point beyond which yield increase is not expected. You can calculate it from the model parameters as follows:

intercept <- model$Parameters[1]
slope <- model$Parameters[2]
plateau <- model$Parameters[3]

critical_P <- (plateau - intercept) / slope

print(exp(critical_P))# results are in mg/l

[1] 15.45268

Closing remarks

The BLA package is a powerful tool for researchers and analysts working with data sets where analyzing boundary response is crucial. With methods ranging from the heuristic binning to advanced statistical algorithms like censored bivariate normal model, BLA provides all the necessary tools for a comprehensive boundary line analysis. For further details and advanced usage, visit the website https://chawezimiti.github.io/BLA/.

We hope you find BLA useful and look forward to your feedback and contributions (https://github.com/chawezimiti/BLA/issues).