Principal Component Analysis was carried out on the data-set to visualize the data-set in orthogonal projections

From the aquaponic facility at Caldwell, two water samples were collected per week: one from the fish tank and the other from the plant bed which was used to grow Romaine Lettuce, watercress, lettuce and green peppers. Similarly, from the facility at Bryan, three water samples were collected: one from the Tilapia tank, one from the Gold- fish tank, and the other from the main plant bed where lettuce and kale were cultivated. The aquaponic facility at Grimes was one of the largest in the state and four water samples were collected each week: one each from the main growth tank which was used for breeding Tilapia and certain shrimps, one from the tank which bred bluegill fish, one from the tank which was used to grow plant seedlings and the other from the main greenhouse where tomatoes, lettuce and collard greens were grown. All of these tanks were connected together with continuous water flow between them, and water purifiers were installed in each of these tanks to ensure that the recycled water meets the optimal water quality requirements needed for growth of fish and plants. A set up where the experiments were recorded have been shown in Fig. 1. The wastewater from the fish tank is connected to the main set-up, where extra nutrients are added to the aquaponic solution to optimize plant growth. After collecting the water samples, they were sent to the Soil, Water, and Forage Testing Laboratory, Texas A&M University to determine the nutrient concentration for each of the samples. The method used to carry out each of these nutrient concentrations in the laboratory is described as follows in Table 1. This process was carried out each week for 9 months and then the data were analysed. Before explaining the analysis part, a pipeline on how the data was processed to design the Decision Support System has been elaborately stated in Fig. 2 below.

As mentioned above, due to the limited size of the data-set, it is not possible to make accurate inferences taking all the predictors into account. Thus, several dimensionality reduction techniques have been used on the data-set to select the top predictors that define the nutrient concentration of the aquaponic solution. The predictors which had zero variance were removed from the data-set. Then, a correlation matrix was constructed between the predictors, and one of the two predictors, which had higher than 90% pair-wise correlation among them, stacking pots was removed. Next, all the predictors with less than 5% importance in the data-set were eliminated as they would likely incorrectly skew any inferencing made from the data-set. As the primary goal was to bring down the size of the data-set to 5 primary chemical predictors, Recursive Feature Elimination technique with XGBoost classifier was used, which ranked the predictors in the order of their importance. This resulted in bringing down the size of the data-set to 5 chemical predictors. In addition to this, 2 categorical predictors were also appended to the data-set, one storing the month and the other storing the place in which the observations were recorded. Therefore, a total of 143 observations with 7 predictors were used to design classification rules and carry out inferences.To perform any classification rules on a given data-set, Data Visualization is an important tool as it aids in understanding the structure of the data. It equally helps in choosing the classification techniques that can be used on the data-set depending on the separability between the classes. In this case, Principal Component Analysis was used. As PCA treats the entire analysis as an unsupervised learning approach and performs an orthogonal transformation of the data, Principal Components were calculated to visualize the variance in the data-set. The loading matrix was analysed to determine the predictor that contributes the most in each Principal Component and therefore, an inference can be drawn about the relative importance of each predictor from their holding values depending on the value of correlation, which was observed between the predictors and the respective PCs.

To further explore the interpretation from the PCs, the pairwise PC plots have been studied to infer which classifiers would suit the best depending on the pattern of separability of the data. All the data points belonging to class 0 have been color-coded in red and the data points belonging to class 1 have been color-coded in green to have a clear visual understanding of the binary distribution of the data.As stated above, the size of the data-set poses a serious issue while designing the classifier in this context, due to which, the classifiers have been trained and tested on the same data. This method of error estimation is referred to as Bolstered Error Estimation. One of the main reasons for choosing this type of estimator is its low bias as well as low variance. This also results in a faster estimator compared to other resampling methods like the bootstrap . The error estimates have been calculated for each of these estimators for the four popular classification rules by varying the size from small to moderate-sized data-sets. The basic idea is to bolster the original empirical distribution of the available data utilizing suitable Bolstering kernels placed at each data point location. For this case, a uniform zero-mean, spherical Bolstering kernel fi⋄ was chosen for analysis, with covariance matrices of the form σi2Ip. In each case, there would be a family of Bolstered estimators corresponding to each value of where i varies from 1 to n. Larger values of σi would result in wider Bolstering kernels resulting in lower variance estimators, but after a point, bias starts to increase. Therefore, choosing the values of standard deviations for the Bolstering kernels is a challenge and several approaches have been attempted to find the best error estimator in this case which minimizes the bias-variance trade-off.

As stated before, the entire approach was treated as an unsupervised approach and the K-means algorithm was used to classify the observations into two classes. Out of the 143 observations used for analysis, 84 were classified into class 0 and 59 into class 1. As it was difficult to derive inferences considering all the predictors, dimensionality reduction techniques were used. ‘Carbonate ’ was dropped as the variance was zero throughout. Then, a correlation matrix was constructed between the rest of the predictors. This led to the removal of 6 predictors from the data-set namely magnesium, both the measures of hardness, alkalinity, Total Dissolved Salts, and conductivity. Then, Extra Trees classifier was used to find the percentages of importance for each of the 14 predictors. Now, the predictors that had less than 5% importance were removed from the data-set. From the list of chemical predictors, Nitrates and Phosphorus, with the importance of 4.51% and 4.44% respectively, and from the list of chemical properties, pH, SAR, and charge balance with the importance of 1.2%, 0.93%, and 0.93% respectively were removed. Thus, the final list of chemical predictors used in the analysis was as follows: Potassium, Boron, Bicarbonate, Sulfate, and Chloride concentrations in the solution . In addition to this, 2 categorical predictors were also appended to the data-set, one storing the month and the other storing the place in which the observations were recorded. Therefore, a total of 143 observations with 7 predictors were used to design classification rules and to carry out inferences. Next, visualizations were carried out on the data-set to gauge the separability of the data and the classifiers which can be used for inferencing.

As inferred from Fig. 5, the first three PCs go on to explain 46.34%, 34.40%, and 9.36% of the total variance in the data-set respectively. The loading matrix was analysed to get an idea about which predictor contributes the most in each PC and therefore, an inference can be drawn about the relative importance of each predictor from their holding values. From Table 2, it can be inferred that the PLACE_CLASS, MONTH_CLASS, and the bicarbonate variables had the highest holding values in the 1st, 2nd, and 3rd PC respectively, indicating their strong correlations with the respective PCs. From this, one of the strongest inferences that can be drawn is that both the categorical predictors, one storing the place and the other storing the month in which the observations were recorded, are the most important predictors as they show high holding values in the first two PCs which explain about 80% of the variance in the data-set. To further explore the interpretation from the PCs, the pairwise PC plots have been studied to infer which classifiers would suit the best depending on the pattern of separability of the data. All the data points belonging to class 0 have been color-coded in red and those belonging to class 1 in green, giving a clear visual understanding that the data follows a binary distribution. From the PC plots, it has been observed that most of the discriminatory information is contained in the first PC, as expected, as it explains most of the variance in the data-set. Next, due to the low sample size,grow lights there is a high probability of over-fitting the data because of which, the notion of different Bolstered error estimators has been introduced in this domain. The performance of linear classifiers like LDA and Linear SVM along with the performance of non-linear classifiers like CART and 3-NN have been used along with these Bolstered error estimators, and results have been discussed below. The value of the depth of the decision tree was chosen as 2 due to the low volume of the data-set as a higher value of the depth would result in over-fitting the data and lead to unreasonably high accuracy on the training data-set.

The value of the K for the KNearest Neighbor was chosen to be 3 using Elbow method as the mean cluster distance on the training data-set was optimal.Eq. holds in the case of linear classifiers like LDA and linear SVM where the Bolstering kernels are given by uniform circular distributions. As the decision boundary in the above case is a hyperplane, it is possible to find analytical solutions as in . However, when the error estimate is made for non-linear classifiers like CART and KNN, an approximate solution is needed which is obtained by applying Monte-Carlo integration. The error estimates for all the four classifiers i.e. LDA, KNN, CART and Linear SVM have been tabulated and the results are shown below. The value of N i.e. the number of randomly sampled data points has been varied from 20% to 100% in increments of 10 for each of the classifiers. As shown in Fig. 7, the Bolstered Error Estimates reduced from 8% to 2.5% for LDA when the size of the sample set increased. For the nonlinear classifiers, it can be observed that they over-fit the training set due to their minuscule size thereby increasing the chance of the classifier performing badly on the testing set. Therefore, a decision not to proceed with either of these classifiers was made using Bolstered Error re-substitution as it is very likely to over-fit when the size of the data-set is small. Here, the opposite is observed when a linear SVC is used with different values of penalty parameters . Irrespective of the amount of data, the error estimates for each of the classifiers are around 50% which makes it unsuitable to be used as an ideal classifier for the separation of data between the classes as it showcases an ideal case of under-fitting.As the name suggests, in this type of error estimation, the classifiers are trained on each subset separately. Every subset contains all the data points belonging to one particular class except one which is used as the test data point. Therefore, in a binary classification problem, if m data points belong to class 0 and n data points belong to class 1, there will be a total of m-1 subsets for class 0 and n-1 subsets for class 1; and the classifiers will be trained separately on all of them. In the end, an aggregate of the error on each of these subsets is calculated to find out the total Bolstered Leave One Out Error estimate for each of these classifiers.