Raman spectroscopy has become very popular in PAT due to elimination of sample preparation. However, in multi-batch fermentation processes, maintaining good predictive ability across different batches is often challenging for the same model, leading to the issue of model generalization[1]. To address this issue, we have tested different machine learning methods to identify a better one for a fermentation process lasting 55 hours. The concentration of the metabolic product corresponding to each spectrum was predicted by the model of the previous fermentation batch based on the standard concentrations of the samples tested by HPLC. Five types of the algorithm methods were used to build the prediction models (Fig.1a). The traditional partial least squares regression (PLSR) and random forest regression (RFR) performed poorly during the intermediate stage (20-40 hours). Due to different data distribution between the source and target domains[2], we introduced the standard value at the 5th hour of this fermentation as calibration data and retrained the transformed TRA-PLS and TRA-RF models (Fig.1a). Furthermore, the Tradaboost method combining ridge regression was used for prediction as well. The benchmarked models exhibited significant improvement in prediction accuracy with Tradaboost achieving the lowest error of 4.35 mM (Fig.1b). As the base classifier in the Tradaboost method, the ridge regression effectively addresses the issues of multicollinearity and noise in the spectra, enhancing model stability and avoiding overfitting problems. This method shows potential for improving the accuracy and reliability of Raman predictions for online analysis and fermentation process optimization.
Key words: Online monitoring, bioprocesses, machine learning methods