Construction of a nucleotide sequence using machine learning methods in the Nanofor SPS" sequencer
Manoilov V. V. 1, Borodinov A. G.1, Petrov A. I. 1, Zarutskiy I. V. 1, Bardin B. V. 1, Yamanovskaya A. Yu. 1, Saraev A. S. 1, Kurochkin V. E. 1
1Institute for Analytical Instrumentation of the Russian Academy of Sciences, Saint Petersburg, Russia
Email: manoilov_vv@mail.ru, borodinov@gmail.com, fataip@mail.ru, igorzv@yandex.ru, bardin.bv@mail.ru, alex.niispb@yandex.ru, vladimirkurockin638@gmail.com

PDF
The development of mathematical methods and information technologies for data processing plays an essential role in establishing various features in the analyzed nucleic acids and is a necessary element in the development and improvement of instruments and devices for practical use in biology and medicine. The technology of mass parallel sequencing of nucleic acids includes the process of measuring the intensities of fluorescence signals based on mathematical processing of images obtained from video cameras, and then constructing a sequence of nucleotides based on the results of these measurements. The paper considers the methods of information processing, which are divided into two parts. The first part includes methods for filtering images, detecting fluorescence clusters, and evaluating the parameters of fluorescence signals, both for single clusters and for clusters "superimposed" on each other. The second part of the information processing methods considered in this work includes methods for constructing a sequence of letter codes of DNA nucleotides based on the intensities of fluorescence signals obtained directly from the results of image processing. No adjustments have been made to such signals related to intensity changes due to phenomena such as Phasing/Prephasing, signal attenuation and Cross-talk. These methods use classifiers based on machine learning. It is shown that as a result of the performed approbation of various machine learning models for the task of constructing a sequence of nucleotides, the results obtained showed sufficiently high quality indicators of genetic analysis. The quality indicators of the Phred score were in the range from 29 to 35 for the reference genome of the bacteriophage Phix174. Keywords: sequencing, nucleic acids, image processing, improving the quality of genetic analysis, machine learning.
  1. V.E. Kurochkin, Ya.I. Alekseev, D.G. Petrov, A.A. Evstrapov. Izvestia of the Russian VMA, 40 (3), 69 (2021) (in Russian). DOI: 10.33917/es-3.189.2023.36-41
  2. V.V. Manoilov, A.G. Borodinov, A.S. Saraev, A.I. Petrov, I.V. Zarutsky, V.E. Kurochkin. ZhTF, 92 (7), 985 (2022) (in Russian). DOI: 10.21883/JTF.2022.07.52655.318-21
  3. V.V. Manoilov, A.G. Borodinov, I.V. Zarutsky, A.I. Petrov, V.E. Kurochkin. Trudy SPIIRAN, 18 (4), 1010 (2019) (in Russian). DOI: 10.15622/sp.2019.18.4.1010-1036
  4. Kao, Wei-Chun. Algorithms for Next-Generation High-Throughput Sequencing Technologies (Thesis, University of California, 2011), https://escholarship.org/uc/item/86b9c87d
  5. RTA Theory of Operations v1.13 ILLUMINA PROPRIETARY Pub. No. 770-2009-020, current as of 09 Nov. 2011
  6. S. Paliwal, A. Sharma, S. Jain, S. Sharma. Machine Learning and Deep Learning in Bioinformatics. In Bioinformatics and Computational Biology (Chapman and Hall/CRC, 2024), p. 63-74
  7. H. Izadkhah. Deep Learning in Bioinformatics: Techniques and Applications in Practice (Academic Press, 2022)
  8. A.G. Borodinov, V.V. Manilov, I.V. Zarutsky, A.I. Petrov, V.E. Kurochkin, A.S. Saraev. Informatika i avtomatizaciya, 21 (3), 572 (2022) (in Russian). DOI: 10.15622/ia.2022.3.21
  9. R. Gonzalez, R. Woods. Digital Image processing (Tekhnosfera, M., 2005) (in Russian)
  10. B.V. Bardin, I.V. Chubinsky-Nadezhdin. Nauchn. Priborostr., 19 (4), 96 (2009) (in Russian)
  11. N. Otsu. IEEE Transactions on Systems, Man, and Cybernetics, 9 (1), 62 (1979)
  12. L. Najman, M. Schmitt. Signal Processing, 38 (1), 99 (1994)
  13. E. Tegfalk. Application of Machine Learning Techniques to Perform Base-Calling in Next-Generation DNA Sequencing (2020). https://www.diva-portal.org/smash/get/diva2:1465444/FULLTEXT01.pdf
  14. S.I. Gallant. IEEE Transactions on Neural Networks, 1 (2), 179 (1990). DOI: 10.1109/72.80230
  15. S. Dreiseitl, L. Ohno-Machado. J. Biomed. Inform., 35 (5-6), 352 (2002). DOI: 10.1016/S1532-0464(03)00034-0
  16. J. Abello, G. Carmode. (eds.). Discrete Methods in Epidemiology. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 70, 13 (2004)
  17. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and Regression Trees (Wadsworth \& Brooks/Cole Advanced Books \& Software, Monterey, CA, 1984) DOI: 10.1201/9781315139470
  18. G. Biau, E. Scornet. Test, 25, 197 (2016). DOI: 10.1007/s11749-016-0481-7
  19. T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, N Y., 2009), DOI: 10.1007/978-0-387-84858-7
  20. J. Hao, T.K. Ho. J. Educational and Behavioral Statistics, 44 (3), 348 (2019). DOI: 10.3102/1076998619832248
  21. L. Buitincketal. API Design for Machine Learning Software: Experiences from the Scikit-Learn Project. arXiv preprint arXiv:1309.0238. 2013
  22. J. Zhou, A.H. Gandomi, F. Chen, A. Holzinger. Electronics, 10 (5), 593 (2021). DOI: 10.3390/electronics10050593
  23. F. Masoodi, M. Quasim, S. Bukhari, S. Dixit, S. Alam. (eds.). Applications of Machine Learning and Deep Learning on Biological Data (CRC Press, 2023), DOI: 10.1201/9781003328780
  24. Quality Scores for Next-Generation Sequencing (Illumina Inc., San Diego, CA, 2011)
  25. A.G. Borodinov, V.V. Manjilov, I.V. Zarutskiy, A.I. Petrov, V.I. Kurochkin. Quality Control Metrics at Different Stages of Genomic Assembly in the Parallel Sequencing Using the Nanofor SPS. XV International scientific-technical conference on actual problems of electronic instrument engineering (APEIE), IEEE, 516 (2021). DOI: 10.1109/APEIE52976.2021.9647574
  26. X. Li, L. Zhang, J. Yang, F. Teng. J. Med. Biolog. Engineering, 44, 231 (2024). DOI: 10.1007/s40846-024-00863-x
  27. V.V. Manoilov, A.G. Borodinov, A.I. Petrov, I.V. Zarutsky, V.E. Kurochkin. Nauchn. Priborostr., 33 (2), 35 (2023) (in Russian)
  28. G.C. Linderman, S. Steinerberger. SIAM J. Mathematics of data Science, 1 (2), 313 (2019). DOI: 10.1137/18M1216134

Подсчитывается количество просмотров абстрактов ("html" на диаграммах) и полных версий статей ("pdf"). Просмотры с одинаковых IP-адресов засчитываются, если происходят с интервалом не менее 2-х часов.

Дата начала обработки статистических данных - 27 января 2016 г.

Publisher:

Ioffe Institute

Institute Officers:

Director: Sergei V. Ivanov

Contact us:

26 Polytekhnicheskaya, Saint Petersburg 194021, Russian Federation
Fax: +7 (812) 297 1017
Phone: +7 (812) 297 2245
E-mail: post@mail.ioffe.ru