Archive detail

Predicting the toxicity of chemicals with AI

July 18, 2024 | Ori Schipper

Researchers at Eawag and the Swiss Data Science Center have trained AI algorithms with a comprehensive ecotoxicological dataset. Now their machine learning models can predict how toxic chemicals are to fish.

Chemicals play an important role in our everyday lives, for example in the production of food, medicines and various everyday goods. Their impact on human health and the environment is closely monitored using various control mechanisms. For instance, the EU stipulates in the REACH regulation that fish toxicity tests must be carried out for all chemicals with a minimum annual production volume of 10 tonnes. These tests are expensive – and require an estimated 50,000 fish each year in Europe.

Scientists have been working for several decades on alternative methods that are cheaper and, above all, do not require the use of laboratory animals. Great hopes are pinned on computer-based methods that can predict the effects of chemicals on fish.

Promising predictive power of the models

The aquatic research institute Eawag and the Swiss Data Science Center (SDSC) have joined forces to cuarte a comprehensive ecotoxicological dataset, made available to the scientific community, to help develop and benchmark new AI algorithms in ecotoxicology. The dataset, called “ADORE”, consists of around 26,000 data points that describe the effects of almost 2,000 chemicals on 140 fish species. It includes as well a large set of characteristics of both chemicals and species. 

As the researchers explain in their recently published scientific paper, the machine learning models are good at predicting the toxicity of chemicals. “The deviations observed are within the range of normal biological fluctuations,” say the two lead authors of the publication, Lilian Gasser, data scientist at the SDSC, and Christoph Schür, postdoctoral researcher at Eawag. The researchers therefore consider the investigated methods to be “promising for the prediction of acute fish mortality”. And those methods could be used for other species groups, provided similar available data.

“However, there are still limitations that need to be taken into account,” the researchers state self-critically. Although the algorithms provide useful predictions on average, they are still substantially off in some cases for individual fish species. For example, they overestimate the toxicity of a chemical for certain fish species and underestimate it for other species. “Evidently, the models are mainly influenced by a few chemical properties and do not yet adequately capture species-specific sensitivities,” says Gasser.

A proper testing procedure leads to meaningful results

In their work, Gasser and Schür took into account the fact that the way in which the data is divided into a training dataset and a test dataset has a decisive influence for proper evaluation of the machine learning models. "It is essential that the algorithm is tested only on chemicals that are not present in the training set in order to show that it is able to identify chemical characteristics that are truly predictive of toxicity," both Gasser and Schür comment. 

The future of chemical safety

According to Gasser and Schür and their co-authors, it is unlikely that machine learning models and artificial intelligence will soon make fish toxicity tests obsolete, but they are likely to help reduce them in the long term. The researchers believe these models will provide a more targeted assessment of chemical safety, which in future will include other biological factors in addition to physicochemical properties of the chemicals and mortality data.

For example, the model predictions could be combined with the evaluations of a series of other – animal-free – tests, which are currently being developed and validated at Eawag using different fish cell lines. For the development of such a highly informative chemical safety system, the researchers are encouraging close cooperation with the regulatory authorities so that the translation of research into practice can be jointly advanced.
 

Cover picture: Fish are often used in experiments. Machine learning could be an alternative to fish testing. (Photo: AdobeStock)
 

Original publication

Extbase Variable Dump
array(3 items)
   publications => '33052,32142' (11 chars)
   libraryUrl => '' (0 chars)
   layout => '0' (1 chars)
Extbase Variable Dump
array(2 items)
   0 => Snowflake\Publications\Domain\Model\Publicationprototypepersistent entity (uid=33052, pid=124)
      originalId => protected33052 (integer)
      authors => protected'Gasser, L.; Schür, C.; Perez-Cruz, F.; Schirmer, K.; Ba
         ity-Jesi, M.
' (93 chars) title => protected'Machine learning-based prediction of fish acute mortality: implementation, i
         nterpretation, and regulatory relevance
' (115 chars) journal => protected'Environmental Science: Advances' (31 chars) year => protected2024 (integer) volume => protected3 (integer) issue => protected'8' (1 chars) startpage => protected'1124' (4 chars) otherpage => protected'1138' (4 chars) categories => protected'' (0 chars) description => protected'Regulation of chemicals requires knowledge of their toxicological effects on
          a large number of species, which has traditionally been acquired through in
          vivo testing. The recent effort to find alternatives based on machine learn
         ing, however, has not focused on guaranteeing transparency, comparability an
         d reproducibility, which makes it difficult to assess advantages and disadva
         ntages of these methods. Also, comparable baseline performances are needed.
         In this study, we trained regression models on the ADORE "t-F2F" challenge p
         roposed in [Schür et al., Nature Scientific data, 2023] to predict acute mo
         rtality, measured as LC50 (lethal concentration 50), of organic compounds on
          fishes. We trained LASSO, random forest (RF), XGBoost, Gaussian process (GP
         ) regression models, and found a series of aspects that are stable across mo
         dels: (i) using mass or molar concentrations does not affect performances; (
         ii) the performances are only weakly dependent on the molecular representati
         ons of the chemicals, but (iii) strongly on how the data is split. Overall,
         the tree-based models RF and XGBoost performed best and we were able to pred
         ict the log10-transformed LC50 with a root mean square error of 0.90, which
         corresponds to an order of magnitude on the original LC50 scale. On a local
         level, on the other hand, the models are not able to consistently predict th
         e toxicity of individual chemicals accurately enough. Predictions for single
          chemicals are mostly influenced by a few chemical properties while taxonomi
         c traits are not captured sufficiently by the models. We discuss technical a
         nd conceptual improvements for these challenges to enhance the suitability o
         f in silico methods to environmental hazard assessment. Accordingly, this wo
         rk showcases state-of-the-art models and contributes to the ongoing discussi
         on on regulatory integration.
' (1853 chars) serialnumber => protected'' (0 chars) doi => protected'10.1039/d4va00072b' (18 chars) uid => protected33052 (integer) _localizedUid => protected33052 (integer)modified _languageUid => protectedNULL _versionedUid => protected33052 (integer)modified pid => protected124 (integer)
1 => Snowflake\Publications\Domain\Model\Publicationprototypepersistent entity (uid=32142, pid=124) originalId => protected32142 (integer) authors => protected'Schür, C.; Gasser, L.; Perez-Cruz, F.; Schirmer, K.; Ba
         ity-Jesi, M.
' (93 chars) title => protected'A benchmark dataset for machine learning in ecotoxicology' (57 chars) journal => protected'Scientific Data' (15 chars) year => protected2023 (integer) volume => protected10 (integer) issue => protected'1' (1 chars) startpage => protected'718 (20 pp.)' (12 chars) otherpage => protected'' (0 chars) categories => protected'' (0 chars) description => protected'The use of machine learning for predicting ecotoxicological outcomes is prom
         ising, but underutilized. The curation of data with informative features req
         uires both expertise in machine learning as well as a strong biological and
         ecotoxicological background, which we consider a barrier of entry for this k
         ind of research. Additionally, model performances can only be compared acros
         s studies when the same dataset, cleaning, and splittings were used. Therefo
         re, we provide <em>ADORE</em>, an extensive and well-described dataset on ac
         ute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans,
         and algae). The core dataset describes ecotoxicological experiments and is e
         xpanded with phylogenetic and species-specific data on the species as well a
         s chemical properties and molecular representations. Apart from challenging
         other researchers to try and achieve the best model performances across the
         whole dataset, we propose specific relevant challenges on subsets of the dat
         a and include datasets and splittings corresponding to each of these challen
         ge as well as in-depth characterization and discussion of train-test splitti
         ng approaches.
' (1154 chars) serialnumber => protected'' (0 chars) doi => protected'10.1038/s41597-023-02612-2' (26 chars) uid => protected32142 (integer) _localizedUid => protected32142 (integer)modified _languageUid => protectedNULL _versionedUid => protected32142 (integer)modified pid => protected124 (integer)
Gasser, L.; Schür, C.; Perez-Cruz, F.; Schirmer, K.; Baity-Jesi, M. (2024) Machine learning-based prediction of fish acute mortality: implementation, interpretation, and regulatory relevance, Environmental Science: Advances, 3(8), 1124-1138, doi:10.1039/d4va00072b, Institutional Repository
Schür, C.; Gasser, L.; Perez-Cruz, F.; Schirmer, K.; Baity-Jesi, M. (2023) A benchmark dataset for machine learning in ecotoxicology, Scientific Data, 10(1), 718 (20 pp.), doi:10.1038/s41597-023-02612-2, Institutional Repository

Financing / Cooperations

  • Eawag
  • Swiss Data Science Center (SDSC)
  • ETH Zürich
  • EPFL
  • European Partnership for the Assessment of Risks from Chemicals (PARC)
  • Horizon Europe