The growth and evolution of our civilization have been based on the use of biodiversity. Human survival was much sustained by the use of plant species, especially for nutrition and medicinal uses. Natural products are the most traditional source for the design and development of new drugs. Taking into account all drugs approved worldwide, 67% are either a natural product, semi-synthetic derivative, macromolecule isolated from an organism or have a pharmacophore group inspired by a natural product. Natural products researchers have been studying the medicinal properties of plants, and great advances regarding biosynthesis, ecology, and biological properties were achieved in the last century. Unfortunately, many species are endangered of extinction, and their chemistry, genes, and biological properties would also be lost.

The knowledge of over 50 years of studies on biodiversity available in scientific articles can become easier accessible when organized and shared through knowledge graphs. It can assist in the development of different fields of science and bio-friendly products with high added value as well as guide public policies to bring benefits both to science and to strengthen the bio-economy. All these benefits demonstrate the need for organized and standardized policies and procedures to produce high-quality and valuable information. In this sense, the Semantic Web architecture choices can facilitate the publishing, sharing, and consuming of this valuable information among researchers and leading institutions. Once the data is in the RDF format, contributions from the wide range of the so-called Linked Data fields such as knowledge discovery, machine learning, data fusion, information retrieval, and data quality can help to foster sustainable growth of biochemical data as well as facilitate harvest all its potential to a more environmentally conscious society.

However, to date, most of the structured biochemical information available on the Web is manually curated, and it is practically impossible to keep pace with the research being constantly published in scientific articles. Within this challenge, we want to speed up and promote research on automatic biochemical knowledge extraction mechanisms by the Semantic Web scientific community to increase the information available on natural products to promote the development of environmental-friendly products while increasing the community’s awareness of the biodiversity value.

Pre-registration

We encourage participants to perform a pre-registration to keep informed about any changes on the program. Pre-registrations can be performed through the online form: https://forms.gle/2sMUUondkULRwnE66

Challenge

We invite researchers to participate by re-using or designing new innovative Biochemical Extraction methods. The participants will have the opportunity to present, discuss and display their knowledge extraction approach at the BiKE workshop. The Challenge consists in extracting relevant information from a set of biochemical research articles and constructing a knowledge graph through a given ontology.

Training Data set

The dataset used for evaluation and training was generated from hundreds of peer-reviewed scientific articles with information on more than 2,521 possibilities of natural product extraction. The dataset was built manually by chemistry specialists that read the articles annotating four relevant properties associated with each natural product discussed in the academic publication. For this challenge, we focus on five NuBBE properties for training and prediction: (I) compound name (rdfs:label), (II) bioactivity (nubbe:biologicalActivity), (III) species from where natural products were extracted (nubbe:collectionSpecie), (IV) collection site of these species (nubbe:collectionSite), and (V) isolation type (nubbe:collectionType). The table below presents an overview of the number of unique properties.

All papers are present on all train splits, but the papers selected for each test split have all links to manually extracted characteristics removed. This means that these papers will not be connected to the rest of the knowledge graph. The provided code for generating the knowledge graph representation with python’s networkx uses BERTopic’s extracted topics for reconnecting the knowledge graph. The assigned topics are also filtered by the following rule: if the topic is present in more than 80% of examples it is eliminated since it does not discriminate from the others. Part of the challenge is to figure out other ways to reconnect the knowledge graph with automatically extracted characteristics like citation networks for the authors, conferences, and others.

We provide the original flat data and the original networkx knowledge graph. We also provide 10 previously randomized train/test splits that contain the links maintained and removed, respectively. For every train/test split, we also provide a prepared networkx knowledge graph. The provided data can be accessed here: https://drive.google.com/drive/folders/1NXLQQsIXe0hz32KSOeSG1PCAzFLHoSGh?usp=sharing The source code and documentation for NatUKE can also be accessed at: https://github.com/AKSW/natuke

For referencing the benchmark used please refer to:

@inproceedings{icsc/natuke/2023,
  title = {NatUKE: A Benchmark for Natural Product
    Knowledge Extraction from Academic Literature},
  author = {Paulo Viviurka do Carmo,
    Edgard Marx,
    Ricardo Marcacini,
    Marilia Valli,
    João Victor Silva e Silva,
    Alan Pilon},
  booktitle = {17th IEEE International Conference on Semantic Computing},
  year = {2023},
  publisher = {IEEE}
}

Evaluation Metrics

We are interested in ranking the correct document prediction of real links that were hidden in the knowledge graph. Together with MRR (Mean Reciprocal Rank), hits@k is a ranking metric for when there is only one correct document. On the other hand, mAP (mean Average Precision) and nDCG (normalized Discounted Cumulative Gain) are designed for ranking when a list of relevant documents is available. We chose hits@k because it allows us to evaluate each characteristic extraction with reasonable expectations by customizing the k value. Following the rule used in NatUKE the final k values in this table are from 1 to 50 considering values multiples of 5 and two thresholds: (1) a score equal or higher than 0.50 is achieved; and (2) a score equal or higher than 0.20 is achieved. Please refer to the NatUKE benchmark paper for further details.

Submission guidelines

On 12th of February, we decided to split the submission into two stages (1) results and (2) papers.

Rules

Paper

QA

To help participants to get on board, we created a QA channel on Slack, you can post your questions directly there:

Contact

Agenda

Awards

The first, second, and third best biochemical knowledge extraction methods are going to be awarded as follows:

Program

28.05.2023

08:30 - Registration

09:00 - Welcome (Edgard Marx)

Towards Natural Inspired Products from Biodiversity by Edgard Marx (Project Manager at the HTWK and Linked Data Expert at eccenca)

09:30 - Session 1 (Chair: Edgard Marx)

10:30 - Coffee Break

11:00 - Session 2 (Chair: Paulo do Carmo)

12:00 - Lunch

Results

First Place:

Second Place:

Third Place:

Organizing Committee

General Chair

Edgard Marx, HTWK, Germany

Organizing Committee

Sanju Tiwari, UAM, Mexico

Joao Victor da Silva e Silva, USP, Brazil

Marilia Valli, USP, Brazil

Paulo Ricardo Viviurka do Carmo, HTWK, Germany

Advisory Committee

Vanderlan da Silva Bozani, UNESP, Brazil

Adriano Defini Andricopulo, USP, Brazil

Thomas Riechert, HTWK, Germany

Alan Pilon, USP, Brazil