Examples include the GO for attributes of gene products such as subcellular localization, molecular function or biological role, and Plant Ontology for plant attributes such as developmental stages or anatomical parts. When data are associated with appropriate ontology terms, data interoperability, retrieval and transfer are more effective. In this section, we review the challenges and opportunities in the use of ontologies and provide a set of recommendations for data curation with ontologies.To identify current status and challenges in ontology use, an online survey was offered to AgBioData members. The survey results for ontology use in databases for each data type are provided in Table 1 and a summary of other survey questions such as barriers to using ontologies are provided in the supplementary material 1. In addition, the ways ontologies are used in data descriptions in some GGB databases are described in supplementary material 2. To facilitate the adoption of ontologies by GGB databases, we describe the challenges identified by the survey along with some opportunities to meet these challenges, including a review of currently available ontologies for agriculture, ontology libraries and registries and tools for working with ontologies.Most GGB databases use GO but fewer use additional ontologies such as Plant Trait Ontology and PO to describe their data. In addition, with a few exceptions,hydroponic bucket these terms are assigned through computation instead of through rigorous manual annotation.
The use of ontologies could be facilitated if the list of applicable ontologies were readily available. Within the agricultural domain there are many reference ontologies applicable to model and crop plants, livestock, arthropods and other animal species. Table 2 lists some of the ontologies that are applicable to agricultural data. In supplementary material 3, we also describe ontology libraries and registries, including description of the Planteome project , the Crop Ontology project ,the Open Biological and Biomedical Ontology Foundry , the NCBO BioPortal , OntoBee , the EBI Ontology Lookup Service , AberOWL and the AgroPortal project . Lack of funding and resources to train and pay biocurators. While most databases recognize the importance of using ontologies for efficient data integration and retrieval, GGB databases typically lack sufficient funding to train and pay additional biocurators. The curation work could be somewhat eased by tools for the curation and validation of annotations and by standardized data formats for ontology annotation data exchange. Significant work has been done in these areas. The curation tools for GO annotation include TAIR’s in-house curation tool PubSearch and TAIR’s community portal the Online Annotation Submission Tool , PomBase’s Canto , the GO consortium’s Noctua and Table Editor .To facilitate sharing annotations among resources, there are some existing and emergent standards for ontology annotation data exchange. The GO Annotation File format is the standard for GO annotation data exchange and Phenopackets is an extensible data model and data exchange format for phenotype data from any species . More details about these tools are provided in supplementary material 4.Some databases develop in-house ontologies since the existing ontologies do not meet their needs. When using in-house ontologies, it is necessary to map their terms to reference ontologies to facilitate ontology development and/or data transfer among other databases. In addition, it is often necessary to use species-specific ontologies. For example, analogous organs across plant species often do not have the same name.
To ease this difficulty, the CO and Planteome projects work together to link terms in plant species-specific ontologies to more general terms in references ontologies like GO and PO. In case of incomplete ontologies, there is a need for a tool or system where researchers and biocurators can add terms, which are timely reviewed for inclusion in the ontologies.The core recommended set of ontologies to use for agricultural is GO for gene function annotation, Sequence Ontology to define sequence types and trait ontologies for Quantitative trait locus , heritable phenotypic markers, gene models, transcripts, germplasm, molecular markers and trait descriptors for breeding and/or diversity projects. PO and TO are recommended for describing plant anatomy, developmental stages of plants and plant traits. When species-specific trait ontologies are used, it is recommended that they be annotated with reference PO and TO to enable cross-species comparison. Trait-related ontologies recommended for livestock, arthropods and other animal species are summarized inTable 4. All curatorial assignments of an ontology term to a database object should also be accompanied with the appropriate Evidence and Conclusions Ontology term describing the evidence on which that assertion is based and a traceable reference to the source of the experimental data. For agricultural animals, anatomy is represented using Uberon and Cell Ontology . For describing traits, the Vertebrate Trait Ontology provides trait descriptors analogous to TO for plants. The Ontology of Biological Attributes is a Phenotype and Trait Ontology -based ontology that provides traits in a completely species-neutral way, and integrates with VT and TO. The Mammalian Phenotype Ontology describes abnormal effects of gene mutations and other phenotypes.In order to provide unified access to different types of agricultural data and enable large-scale analysis, it is crucial to have a network of domain ontologies.
Each ontology focuses on its own domain, but often several ontologies are needed to fully describe the data.As an example of the creation of an ontology network, TO defines phenotypic traits in plants as Entity–Quality statements. The quality terms come from the PATO whereas the entity terms come from PO, GO or ChEBI, depending on the entity. These efforts can reduce the curation time in individual databases since once the data is curated with one ontology such as TO, it can be further associated with other component ontologies. There are also tools to make relationships among reference ontologies. One example is Intelligent Concept Assistant , a National Institute of Health Big Data 2 Knowledge – funded project to develop an environment for helping scientists to collaboratively create, extend and apply their knowledge to describe and interpret biomedical data sets.We encourage the use of ontologies by implementing rules and procedures where available/applicable, and improving ontologies by enlisting community helps in adding new terms, correcting existing terms as needed and in general, modifying ontologies to be broadly adaptive. A good example of this process occurs in AnimalQTLdb, where ontologies were developed in parallel with improvements to AnimalQTLdb . One way to encourage researchers to use ontologies is to provide a system that requires collection of accepted ontology terms as part of the data and/or publication submission process. Some databases, such as TAIR, TreeGenes and GDR work with journals to require that authors submit their data to the appropriate GGB database prior to manuscript submission . There are multiple approaches to this. GDR has downloadable data templates that researchers fill in and submit. TreeGenes has a web form for submitting association genetics and population genomics studies. TreeGene’s web form is currently being converted to a Tripal module ,stackable planters with the aim of fully adopting Minimum Information About a Plant Phenotyping Experiment to include the full spectrum of data generation, including design, sampling, sequencing and informatic analysis. The TreeGene system simplifies the submission process through the use of guided prompts to query researcher for the location of the experiment , the type of experiment and environmental conditions , and to determine which ontologies are necessary. TAIR’s TOAST allows authors to submit GO and PO annotations for their own or for others’ published works.Public data is valuable for additional research and for reproducibility analyses. But data cannot be reused unless they are sufficiently described, including attribution, analysis methods, procedures, data formats and a description of the subjects and treatments. Data cannot be reused if they cannot be found via search engines or persistent identifiers. Take the data resulting from a Genome Wide Association Study as an example. The accompanying metadata should include the species and specific individuals that were sampled; the study participants and publication; the genotypes and phenotypes and how they are obtained; the name, versions and parameters of software used; any scripts developed; parameters used to define significance; and data formats. Not only does this enable researchers to reuse data that may have been produced at considerable expense, but also enables researchers to reproduce results . Furthermore, having a persistent identifier attached to this data set, and having it deposited in a permanent repository, ensures that it can be found, retrieved and reused by multiple researchers for years to come. Metadata is descriptive information about an object or resource whether it be physical or electronic. The underlying concepts behind metadata have been in use for as long as collections of information have been organized. Library card catalogs represent a well-established type of metadata that have served as collection management and resource discovery tools for decades.
A metadata standard can be either a set of core descriptors that will apply in all instances and extended as needed or a comprehensive standard consisting of both required and optional data fields. The advantage of a core set is that its simplicity can greatly aid its adoption. The Dublin Core Metadata Initiative is an example of a core standard. For scientific research data, a ‘core standard’ will not be adequate to describe how the data was generated and analysed. Extensions to a ‘core standard’ are by definition, not standardized and so, extended fields likely cannot be compared. A ‘comprehensive standard’, on the other hand, may provide sufficient descriptive fields to enable reuse of research data, but its complexity may create a significant barrier to adoption. Another dimension of a metadata standard is the use of CVs. To compare metadata for multiple data sets, there must be a means of directly comparing the contents of each field. CVs, especially in the form of a hierarchical ontology that contains information about the relatedness of values, are essential to metadata.For conservation to achieve success in a dynamic and changing world, many issues must be addressed. Threats, costs, site availability, and type of conservation action, in addition to biodiversity targets or other conservation benefits, are important factors to consider when allocating limited conservation funds . Systematic conservation planning has traditionally emphasized designing reserves for biodiversity conservation. However, tools from spatial conservation prioritization, the prioritization of conservation actions through quantitative means , have been applied to prioritize other resources such as ecosystem services , the future ranges of biodiversity , and multifunctional agricultural lands , and for other conservation actions, such as restoration or invasive species control . The problem formulation of many cases of spatial conservation prioritization can generally be defined as trying to find the minimum set solution, the network with the minimum area or cost that meets all of the conservation targets , or the maximal coverage solution, the network with the most conservation targets met at a specified budget . Utility maximization is one case of the maximal coverage problem . Utility maximization problems are similar to maximal coverage problems in conservation planning in that the goal is to maximize the benefit of conservation actions subject to a resource constraint . The major difference between the two is in the calculation of site value. In the utility maximization approach, a site’s marginal value is calculated based on the representation level of the resource, and a target amount of the resource, and a benefit or utility function . Whereas the maximal covering formulation uses a step function that values all selected sites within the set equally, and non-selected sites have no value. A major advantage of the former is that it allows non-threatened areas outside of the selected set to have value and contribute towards conservation goals. A prime example of a utility maximization problem can be seen in Davis, Costello & Stoms , where conservation funds are allocated to maximize the averted loss of utility, which is a composite of three conservation criteria: hotspots of rare, threatened, and endangered species; under-represented wildlife habitat types; and sites for expanding existing small reserves, all in the Sierra Nevada region of California. They highlight a priority acquisition schedule for conservation and discuss how the framework incorporates key elements of systematic conservation planning , including concepts of complementarity, efficiency, irreplaceability, and retention .