How does Nature Navigator process data?

Modified on: Tue, 3 Dec, 2024 at 4:35 PM

What types of analysis can I perform with the Nature Navigator?

There are two main steps in creating topic overviews and summaries for Nature Navigator:

Identification and selection of relevant content
Aggregation and summarisation of relevant content

How is relevant scientific content selected

The selection of relevant content is based on selection criteria specified by the creator of a topic. The creator can be an editor, data specialist or scientist at Springer Nature as well as individual end users of the platform. There are two basic modes of content selection.

Direct content selection

Direct content selection refers to defining a certain set of metadata related criteria which are used to filter all available content. These criteria are directly related to the attributes we are sourcing, which are described in What data attributes is Nature Navigator using? Following are a few examples of possible direct selection criteria, written in non-technical terms

Available scientific publications from the last 5 years, which mentioned at least one of the following concepts: wind energy, solar energy, hydropower
A predefined list of publication identifiers
Content authored by researchers affiliated with institutions in a certain country. A selection like this can be useful if the aim is to analyse the research landscape of a certain country or research organisation.

In general, we avoid selection bias based on reception or impact metrics, i.e. there is no up-front exclusion of content that is published in journals with lower impact.

Indirect content selection

Another and more powerful way (for our creators) of selecting content is through indirect selection. In this case example content from the desired field of interest is used to select similar content using state-of-the-art classification techniques. Creators provide sample content as well as the type of similarity they would like the machine to use.Indirect selection via similarity is in principle more susceptible to biases due to the selection the creator did while seeding the process. Please see Improper selection criteria for more information.

How relevant scientific content is analysed

There are different levels of analysis that are done on the selected content above.

Statistical analysis

We offer standard ways of aggregating the sum of all content into meaningful charts for analysis. The simplest example of a statistical analysis could be an aggregation of all relevant content based on one or more metadata criteria, e.g. the publication output in the topic per year.

Relationship and network analysis

Certain metadata enable the creation of relationships and connections between the individual content pieces, like the co-authorship (two authors or affiliations contributing to the same content piece), co-citation (two content pieces referencing the same third content pieces) or usage of a similar subset of concepts. We use common network analysis techniques to identify relationships and the strength of relationships within the content selection. The result of such an analysis can be for example the network of authors which surfaces and visualises

Active authors in the field
The amount of content authors contribute to a field
Collaboration between authors that work together
Clustering the topic of "renewable energy" into "wind energy", "solar energy" and"hydropower" by analysing the overlapping concepts
Clustering researchers who collaborate together on a certain sub-topic

Is your analysis biassed and how do you avoid biases?

In principle, there are different sources for biases in the content processing described above. We are continuously working on minimising the impact of biases where we are in direct control.

Improper selection criteria

Any analysis is influenced directly by the raw data that is fed into it. Hence, the section criteria are a common source for bias by ignoring or suppressing a relevant amount of content. Therefore, creators have to be mindful about the selection criteria they are using. For topics created by Springer Nature we try to use only filter criteria that are directly related to the question, e.g. applying a country filter only makes sense when the research question to be answered is about the research of a given country/region. When working with sample content for indirect selection, we aim to use large input lists with diverse content, e.g. content of many publishers.

Data completeness

Another source for biases is incomplete data. We try to address this by constantly improving and enriching our raw data sources as well as avoiding analysis on attributes that are known to be incomplete. In cases where the current data landscape does not allow for more complete data and we believe this can have an impact on the analysis we aim to indicate this to users.

English