Selection of Data Sources

Information is available from many sources, and it can vary in substantial ways. Several criteria guide the selection of data for Indicators:

Representativeness. Data should represent the entire national or international populations of interest and should reflect the heterogeneity of those populations. Data should be also available for the subdomains of interest covered in Indicators (e.g., the population of scientists and engineers or the topic of R&D spending by universities).

Relevance. Data should include indicators central to the functioning of the science and technology enterprise.

Timeliness. Data that are not part of a time series should be timely (i.e., they should be the most recent data available that meet the selection criteria).

Statistical and methodological quality. Survey methods used to collect data should provide sufficient assurance that survey estimates are robust and that statements based on statistical analysis of the data are valid and reliable. Nonsurvey data, such as administrative records, or data from other third-party sources should similarly be assessed for quality—that is, fitness for use. All external data should be properly sourced and cited. Data included in Indicators must be of high quality. Known limitations of the external data must be clearly stated. Data quality has several characteristics. Some key dimensions of quality include the following.

Validity. Data have validity if they accurately measure the phenomenon they are supposed to represent.

Reliability. Data have reliability if similar results would be produced if the same measurement or procedure were performed multiple times on the same population.

Accuracy. Data are accurate if estimates from the data do not widely deviate from the true population value.

Data that are collected by U.S. government agencies and that are products of the federal statistical system meet the rigorous statistical and methodological criteria described above. Unless otherwise indicated, these data are representative of the nation as a whole and of the demographic, organizational, or geographic subgroups that constitute it.

For data collected by governments in other countries and by nongovernment sources, including private survey firms and academic researchers, methodological information is examined to assess conformity with the criteria that U.S. federal agencies typically use. Government statistical agencies in the developed world cooperate extensively both in developing data-quality standards and in improving international comparability for key data, and these agencies ensure that the methodological information about the data generated by this international statistical system is relatively complete.

Often, methodological information about data from nongovernmental sources and from governmental agencies outside the international statistical system is less well documented. These data must meet basic scientific standards for representative sampling of survey respondents and for adequate and unbiased coverage of the population under study. The resulting measurements must be sufficiently relevant and meaningful to warrant publication despite methodological uncertainties that remain after the documentation has been scrutinized.

Many data sources that contain pertinent information about a segment of the S&E enterprise are not cited in Indicators because their coverage of the United States is partial in terms of geography, incomplete in terms of segments of the population, or otherwise not representative. For example, data may be available for only a limited number of states, or studies may be based on populations not representative of the United States as a whole. Similarly, data for other countries should cover and be representative of the entire country. In some cases, data that have limited coverage or are otherwise insufficiently representative are referenced in sidebars.