Data Accuracy

Accurate information is a primary goal of censuses and sample surveys. Accuracy can be defined as the extent to which results deviate from the true values of the characteristics in the target population. Statisticians use the term “error” to refer to this deviation. Good survey design seeks to minimize survey error.

Statisticians usually classify the factors affecting the accuracy of survey data into two categories: nonsampling and sampling errors. Nonsampling error applies to administrative records and surveys, including censuses, whereas sampling error applies only to sample surveys.

Nonsampling Error

Nonsampling error refers to error related to the design, data collection, and processing procedures. Nonsampling error may occur at each stage of the survey process and is often difficult to measure. The sources of nonsampling error in surveys have analogues for administrative records: the purposes for and the processes through which the records are created affect how well the records capture the concepts of interest of relevant populations (e.g., patents, journal articles, immigrant scientists and engineers). A brief description of five sources of nonsampling error follows. For convenience, the descriptions refer to samples, but they also apply to censuses and administrative records.

Specification error. Survey questions often do not perfectly measure the concept for which they are intended as indicators. For example, the number of patents does not perfectly quantify the amount of invention.

Coverage error. The sampling frame, the listing of the target population members used for selecting survey respondents, may be inaccurate or incomplete. If the frame has omissions, duplications, or other flaws, the survey is less representative because coverage of the target population is inaccurate. Frame errors often require extensive effort to correct.

Nonresponse error. Nonresponse error can occur if not all members of the sample respond to the survey. Response rates indicate the proportion of sample members that respond to the survey. Response rate is not always an indication of nonresponse error.

Nonresponse can cause nonresponse bias, which occurs when the people or establishments that respond to a question, or to the survey as a whole, differ in systematic ways from those who do not respond. For example, in surveys of national populations, complete or partial nonresponse is often more likely among lower-income or less-educated respondents. Evidence of nonresponse bias is an important factor in decisions about whether survey data should be included in Indicators.

Managers of high-quality surveys, such as those in the U.S. federal statistical system, do research on nonresponse patterns to assess whether and how nonresponse might bias survey estimates. Indicators notes instances where reported data may be subject to substantial nonresponse bias.

Measurement error. There are many sources of measurement error, but respondents, interviewers, mode of administration, and survey questionnaires are the most common. Knowingly or unintentionally, respondents may provide incorrect information. Interviewers may influence respondents’ answers or record their answers incorrectly. The questionnaire can be a source of error if there are ambiguous, poorly worded, or confusing questions, instructions, or terms or if the questionnaire layout is confusing.

In addition, the records or systems of information that a respondent may refer to, the mode of data collection, and the setting for the survey administration may contribute to measurement error. Perceptions about whether data will be treated as confidential may affect the accuracy of survey responses to sensitive questions, such as those about business profits or personal incomes.

Processing error. Processing errors include errors in recording, checking, coding, and preparing survey data to make them ready for analysis.

Sampling Error

Sampling error is the most commonly reported measure of a survey’s precision. Unlike nonsampling error, sampling error can be quantitatively estimated in most scientific sample surveys.

Sampling error is the uncertainty in an estimate that results because not all units in the population are measured. Chance is involved in selecting the members of a sample. If the same, random procedures were used repeatedly to select samples from the population, numerous samples would be selected, each containing different members of the population with different characteristics. Each sample would produce different population estimates. When there is great variation among the samples drawn from a given population, the sampling error is high, and there is a large chance that the survey estimate is far from the true population value. In a census, because the entire population is surveyed, there is no sampling error, but nonsampling errors may still exist.

Sampling error is reduced when samples are large, and most of the surveys used in Indicators have large samples. Typically, sampling error is a function of the sample design and size, the variability of the measure of interest, and the methods used to produce estimates from the sample data.

Sampling error associated with an estimate is often measured by the coefficient of variation or margin of error, both of which are measures of the amount of uncertainty in the estimate.