A. Measurement

Validity and Reliability

For every dimension of interest and specific question or set of questions, there are a vast number of ways to make questions. Although the guiding principle should be the specific purposes of the research, there are better and worse questions for any particular operationalization. How to evaluate the measures?

Two of the primary criteria of evaluation in any measurement or observation are:

  1. Whether we are measuring what we intend to measure.
  2. Whether the same measurement process yields the same results.

These two concepts are validity and reliability.

Reliability is concerned with questions of stability and consistency - does the same measurement tool yield stable and consistent results when repeated over time. Think about measurement processes in other contexts - in construction or woodworking, a tape measure is a highly reliable measuring instrument.

Say you have a piece of wood that is 2 1/2 feet long. You measure it once with the tape
measure - you get a measurement of 2 1/2 feet. Measure it again and you get 2 1/2 feet. Measure it repeatedly and you consistently get a measurement of 2 1/2 feet. The tape measure yields reliable results.

Validity refers to the extent we are measuring what we hope to measure (and what we think we are measuring). To continue with the example of measuring the piece of wood, a tape measure that has been created with accurate spacing for inches, feet, etc. should yield valid results as well. Measuring this piece of wood with a "good" tape measure should produce a correct measurement of the wood's length.

To apply these concepts to social research, we want to use measurement tools that are both reliable and valid. We want questions that yield consistent responses when asked multiple times - this is reliability. Similarly, we want questions that get accurate responses from respondents - this is validity.


Reliability refers to a condition where a measurement process yields consistent scores (given an unchanged measured phenomenon) over repeat measurements. Perhaps the most straightforward way to assess reliability is to ensure that they meet the following three criteria of reliability. Measures that are high in reliability should exhibit all three.

Test-Retest Reliability

When a researcher administers the same measurement tool multiple times - asks the same question, follows the same research procedures, etc. - does he/she obtain consistent results, assuming that there has been no change in whatever he/she is measuring? This is really the simplest method for assessing reliability - when a researcher asks the same person the same question twice ("What's your name?"), does he/she get back the same results both times. If so, the measure has test-retest reliability. Measurement of the piece of wood talked about earlier has high test-retest reliability.

Inter-Item Reliability

This is a dimension that applies to cases where multiple items are used to measure a
single concept. In such cases, answers to a set of questions designed to measure some single concept (e.g., altruism) should be associated with each other.

Interobserver Reliability

Interobserver reliability concerns the extent to which different interviewers or observers using the same measure get equivalent results. If different observers or interviewers use the same instrument to score the same thing, their scores should match. For example, the interobserver reliability of an observational assessment of parent-child interaction is often evaluated by showing two observers a videotape of a parent and child at play. These observers are asked to use an assessment tool to score the interactions between parent and child on the tape. If the instrument has high interobserver reliability, the scores of the two observers should match.


To reiterate, validity refers to the extent we are measuring what we hope to measure (and what we think we are measuring). How to assess the validity of a set of measurements? A valid measure should satisfy four criteria.

Face Validity

This criterion is an assessment of whether a measure appears, on the face of it, to measure the concept it is intended to measure. This is a very minimum assessment - if a measure cannot satisfy this criterion, then the other criteria are inconsequential. We can think about observational measures of behavior that would have face validity. For example, striking out at another person would have face validity for an indicator of aggression. Similarly, offering assistance to a stranger would meet the criterion of face validity for helping. However, asking people about their favorite movie to measure racial prejudice has little face validity.

Content Validity

Content validity concerns the extent to which a measure adequately represents all facets of a concept. Consider a series of questions that serve as indicators of depression (don't feel like eating, lost interest in things usually enjoyed, etc.). If there were other kinds of common behaviors that mark a person as depressed that were not included in the index, then the index would have low content validity since it did not adequately represent
all facets of the concept.

Criterion-Related Validity

Criterion-related validity applies to instruments than have been developed for usefulness as indicator of specific trait or behavior, either now or in the future. For example, think about the driving test as a social measurement that has pretty good predictive validity. That is to say, an individual's performance on a driving test correlates well with his/her driving ability.

Construct Validity

But for a many things we want to measure, there is not necessarily a pertinent criterion available. In this case, turn to construct validity, which concerns the extent to which a measure is related to other measures as specified by theory or previous research. Does a measure stack up with other variables the way we expect it to? A good example of this form of validity comes from early self-esteem studies - self-esteem refers to a person's sense of self-worth or self-respect. Clinical observations in psychology had shown that people who had low self-esteem often had depression. Therefore, to establish the construct validity of the self-esteem measure, the researchers showed that those with higher scores on the self-esteem measure had lower depression scores, while those with low self-esteem had higher rates of depression.

Validity and Reliability Compared

So what is the relationship between validity and reliability? The two do not necessarily go hand-in-hand.

Graphic of Target
At best, we have a measure that has both high validity and high reliability. It yields consistent results in repeated application and it accurately reflects what we hope to represent.

It is possible to have a measure that has high reliability but low validity - one that is consistent in getting bad information or consistent in missing the mark. *It is also possible to have one that has low reliability and low validity - inconsistent and not on target.

Finally, it is not possible to have a measure that has low reliability and high validity - you can't really get at what you want or what you're interested in if your measure fluctuates wildly.