Week 3
Surveys

Soci—316

Sakeef M. Karim
Amherst College

SOCIAL RESEARCH

Unit I Begins–
February 10th

Some Reminders

Research Memo Deadline

Deadline for Memo on Research Interests

Memos are due by 8:00 PM on Friday, February 13th.

Some Reminders

Office Hours

Why Quantify Society?

Drawing Inferences

A Random Question Was your first impression of someone ever “off?”

Drawing Inferences

First impressions often change as we gather more information—i.e., data points—allowing us to fine-tune our inferences about how person i might behave in different situations.

Just as in statistical inference, more data helps us move from crude hypotheses to well-defined expectations about how i might behave within a “margin of error.”

Drawing Inferences

Here’s a general rule-of-thumb:

In quantitative sociological research, we often strive to generalize our findings—not just to draw inferences about person i, but to make tenable claims about U, a broader target population. To wit, we want our insights to tell us something meaningful about a large set of units—people, countries, firms, documents, inter alia.

Drawing Inferences

We’ll Return to This Example in a Second

Drawing Inferences

Two Quick Questions

  1. How can surveys help us draw inferences?

  2. Are all surveys designed to capture population-level characteristics?

Sampling

The Logic of Sampling

In the social sciences, we often want to know the characteristics of a population of interest. Yet collecting data from every individual in the target population may be prohibitively expensive or simply not feasible … In survey research, we collect data from a subset of observations in order to understand the target population as a whole.

(Llaudet and Imai 2023:52, EMPHASIS ADDED)

The Logic of Sampling

Back to This Figure

The Logic of Sampling

Total Population, N

Show the underlying code
library(tidyverse)

# Generating a population with one million people

set.seed(2026)

population <- tibble(x = rnorm(1e6, mean = 0, sd = 27),
                     sample = "Overall") |> 
              mutate(x = scales::rescale(x, to = c(0, 1)))

ggplot(population, 
       mapping = aes(x = x)) +
geom_density(alpha = 0.75, fill = "#FFFFB3") +
theme_minimal() 

The Logic of Sampling

n = 20

Show the underlying code
set.seed(2026)

ggplot(population, 
       mapping = aes(x = x, fill = sample)) +
geom_density(alpha = 0.75) +
geom_density(colour = "white",
             alpha = 0.6,
             data = population |> 
                    # Toggle these numbers on your own time! Try n = 10, 100, 1000, 5000
                    slice_sample(n = 20) |> 
                    mutate(sample = "N = 20")) +
theme_minimal() +
scale_fill_brewer(palette = "Set3")

The Logic of Sampling

n = 1000

Show the underlying code
set.seed(2026)

ggplot(population, 
       mapping = aes(x = x, fill = sample)) +
geom_density(alpha = 0.75) +
geom_density(colour = "white",
             alpha = 0.6,
             data = population |> 
                    # Toggle these numbers on your own time! Try n = 10, 100, 1000, 5000
                    slice_sample(n = 1000) |> 
                    mutate(sample = "N = 1000")) +
theme_minimal() +
scale_fill_brewer(palette = "Set3")

The Logic of Sampling

n = 5000

Show the underlying code
set.seed(2026)

ggplot(population, 
       mapping = aes(x = x, fill = sample)) +
geom_density(alpha = 0.75) +
geom_density(colour = "white",
             alpha = 0.6,
             data = population |> 
                    # Toggle these numbers on your own time! Try n = 10, 100, 1000, 5000
                    slice_sample(n = 5000) |> 
                    mutate(sample = "N = 5000")) +
theme_minimal() +
scale_fill_brewer(palette = "Set3")

Probabilistic Sampling

A Cautionary Tale

Image can be retrieved here.

A Cautionary Tale

Full Page

Probability Samples

[W]hen we use random selection to determine the members of our sample, there is no systematic difference between the sample and the population from which it is drawn. Any differences are strictly due to random chance alone … Samples that are based on random selection are called probability samples. More precisely, a probability sample is one in which (a) random chance is used to select participants for the sample, and (b) each individual has a probability of being selected that can be calculated.

(Carr et al. 2020:158, EMPHASIS ADDED)

Probability Samples

Of course, random sampling alone is not a panacea. Other considerations include—but are not limited to—our sample size, n, and target population, U.

Consider the following list of potential respondents:

 [1] "Quimby, Kyle"         "Tierney, Braden"      "Owens, George"       
 [4] "Rio, Corinna"         "Learned, Gerad"       "Sullivan, Kevin"     
 [7] "Schroeder, Brandon"   "Cooper, Sarah"        "Taylor, Samantha"    
[10] "Mills, Colleen"       "Dye, Bryan"           "Bartholomew, Michael"
[13] "Price, Brooke"        "Hartz, Casey"         "Noxon, Kayla"        
[16] "Mccombs, Christine"   "Carter, Jonathon"     "Drassen, Taylor"     
[19] "Fraley, Alex"         "Musk, Elon"          

Let’s say we want to talk to two of our respondents. Through random sampling, we might end up with this duo:

Show the underlying code
set.seed(2026) |> 
c(randomNames::randomNames(19, ethnicity = 5), "Musk, Elon") |> 
sample(size = 2)
[1] "Musk, Elon"      "Drassen, Taylor"

Probability Samples

A Question What is a sampling frame?

Probability Sampling Techniques

A simple random sample has two key features. First, each individual has the same probability of being selected into the sample. Second—and this is the tricky part—each pair of individuals has the same probability of being selected.

(Carr et al. 2020:164, EMPHASIS ADDED)

Click to Expand Image & Launch Gallery

Panel C  in Figure 6.2 in Carr et al. (2020)

In stratified sampling, the population is divided into groups (called strata), and the researcher selects some members of every group.

(Carr et al. 2020:166, EMPHASIS ADDED)

Click to Expand Image & Launch Gallery

Panel B  in Figure 6.2 in Carr et al. (2020)

In cluster sampling, the target population is first divided into groups, called clusters. Some of these clusters are selected at random. Then, some individuals are selected at random from within each cluster.

(Carr et al. 2020:165, EMPHASIS ADDED)

Click to Expand Image & Launch Gallery

Panel A  in Figure 6.2 in Carr et al. (2020)

Probability Sampling Techniques

Simple Random Sampling

Probability Sampling Techniques

Stratified Sampling

Probability Sampling Techniques

Cluster Sampling

A Note on Weighting

In probability sampling, it is not a problem if some individuals have a higher likelihood of being in the sample than others, as long as we know what those probabilities are. If Person A is x times more likely to be in our sample than Person B, then we give Person A 1/x times as much weight as Person B when computing our estimates.

(Carr et al. 2020:168, EMPHASIS ADDED)

Let’s say that s_share captures the distribution of islands in our sample, s, while u_share represents the distribution of islands in our target population, U.

     island   s_share   u_share
1    Biscoe 0.5555556 0.4883721
2     Dream 0.3333333 0.3604651
3 Torgersen 0.1111111 0.1511628

Now, here’s the weighted distribution of islands in s after applying post-stratification weights:

Show the underlying code
set.seed(2026)

# Here's a "dummy" stratified sample of `penguins`:

dummy_sample <- penguins |> 
                slice_sample(prop = 0.03, by = island)


# We have data on the distribution of islands in our data:

u_share <- penguins |> count(island) |> 
                       mutate(u_share = n/sum(n)) |> 
                       select(-n)

# Here's how we can generate simple post-stratification weights;

weights <- dummy_sample |> count(island) |> 
                           mutate(s_share = n/sum(n)) |> 
                           left_join(u_share) |> 
                           mutate(weight = u_share/s_share) |> 
                           select(island, weight)

# And here's how we can apply those weights:

dummy_sample |> left_join(weights) |> 
                count(island, wt = weight) |> 
                mutate(s_share = n/sum(n)) |> 
                select(-n)
     island   s_share
1    Biscoe 0.4883721
2     Dream 0.3604651
3 Torgersen 0.1511628

Non-Representative Samples

Another Question Are non-representative samples ever desirable?

A Group Exercise

Recruiting A Sample

Move around the classroom and form groups of two. You and your teammate are fielding a survey to explore how phenomenon x varies by y (e.g., major) at Amherst College. You have three tasks:

  1. Clearly state your research question(s).

  2. Clearly define your target population(s).

  3. Clearly describe how you’d recruit respondents from Amherst’s student population to arrive at s, your sample. Specifically, how could you draw on:

    • A simple random sample?
    • A stratified sample (specify the strata)?
    • A cluster sample (specify the clusters)?

Survey Design–
February 12th

A Friendly Reminder

Research Memo Deadline

Deadline for Memo on Research Interests

Memos are due tomorrow at 8:00 PM.

Formats, Delivery, Error

Formats

A Basic Distinction

Formats

Adapted from Table 7.1 in Carr et al. (2020)
Characteristic Cross-Sectional Surveys Panel Surveys
Cost Relatively low, as respondents are contacted only once Relatively high, as respondents are contacted over time
Ease of Administration Relatively easy Relatively difficult
Causal Inference Cannot ascertain causality clearly Better suited to infer causality due to repeated measures
Sources of Bias Typically exclude those difficult to reach Same biases as cross-sectional surveys plus selective attrition
Ability to Document Change Cannot assess within-person change Well suited to assess within-person change over time
Attention to Social History Cannot disentangle age, period, and cohort effects Can partially disentangle age, period, and cohort effects if multiple cohorts included

Formats

Full Page

Modes of Delivery

Adapted from Table 7.2 in Carr et al. (2020)
Attribute Face-to-Face Interview Mail or Self-Administered Questionnaire Telephone Interview Online Surveys
Cost High Low Moderate Low
Response Rate High Low High Moderate
Researcher Control Over Interview High Low Moderate Moderate
Interviewer Effects High Low Moderate Low

Sources of “Error”

Carr et al. (2020) discuss four major errors that
surveys are susceptible to:

  1. Nonresponse

  2. Measurement Error

  3. Coverage Errors

  4. Sampling Error

Make sure you know what these errors refer to.

A Few Question Types

Closed-Ended

What is your relationship status?

Feel free to click different boxes—no “answers” will be recorded.

Rating Scales

How much do you identify as “woke” or “anti-woke?”

Feel free to click different boxes—no “answers” will be recorded.

Open-Ended

What do you understand by “wokeness” or “being woke?”
Please write at least one complete sentence.
0/250 characters

Feel free to type an answer—it won’t be recorded.

Composite Measures

A Stylized Example

Composite Measures

A Stylized Example

Click Different Values for Each x to Estimate Aggregate Scores
x1
1
2
3
4
5
x2
1
2
3
4
5
x3
1
2
3
4
5

Mean Score

Summative Score

Another Group Exercise

A Bad Survey on Qualtrics

Get into your groups from Tuesday. Then, complete the following tasks.

  • Discuss–and review–the characteristics of high-quality survey questions (see Carr et al. 2020:213–18).

  • Design a set of 5-10 low quality (i.e., bad) questions.

  • Then, fire up Qualtrics and create your bad survey.

    Be sure to:

You’ll present your bad survey on Tuesday.

For some guidance on how to use Qualtrics, click here.

See You Tuesday

References

Carr, Deborah S., Elizabeth Heger Boyle, Benjamin Cornwell, Shelley J. Correll, Robert Crosnoe, et al. 2020. The Art and Science of Social Research. Second Edition. New York: W. W. Norton & Company, Inc.
Llaudet, Elena, and Kōsuke Imai. 2023. Data Analysis for Social Science: A Friendly and Practical Introduction. Princeton (N.J.) Oxford: Princeton University press.