Causality

Task 1: SDO, ATE and Randomization

πŸ“‚ 1. 데이터 뢈러였기 및 λ³€μˆ˜ 생성

  • 데이터 λ‹€μš΄λ‘œλ“œ 링크. read.csv()λ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό 뢈러였기.

  • λ³€μˆ˜ μ„€λͺ…

    • group_dummy: 처치λ₯Ό λ°›μ•˜λŠ”μ§€ μ—¬λΆ€ (1 = 처치ꡰ, 0 = ν†΅μ œκ΅°).
    • Y0: 처치λ₯Ό λ°›μ§€ μ•Šμ•˜μ„ 경우의 잠재적 κ²°κ³Ό \((Y_i^0)\).
    • Y1: 처치λ₯Ό λ°›μ•˜μ„ 경우의 잠재적 κ²°κ³Ό \((Y_i^1)\).
  • λ‹€μŒ λ³€μˆ˜λ“€μ„ μƒμ„±ν•˜μ‹œμ˜€.

    • κ΄€μΈ‘λœ κ²°κ³Ό \(Y_i = D_i \times Y_i^1 + (1 - D_i) \times Y_i^0\)
    • κ°œλ³„ 처치 효과 \(\delta_i = Y_i^1 - Y_i^0\)
link <- "https://chung-jiwoong.github.io/FMB819-Slides/chapter_causality/data/toy_data_2.csv"
toy_data <- read.csv(link)

library(tidyverse)

toy_data <- toy_data %>%
    mutate(Y = Y1 * group_dummy + (1 - group_dummy) * Y0,
           delta = Y1 - Y0)

πŸ“Š 2. ATE 및 SDO 계산

  • ATE (Average Treatment Effect) κ³„μ‚°ν•˜μ‹œμ˜€.
  • SDO (Simple Difference in Mean Outcomes) κ³„μ‚°ν•˜μ‹œμ˜€.
  • Biasκ°€ μ‘΄μž¬ν•˜λŠ”κ°€? ν¬κΈ°λŠ” μ–Όλ§ˆλ‚˜ 큰가?
ATE = mean(toy_data$delta)
ATE
## [1] 2.10418
treatment_mean <- mean(toy_data$Y[toy_data$group == "treatment"])
control_mean <- mean(toy_data$Y[toy_data$group == "control"])
SDO <- treatment_mean - control_mean

SDCκ°€ ATE보닀 μ•½ 50%정도 ν½λ‹ˆλ‹€.

πŸ”€ 3. λ¬΄μž‘μœ„ λ°°μ •λœ λ°μ΄ν„°μ—μ„œ SDO 계산

  • μƒˆλ‘œμš΄ 데이터 λ‹€μš΄λ‘œλ“œ 링크. 이 λ°μ΄ν„°μ—μ„œλŠ” λ™μΌν•œ κ°œμΈμ„ μž„μ˜λ‘œ λ¬΄μž‘μœ„ λ°°μ •(random assignment)ν•˜μ˜€μŒ.

  • \(Y_i\) μž¬κ³„μ‚° ν•˜μ‹œμ˜€. μƒˆλ‘œμš΄ 처치 배정에 맞좰 λ‹€μ‹œ 계산해야 함.

  • λ¬΄μž‘μœ„ λ°°μ •μ—μ„œ SDO 계산

    • 편ν–₯이 거의 0에 κ°€κΉŒμ›Œμ•Ό 함.
    • ν•˜μ§€λ§Œ μ •ν™•νžˆ 0이 λ˜μ§€ μ•ŠλŠ” μ΄μœ λŠ” λ¬΄μ—‡μΌκΉŒ?
link_rand <- "https://chung-jiwoong.github.io/FMB819-Slides/chapter_causality/data/toy_data_random.csv"
toy_data_random <- read.csv(link_rand)

toy_data_random <- toy_data_random %>%
    mutate(Y = Y1 * group_random_dummy + (1 - group_random_dummy) * Y0)

SDO_random = mean(toy_data_random$Y[toy_data_random$group_random == "treatment"]) - mean(toy_data_random$Y[toy_data_random$group_random == "control"])
SDO_random
## [1] 2.047158
ATE - SDO_random
## [1] 0.05702187

편ν–₯은 0.057와 κ°™μŠ΅λ‹ˆλ‹€. μ΄λŠ” 0이 μ•„λ‹™λ‹ˆλ‹€. κ·Έ μ΄μœ λŠ” ν‘œλ³Έ 크기가 μΆ©λΆ„νžˆ 크지 μ•Šμ•„ 두 κ·Έλ£Ή κ°„ 차이에 μ‘΄μž¬ν•˜λŠ” λ¬΄μž‘μœ„ 변동성을 μ™„μ „νžˆ 상쇄할 수 μ—†κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

πŸ“Œ 4. (Optional) 편ν–₯ μš”μ†Œ 확인

  1. 선택 편ν–₯(Selection Bias) κ³„μ‚°ν•˜μ‹œμ˜€.
  2. 이질적 처치 효과 편ν–₯(Heterogeneous Treatment Effect Bias) κ³„μ‚°ν•˜μ‹œμ˜€.
  3. μ•„λž˜ 식이 μ„±λ¦½ν•˜λŠ”μ§€ 확인: \[ SDO = ATE + \text{Selection Bias} + \text{Heterogeneous Treatment Effect Bias} \]
selection_bias = mean(toy_data$Y0[toy_data$group == "treatment"]) - mean(toy_data$Y0[toy_data$group == "control"])

het_trt_effect_bias = (1 - sum(toy_data$group == "treatment") / nrow(toy_data)) * (mean(toy_data$delta[toy_data$group == "treatment"]) - mean(toy_data$delta[toy_data$group == "control"]))

SDO
## [1] 3.208584
ATE + selection_bias + het_trt_effect_bias
## [1] 3.208584

Task 2: STAR data

πŸ“‚ 1. 데이터 뢈러였기: 데이터 λ‹€μš΄λ‘œλ“œ 링크
- 데이터λ₯Ό star_df 객체에 μ €μž₯. λ³€μˆ˜ μ„€λͺ… 도움말 확인.
(πŸ“Œλ°μ΄ν„°κ°€ μž¬κ΅¬μ„±(reshaped)λ˜μ–΄, λ³€μˆ˜λͺ… 끝의 β€œk”, β€œ1” λ“±μ˜ μˆ«μžλŠ” λ¬΄μ‹œ.)

star_df = read_csv("https://chung-jiwoong.github.io/FMB819-Slides/chapter_causality/data/star_data.csv")

πŸ” 2. λ°μ΄ν„°μ˜ κΈ°λ³Έ 정보 확인

  • κ΄€μ°° λ‹¨μœ„(Unit of observation)λŠ” 무엇인가?
str(star_df)

κ΄€μΈ‘λ‹¨μœ„λŠ” 학생-ν•™κΈ‰ (student-grade).

    1. 랜덀 ν•™κΈ‰ λ°°μ •(random class assignment): star, (ii) ν•™μƒμ˜ ν•™λ…„(class grade): grade, (iii) κ²°κ³Ό λ³€μˆ˜(outcomes of interest): read & math

πŸ“Š 3. 데이터 크기 및 κ²°μΈ‘κ°’(NA) 뢄석

  • 총 κ΄€μΈ‘μΉ˜ μˆ˜λŠ” λͺ‡ κ°œμΈκ°€? μ›λž˜μ—λŠ” ν•™μƒλ³„λ‘œ κ΄€μΈ‘λ‹¨μœ„μ˜€λŠ”λ°, ν•™λ…„-학생 λ‹¨μœ„λ‘œ 데이터λ₯Ό μž¬κ΅¬μ‘°ν™”ν•˜μ˜€κΈ° λ•Œλ¬Έμ— NA 값이 많음. λ˜ν•œ NAλŠ” μ—¬λŸ¬ κ°€μ§€ 이유둜 μ‹€ν—˜μ„ λ– λ‚˜κ²Œ 된 ν•™μƒλ“€μž„.
# 총 κ΄€μΈ‘μΉ˜ 수 확인
nrow(star_df)

# κ²°μΈ‘κ°’ 개수 확인
sum(is.na(star_df))

# 결츑값이 ν¬ν•¨λœ λ³€μˆ˜ 확인
colSums(is.na(star_df))

πŸš€ 4. κ²°μΈ‘κ°’ 처리 (NA 제거)

  • λ‹€μŒ μ½”λ“œ μ‹€ν–‰ν•˜μ—¬ 결츑값이 μ—†λŠ” 경우만 μœ μ§€:
star_df <- star_df[complete.cases(star_df),] # λ˜λŠ”
star_df <- na.omit(star_df)

πŸ“ˆ 5. λ¬΄μž‘μœ„ λ°°μ • 확인 (Balancing Checks)

  • 랜덀 배정이 잘 μ΄λ£¨μ–΄μ‘ŒλŠ”μ§€ ν™•μΈν•˜κΈ° μœ„ν•΄, 그룹별 기초 ν†΅κ³„λŸ‰μ„ 계산.
  • λ‹€μŒ ν•­λͺ©λ³„ 평균 λΉ„μœ¨(%)을 학년별(grade) 및 μ²˜μΉ˜κ΅°λ³„(treatment class)둜 비ꡐ: 1️⃣ 여학생 λΉ„μœ¨ (percentage of girls), 2️⃣ 아프리카계 미ꡭ인 λΉ„μœ¨ (percentage of African Americans), 3️⃣ 무료 급식 λŒ€μƒ λΉ„μœ¨ (percentage of free lunch qualifiers)을 μ‚΄νŽ΄λ³΄μž.

(πŸ“Œ 힌트: λ‹€μŒ μ½”λ“œλ‘œ 여학생 λΉ„μœ¨ 계산 κ°€λŠ₯ (dplyr ν™œμš© ν•„μš”): share_female = mean(gender == "female") * 100.)

star_df %>%
    group_by(grade, star) %>%
    summarise(
        share_female = mean(gender == "female") * 100,
        share_african_american = mean(ethnicity == "afam") * 100,
        share_free_lunch = mean(lunch == "free") * 100)
## # A tibble: 12 Γ— 5
## # Groups:   grade [4]
##    grade star         share_female share_african_american share_free_lunch
##    <chr> <chr>               <dbl>                  <dbl>            <dbl>
##  1 1     regular              48.7                   37.3             51.5
##  2 1     regular+aide         47.8                   29.8             50.8
##  3 1     small                48.6                   31.9             47.9
##  4 2     regular              48.3                   36.7             50.6
##  5 2     regular+aide         47.7                   33.8             48.2
##  6 2     small                49.1                   33.3             46.6
##  7 3     regular              48.9                   35.2             49.7
##  8 3     regular+aide         47.2                   34.3             49.1
##  9 3     small                50.1                   31.6             46.8
## 10 k     regular              48.5                   28.5             46.0
## 11 k     regular+aide         48.8                   32.2             49.3
## 12 k     small                48.0                   30.2             46.7

Task 3

πŸ“Œ 1. μ•„λž˜ μ½”λ“œλ₯Ό μ‹€ν–‰ν•˜μ—¬ 1ν•™λ…„(grade == β€œ1”)이며, 일반 ν•™κΈ‰(regular) λ˜λŠ” μ†Œκ·œλͺ¨ ν•™κΈ‰(small)에 μ†ν•œ ν•™μƒλ“€λ§Œ 선택.

star_df_1_small <- star_df %>%
    filter(star %in% c("small","regular") & grade == "1")

πŸ“Š 2. 두 그룹의 평균 μˆ˜ν•™ 점수 및 차이 계산 (Base R μ‚¬μš©)

mean_small = mean(star_df_1_small$math[star_df_1_small$star == "small"])
mean_small
## [1] 539.0885
mean_regular = mean(star_df_1_small$math[star_df_1_small$star == "regular"])
mean_regular
## [1] 526.4434
ATE = mean_small - mean_regular
ATE
## [1] 12.64506

πŸ”„ 3. 더미 λ³€μˆ˜ 생성: μ†Œκ·œλͺ¨ ν•™κΈ‰(small) = 1 (TRUE), 일반 ν•™κΈ‰(regular) = 0 (FALSE) 힌트: treatment = (star == "small").

star_df_1_small <- star_df_1_small %>%
    mutate(treatment = (star == "small"))
table(star_df_1_small$treatment)
## 
## FALSE  TRUE 
##  2359  1786

πŸ“ˆ 4. νšŒκ·€ 뢄석 μ‹€ν–‰

❓ 5. κ²°κ³Ό 해석: νšŒκ·€ 뢄석 κ²°κ³Όκ°€ 2번 μ§ˆλ¬Έμ—μ„œ κ΅¬ν•œ 평균 차이와 μΌμΉ˜ν•˜λŠ”κ°€?

μ ˆνŽΈμ€ ν†΅μ œ 집단(즉, 일반 크기의 ν•™κΈ‰)에 μ†ν•œ 1ν•™λ…„ ν•™μƒλ“€μ˜ μ˜ˆμƒ μˆ˜ν•™ 점수λ₯Ό μ˜λ―Έν•©λ‹ˆλ‹€. 즉, ν†΅μ œ 집단에 μ†ν•œ 1ν•™λ…„ ν•™μƒλ“€μ˜ μ˜ˆμƒ μˆ˜ν•™ μ μˆ˜λŠ” 526.44μ μž…λ‹ˆλ‹€. μ΄λŠ” 질문 2μ—μ„œ κ³„μ‚°λœ λ™μΌν•œ 평균과 κ³„μˆ˜λ₯Ό λΉ„κ΅ν•˜λ©΄ 직접 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

기울기 κ³„μˆ˜λŠ” μ‹€ν—˜ 집단과 ν†΅μ œ 집단에 μ†ν•œ 1ν•™λ…„ ν•™μƒλ“€μ˜ μ˜ˆμƒ μˆ˜ν•™ 점수 차이λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€. 즉, μ†Œκ·œλͺ¨ 학급에 μ†ν•œ 1ν•™λ…„ 학생듀은 일반 크기의 ν•™κΈ‰ 학생듀보닀 ν‰κ· μ μœΌλ‘œ 12.65점 더 높은 점수λ₯Ό 받을 κ²ƒμœΌλ‘œ μ˜ˆμƒλ©λ‹ˆλ‹€. 이 κ³„μˆ˜ μ—­μ‹œ 질문 2μ—μ„œ κ³„μ‚°λœ 평균 차이와 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.