Chapter 18 He or She Analysis
We examine the words which start with he or she. This section draws inspiration from the blog post by David Robinson in his writeup
train %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(word1 %in% c("he", "she"))## # A tibble: 5,680 x 5
## id author len word1 word2
## <chr> <chr> <int> <chr> <chr>
## 1 id00004 EAP 134 he might
## 2 id00004 EAP 134 he necessarily
## 3 id00017 EAP 469 he makes
## 4 id00029 MWS 115 he entered
## 5 id00035 HPL 75 he was
## 6 id00036 HPL 201 he had
## 7 id00037 MWS 274 he owned
## 8 id00037 MWS 274 he the
## 9 id00043 HPL 167 he absorbed
## 10 id00045 MWS 237 he found
## # ... with 5,670 more rows
18.1 Gender associated verbs
Which words were most shifted towards occurring after “he” or “she”? We’ll filter for words that appeared at least 20 times.
train %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(word1 %in% c("he", "she")) %>%
count(word1,word2) %>%
spread(word1, n, fill = 0) %>%
mutate(total = he + she,
he = (he + 1) / sum(he + 1),
she = (she + 1) / sum(she + 1),
log_ratio = log2(she / he),
abs_ratio = abs(log_ratio)) %>%
arrange(desc(log_ratio)) %>%
filter(!word2 %in% c("himself", "herself"),
!word2 %in% stop_words$word,
total>= 20) %>%
group_by(direction = ifelse(log_ratio > 0, 'More "she"', "More 'he'")) %>%
top_n(15, abs_ratio) %>%
ungroup() %>%
mutate(word2 = reorder(word2, log_ratio)) %>%
ggplot(aes(word2, log_ratio, fill = direction)) +
geom_col() +
coord_flip() +
labs(x = "",
y = 'Relative appearance after "she" compared to "he"',
fill = "",
title = "Gender associated with Verbs ") +
scale_y_continuous(labels = c("4X", "2X", "Same", "2X"),
breaks = seq(-2, 1)) +
guides(fill = guide_legend(reverse = TRUE)) +
theme_bw()
She cried , She loved , She died ,She heard is common while He told, He spoke, He sat, He wished , He found are common