library(tidyverse)
library(quanteda)
library(readtext)
library(striprtf)
library(corpustools)
library(quanteda.textplots)
library(readr)
library(topicmodels)
library(tidytext)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(tm)
library(stm)
Topic Modelling based on Gender Corpora in Sports News
Introduction
Newspapers often reflect the gender biases and gender roles in society. Rao and Taboada found that English Canadian newspapers quote women more often in the Lifestyle, Entertainment, Arts and Healthcare categories and men more often in the Business, Sports and United States Politics (2021). Even within a field such as sports, the details of the sports events are provided for articles about men’s sports while in women’s sports articles, only women’s achievements are focused upon. Similarly, Devinney et al. studied Mainstream English news articles, Mainstream Swedish articles and LGBTQ+ web content and found that feminine topics were linked to the private sphere and masculine topics were linked to the public sphere (2020).
Research aim : To understand whether there was a difference in the way Indian newspapers reported women’s and men’s sports during the Tokyo Olympics held in 2021.
Data used in the project
The LexisNexis database was used to collect articles from July 22 to August 9, 2021 (the time when the 2021 Olympics were held).
The data included articles from Hindustan Times, Times of India (Electronic Edition), Free Press Journal (India), The Telegraph (India), Indian Express, Mint, DNA, India Today Online, The Hindu and Economic Times (E-Paper Edition).
The key word searched was Olympics and filters including Men’s Sports, Women’s Sports, Sports Awards, Sports & Recreation, India and Newspapers were used.
Methodology
The quanteda package was used for preprocessing. The corpora used were either the entire set of files or a subset depending on the model used. Punctuation and stopwords were removed from the corpora. Additionally, words such as Olympics, India and Tokyo were removed to derive more meaningful results.
Structural Topic Modelling and LDA Topic Modelling were employed using the stm and topicmodels packages respectively. For this, subsets of the dataset were utilised to create corpora. These corpora were made using the metadata which had classification tags such as sports, women’s sports and men’s sports. The articles were categorised as men’s sports, women’s sports or both.
For structural topic modelling, the corpus had 468 articles. However, the structural topic model did not produce anything insightful because it provided the information that women who played particular sports are mentioned more often in the women’s section and vice versa. The terms used to describe the events were not present when the model was run.
Hence, the final model included Latent Dirichlet Allocation (LDA) topic models for the corpora separately by gender. Additionally there was one corpus used which included articles that included both the tags of women’s and men’s sports.
There were 191 articles in the men’s sports corpora, 277 articles in the women’s sports corpora and 148 articles in the corpora which had both men’s and women’s sports’ articles.
For the LDA topic models and structural topic models, the search_K() function was used to determine the optimal number of topics.
Semantic Network
The semantic network displayed here was made using 1128 articles.
I limited the document feature matrix to terms that appeared a least 15 times and in 25% of the documents. This consisted of 50 terms which I plotted.
Unsurprisingly, this shows that most of the articles discuss India in the Olympics (as Indian newspaper articles were used). One major theme that can be observed is the discussion of the hockey team- the men’s team had placed third in over four decades hence marking history and was led by the captain Manpreet Singh. Other significant terms include medals and medal colours perhaps pertaining to victories by other Indian athletes; which are more clearly observed through the LDA topic models in the following sections.
<-readRDS(file = "_data/News_DFMForSemNet.rds") articles_dfm
<- dfm_trim(articles_dfm, min_termfreq = 15)
dfm_refined <- dfm_trim(dfm_refined, min_docfreq = .25, docfreq_type = "prop")
dfm_refined
<- fcm(dfm_refined)
fcmdim(fcm)
[1] 50 50
<- names(topfeatures(fcm,50))
top_features <- fcm_select(fcm, pattern = top_features, selection = "keep")
fcm_refined dim(fcm_refined)
[1] 50 50
<- log(colSums(fcm_refined))
size textplot_network(fcm_refined, vertex_size = size / max(size) * 3)
Reading in the files for the LDA topic models
<-readRDS(file = "_data/FilesClassificationNoDuplicates.rds")
df_All<-df_All %>% distinct(body, .keep_all = TRUE) df_All
<-df_All%>%split(df_All$Classification)
df_3<-df_3$Men
df_Mendim(df_Men)
[1] 191 4
<-df_3$Women
df_Womendim(df_Women)
[1] 277 4
<-df_3$MenAndWomen
df_Bothdim(df_Both)
[1] 148 4
Preprocessing for each corpora
Since the dataset was divided into 3 parts to be analysed separately, preprocessing for each part had to be conducted. Each dataframe was converted into a corpus and there was a check for metadata. After this, tokens were created, and punctuation and stopwords were removed.
Women’s corpora
<- corpus(df_Women,text_field = "body")
corpus_w head(corpus_w)
Corpus consisting of 6 documents and 3 docvars.
text1 :
" From a solitary two-day fixture between Great Britain and F..."
text2 :
" Tokyo Olympics Day 10 Full Schedule: Kamalpreet Kaur stunne..."
text3 :
" Haryana CM Manohar Lal Khattar has announced Rs.50 lakh cas..."
text4 :
" India's quest for another medal will continue on Day 8 of t..."
text5 :
" India vs Argentina Women's Hockey Semifinal Match Live Stre..."
text6 :
" India would fancy their chances of a medal finish as they g..."
<- summary(corpus_w)
corpus_w_summary head(corpus_w_summary)
Text Types Tokens Sentences newspaper date Classification
1 text1 260 542 22 Hindustan Times August 9, 2021 Women
2 text2 106 220 5 Hindustan Times August 2, 2021 Women
3 text3 126 298 12 Hindustan Times August 7, 2021 Women
4 text4 168 326 5 Hindustan Times July 31, 2021 Women
5 text5 104 292 12 Hindustan Times August 4, 2021 Women
6 text6 196 393 6 Hindustan Times July 30, 2021 Women
#corpus_w_summary$Tokens
#docvars(corpus_w)
<- tokens(corpus_w)
corpus_w_tokens head(corpus_w_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "From" "a" "solitary" "two-day" "fixture" "between"
[7] "Great" "Britain" "and" "France" "in" "the"
[ ... and 530 more ]
text2 :
[1] "Tokyo" "Olympics" "Day" "10" "Full"
[6] "Schedule" ":" "Kamalpreet" "Kaur" "stunned"
[11] "the" "nation"
[ ... and 208 more ]
text3 :
[1] "Haryana" "CM" "Manohar" "Lal" "Khattar" "has"
[7] "announced" "Rs" "." "50" "lakh" "cash"
[ ... and 286 more ]
text4 :
[1] "India's" "quest" "for" "another" "medal" "will"
[7] "continue" "on" "Day" "8" "of" "the"
[ ... and 314 more ]
text5 :
[1] "India" "vs" "Argentina" "Women's" "Hockey" "Semifinal"
[7] "Match" "Live" "Streaming" "," "Tokyo" "Olympics"
[ ... and 280 more ]
text6 :
[1] "India" "would" "fancy" "their" "chances" "of" "a"
[8] "medal" "finish" "as" "they" "gear"
[ ... and 381 more ]
<- tokens(corpus_w_tokens ,
corpus_w_tokens remove_punct = T)
head(corpus_w_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "From" "a" "solitary" "two-day" "fixture" "between"
[7] "Great" "Britain" "and" "France" "in" "the"
[ ... and 464 more ]
text2 :
[1] "Tokyo" "Olympics" "Day" "10" "Full"
[6] "Schedule" "Kamalpreet" "Kaur" "stunned" "the"
[11] "nation" "with"
[ ... and 171 more ]
text3 :
[1] "Haryana" "CM" "Manohar" "Lal" "Khattar" "has"
[7] "announced" "Rs" "50" "lakh" "cash" "award"
[ ... and 253 more ]
text4 :
[1] "India's" "quest" "for" "another" "medal" "will"
[7] "continue" "on" "Day" "8" "of" "the"
[ ... and 265 more ]
text5 :
[1] "India" "vs" "Argentina" "Women's" "Hockey" "Semifinal"
[7] "Match" "Live" "Streaming" "Tokyo" "Olympics" "Winning"
[ ... and 249 more ]
text6 :
[1] "India" "would" "fancy" "their" "chances" "of" "a"
[8] "medal" "finish" "as" "they" "gear"
[ ... and 317 more ]
<- tokens_select(corpus_w_tokens,
corpus_w_tokenspattern = stopwords("en"),
select = "remove")
head(corpus_w_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "solitary" "two-day" "fixture" "Great" "Britain" "France"
[7] "1900" "Olympics" "prospects" "cricket's" "inclusion" "8-team"
[ ... and 271 more ]
text2 :
[1] "Tokyo" "Olympics" "Day" "10" "Full"
[6] "Schedule" "Kamalpreet" "Kaur" "stunned" "nation"
[11] "64m" "throw"
[ ... and 138 more ]
text3 :
[1] "Haryana" "CM" "Manohar" "Lal" "Khattar" "announced"
[7] "Rs" "50" "lakh" "cash" "award" "every"
[ ... and 155 more ]
text4 :
[1] "India's" "quest" "another" "medal" "will" "continue"
[7] "Day" "8" "Tokyo" "Olympics" "ace" "shuttler"
[ ... and 207 more ]
text5 :
[1] "India" "vs" "Argentina" "Women's" "Hockey" "Semifinal"
[7] "Match" "Live" "Streaming" "Tokyo" "Olympics" "Winning"
[ ... and 185 more ]
text6 :
[1] "India" "fancy" "chances" "medal" "finish" "gear"
[7] "Day" "7" "Tokyo" "Olympics" "eyes" "archer"
[ ... and 251 more ]
Men’s corpora
<- corpus(df_Men,text_field = "body")
corpus_m head(corpus_m)
Corpus consisting of 6 documents and 3 docvars.
text1 :
" India's 41-year-long wait for an Olympic medal in hockey ca..."
text2 :
" The Board of Control for Cricket in India (BCCI) on Saturda..."
text3 :
" Indian athletes who won laurels for the nation at Tokyo Oly..."
text4 :
" India vs Belgium Hockey Match Live Streaming, Tokyo Olympic..."
text5 :
" India vs Germany Hockey Match Live Streaming, Tokyo Olympic..."
text6 :
" Indian men's hockey team on Sunday defeated Great Britain i..."
<- summary(corpus_m)
corpus_m_summary head(corpus_m_summary)
Text Types Tokens Sentences newspaper date Classification
1 text1 206 547 22 Hindustan Times August 5, 2021 Men
2 text2 156 299 12 Hindustan Times August 7, 2021 Men
3 text3 290 644 29 MINT August 9, 2021 Men
4 text4 117 289 12 Hindustan Times August 2, 2021 Men
5 text5 120 300 12 Hindustan Times August 5, 2021 Men
6 text6 234 522 26 MINT August 1, 2021 Men
#corpus_m_summary$Tokens
#docvars(corpus_m)
<- tokens(corpus_m)
corpus_m_tokens head(corpus_m_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "India's" "41-year-long" "wait" "for" "an"
[6] "Olympic" "medal" "in" "hockey" "came"
[11] "to" "an"
[ ... and 535 more ]
text2 :
[1] "The" "Board" "of" "Control" "for" "Cricket" "in"
[8] "India" "(" "BCCI" ")" "on"
[ ... and 287 more ]
text3 :
[1] "Indian" "athletes" "who" "won" "laurels" "for"
[7] "the" "nation" "at" "Tokyo" "Olympics" "were"
[ ... and 632 more ]
text4 :
[1] "India" "vs" "Belgium" "Hockey" "Match" "Live"
[7] "Streaming" "," "Tokyo" "Olympics" "Live" "Streaming"
[ ... and 277 more ]
text5 :
[1] "India" "vs" "Germany" "Hockey" "Match" "Live"
[7] "Streaming" "," "Tokyo" "Olympics" ":" "After"
[ ... and 288 more ]
text6 :
[1] "Indian" "men's" "hockey" "team"
[5] "on" "Sunday" "defeated" "Great"
[9] "Britain" "in" "the" "quarterfinals"
[ ... and 510 more ]
<- tokens(corpus_m_tokens ,
corpus_m_tokens remove_punct = T)
head(corpus_m_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "India's" "41-year-long" "wait" "for" "an"
[6] "Olympic" "medal" "in" "hockey" "came"
[11] "to" "an"
[ ... and 482 more ]
text2 :
[1] "The" "Board" "of" "Control" "for" "Cricket"
[7] "in" "India" "BCCI" "on" "Saturday" "decided"
[ ... and 257 more ]
text3 :
[1] "Indian" "athletes" "who" "won" "laurels" "for"
[7] "the" "nation" "at" "Tokyo" "Olympics" "were"
[ ... and 551 more ]
text4 :
[1] "India" "vs" "Belgium" "Hockey" "Match" "Live"
[7] "Streaming" "Tokyo" "Olympics" "Live" "Streaming" "The"
[ ... and 245 more ]
text5 :
[1] "India" "vs" "Germany" "Hockey" "Match" "Live"
[7] "Streaming" "Tokyo" "Olympics" "After" "going" "down"
[ ... and 262 more ]
text6 :
[1] "Indian" "men's" "hockey" "team"
[5] "on" "Sunday" "defeated" "Great"
[9] "Britain" "in" "the" "quarterfinals"
[ ... and 447 more ]
<- tokens_select(corpus_m_tokens,
corpus_m_tokenspattern = stopwords("en"),
select = "remove")
head(corpus_m_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "India's" "41-year-long" "wait" "Olympic" "medal"
[6] "hockey" "came" "end" "Thursday" "men's"
[11] "hockey" "team"
[ ... and 290 more ]
text2 :
[1] "Board" "Control" "Cricket" "India" "BCCI"
[6] "Saturday" "decided" "celebrate" "India's" "successful"
[11] "ever" "campaign"
[ ... and 172 more ]
text3 :
[1] "Indian" "athletes" "won" "laurels" "nation" "Tokyo"
[7] "Olympics" "honoured" "grand" "ceremony" "Delhi" "Sport"
[ ... and 329 more ]
text4 :
[1] "India" "vs" "Belgium" "Hockey" "Match" "Live"
[7] "Streaming" "Tokyo" "Olympics" "Live" "Streaming" "stage"
[ ... and 189 more ]
text5 :
[1] "India" "vs" "Germany" "Hockey" "Match" "Live"
[7] "Streaming" "Tokyo" "Olympics" "going" "fighting" "current"
[ ... and 196 more ]
text6 :
[1] "Indian" "men's" "hockey" "team"
[5] "Sunday" "defeated" "Great" "Britain"
[9] "quarterfinals" "3-1" "Now" "Indian"
[ ... and 285 more ]
Both men and women
<- corpus(df_Both,text_field = "body")
corpus_b head(corpus_b)
Corpus consisting of 6 documents and 3 docvars.
text1 :
" The Tokyo Olympics 2020 enters the fifth day which will beg..."
text2 :
" Day 5 of the Tokyo Olympics on Wednesday was a hot and cold..."
text3 :
" Day 8 of the Tokyo Olympics wasn't great in particular for ..."
text4 :
" India men's hockey team defeated Germany to win the bronze ..."
text5 :
" Neeraj Chopra on Saturday not only won gold for the country..."
text6 :
" While the Indian Men's Hockey team created history on Thurs..."
<- summary(corpus_b)
corpus_b_summary head(corpus_b_summary)
Text Types Tokens Sentences newspaper date Classification
1 text1 153 305 8 Hindustan Times July 28, 2021 MenAndWomen
2 text2 197 348 6 Hindustan Times July 29, 2021 MenAndWomen
3 text3 166 255 8 Hindustan Times August 1, 2021 MenAndWomen
4 text4 221 431 14 Hindustan Times August 5, 2021 MenAndWomen
5 text5 216 479 20 MINT August 7, 2021 MenAndWomen
6 text6 241 522 26 Hindustan Times August 5, 2021 MenAndWomen
#corpus_b_summary$Tokens
#docvars(corpus_b)
<- tokens(corpus_b)
corpus_b_tokens head(corpus_b_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "The" "Tokyo" "Olympics" "2020" "enters" "the"
[7] "fifth" "day" "which" "will" "begin" "with"
[ ... and 293 more ]
text2 :
[1] "Day" "5" "of" "the" "Tokyo" "Olympics"
[7] "on" "Wednesday" "was" "a" "hot" "and"
[ ... and 336 more ]
text3 :
[1] "Day" "8" "of" "the" "Tokyo"
[6] "Olympics" "wasn't" "great" "in" "particular"
[11] "for" "India"
[ ... and 243 more ]
text4 :
[1] "India" "men's" "hockey" "team" "defeated" "Germany"
[7] "to" "win" "the" "bronze" "medal" "at"
[ ... and 419 more ]
text5 :
[1] "Neeraj" "Chopra" "on" "Saturday" "not" "only"
[7] "won" "gold" "for" "the" "country" ","
[ ... and 467 more ]
text6 :
[1] "While" "the" "Indian" "Men's" "Hockey" "team"
[7] "created" "history" "on" "Thursday" "by" "winning"
[ ... and 510 more ]
<- tokens(corpus_b_tokens ,
corpus_b_tokens remove_punct = T)
head(corpus_b_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "The" "Tokyo" "Olympics" "2020" "enters" "the"
[7] "fifth" "day" "which" "will" "begin" "with"
[ ... and 255 more ]
text2 :
[1] "Day" "5" "of" "the" "Tokyo" "Olympics"
[7] "on" "Wednesday" "was" "a" "hot" "and"
[ ... and 274 more ]
text3 :
[1] "Day" "8" "of" "the" "Tokyo"
[6] "Olympics" "wasn't" "great" "in" "particular"
[11] "for" "India"
[ ... and 215 more ]
text4 :
[1] "India" "men's" "hockey" "team" "defeated" "Germany"
[7] "to" "win" "the" "bronze" "medal" "at"
[ ... and 377 more ]
text5 :
[1] "Neeraj" "Chopra" "on" "Saturday" "not" "only"
[7] "won" "gold" "for" "the" "country" "but"
[ ... and 394 more ]
text6 :
[1] "While" "the" "Indian" "Men's" "Hockey" "team"
[7] "created" "history" "on" "Thursday" "by" "winning"
[ ... and 451 more ]
<- tokens_select(corpus_b_tokens,
corpus_b_tokenspattern = stopwords("en"),
select = "remove")
#print(corpus_b_tokens)
head(corpus_b_tokens)
Tokens consisting of 6 documents and 3 docvars.
text1 :
[1] "Tokyo" "Olympics" "2020" "enters" "fifth" "day"
[7] "will" "begin" "Indian" "women's" "Hockey" "team"
[ ... and 192 more ]
text2 :
[1] "Day" "5" "Tokyo" "Olympics" "Wednesday" "hot"
[7] "cold" "affair" "Shuttler" "PV" "Sindhu" "advanced"
[ ... and 217 more ]
text3 :
[1] "Day" "8" "Tokyo" "Olympics" "great"
[6] "particular" "India" "top" "guns" "failed"
[11] "make" "mark"
[ ... and 145 more ]
text4 :
[1] "India" "men's" "hockey" "team" "defeated" "Germany"
[7] "win" "bronze" "medal" "Tokyo" "Olympics" "Thursday"
[ ... and 239 more ]
text5 :
[1] "Neeraj" "Chopra" "Saturday" "won" "gold" "country"
[7] "also" "helped" "surpass" "previous" "best" "haul"
[ ... and 266 more ]
text6 :
[1] "Indian" "Men's" "Hockey" "team" "created" "history"
[7] "Thursday" "winning" "Bronze" "medal" "Olympics" "Germany"
[ ... and 217 more ]
Creating document feature matrices
<- dfm(tokens(corpus_w_tokens))
dfm_women <- dfm_remove(dfm_women, c("said","also","says","can","just"), verbose = TRUE) dfm_women
removed 5 features
<- dfm(tokens(corpus_m_tokens))
dfm_men <- dfm_remove(dfm_men, c("said","also","says","can","just"), verbose = TRUE) dfm_men
removed 5 features
<- dfm(tokens(corpus_b_tokens))
dfm_both<- dfm_remove(dfm_both, c("said","also","says","can","just"), verbose = TRUE) dfm_both
removed 5 features
Search k for each corpora
The number of optimal topics was checked for each corpora. ### For women Based on semantic coherence, the number of topics for the women’s corpora was chosen as 9.
<- searchK(dfm_women,
diffKwomen K = c(5,6,7,8,9,10),
N = floor(0.1 * 277),
data = df_Women,
max.em.its = 1000,
init.type = "Spectral",
verbose=FALSE)
plot(diffKwomen)
For men
Based on semantic coherence, the number of topics for the men’s corpora was chosen as 7.
<- searchK(dfm_men,
diffKmen K = c(5,6,7,8,9,10),
N = floor(0.1 * 191),
data = df_Men,
max.em.its = 1000,
init.type = "Spectral",
verbose=FALSE)
plot(diffKmen)
For both
Based on semantic coherence, the number of topics for the corpora with both men’s and women’s articles was chosen as 8.
<- searchK(dfm_both,
diffKboth K = c(5,6,7,8,9,10),
N = floor(0.1 * 148),
data = df_Both,
max.em.its = 1000,
init.type = "Spectral",
verbose=FALSE)
plot(diffKboth)
Interpretation of the models
LDA models
Topic models for women’s corpora
<- dfm_remove(dfm_women, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's","games","game","match","will","day"), verbose = TRUE) dfm_women
removed 15 features
<-tidy(dfm_women)
tidy_w tidy_w
# A tibble: 49,119 × 3
document term count
<chr> <chr> <dbl>
1 text1 solitary 1
2 text50 solitary 1
3 text59 solitary 1
4 text146 solitary 1
5 text1 two-day 1
6 text85 two-day 1
7 text86 two-day 1
8 text224 two-day 1
9 text1 fixture 1
10 text1 great 1
# ℹ 49,109 more rows
<- tidy_w %>%
women_dtm cast_dtm(document, term, count)
women_dtm
<<DocumentTermMatrix (documents: 277, terms: 10936)>>
Non-/sparse entries: 49119/2980153
Sparsity : 98%
Maximal term length: 29
Weighting : term frequency (tf)
The topics can be classified as follows:
Topic 1- Hockey match details
Topic 2- Mirabai Chanu placing second in weightlifting, hockey
Topic 3- Hockey and weightlifting
Topic 4- Lovlina Borgohain placing third in boxing
Topic 5- PV Sindhu placing third in badminton
Topic 6- Casteist remarks against women’s hockey team
Topic 7- Information about Simon Biles and importance of mental health
Topic 8- Aditi Ashok’s performance in golf and medals mentioned from other sports
Topic 9- Rewards offered to the hockey teams
Most of the topics in the women’s sports corpora are about the Indian women athletes who won medals at the Olympics or were in the final rounds. Other than this, there was an incident where casteist remarks about Indian women hockey players were made after the women’s team had lost a semifinal which is reflected in topic 6. Finally, when it came to international athletes and events, the only topic found was about Simon Biles and her decision to leave the Olympics early due to mental health reasons.
<- LDA(women_dtm, k = 9, control = list(seed = 2345))
lda_women lda_women
A LDA_VEM topic model with 9 topics.
#extracting per-topic-per-word probabilities
<- tidy(lda_women, matrix = "beta")
topics_women topics_women
# A tibble: 98,424 × 3
topic term beta
<int> <chr> <dbl>
1 1 solitary 3.86e-277
2 2 solitary 1.27e- 4
3 3 solitary 3.19e-276
4 4 solitary 3.49e-277
5 5 solitary 1.80e-278
6 6 solitary 2.53e-279
7 7 solitary 1.32e-277
8 8 solitary 4.64e- 4
9 9 solitary 1.04e-268
10 1 two-day 7.05e-274
# ℹ 98,414 more rows
#Finding the top 20 terms
<- topics_women %>%
top_20_w group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
top_20_wmutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()+
labs(title = "Topic Modelling for the Women's Corpora")
Since topics 2 and 3 both have words related to weightlifting and hockey I checked for the words with the greatest difference in the 2 topics. The words that are more common in topic 2 include world, chanu and medal whereas the words in topic 3 include hockey, win, team, time, mirabai, weightlifting and khan. This is indicative that topic 2 might have more information specific to weightlifting and topic 3 is a mixture of the two sports.
<- topics_women %>%
beta_2_3mutate(topic = paste0("topic", topic))%>%
filter(topic=="topic2"|topic=="topic3")%>%
pivot_wider(names_from =topic, values_from = beta)%>%
filter(topic2 > .006| topic3 > .006) %>%
mutate(log_ratio = log2(topic2/ topic3))
%>%select(log_ratio)%>%max() beta_2_3
[1] 2.888065
%>%select(log_ratio)%>%min() beta_2_3
[1] -4.567394
<-beta_2_3 %>%
newgroup_by(direction = log_ratio > 0) %>%
top_n(10, abs(log_ratio)) %>%
ungroup() %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col() +
labs(y = "Log2 ratio of beta in topic 2 / topic 3",title="Words with the GreatestDifference in Topics 2and3 in the Women'sCorpora") +
coord_flip()
new
#ggplotly(new)
Topic models for men’s corpora
<- dfm_remove(dfm_men, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's","games","game","match","will","day"), verbose = TRUE) dfm_men
removed 15 features
<-tidy(dfm_men)
tidy_m tidy_m
# A tibble: 28,758 × 3
document term count
<chr> <chr> <dbl>
1 text1 41-year-long 1
2 text49 41-year-long 1
3 text1 wait 1
4 text34 wait 1
5 text49 wait 2
6 text52 wait 2
7 text53 wait 4
8 text55 wait 1
9 text63 wait 1
10 text67 wait 1
# ℹ 28,748 more rows
<- tidy_m %>%
men_dtm cast_dtm(document, term, count)
men_dtm
<<DocumentTermMatrix (documents: 191, terms: 7369)>>
Non-/sparse entries: 28758/1378721
Sparsity : 98%
Maximal term length: 30
Weighting : term frequency (tf)
The topics can be classified as follows:
Topic 1- Hockey match details
Topic 2- Hockey rewards
Topic 3- Hockey and cash awards
Topic 4- Hockey, Shooting 10 m air pistol, Tennis
Topic 5- More details about hockey, related to the coach
Topic 6- Archery, Hockey, multiple Olympic winners from the same university
Topic 7- Many of the medal winners- PV Sindhu: Bronze medal in Badminton, Bajrang Punia: Bronze medal in Wrestling, Neeraj Chopra: Gold medal in Javelin throw, Ravi Kumar Dahiya: Bronze medal in Wrestling
Most of the topics are regarding the men’s hockey team’s victory, including the details of the match and people’s reaction to the same. Other people discussed in the corpora as well are medallists.This suggests that more than the gender aspect, perhaps the Indian newspapers focused on the athletes who achieved victories. Moreover, even though this was the men’s corpora, the female Badminton player PV Sindhu was among the top terms in topic 7. This shows that the tags present in the metadata were not completely accurate.
<- LDA(men_dtm, k = 7, control = list(seed = 2345))
lda_men lda_men
A LDA_VEM topic model with 7 topics.
#extracting per-topic-per-word probabilities
<- tidy(lda_men, matrix = "beta")
topics_men topics_men
# A tibble: 51,583 × 3
topic term beta
<int> <chr> <dbl>
1 1 41-year-long 1.02e- 73
2 2 41-year-long 1.41e- 73
3 3 41-year-long 1.02e- 73
4 4 41-year-long 1.16e- 73
5 5 41-year-long 1.77e- 4
6 6 41-year-long 1.22e- 4
7 7 41-year-long 4.00e-218
8 1 wait 2.26e- 73
9 2 wait 3.72e- 44
10 3 wait 1.75e- 4
# ℹ 51,573 more rows
#Finding the top 20 terms
<- topics_men %>%
top_20_m group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
top_20_mmutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()+
labs(title = "Topic Modelling for the Men's Corpora")
Since topics 2 and 3 both have words related to hockey and rewards. I checked for the words with the greatest difference in the 2 topics. The words that are more common in topic 2 include men’s,hockey, team and medal whereas the words in topic 3 include bronze, rs (rupees, the Indian currency), singh, contingent, athletes and village. This is indicative that topic 2 might have more information specific to the details of hockey and the medal whereas the other topic has miscellaneous information such as cash prizes and about the captain of the hockey team.
<- topics_men %>%
beta_2_3mutate(topic = paste0("topic", topic))%>%
filter(topic=="topic2"|topic=="topic3")%>%
pivot_wider(names_from =topic, values_from = beta)%>%
filter(topic3 > .006| topic3> .006) %>%
mutate(log_ratio = log2(topic2/ topic3))
%>%select(log_ratio)%>%max() beta_2_3
[1] 1.029846
%>%select(log_ratio)%>%min() beta_2_3
[1] -234.7674
<-beta_2_3 %>%
new2group_by(direction = log_ratio > 0) %>%
top_n(15, abs(log_ratio)) %>%
ungroup() %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col() +
labs(y = "Log2 ratio of beta in topic 2/ topic 3",title="Words with the Greatest Difference in Topics 2 and 3 in the Men's Corpora") +
coord_flip()
new2
#ggplotly(new2)
Topic models for corpora with both men and women
<- dfm_remove(dfm_both, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's"), verbose = TRUE) dfm_both
removed 10 features
<-tidy(dfm_both)
tidy_b tidy_b
# A tibble: 25,265 × 3
document term count
<chr> <chr> <dbl>
1 text1 enters 1
2 text1 fifth 1
3 text4 fifth 1
4 text69 fifth 1
5 text88 fifth 3
6 text91 fifth 1
7 text134 fifth 1
8 text1 day 2
9 text2 day 4
10 text3 day 3
# ℹ 25,255 more rows
<- tidy_b %>%
both_dtm cast_dtm(document, term, count)
both_dtm
<<DocumentTermMatrix (documents: 148, terms: 7350)>>
Non-/sparse entries: 25265/1062535
Sparsity : 98%
Maximal term length: 26
Weighting : term frequency (tf)
The topics can be classified as follows:
Topic 1- Hockey, Tennis and Table Tennis
Topics 2 and 3- No clear topic can be distinguished
Topic 4- Odisha (Indian state) Government’s rewards for Hockey players
Topic 5- Hockey details and rewards
Topic 6- Reactions to the hockey team’s victory
Topic 7- Neeraj Chopra’s achievement in Javelin throw and receipt of highest sporting honour in India
Topic 8- Badminton and badminton player PV Sindhu
There is not much difference seen in this corpora when compared to the other two. Aspects of hockey remain to be common across multiple topics. The other prominent sports players that stood out in this corpora were PV Sindhu (female badminton player) and Neeraj Chopra (male track and field athlete) who won the bronze and gold medals respectively.
<- LDA(both_dtm, k = 8, control = list(seed = 2345))
lda_both lda_both
A LDA_VEM topic model with 8 topics.
#extracting per-topic-per-word probabilities
<- tidy(lda_both, matrix = "beta")
topics_both topics_both
# A tibble: 58,800 × 3
topic term beta
<int> <chr> <dbl>
1 1 enters 4.72e- 12
2 2 enters 1.21e-214
3 3 enters 9.35e-215
4 4 enters 1.66e-213
5 5 enters 1.27e-215
6 6 enters 3.25e-212
7 7 enters 2.11e-213
8 8 enters 1.89e- 4
9 1 fifth 8.47e- 20
10 2 fifth 9.62e-130
# ℹ 58,790 more rows
#Finding the top 20 terms
<- topics_both%>%
top_20_b group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
top_20_bmutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Conclusion
To summarise, there was no difference in the language used to describe women’s and men’s sports. Most of the athletes that were mentioned in the Indian newspapers had won medals or progressed to the final rounds of their respective sports. However, the modelling allowed us to see the popular aspects that were discussed in India during the Olympics, the most popular being the victory of the Indian men’s hockey team. There were various features regarding hockey that were covered by the media including details of the matches, awards and people’s reactions to the Indian team winning. In conclusion, there was no difference in the way that the sports were reported for men and women but an interesting find was that the media mainly focused on the sports players who won medals.
One of the limitations was that the classification was not proper in metadata, there were a few articles where women’s sports were labelled as men’s sports and vice versa. Additionally, another limitation was that a few of the duplicate articles could not be removed even after using the distinct function. This may have caused a few of the words to seem to appear more often than they did in the topics.
Future research can incorporate more categories beyond sports and a longer time period in order to determine whether a gender bias in Indian newspapers exists.
References
Devinney,H., Björklund,J. & Björklund,H.(2020). Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, 79-92.
Nexis Data Lab (2022). Olympics. Retrieved October 24, 2022,https://advance.lexis.com/nexisdatalabhome/? pdmfid=1534561&crid=7e8f5fed-48c5-45a3-a2dfbd9438b5d050&ecomp=zd54k&prid=6f411be1-3913-4a48-95be6a5c4d8b2367 https://aclanthology.org/2020.gebnlp-1.8
Grün, B., Hornik, K., Blei , D. M., Lafferty , J. D., Phan, X.-H., Matsumoto, M., Nishimura, T., & Cokus, S. (2022, December 6). topicmodels: Topic Models. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/topicmodels/
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Rao P and Taboada M (2021). Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework. Front. Artificial Intelligence 4:664737. doi: 10.3389/frai.2021.66473
Roberts, M., Stewart, B., Tingley, D., & Benoit , K. (2022, October 14). STM: Estimation of the structural topic model.The Comprehensive R Archive Network.https://cran.r-project.org/web/packages/stm/stm.pdf
Silge,J. & Robinson,D. (2017).Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/topicmodeling.html