Topic Modelling based on Gender Corpora in Sports News

MekhalaKumar

Olympics2020

GenderandSports

LDATopicModelling

BlogPost6

Author

Mekhala Kumar

Published

December 19, 2022

library(tidyverse)
library(quanteda)
library(readtext)
library(striprtf)
library(corpustools)
library(quanteda.textplots)
library(readr)
library(topicmodels)
library(tidytext)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(tm)
library(stm)

Introduction

Newspapers often reflect the gender biases and gender roles in society. Rao and Taboada found that English Canadian newspapers quote women more often in the Lifestyle, Entertainment, Arts and Healthcare categories and men more often in the Business, Sports and United States Politics (2021). Even within a field such as sports, the details of the sports events are provided for articles about men’s sports while in women’s sports articles, only women’s achievements are focused upon. Similarly, Devinney et al. studied Mainstream English news articles, Mainstream Swedish articles and LGBTQ+ web content and found that feminine topics were linked to the private sphere and masculine topics were linked to the public sphere (2020).

Research aim : To understand whether there was a difference in the way Indian newspapers reported women’s and men’s sports during the Tokyo Olympics held in 2021.

Data used in the project

The LexisNexis database was used to collect articles from July 22 to August 9, 2021 (the time when the 2021 Olympics were held).
The data included articles from Hindustan Times, Times of India (Electronic Edition), Free Press Journal (India), The Telegraph (India), Indian Express, Mint, DNA, India Today Online, The Hindu and Economic Times (E-Paper Edition).
The key word searched was Olympics and filters including Men’s Sports, Women’s Sports, Sports Awards, Sports & Recreation, India and Newspapers were used.

Methodology

The quanteda package was used for preprocessing. The corpora used were either the entire set of files or a subset depending on the model used. Punctuation and stopwords were removed from the corpora. Additionally, words such as Olympics, India and Tokyo were removed to derive more meaningful results.
Structural Topic Modelling and LDA Topic Modelling were employed using the stm and topicmodels packages respectively. For this, subsets of the dataset were utilised to create corpora. These corpora were made using the metadata which had classification tags such as sports, women’s sports and men’s sports. The articles were categorised as men’s sports, women’s sports or both.
For structural topic modelling, the corpus had 468 articles. However, the structural topic model did not produce anything insightful because it provided the information that women who played particular sports are mentioned more often in the women’s section and vice versa. The terms used to describe the events were not present when the model was run.
Hence, the final model included Latent Dirichlet Allocation (LDA) topic models for the corpora separately by gender. Additionally there was one corpus used which included articles that included both the tags of women’s and men’s sports.
There were 191 articles in the men’s sports corpora, 277 articles in the women’s sports corpora and 148 articles in the corpora which had both men’s and women’s sports’ articles.
For the LDA topic models and structural topic models, the search_K() function was used to determine the optimal number of topics.

Semantic Network

The semantic network displayed here was made using 1128 articles.
I limited the document feature matrix to terms that appeared a least 15 times and in 25% of the documents. This consisted of 50 terms which I plotted.
Unsurprisingly, this shows that most of the articles discuss India in the Olympics (as Indian newspaper articles were used). One major theme that can be observed is the discussion of the hockey team- the men’s team had placed third in over four decades hence marking history and was led by the captain Manpreet Singh. Other significant terms include medals and medal colours perhaps pertaining to victories by other Indian athletes; which are more clearly observed through the LDA topic models in the following sections.

articles_dfm<-readRDS(file = "_data/News_DFMForSemNet.rds")

dfm_refined <- dfm_trim(articles_dfm, min_termfreq = 15)
dfm_refined <- dfm_trim(dfm_refined, min_docfreq = .25, docfreq_type = "prop")

fcm<- fcm(dfm_refined)
dim(fcm)

[1] 50 50

top_features <- names(topfeatures(fcm,50))
fcm_refined <- fcm_select(fcm, pattern = top_features, selection = "keep")
dim(fcm_refined)

[1] 50 50

size <- log(colSums(fcm_refined))
textplot_network(fcm_refined, vertex_size = size / max(size) * 3)

Reading in the files for the LDA topic models

df_All<-readRDS(file = "_data/FilesClassificationNoDuplicates.rds")
df_All<-df_All %>% distinct(body, .keep_all = TRUE)

df_3<-df_All%>%split(df_All$Classification)
df_Men<-df_3$Men
dim(df_Men)

[1] 191   4

df_Women<-df_3$Women
dim(df_Women)

[1] 277   4

df_Both<-df_3$MenAndWomen
dim(df_Both)

[1] 148   4

Preprocessing for each corpora

Since the dataset was divided into 3 parts to be analysed separately, preprocessing for each part had to be conducted. Each dataframe was converted into a corpus and there was a check for metadata. After this, tokens were created, and punctuation and stopwords were removed.

Women’s corpora

corpus_w <- corpus(df_Women,text_field = "body")
head(corpus_w)

Corpus consisting of 6 documents and 3 docvars.
text1 :
" From a solitary two-day fixture between Great Britain and F..."

text2 :
" Tokyo Olympics Day 10 Full Schedule: Kamalpreet Kaur stunne..."

text3 :
" Haryana CM Manohar Lal Khattar has announced Rs.50 lakh cas..."

text4 :
" India's quest for another medal will continue on Day 8 of t..."

text5 :
" India vs Argentina Women's Hockey Semifinal Match Live Stre..."

text6 :
" India would fancy their chances of a medal finish as they g..."

corpus_w_summary <- summary(corpus_w)
head(corpus_w_summary)

   Text Types Tokens Sentences       newspaper            date Classification
1 text1   260    542        22 Hindustan Times August 9, 2021           Women
2 text2   106    220         5 Hindustan Times August 2, 2021           Women
3 text3   126    298        12 Hindustan Times August 7, 2021           Women
4 text4   168    326         5 Hindustan Times  July 31, 2021           Women
5 text5   104    292        12 Hindustan Times August 4, 2021           Women
6 text6   196    393         6 Hindustan Times  July 30, 2021           Women

#corpus_w_summary$Tokens
#docvars(corpus_w)

corpus_w_tokens <- tokens(corpus_w)
head(corpus_w_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "From"     "a"        "solitary" "two-day"  "fixture"  "between" 
 [7] "Great"    "Britain"  "and"      "France"   "in"       "the"     
[ ... and 530 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   ":"          "Kamalpreet" "Kaur"       "stunned"   
[11] "the"        "nation"    
[ ... and 208 more ]

text3 :
 [1] "Haryana"   "CM"        "Manohar"   "Lal"       "Khattar"   "has"      
 [7] "announced" "Rs"        "."         "50"        "lakh"      "cash"     
[ ... and 286 more ]

text4 :
 [1] "India's"  "quest"    "for"      "another"  "medal"    "will"    
 [7] "continue" "on"       "Day"      "8"        "of"       "the"     
[ ... and 314 more ]

text5 :
 [1] "India"     "vs"        "Argentina" "Women's"   "Hockey"    "Semifinal"
 [7] "Match"     "Live"      "Streaming" ","         "Tokyo"     "Olympics" 
[ ... and 280 more ]

text6 :
 [1] "India"   "would"   "fancy"   "their"   "chances" "of"      "a"      
 [8] "medal"   "finish"  "as"      "they"    "gear"   
[ ... and 381 more ]

corpus_w_tokens <- tokens(corpus_w_tokens ,
                                    remove_punct = T)
head(corpus_w_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "From"     "a"        "solitary" "two-day"  "fixture"  "between" 
 [7] "Great"    "Britain"  "and"      "France"   "in"       "the"     
[ ... and 464 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   "Kamalpreet" "Kaur"       "stunned"    "the"       
[11] "nation"     "with"      
[ ... and 171 more ]

text3 :
 [1] "Haryana"   "CM"        "Manohar"   "Lal"       "Khattar"   "has"      
 [7] "announced" "Rs"        "50"        "lakh"      "cash"      "award"    
[ ... and 253 more ]

text4 :
 [1] "India's"  "quest"    "for"      "another"  "medal"    "will"    
 [7] "continue" "on"       "Day"      "8"        "of"       "the"     
[ ... and 265 more ]

text5 :
 [1] "India"     "vs"        "Argentina" "Women's"   "Hockey"    "Semifinal"
 [7] "Match"     "Live"      "Streaming" "Tokyo"     "Olympics"  "Winning"  
[ ... and 249 more ]

text6 :
 [1] "India"   "would"   "fancy"   "their"   "chances" "of"      "a"      
 [8] "medal"   "finish"  "as"      "they"    "gear"   
[ ... and 317 more ]

corpus_w_tokens<- tokens_select(corpus_w_tokens, 
                    pattern = stopwords("en"),
                    select = "remove")
head(corpus_w_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "solitary"  "two-day"   "fixture"   "Great"     "Britain"   "France"   
 [7] "1900"      "Olympics"  "prospects" "cricket's" "inclusion" "8-team"   
[ ... and 271 more ]

text2 :
 [1] "Tokyo"      "Olympics"   "Day"        "10"         "Full"      
 [6] "Schedule"   "Kamalpreet" "Kaur"       "stunned"    "nation"    
[11] "64m"        "throw"     
[ ... and 138 more ]

text3 :
 [1] "Haryana"   "CM"        "Manohar"   "Lal"       "Khattar"   "announced"
 [7] "Rs"        "50"        "lakh"      "cash"      "award"     "every"    
[ ... and 155 more ]

text4 :
 [1] "India's"  "quest"    "another"  "medal"    "will"     "continue"
 [7] "Day"      "8"        "Tokyo"    "Olympics" "ace"      "shuttler"
[ ... and 207 more ]

text5 :
 [1] "India"     "vs"        "Argentina" "Women's"   "Hockey"    "Semifinal"
 [7] "Match"     "Live"      "Streaming" "Tokyo"     "Olympics"  "Winning"  
[ ... and 185 more ]

text6 :
 [1] "India"    "fancy"    "chances"  "medal"    "finish"   "gear"    
 [7] "Day"      "7"        "Tokyo"    "Olympics" "eyes"     "archer"  
[ ... and 251 more ]

Men’s corpora

corpus_m <- corpus(df_Men,text_field = "body")
head(corpus_m)

Corpus consisting of 6 documents and 3 docvars.
text1 :
" India's 41-year-long wait for an Olympic medal in hockey ca..."

text2 :
" The Board of Control for Cricket in India (BCCI) on Saturda..."

text3 :
" Indian athletes who won laurels for the nation at Tokyo Oly..."

text4 :
" India vs Belgium Hockey Match Live Streaming, Tokyo Olympic..."

text5 :
" India vs Germany Hockey Match Live Streaming, Tokyo Olympic..."

text6 :
" Indian men's hockey team on Sunday defeated Great Britain i..."

corpus_m_summary <- summary(corpus_m)
head(corpus_m_summary)

   Text Types Tokens Sentences       newspaper            date Classification
1 text1   206    547        22 Hindustan Times August 5, 2021             Men
2 text2   156    299        12 Hindustan Times August 7, 2021             Men
3 text3   290    644        29            MINT August 9, 2021             Men
4 text4   117    289        12 Hindustan Times August 2, 2021             Men
5 text5   120    300        12 Hindustan Times August 5, 2021             Men
6 text6   234    522        26            MINT August 1, 2021             Men

#corpus_m_summary$Tokens
#docvars(corpus_m)

corpus_m_tokens <- tokens(corpus_m)
head(corpus_m_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "India's"      "41-year-long" "wait"         "for"          "an"          
 [6] "Olympic"      "medal"        "in"           "hockey"       "came"        
[11] "to"           "an"          
[ ... and 535 more ]

text2 :
 [1] "The"     "Board"   "of"      "Control" "for"     "Cricket" "in"     
 [8] "India"   "("       "BCCI"    ")"       "on"     
[ ... and 287 more ]

text3 :
 [1] "Indian"   "athletes" "who"      "won"      "laurels"  "for"     
 [7] "the"      "nation"   "at"       "Tokyo"    "Olympics" "were"    
[ ... and 632 more ]

text4 :
 [1] "India"     "vs"        "Belgium"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" ","         "Tokyo"     "Olympics"  "Live"      "Streaming"
[ ... and 277 more ]

text5 :
 [1] "India"     "vs"        "Germany"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" ","         "Tokyo"     "Olympics"  ":"         "After"    
[ ... and 288 more ]

text6 :
 [1] "Indian"        "men's"         "hockey"        "team"         
 [5] "on"            "Sunday"        "defeated"      "Great"        
 [9] "Britain"       "in"            "the"           "quarterfinals"
[ ... and 510 more ]

corpus_m_tokens <- tokens(corpus_m_tokens ,
                                    remove_punct = T)
head(corpus_m_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "India's"      "41-year-long" "wait"         "for"          "an"          
 [6] "Olympic"      "medal"        "in"           "hockey"       "came"        
[11] "to"           "an"          
[ ... and 482 more ]

text2 :
 [1] "The"      "Board"    "of"       "Control"  "for"      "Cricket" 
 [7] "in"       "India"    "BCCI"     "on"       "Saturday" "decided" 
[ ... and 257 more ]

text3 :
 [1] "Indian"   "athletes" "who"      "won"      "laurels"  "for"     
 [7] "the"      "nation"   "at"       "Tokyo"    "Olympics" "were"    
[ ... and 551 more ]

text4 :
 [1] "India"     "vs"        "Belgium"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" "Tokyo"     "Olympics"  "Live"      "Streaming" "The"      
[ ... and 245 more ]

text5 :
 [1] "India"     "vs"        "Germany"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" "Tokyo"     "Olympics"  "After"     "going"     "down"     
[ ... and 262 more ]

text6 :
 [1] "Indian"        "men's"         "hockey"        "team"         
 [5] "on"            "Sunday"        "defeated"      "Great"        
 [9] "Britain"       "in"            "the"           "quarterfinals"
[ ... and 447 more ]

corpus_m_tokens<- tokens_select(corpus_m_tokens, 
                    pattern = stopwords("en"),
                    select = "remove")
head(corpus_m_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "India's"      "41-year-long" "wait"         "Olympic"      "medal"       
 [6] "hockey"       "came"         "end"          "Thursday"     "men's"       
[11] "hockey"       "team"        
[ ... and 290 more ]

text2 :
 [1] "Board"      "Control"    "Cricket"    "India"      "BCCI"      
 [6] "Saturday"   "decided"    "celebrate"  "India's"    "successful"
[11] "ever"       "campaign"  
[ ... and 172 more ]

text3 :
 [1] "Indian"   "athletes" "won"      "laurels"  "nation"   "Tokyo"   
 [7] "Olympics" "honoured" "grand"    "ceremony" "Delhi"    "Sport"   
[ ... and 329 more ]

text4 :
 [1] "India"     "vs"        "Belgium"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" "Tokyo"     "Olympics"  "Live"      "Streaming" "stage"    
[ ... and 189 more ]

text5 :
 [1] "India"     "vs"        "Germany"   "Hockey"    "Match"     "Live"     
 [7] "Streaming" "Tokyo"     "Olympics"  "going"     "fighting"  "current"  
[ ... and 196 more ]

text6 :
 [1] "Indian"        "men's"         "hockey"        "team"         
 [5] "Sunday"        "defeated"      "Great"         "Britain"      
 [9] "quarterfinals" "3-1"           "Now"           "Indian"       
[ ... and 285 more ]

Both men and women

corpus_b <- corpus(df_Both,text_field = "body")
head(corpus_b)

Corpus consisting of 6 documents and 3 docvars.
text1 :
" The Tokyo Olympics 2020 enters the fifth day which will beg..."

text2 :
" Day 5 of the Tokyo Olympics on Wednesday was a hot and cold..."

text3 :
" Day 8 of the Tokyo Olympics wasn't great in particular for ..."

text4 :
" India men's hockey team defeated Germany to win the bronze ..."

text5 :
" Neeraj Chopra on Saturday not only won gold for the country..."

text6 :
" While the Indian Men's Hockey team created history on Thurs..."

corpus_b_summary <- summary(corpus_b)
head(corpus_b_summary)

   Text Types Tokens Sentences       newspaper            date Classification
1 text1   153    305         8 Hindustan Times  July 28, 2021     MenAndWomen
2 text2   197    348         6 Hindustan Times  July 29, 2021     MenAndWomen
3 text3   166    255         8 Hindustan Times August 1, 2021     MenAndWomen
4 text4   221    431        14 Hindustan Times August 5, 2021     MenAndWomen
5 text5   216    479        20            MINT August 7, 2021     MenAndWomen
6 text6   241    522        26 Hindustan Times August 5, 2021     MenAndWomen

#corpus_b_summary$Tokens
#docvars(corpus_b)

corpus_b_tokens <- tokens(corpus_b)
head(corpus_b_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "The"      "Tokyo"    "Olympics" "2020"     "enters"   "the"     
 [7] "fifth"    "day"      "which"    "will"     "begin"    "with"    
[ ... and 293 more ]

text2 :
 [1] "Day"       "5"         "of"        "the"       "Tokyo"     "Olympics" 
 [7] "on"        "Wednesday" "was"       "a"         "hot"       "and"      
[ ... and 336 more ]

text3 :
 [1] "Day"        "8"          "of"         "the"        "Tokyo"     
 [6] "Olympics"   "wasn't"     "great"      "in"         "particular"
[11] "for"        "India"     
[ ... and 243 more ]

text4 :
 [1] "India"    "men's"    "hockey"   "team"     "defeated" "Germany" 
 [7] "to"       "win"      "the"      "bronze"   "medal"    "at"      
[ ... and 419 more ]

text5 :
 [1] "Neeraj"   "Chopra"   "on"       "Saturday" "not"      "only"    
 [7] "won"      "gold"     "for"      "the"      "country"  ","       
[ ... and 467 more ]

text6 :
 [1] "While"    "the"      "Indian"   "Men's"    "Hockey"   "team"    
 [7] "created"  "history"  "on"       "Thursday" "by"       "winning" 
[ ... and 510 more ]

corpus_b_tokens <- tokens(corpus_b_tokens ,
                                    remove_punct = T)
head(corpus_b_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "The"      "Tokyo"    "Olympics" "2020"     "enters"   "the"     
 [7] "fifth"    "day"      "which"    "will"     "begin"    "with"    
[ ... and 255 more ]

text2 :
 [1] "Day"       "5"         "of"        "the"       "Tokyo"     "Olympics" 
 [7] "on"        "Wednesday" "was"       "a"         "hot"       "and"      
[ ... and 274 more ]

text3 :
 [1] "Day"        "8"          "of"         "the"        "Tokyo"     
 [6] "Olympics"   "wasn't"     "great"      "in"         "particular"
[11] "for"        "India"     
[ ... and 215 more ]

text4 :
 [1] "India"    "men's"    "hockey"   "team"     "defeated" "Germany" 
 [7] "to"       "win"      "the"      "bronze"   "medal"    "at"      
[ ... and 377 more ]

text5 :
 [1] "Neeraj"   "Chopra"   "on"       "Saturday" "not"      "only"    
 [7] "won"      "gold"     "for"      "the"      "country"  "but"     
[ ... and 394 more ]

text6 :
 [1] "While"    "the"      "Indian"   "Men's"    "Hockey"   "team"    
 [7] "created"  "history"  "on"       "Thursday" "by"       "winning" 
[ ... and 451 more ]

corpus_b_tokens<- tokens_select(corpus_b_tokens, 
                    pattern = stopwords("en"),
                    select = "remove")
#print(corpus_b_tokens)
head(corpus_b_tokens)

Tokens consisting of 6 documents and 3 docvars.
text1 :
 [1] "Tokyo"    "Olympics" "2020"     "enters"   "fifth"    "day"     
 [7] "will"     "begin"    "Indian"   "women's"  "Hockey"   "team"    
[ ... and 192 more ]

text2 :
 [1] "Day"       "5"         "Tokyo"     "Olympics"  "Wednesday" "hot"      
 [7] "cold"      "affair"    "Shuttler"  "PV"        "Sindhu"    "advanced" 
[ ... and 217 more ]

text3 :
 [1] "Day"        "8"          "Tokyo"      "Olympics"   "great"     
 [6] "particular" "India"      "top"        "guns"       "failed"    
[11] "make"       "mark"      
[ ... and 145 more ]

text4 :
 [1] "India"    "men's"    "hockey"   "team"     "defeated" "Germany" 
 [7] "win"      "bronze"   "medal"    "Tokyo"    "Olympics" "Thursday"
[ ... and 239 more ]

text5 :
 [1] "Neeraj"   "Chopra"   "Saturday" "won"      "gold"     "country" 
 [7] "also"     "helped"   "surpass"  "previous" "best"     "haul"    
[ ... and 266 more ]

text6 :
 [1] "Indian"   "Men's"    "Hockey"   "team"     "created"  "history" 
 [7] "Thursday" "winning"  "Bronze"   "medal"    "Olympics" "Germany" 
[ ... and 217 more ]

Creating document feature matrices

dfm_women <- dfm(tokens(corpus_w_tokens))
dfm_women <- dfm_remove(dfm_women, c("said","also","says","can","just"), verbose = TRUE)

removed 5 features

dfm_men <- dfm(tokens(corpus_m_tokens))
dfm_men<- dfm_remove(dfm_men, c("said","also","says","can","just"), verbose = TRUE)

removed 5 features

dfm_both<- dfm(tokens(corpus_b_tokens))
dfm_both <- dfm_remove(dfm_both, c("said","also","says","can","just"), verbose = TRUE)

removed 5 features

Search k for each corpora

The number of optimal topics was checked for each corpora. ### For women Based on semantic coherence, the number of topics for the women’s corpora was chosen as 9.

diffKwomen <- searchK(dfm_women,
                       K = c(5,6,7,8,9,10),
                       N = floor(0.1 * 277),
                       data = df_Women,
                       max.em.its = 1000,
                       init.type = "Spectral",
                       verbose=FALSE)

plot(diffKwomen)

For men

Based on semantic coherence, the number of topics for the men’s corpora was chosen as 7.

diffKmen <- searchK(dfm_men,
                       K = c(5,6,7,8,9,10),
                       N = floor(0.1 * 191),
                       data = df_Men,
                       max.em.its = 1000,
                       init.type = "Spectral",
                       verbose=FALSE)

plot(diffKmen)

For both

Based on semantic coherence, the number of topics for the corpora with both men’s and women’s articles was chosen as 8.

diffKboth <- searchK(dfm_both,
                       K = c(5,6,7,8,9,10),
                       N = floor(0.1 * 148),
                       data = df_Both,
                       max.em.its = 1000,
                       init.type = "Spectral",
                       verbose=FALSE)

plot(diffKboth)

Interpretation of the models

LDA models

Topic models for women’s corpora

dfm_women <- dfm_remove(dfm_women, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's","games","game","match","will","day"), verbose = TRUE)

removed 15 features

tidy_w<-tidy(dfm_women)
tidy_w

# A tibble: 49,119 × 3
   document term     count
   <chr>    <chr>    <dbl>
 1 text1    solitary     1
 2 text50   solitary     1
 3 text59   solitary     1
 4 text146  solitary     1
 5 text1    two-day      1
 6 text85   two-day      1
 7 text86   two-day      1
 8 text224  two-day      1
 9 text1    fixture      1
10 text1    great        1
# ℹ 49,109 more rows

women_dtm <- tidy_w %>%
  cast_dtm(document, term, count)
women_dtm

<<DocumentTermMatrix (documents: 277, terms: 10936)>>
Non-/sparse entries: 49119/2980153
Sparsity           : 98%
Maximal term length: 29
Weighting          : term frequency (tf)

The topics can be classified as follows:

Topic 1- Hockey match details

Topic 2- Mirabai Chanu placing second in weightlifting, hockey

Topic 3- Hockey and weightlifting

Topic 4- Lovlina Borgohain placing third in boxing

Topic 5- PV Sindhu placing third in badminton

Topic 6- Casteist remarks against women’s hockey team

Topic 7- Information about Simon Biles and importance of mental health

Topic 8- Aditi Ashok’s performance in golf and medals mentioned from other sports

Topic 9- Rewards offered to the hockey teams

Most of the topics in the women’s sports corpora are about the Indian women athletes who won medals at the Olympics or were in the final rounds. Other than this, there was an incident where casteist remarks about Indian women hockey players were made after the women’s team had lost a semifinal which is reflected in topic 6. Finally, when it came to international athletes and events, the only topic found was about Simon Biles and her decision to leave the Olympics early due to mental health reasons.

lda_women<- LDA(women_dtm, k = 9, control = list(seed = 2345))
lda_women

A LDA_VEM topic model with 9 topics.

#extracting per-topic-per-word probabilities
topics_women <- tidy(lda_women, matrix = "beta")
topics_women

# A tibble: 98,424 × 3
   topic term          beta
   <int> <chr>        <dbl>
 1     1 solitary 3.86e-277
 2     2 solitary 1.27e-  4
 3     3 solitary 3.19e-276
 4     4 solitary 3.49e-277
 5     5 solitary 1.80e-278
 6     6 solitary 2.53e-279
 7     7 solitary 1.32e-277
 8     8 solitary 4.64e-  4
 9     9 solitary 1.04e-268
10     1 two-day  7.05e-274
# ℹ 98,414 more rows

#Finding the top 20 terms
top_20_w <- topics_women %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_20_w%>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()+
  labs(title = "Topic Modelling for the Women's Corpora")

Since topics 2 and 3 both have words related to weightlifting and hockey I checked for the words with the greatest difference in the 2 topics. The words that are more common in topic 2 include world, chanu and medal whereas the words in topic 3 include hockey, win, team, time, mirabai, weightlifting and khan. This is indicative that topic 2 might have more information specific to weightlifting and topic 3 is a mixture of the two sports.

beta_2_3<- topics_women %>%
  mutate(topic = paste0("topic", topic))%>%
  filter(topic=="topic2"|topic=="topic3")%>%
  pivot_wider(names_from =topic, values_from = beta)%>% 
  filter(topic2 > .006| topic3 > .006) %>%
  mutate(log_ratio = log2(topic2/ topic3))

beta_2_3%>%select(log_ratio)%>%max()

[1] 2.888065

beta_2_3%>%select(log_ratio)%>%min()

[1] -4.567394

new<-beta_2_3 %>%
  group_by(direction = log_ratio > 0) %>%
  top_n(10, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col() +
  labs(y = "Log2 ratio of beta in topic 2 / topic 3",title="Words with the GreatestDifference in Topics 2and3 in the Women'sCorpora") +
  coord_flip()

new

#ggplotly(new)

Topic models for men’s corpora

dfm_men <- dfm_remove(dfm_men, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's","games","game","match","will","day"), verbose = TRUE)

removed 15 features

tidy_m<-tidy(dfm_men)
tidy_m

# A tibble: 28,758 × 3
   document term         count
   <chr>    <chr>        <dbl>
 1 text1    41-year-long     1
 2 text49   41-year-long     1
 3 text1    wait             1
 4 text34   wait             1
 5 text49   wait             2
 6 text52   wait             2
 7 text53   wait             4
 8 text55   wait             1
 9 text63   wait             1
10 text67   wait             1
# ℹ 28,748 more rows

men_dtm <- tidy_m %>%
  cast_dtm(document, term, count)
men_dtm

<<DocumentTermMatrix (documents: 191, terms: 7369)>>
Non-/sparse entries: 28758/1378721
Sparsity           : 98%
Maximal term length: 30
Weighting          : term frequency (tf)

The topics can be classified as follows:

Topic 1- Hockey match details

Topic 2- Hockey rewards

Topic 3- Hockey and cash awards

Topic 4- Hockey, Shooting 10 m air pistol, Tennis

Topic 5- More details about hockey, related to the coach

Topic 6- Archery, Hockey, multiple Olympic winners from the same university

Topic 7- Many of the medal winners- PV Sindhu: Bronze medal in Badminton, Bajrang Punia: Bronze medal in Wrestling, Neeraj Chopra: Gold medal in Javelin throw, Ravi Kumar Dahiya: Bronze medal in Wrestling

Most of the topics are regarding the men’s hockey team’s victory, including the details of the match and people’s reaction to the same. Other people discussed in the corpora as well are medallists.This suggests that more than the gender aspect, perhaps the Indian newspapers focused on the athletes who achieved victories. Moreover, even though this was the men’s corpora, the female Badminton player PV Sindhu was among the top terms in topic 7. This shows that the tags present in the metadata were not completely accurate.

lda_men<- LDA(men_dtm, k = 7, control = list(seed = 2345))
lda_men

A LDA_VEM topic model with 7 topics.

#extracting per-topic-per-word probabilities
topics_men <- tidy(lda_men, matrix = "beta")
topics_men

# A tibble: 51,583 × 3
   topic term              beta
   <int> <chr>            <dbl>
 1     1 41-year-long 1.02e- 73
 2     2 41-year-long 1.41e- 73
 3     3 41-year-long 1.02e- 73
 4     4 41-year-long 1.16e- 73
 5     5 41-year-long 1.77e-  4
 6     6 41-year-long 1.22e-  4
 7     7 41-year-long 4.00e-218
 8     1 wait         2.26e- 73
 9     2 wait         3.72e- 44
10     3 wait         1.75e-  4
# ℹ 51,573 more rows

#Finding the top 20 terms
top_20_m <- topics_men %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_20_m%>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()+
  labs(title = "Topic Modelling for the Men's Corpora")

Since topics 2 and 3 both have words related to hockey and rewards. I checked for the words with the greatest difference in the 2 topics. The words that are more common in topic 2 include men’s,hockey, team and medal whereas the words in topic 3 include bronze, rs (rupees, the Indian currency), singh, contingent, athletes and village. This is indicative that topic 2 might have more information specific to the details of hockey and the medal whereas the other topic has miscellaneous information such as cash prizes and about the captain of the hockey team.

beta_2_3<- topics_men %>%
  mutate(topic = paste0("topic", topic))%>%
  filter(topic=="topic2"|topic=="topic3")%>%
  pivot_wider(names_from =topic, values_from = beta)%>% 
  filter(topic3 > .006| topic3> .006) %>%
  mutate(log_ratio = log2(topic2/ topic3))

beta_2_3%>%select(log_ratio)%>%max()

[1] 1.029846

beta_2_3%>%select(log_ratio)%>%min()

[1] -234.7674

new2<-beta_2_3 %>%
  group_by(direction = log_ratio > 0) %>%
  top_n(15, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col() +
  labs(y = "Log2 ratio of beta in topic 2/ topic 3",title="Words with the Greatest Difference in Topics 2 and 3 in the Men's Corpora") +
  coord_flip()

new2

#ggplotly(new2)

Topic models for corpora with both men and women

dfm_both <- dfm_remove(dfm_both, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's"), verbose = TRUE)

removed 10 features

tidy_b<-tidy(dfm_both)
tidy_b

# A tibble: 25,265 × 3
   document term   count
   <chr>    <chr>  <dbl>
 1 text1    enters     1
 2 text1    fifth      1
 3 text4    fifth      1
 4 text69   fifth      1
 5 text88   fifth      3
 6 text91   fifth      1
 7 text134  fifth      1
 8 text1    day        2
 9 text2    day        4
10 text3    day        3
# ℹ 25,255 more rows

both_dtm <- tidy_b %>%
  cast_dtm(document, term, count)
both_dtm

<<DocumentTermMatrix (documents: 148, terms: 7350)>>
Non-/sparse entries: 25265/1062535
Sparsity           : 98%
Maximal term length: 26
Weighting          : term frequency (tf)

The topics can be classified as follows:

Topic 1- Hockey, Tennis and Table Tennis

Topics 2 and 3- No clear topic can be distinguished

Topic 4- Odisha (Indian state) Government’s rewards for Hockey players

Topic 5- Hockey details and rewards

Topic 6- Reactions to the hockey team’s victory

Topic 7- Neeraj Chopra’s achievement in Javelin throw and receipt of highest sporting honour in India

Topic 8- Badminton and badminton player PV Sindhu

There is not much difference seen in this corpora when compared to the other two. Aspects of hockey remain to be common across multiple topics. The other prominent sports players that stood out in this corpora were PV Sindhu (female badminton player) and Neeraj Chopra (male track and field athlete) who won the bronze and gold medals respectively.

lda_both<- LDA(both_dtm, k = 8, control = list(seed = 2345))
lda_both

A LDA_VEM topic model with 8 topics.

#extracting per-topic-per-word probabilities
topics_both <- tidy(lda_both, matrix = "beta")
topics_both

# A tibble: 58,800 × 3
   topic term        beta
   <int> <chr>      <dbl>
 1     1 enters 4.72e- 12
 2     2 enters 1.21e-214
 3     3 enters 9.35e-215
 4     4 enters 1.66e-213
 5     5 enters 1.27e-215
 6     6 enters 3.25e-212
 7     7 enters 2.11e-213
 8     8 enters 1.89e-  4
 9     1 fifth  8.47e- 20
10     2 fifth  9.62e-130
# ℹ 58,790 more rows

#Finding the top 20 terms
top_20_b <- topics_both%>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_20_b%>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Conclusion

To summarise, there was no difference in the language used to describe women’s and men’s sports. Most of the athletes that were mentioned in the Indian newspapers had won medals or progressed to the final rounds of their respective sports. However, the modelling allowed us to see the popular aspects that were discussed in India during the Olympics, the most popular being the victory of the Indian men’s hockey team. There were various features regarding hockey that were covered by the media including details of the matches, awards and people’s reactions to the Indian team winning. In conclusion, there was no difference in the way that the sports were reported for men and women but an interesting find was that the media mainly focused on the sports players who won medals.
One of the limitations was that the classification was not proper in metadata, there were a few articles where women’s sports were labelled as men’s sports and vice versa. Additionally, another limitation was that a few of the duplicate articles could not be removed even after using the distinct function. This may have caused a few of the words to seem to appear more often than they did in the topics.
Future research can incorporate more categories beyond sports and a longer time period in order to determine whether a gender bias in Indian newspapers exists.

References

Devinney,H., Björklund,J. & Björklund,H.(2020). Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, 79-92.

Nexis Data Lab (2022). Olympics. Retrieved October 24, 2022,https://advance.lexis.com/nexisdatalabhome/? pdmfid=1534561&crid=7e8f5fed-48c5-45a3-a2dfbd9438b5d050&ecomp=zd54k&prid=6f411be1-3913-4a48-95be6a5c4d8b2367 https://aclanthology.org/2020.gebnlp-1.8

Grün, B., Hornik, K., Blei , D. M., Lafferty , J. D., Phan, X.-H., Matsumoto, M., Nishimura, T., & Cokus, S. (2022, December 6). topicmodels: Topic Models. The Comprehensive R Archive Network. https://cran.r-project.org/web/packages/topicmodels/

R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Rao P and Taboada M (2021). Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework. Front. Artificial Intelligence 4:664737. doi: 10.3389/frai.2021.66473

Roberts, M., Stewart, B., Tingley, D., & Benoit , K. (2022, October 14). STM: Estimation of the structural topic model.The Comprehensive R Archive Network.https://cran.r-project.org/web/packages/stm/stm.pdf

Silge,J. & Robinson,D. (2017).Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/topicmodeling.html