当前位置：网站首页>Use text analysis to identify the main gender in a text

Use text analysis to identify the main gender in a text

2022-06-25 04:37:00 【Triumph19】

Building gender vocabulary

MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown' # This sentence is neither about men nor women 
BOTH = 'both' 

MALE_WORDS = set([
    'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
    'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
    'dads','dude','father','fathers','fiance','gentleman','gentlemen',
    'god','grandfather','grandpa','grandson','groom','he','himself',
    'husband','husbands','king','male','man','mr','nephew','nephews',
    'priest','prince','son','sons','uncle','uncles','waiter','widower',
    'widowers'
])

FEMALE_WORDS = set([
    'heroine','spokeswoman','chairwoman',"women's",'actress','women',
    "she's",'her','aunt','aunts','bride','daughter','daughters','female',
    'fiancee','girl','girlfriend','girlfriends','girls','goddess',
    'granddaughter','grandma','grandmother','herself','ladies','lady',
    'mom','moms','mother','mothers','mrs','ms','niece','nieces',
    'priestess','princess','queens','she','sister','sisters','waitress',
    'widow','widows','wife','wives','woman'
])

Now that we have a gender vocabulary , We need a way to assign gender to sentences , establish genderize function , Used to check the presence of in a sentence MALE_WORDS List and FEMAILE_WORDS Number of words in the list . If a sentence has only MALE_WORDS, We call it male The sentence , If it only has FEMALE_WORDS, We call it female The sentence . If a sentence has a non-zero count for both men and women , We call it both, If it doesn't have male words or female words , We call it unknown:

Building a function for assigning gender

def genderize(words):
    mwlen = len(MALE_WORDS.intersection(words))
    fwlen = len(FEMALE_WORDS.intersection(words))
    if mwlen > 0 and fwlen == 0:
        return MALE
    elif mwlen == 0 and fwlen > 0:
        return FEMALE
    elif mwlen > 0 and fwlen > 0:
        return BOTH
    else:
        return UNKNOWN

About intersection（） Method , See ：Python aggregate intersection() Method
We need a way , To calculate the frequency of words and sentences with gender in the text , We use it Python Built in classes for collections.Counters Class to count .count_gender Function to get a list of sentences and apply genderize Function to evaluate the total number of gendered words and sentences .
The gender of each sentence is counted , All words in the sentence are also considered to belong to that gender ：

Count the number of gendered words and sentences

from collections import Counter
def count_gender(sentences):
    sents = Counter()
    words = Counter()

    for sentence in sentences:
        gender = genderize(sentence) # Judge the type of this sentence , Is it about men or women , Or other types 
        sents[gender] += 1 # The statistics show that the gender （ male , Woman , Or others ） The number of sentences 
        words[gender] += len(sentence) # Identify the number of words that make up a gender sentence as the number of words of that gender 
    return sents,words

About Counter() Method , See ：Python collections.Counter usage
Last , To use the gender counter , We also need to parse the original text of the article into sentences and words , Therefore I use NLTK library （ We will discuss further later in this chapter ） Break paragraphs into sentences . When the sentence is broken , You can strip it to determine each word and punctuation , Then pass the itemized text to the gender counter to print the document male,female,both or unknown percentage ：

Count the percentage of gendered sentences and words

import nltk
def parse_gender(text):
    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)
    ]
    sents,words = count_gender(sentences)
    total = sum(words.values())

    for gender,count in words.items():
        pcent = (count/total) * 100
        nsents = sents[gender]

        print(
            "{0.3f}% {} ({} sentences)".format(pcent,gender,nsents)
              )

if __name__ == '__main__':
    with open('ballet.txt', 'r',encoding='utf8') as f:
        parse_gender(f.read())

39.269% unknown (48 sentences)
52.994% female (38 sentences)
4.393% both (2 sentences)
3.344% male (3 sentences)

ad locum , The scoring function takes into account the number of words in the sentence length calculation . therefore , Although there are few sentences about women in general , But more than 50% My article is about women . Extend this technology , Can analyze the occurrence of words in male and female sentences , See if there are other common words related to male or female gender .