当前位置:网站首页>Use text analysis to identify the main gender in a text
Use text analysis to identify the main gender in a text
2022-06-25 04:37:00 【Triumph19】
Building gender vocabulary
MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown' # This sentence is neither about men nor women
BOTH = 'both'
MALE_WORDS = set([
'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
'dads','dude','father','fathers','fiance','gentleman','gentlemen',
'god','grandfather','grandpa','grandson','groom','he','himself',
'husband','husbands','king','male','man','mr','nephew','nephews',
'priest','prince','son','sons','uncle','uncles','waiter','widower',
'widowers'
])
FEMALE_WORDS = set([
'heroine','spokeswoman','chairwoman',"women's",'actress','women',
"she's",'her','aunt','aunts','bride','daughter','daughters','female',
'fiancee','girl','girlfriend','girlfriends','girls','goddess',
'granddaughter','grandma','grandmother','herself','ladies','lady',
'mom','moms','mother','mothers','mrs','ms','niece','nieces',
'priestess','princess','queens','she','sister','sisters','waitress',
'widow','widows','wife','wives','woman'
])
- Now that we have a gender vocabulary , We need a way to assign gender to sentences , establish genderize function , Used to check the presence of in a sentence MALE_WORDS List and FEMAILE_WORDS Number of words in the list . If a sentence has only MALE_WORDS, We call it male The sentence , If it only has FEMALE_WORDS, We call it female The sentence . If a sentence has a non-zero count for both men and women , We call it both, If it doesn't have male words or female words , We call it unknown:
Building a function for assigning gender
def genderize(words):
mwlen = len(MALE_WORDS.intersection(words))
fwlen = len(FEMALE_WORDS.intersection(words))
if mwlen > 0 and fwlen == 0:
return MALE
elif mwlen == 0 and fwlen > 0:
return FEMALE
elif mwlen > 0 and fwlen > 0:
return BOTH
else:
return UNKNOWN
- About intersection() Method , See :Python aggregate intersection() Method
- We need a way , To calculate the frequency of words and sentences with gender in the text , We use it Python Built in classes for collections.Counters Class to count .count_gender Function to get a list of sentences and apply genderize Function to evaluate the total number of gendered words and sentences .
- The gender of each sentence is counted , All words in the sentence are also considered to belong to that gender :
Count the number of gendered words and sentences
from collections import Counter
def count_gender(sentences):
sents = Counter()
words = Counter()
for sentence in sentences:
gender = genderize(sentence) # Judge the type of this sentence , Is it about men or women , Or other types
sents[gender] += 1 # The statistics show that the gender ( male , Woman , Or others ) The number of sentences
words[gender] += len(sentence) # Identify the number of words that make up a gender sentence as the number of words of that gender
return sents,words
- About Counter() Method , See :Python collections.Counter usage
- Last , To use the gender counter , We also need to parse the original text of the article into sentences and words , Therefore I use NLTK library ( We will discuss further later in this chapter ) Break paragraphs into sentences . When the sentence is broken , You can strip it to determine each word and punctuation , Then pass the itemized text to the gender counter to print the document male,female,both or unknown percentage :
Count the percentage of gendered sentences and words
import nltk
def parse_gender(text):
sentences = [
[word.lower() for word in nltk.word_tokenize(sentence)]
for sentence in nltk.sent_tokenize(text)
]
sents,words = count_gender(sentences)
total = sum(words.values())
for gender,count in words.items():
pcent = (count/total) * 100
nsents = sents[gender]
print(
"{0.3f}% {} ({} sentences)".format(pcent,gender,nsents)
)
if __name__ == '__main__':
with open('ballet.txt', 'r',encoding='utf8') as f:
parse_gender(f.read())
39.269% unknown (48 sentences)
52.994% female (38 sentences)
4.393% both (2 sentences)
3.344% male (3 sentences)
- ad locum , The scoring function takes into account the number of words in the sentence length calculation . therefore , Although there are few sentences about women in general , But more than 50% My article is about women . Extend this technology , Can analyze the occurrence of words in male and female sentences , See if there are other common words related to male or female gender .
This article comes from 《 be based on Python Intelligent text analysis 》_Benjamin Bengfort Waiting , Chenguangyi
边栏推荐
- 论文笔记: 多标签学习 ESMC (没看懂, 还没写出来, 暂时放这里占个位置)
- 小白学习MySQL - 统计的'投机取巧'
- GBASE 8s的触发器
- Data view for gbase 8s
- Paper notes: multi label learning ESMC (I don't understand it, but I haven't written it yet, so I'll put it here for a place temporarily)
- Gbase 8s stored procedure execution and deletion
- CTF_ Web:php weak type bypass and MD5 collision
- GBASE 8s的数据视图
- OOP栈类模板(模板+DS)
- 什么是持久化?redis 持久化中的RDB和AOF是什么?
猜你喜欢

OBS Browser+浏览器的基本使用

php开发支付宝支付功能之扫码支付流程图

关于TCP连接四次握手(或者叫四次挥手)的详细总结

CTF_ Web: basic 12 questions WP of attack and defense world novice zone

Finereport displays and hides column data according to conditions

Paper notes: multi label learning ESMC (I don't understand it, but I haven't written it yet, so I'll put it here for a place temporarily)

简单的恶意样本行文分析-入门篇

CTF_ Web: Advanced questions of attack and defense world expert zone WP (15-18)

CTF_ Web: Advanced questions of attack and defense world expert zone WP (1-4)

2.0SpingMVC使用RESTful
随机推荐
2.0springmvc uses restful
GBASE 8s活锁、死锁问题的解决
Introduction to the isolation level of gbase 8s
Use of deferred environment variable in gbase 8s
Finereport displays and hides column data according to conditions
Multithreading structure of gbase 8s
论文笔记: 多标签学习 ESMC (没看懂, 还没写出来, 暂时放这里占个位置)
Gbase 8s stored procedure execution and deletion
Immutable學習之路----告別傳統拷貝
GBASE 8s的数据导入和导出
马斯克发布人形机器人,AI对马斯克为什么意义重大?
Basic introduction of gbase 8s blocking technology
Anaconda installation +tensorflow installation +keras installation +numpy installation (including image and version information compatibility issues)
GBase 8s的封锁技术的基本介绍
Gbase 8s index b+ tree
分布式websocket搭建方案
Finereport (sail soft) handling the problem that the histogram data label is blocked
js的sort()函数
计算学生成绩等级(虚函数和多态)
Package for gbase 8s