当前位置:网站首页>使用文本分析识别一段文本中的主要性别
使用文本分析识别一段文本中的主要性别
2022-06-25 04:00:00 【Triumph19】
构建性别词汇集
MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown' #表示这段句子既不关于男性也关于女性
BOTH = 'both'
MALE_WORDS = set([
'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
'dads','dude','father','fathers','fiance','gentleman','gentlemen',
'god','grandfather','grandpa','grandson','groom','he','himself',
'husband','husbands','king','male','man','mr','nephew','nephews',
'priest','prince','son','sons','uncle','uncles','waiter','widower',
'widowers'
])
FEMALE_WORDS = set([
'heroine','spokeswoman','chairwoman',"women's",'actress','women',
"she's",'her','aunt','aunts','bride','daughter','daughters','female',
'fiancee','girl','girlfriend','girlfriends','girls','goddess',
'granddaughter','grandma','grandmother','herself','ladies','lady',
'mom','moms','mother','mothers','mrs','ms','niece','nieces',
'priestess','princess','queens','she','sister','sisters','waitress',
'widow','widows','wife','wives','woman'
])
- 既然我们有了性别词汇集,我们需要一种方法来为句子分配性别,创建genderize函数,用于检查句子中出现MALE_WORDS列表和FEMAILE_WORDS列表中的单词数。如果一个句子只有MALE_WORDS,我们称为male句子,如果它只有FEMALE_WORDS,我们称为female句子。如果一个句子对男性和女性都有非零的计数值,我们称为both,如果它没有男性词汇也没有女性词汇,我们称为unknown:
构建分配性别的函数
def genderize(words):
mwlen = len(MALE_WORDS.intersection(words))
fwlen = len(FEMALE_WORDS.intersection(words))
if mwlen > 0 and fwlen == 0:
return MALE
elif mwlen == 0 and fwlen > 0:
return FEMALE
elif mwlen > 0 and fwlen > 0:
return BOTH
else:
return UNKNOWN
- 关于intersection()方法,详见:Python 集合 intersection() 方法
- 我们需要一种方法,来计算文章中带有性别化的词汇和句子的频率,我们用Python的内置类collections.Counters类来进行计数。count_gender函数获取句子列表并应用genderize函数来评估性别化单词和性别化句子的总数。
- 每个句子的性别都被计算在内,句子中所有单词也被视为属于该性别:
统计性别化单词和性别化句子数量
from collections import Counter
def count_gender(sentences):
sents = Counter()
words = Counter()
for sentence in sentences:
gender = genderize(sentence) #判断这个句子类型,是关于男性还是女性,亦或其他类型
sents[gender] += 1 #统计出现性别(男,女,或其他)的句子的数量
words[gender] += len(sentence) #将组成性别的句子的单词数量都认定为这一性别的单词数
return sents,words
- 关于Counter()方法,详见:Python collections.Counter用法
- 最后,为了使用性别计数器,我们还需要将文章的原始文本解析为句子和单词,为此我用NLTK库(我们会在本章后面进一步讨论)将段落分成句子。句子被断开后,可以将其词条化以判断每个单词和标点符号,再将词条化文本传递给性别计数器打印文档male,female,both或unknown百分比:
统计性别化句子及单词的百分比
import nltk
def parse_gender(text):
sentences = [
[word.lower() for word in nltk.word_tokenize(sentence)]
for sentence in nltk.sent_tokenize(text)
]
sents,words = count_gender(sentences)
total = sum(words.values())
for gender,count in words.items():
pcent = (count/total) * 100
nsents = sents[gender]
print(
"{0.3f}% {} ({} sentences)".format(pcent,gender,nsents)
)
if __name__ == '__main__':
with open('ballet.txt', 'r',encoding='utf8') as f:
parse_gender(f.read())
39.269% unknown (48 sentences)
52.994% female (38 sentences)
4.393% both (2 sentences)
3.344% male (3 sentences)
- 在这里,评分函数在计算句子长度时考虑了所包含的单词数量。因此,尽管关于女性的句子总体上比较少,但超过50%的文章是关于女性的。对于这种技术进行扩展,就能分析男性句子和女性句子中单词的出现情况,看是否存在与男性或女性性别相关的其他常用词汇。
本文来源于《基于Python的智能文本分析》_Benjamin Bengfort等著,陈光译
边栏推荐
- JS arrow function
- Coinlist queuing tutorial to improve the winning rate
- GBASE 8s的级联删除功能
- How to install opencv? Opencv download installation tutorial
- Nodejs 通过Heidisql连接mysql出现ER_BAD_DB_ERROR: Unknown database 'my_db_books'
- Upgrade cmake
- mongodb集群
- 【LeetCode】143. Rearrange linked list
- ThinkPHP is integrated with esaywechat. What's wrong with wechat payment callback without callback?
- 515. 在每个树行中找最大值 / 剑指 Offer II 095. 最长公共子序列
猜你喜欢

关于TCP连接四次握手(或者叫四次挥手)的详细总结

How to install opencv? Opencv download installation tutorial

Unit test coverage

Watch out for the stolen face! So many risks of face recognition used every day?

LabVIEW开发气体调节器

Finereport displays and hides column data according to conditions

"Renaissance" in the digital age? The bottom digital collection makes people happy and sad

5 key indicators of SEO: ranking + traffic + session + length of stay + bounce rate

PHP code audit 1 - php Ini

地方/园区产业规划之 “ 如何进行产业定位 ”
随机推荐
Coinlist how to operate the middle lot number security tutorial
Nodejs 通过Heidisql连接mysql出现ER_BAD_DB_ERROR: Unknown database 'my_db_books'
"Comment positionner l'industrie" dans la planification industrielle locale / parc
关于TCP连接三次握手的详细总结
mysql的tinyint字段类型判断的疑惑
Smart contract learning materials
OBS Browser+浏览器的基本使用
Detailed explanation of flex attributes in flex layout
How to draw an industry investment map
Laravel document sorting 4. Controller
i. Max development board learning record
Comparison of towe/ JIRA / tapd / Zen collaboration platforms
2021.8.29 notes: register, bit operation, pointer, structure
[openwrt] we recommend a domestically developed version of openwrt, an introduction to istoreos. It is very easy to use. It is mainly optimized. It solves the problem of Sinicization.
【LeetCode】22. bracket-generating
GBASE 8s中DELIMIDENT环境变量的使用
地方/園區產業規劃之 “ 如何進行產業定比特 ”
GBase 8s 锁的分类
GBASE 8s 总体架构
2021.4.15 note the difference between let, const and VaR in ES6