当前位置:网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
2022-07-01 06:25:00 【Code power full quantization】
Investing , We often read announcements , How to interpret the announcement ? This article will traverse all financial announcements in a folder
PDF file , adopt PDF Text analysis is used to deeply screen financial announcements , Get qualified announcement content .
1. First, go through all the... In the folder PDF file , Splice the required pdf File location of . Then put them in a list , Call after convenience .
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
2. Traverse all of pdf What's in it , And then use pdfplumber Function to get the contents of the text , Put all the text together .
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
3. Analyze the text content obtained after traversal , The keyword filter written here , Only when PDF The document contains “ have ”,“ bill ”,“ conduct financial transactions ” and “ cash management ”
, Filter out these files . Here we can do more in-depth analysis through natural language and machine learning .
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
4. After screening , Will filter out PDF Move the file , Create a new file path , Then filter out PDF Move the file to a new folder .
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
5. Completion code
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
# Filter the content in the body
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
边栏推荐
猜你喜欢

Excel visualization

Forkjoin and stream flow test

Movable mechanical wall clock

Solve the problem of garbled files uploaded by Kirin v10

Index method and random forest to realize the information of surface water body in wet season in Shandong Province

【ITSM】什么是ITSM,IT部门为什么需要ITSM

【ManageEngine】如何实现网络自动化运维

让田头村变甜头村的特色农产品是仙景芋还是白菜

Redis安装到Windows系统上的详细步骤

C语言课设图书信息管理系统(大作业)
随机推荐
请求模块(requests)
One of the characteristic agricultural products that make Tiantou village, Guankou Town, Xiamen into a "sweet" village is
[ManageEngine Zhuohao] the role of LAN monitoring
Top 10 Free 3D modeling software for beginners in 2022
High order binary search tree
B-tree series
kubeadm搭建kubenetes 集群(个人学习版)
【网络安全工具】USB控制软件有什么用
Excel visualization
记磁盘扇区损坏导致的Mysql故障排查
HCM Beginner (I) - Introduction
Teach you how to implement a deep learning framework
Solve the problem of garbled files uploaded by Kirin v10
【ManageEngine卓豪】助力黄石爱康医院实现智能批量化网络设备配置管理
UOW of dev XPO comparison
SystemVerilog learning-09-interprocess synchronization, communication and virtual methods
SystemVerilog learning-10-validation quantification and coverage
让厦门灌口镇田头村变甜头村的特色农产品之一是蚂蚁新村
Projects and dependencies in ABP learning solutions
地宫取宝(记忆化深搜)