当前位置:网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
2022-07-01 06:25:00 【Code power full quantization】
Investing , We often read announcements , How to interpret the announcement ? This article will traverse all financial announcements in a folder
PDF file , adopt PDF Text analysis is used to deeply screen financial announcements , Get qualified announcement content .
1. First, go through all the... In the folder PDF file , Splice the required pdf File location of . Then put them in a list , Call after convenience .
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
2. Traverse all of pdf What's in it , And then use pdfplumber Function to get the contents of the text , Put all the text together .
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
3. Analyze the text content obtained after traversal , The keyword filter written here , Only when PDF The document contains “ have ”,“ bill ”,“ conduct financial transactions ” and “ cash management ”
, Filter out these files . Here we can do more in-depth analysis through natural language and machine learning .
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
4. After screening , Will filter out PDF Move the file , Create a new file path , Then filter out PDF Move the file to a new folder .
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
5. Completion code
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
# Filter the content in the body
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
边栏推荐
- MongoDB:一、MongoDB是什么?MongoDB的优缺点
- ManageEngine卓豪助您符合ISO 20000标准(四)
- 局域网监控软件有哪些功能
- [ManageEngine] how to realize network automatic operation and maintenance
- SQL中DML语句(数据操作语言)
- C语言课设工资管理系统(大作业)
- 【自动化运维】自动化运维平台有什么用
- One of the characteristic agricultural products that make Tiantou village, Guankou Town, Xiamen into a "sweet" village is
- [network security tool] what is the use of USB control software
- [ManageEngine Zhuohao] helps Julia college, the world's top Conservatory of music, improve terminal security
猜你喜欢

Forkjoin and stream flow test

Application of IT service management (ITSM) in Higher Education

Make Tiantou village sweet. Is Xianjing taro or cabbage the characteristic agricultural product of Tiantou Village

JMM详解

three. JS summary

Promise

HCM Beginner (II) - information type

图片服务器项目测试

The row and column numbers of each pixel of multi-source grid data in the same area are the same, that is, the number of rows and columns are the same, and the pixel size is the same

One of the characteristic agricultural products that make Tiantou village, Guankou Town, Xiamen into a "sweet" village is
随机推荐
What is a port scanning tool? What is the use of port scanning tools
C语言课设工资管理系统(大作业)
自开发软件NoiseCreater1.1版本免费试用
【自动化运维】自动化运维平台有什么用
lxml模块(数据提取)
扩散(多源广搜)
webapck打包原理--启动过程分析
FPGA - 7 Series FPGA internal structure clocking-01-clock Architecture Overview
Movable mechanical wall clock
【ManageEngine】终端管理系统,助力华盛证券数字化转型
Application of IT service management (ITSM) in Higher Education
make: g++:命令未找到
Mysql 表分区创建方法
记磁盘扇区损坏导致的Mysql故障排查
MongoDB:一、MongoDB是什么?MongoDB的优缺点
【ITSM】什么是ITSM,IT部门为什么需要ITSM
Understanding of C manualresetevent class
SystemVerilog learning-10-validation quantification and coverage
[ManageEngine Zhuohao] helps Julia college, the world's top Conservatory of music, improve terminal security
虚幻 简单的屏幕雨滴后处理效果