当前位置:网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
2022-07-01 06:25:00 【Code power full quantization】
Investing , We often read announcements , How to interpret the announcement ? This article will traverse all financial announcements in a folder
PDF file , adopt PDF Text analysis is used to deeply screen financial announcements , Get qualified announcement content .
1. First, go through all the... In the folder PDF file , Splice the required pdf File location of . Then put them in a list , Call after convenience .
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
2. Traverse all of pdf What's in it , And then use pdfplumber Function to get the contents of the text , Put all the text together .
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
3. Analyze the text content obtained after traversal , The keyword filter written here , Only when PDF The document contains “ have ”,“ bill ”,“ conduct financial transactions ” and “ cash management ”
, Filter out these files . Here we can do more in-depth analysis through natural language and machine learning .
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
4. After screening , Will filter out PDF Move the file , Create a new file path , Then filter out PDF Move the file to a new folder .
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
5. Completion code
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
# Filter the content in the body
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
边栏推荐
- JDBC connection pool
- Index method and random forest to realize the information of surface water body in wet season in Shandong Province
- The row and column numbers of each pixel of multi-source grid data in the same area are the same, that is, the number of rows and columns are the same, and the pixel size is the same
- HDU - 1501 Zipper(记忆化深搜)
- 证券类开户有什么影响 开户安全吗
- lxml模块(数据提取)
- How does the port scanning tool help enterprises?
- C语言课设工资管理系统(大作业)
- How did ManageEngine Zhuohao achieve the goal of being selected into Gartner Magic Quadrant for four consecutive years?
- Record currency in MySQL
猜你喜欢

C语言课设学生选修课程系统(大作业)

【ManageEngine卓豪 】助力世界顶尖音乐学院--茱莉亚学院,提升终端安全

The row and column numbers of each pixel of multi-source grid data in the same area are the same, that is, the number of rows and columns are the same, and the pixel size is the same

Promise

【#Unity Shader#自定义材质面板_第二篇】

Record MySQL troubleshooting caused by disk sector damage

High order binary search tree

Transformer le village de tiantou en un village de betteraves sucrières

【KV260】利用XADC生成芯片温度曲线图

SystemVerilog learning-10-validation quantification and coverage
随机推荐
[ManageEngine] how to realize network automatic operation and maintenance
Using Baidu map to query national subway lines
C语言课设学生选修课程系统(大作业)
Teach you how to implement a deep learning framework
SystemVerilog learning-08-random constraints and thread control
C语言课设销售管理系统设计(大作业)
Recueillir des trésors dans le palais souterrain (recherche de mémoire profonde)
【ManageEngine卓豪】用统一终端管理助“欧力士集团”数字化转型
10 golang operator
[ManageEngine Zhuohao] what is network operation and maintenance management and what is the use of network operation and maintenance platform
IT服务管理(ITSM)在高等教育领域的应用
【企业数据安全】升级备份策略 保障企业数据安全
ManageEngine卓豪助您符合ISO 20000标准(四)
HCM Beginner (III) - quickly enter pa70 and pa71 to browse employee information PA10
The row and column numbers of each pixel of multi-source grid data in the same area are the same, that is, the number of rows and columns are the same, and the pixel size is the same
让田头村变甜头村的特色农产品是仙景芋还是白菜
[ITSM] what is ITSM and why does it department need ITSM
Restframework-simplejwt rewrite authentication mechanism
68 cesium code datasource loading czml
FPGA - clocking -02- clock wiring resources of internal structure of 7 Series FPGA