当前位置:网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report
2022-07-01 06:25:00 【Code power full quantization】
Investing , We often read announcements , How to interpret the announcement ? This article will traverse all financial announcements in a folder
PDF file , adopt PDF Text analysis is used to deeply screen financial announcements , Get qualified announcement content .
1. First, go through all the... In the folder PDF file , Splice the required pdf File location of . Then put them in a list , Call after convenience .
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
2. Traverse all of pdf What's in it , And then use pdfplumber Function to get the contents of the text , Put all the text together .
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
3. Analyze the text content obtained after traversal , The keyword filter written here , Only when PDF The document contains “ have ”,“ bill ”,“ conduct financial transactions ” and “ cash management ”
, Filter out these files . Here we can do more in-depth analysis through natural language and machine learning .
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
4. After screening , Will filter out PDF Move the file , Create a new file path , Then filter out PDF Move the file to a new folder .
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
5. Completion code
import os
import pdfplumber
path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " # Storage path of financial statements
# Find all under the folder pdf File path
file_list=[]
for files in os.walk(path): # Traverse all folders under the path
for file in files[2]: # Traverse all files in the path
if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': # Check file suffix
file_list.append(path+"\\"+file) # Splicing file path
print(file_list)
# PDF Text parsing and filtering out the content in the text
pdf_all=[]
for i in range(len(file_list)):
pdf=pdfplumber.open(file_list[i]) # Open each pdf file
pages=pdf.pages
text_all=[]
for page in pages: # Traverse the information of each page
text=page.extract_text() # Extract the text content of the current page
text_all.append(text) # Bring together the contents of each page
text_all="".join(text_all) # Convert list to string
pdf.close()
# Filter the content in the body
if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
pdf_all.append(file_list[i])
print(pdf_all) # Print the filtered pdf file
# After moving filter pdf file
for pdf_i in pdf_all:
new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
os.rename(pdf_i,new_path) # Perform file movement
print("PDF Text parsing and filtering are completed !")
边栏推荐
- Discrimination between left and right limits of derivatives and left and right derivatives
- 分布式锁实现
- SQL语句
- [ManageEngine] terminal management system helps Huasheng securities' digital transformation
- Minio error correction code, construction and startup of distributed Minio cluster
- 阶乘约数(唯一分解定理)
- json模块
- Index method and random forest to realize the information of surface water body in wet season in Shandong Province
- HCM Beginner (II) - information type
- 端口扫描工具是什么?端口扫描工具有什么用
猜你喜欢

Tidb single machine simulation deployment production environment cluster (closed pit practice, personal test is effective)

分布式锁实现

Discrimination between left and right limits of derivatives and left and right derivatives

JMM details

ManageEngine Zhuohao helps you comply with ISO 20000 standard (IV)

异常检测方法梳理,看这篇就够了!

One of the characteristic agricultural products that make Tiantou village, Guankou Town, Xiamen into a "sweet" village is
![[summary of knowledge points] chi square distribution, t distribution, F distribution](/img/a6/bb5cabbfffb0edc9449c4c251354ae.png)
[summary of knowledge points] chi square distribution, t distribution, F distribution

【Unity Shader 描边效果_案例分享第一篇】

图片服务器项目测试
随机推荐
68 cesium code datasource loading czml
[ManageEngine] how to realize network automatic operation and maintenance
Save data in browser to local file
SystemVerilog learning-07-class inheritance and package use
Make Tiantou village sweet. Is Xianjing taro or cabbage the characteristic agricultural product of Tiantou Village
Minio error correction code, construction and startup of distributed Minio cluster
高阶-二叉平衡树
HCM Beginner (II) - information type
lxml模块(数据提取)
Forkjoin and stream flow test
[file system] how to run squashfs on UBI
Servlet
端口扫描工具对企业有什么帮助?
阿里OSS Postman Invalid according to Policy: Policy Condition failed: [“starts-with“, “$key“, “test/“]
three. JS summary
HCM Beginner (III) - quickly enter pa70 and pa71 to browse employee information PA10
[leetcode] day91- duplicate elements exist
[enterprise data security] upgrade backup strategy to ensure enterprise data security
[summary of problem thinking] Why is the register reset performed in user mode?
High order binary balanced tree