当前位置：网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report

Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report

2022-07-01 06:25:00 【Code power full quantization】

Investing , We often read announcements , How to interpret the announcement ？ This article will traverse all financial announcements in a folder
PDF file , adopt PDF Text analysis is used to deeply screen financial announcements , Get qualified announcement content .

1. First, go through all the... In the folder PDF file , Splice the required pdf File location of . Then put them in a list , Call after convenience .

import os
import pdfplumber

path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " #  Storage path of financial statements 

#  Find all under the folder pdf File path 
file_list=[]
for files in os.walk(path):  #  Traverse all folders under the path 
    for file in files[2]: #  Traverse all files in the path 
        if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': #  Check file suffix 

            file_list.append(path+"\\"+file)  #  Splicing file path 
print(file_list)

2. Traverse all of pdf What's in it , And then use pdfplumber Function to get the contents of the text , Put all the text together .

# PDF Text parsing and filtering out the content in the text 
pdf_all=[]
for i in range(len(file_list)):
    pdf=pdfplumber.open(file_list[i])  #  Open each pdf file 
    pages=pdf.pages
    text_all=[]
    for page in pages:  #  Traverse the information of each page 
        text=page.extract_text() #  Extract the text content of the current page 
        text_all.append(text) #  Bring together the contents of each page 
    text_all="".join(text_all) #  Convert list to string 
    pdf.close()

3. Analyze the text content obtained after traversal , The keyword filter written here , Only when PDF The document contains “ have ”,“ bill ”,“ conduct financial transactions ” and “ cash management ”
, Filter out these files . Here we can do more in-depth analysis through natural language and machine learning .

    if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
        pdf_all.append(file_list[i])
print(pdf_all)  #  Print the filtered pdf file

4. After screening , Will filter out PDF Move the file , Create a new file path , Then filter out PDF Move the file to a new folder .

#  After moving filter pdf file 
for pdf_i in pdf_all:
    new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
    os.rename(pdf_i,new_path) #  Perform file movement 
print("PDF Text parsing and filtering are completed ！")

5. Completion code

import os
import pdfplumber

path= r"C:\Users\86186\PycharmProjects\online\spider\requests\ Financial statements " #  Storage path of financial statements 

#  Find all under the folder pdf File path 
file_list=[]
for files in os.walk(path):  #  Traverse all folders under the path 
    for file in files[2]: #  Traverse all files in the path 
        if os.path.splitext(file)[1]=='.pdf' or os.path.splitext(file)[1]=='.PDF': #  Check file suffix 

            file_list.append(path+"\\"+file)  #  Splicing file path 
print(file_list)

# PDF Text parsing and filtering out the content in the text 
pdf_all=[]
for i in range(len(file_list)):
    pdf=pdfplumber.open(file_list[i])  #  Open each pdf file 
    pages=pdf.pages
    text_all=[]
    for page in pages:  #  Traverse the information of each page 
        text=page.extract_text() #  Extract the text content of the current page 
        text_all.append(text) #  Bring together the contents of each page 
    text_all="".join(text_all) #  Convert list to string 
    pdf.close()

#  Filter the content in the body 
    if (" Overweight "in text_all) or (" cash management " in text_all) or (" conduct financial transactions " in text_all):
        pdf_all.append(file_list[i])
print(pdf_all)  #  Print the filtered pdf file 

#  After moving filter pdf file 
for pdf_i in pdf_all:
    new_path=r"C:\\Users\\86186\\PycharmProjects\\online\\spider\\requests\\ Filtered folder \\"+pdf_i.split('\\')[-1]
    os.rename(pdf_i,new_path) #  Perform file movement 
print("PDF Text parsing and filtering are completed ！")

原网站

版权声明
本文为[Code power full quantization]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010617416777.html

当前位置：网站首页>Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report

Code power is full of quantitative learning | how to find a suitable financial announcement in the financial report

边栏推荐

猜你喜欢

随机推荐