当前位置:网站首页>Use C to extract all text in PDF files (support.Net core)
Use C to extract all text in PDF files (support.Net core)
2022-07-04 10:10:00 【Dotnet cross platform】
PDF yes Portable Document Format For short , Meaning for “ Portable document format ”, By Adobe Systems Used with applications 、 operating system 、 The file format developed by the file exchange in a hardware independent way .PDF Document to PostScript Language is based on image model , No matter in which kind of printer can guarantee the accurate color and accurate printing effect , namely PDF Will faithfully reproduce every character of the original 、 Color and image .
Whereas PDF The file format is complex , Generally, third-party components are used to PDF To operate , This article USES itext7 .
Official website :https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
adopt NuGet introduce itext7 After component , You can use the following code to extract PDF The text in the document :
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}Sample code :
var lines = PdfHelper.ExtractText("{PDF File path }").ToList();It should be noted that : If your PDF The document is a scanned version based on pictures , Then the code in this article cannot extract text , What you need is OCR technology .
边栏推荐
- Hands on deep learning (33) -- style transfer
- How can Huawei online match improve the success rate of player matching
- Log cannot be recorded after log4net is deployed to the server
- C # use gdi+ to add text with center rotation (arbitrary angle)
- 5g/4g wireless networking scheme for brand chain stores
- Mmclassification annotation file generation
- QTreeView+自定义Model实现示例
- Golang defer
- 自动化的优点有哪些?
- 入职中国平安三周年的一些总结
猜你喜欢

Hands on deep learning (34) -- sequence model

H5 audio tag custom style modification and adding playback control events

leetcode1-3

六月份阶段性大总结之Doris/Clickhouse/Hudi一网打尽

C language pointer classic interview question - the first bullet

法向量点云旋转

Servlet基本原理与常见API方法的应用

PHP code audit 3 - system reload vulnerability

MATLAB小技巧(25)竞争神经网络与SOM神经网络

Kubernetes CNI 插件之Fabric
随机推荐
libmysqlclient. so. 20: cannot open shared object file: No such file or directory
Qtreeview+ custom model implementation example
Write a jison parser from scratch (2/10): learn the correct posture of the parser generator parser generator
Hands on deep learning (40) -- short and long term memory network (LSTM)
Write a jison parser (7/10) from scratch: the iterative development process of the parser generator 'parser generator'
C语言指针面试题——第二弹
Deep learning 500 questions
C # use ffmpeg for audio transcoding
有老师知道 继承RichSourceFunction自定义读mysql怎么做增量吗?
Devop basic command
Write a jison parser from scratch (5/10): a brief introduction to the working principle of jison parser syntax
Hands on deep learning (35) -- text preprocessing (NLP)
Kotlin:集合使用
Service developers publish services based on EDAs
El Table Radio select and hide the select all box
ASP. Net to access directory files outside the project website
Lauchpad x | MODE
MATLAB小技巧(25)竞争神经网络与SOM神经网络
MySQL develops small mall management system
Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)