当前位置:网站首页>使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
2022-07-04 09:35:00 【dotNET跨平台】
PDF 是 Portable Document Format 的简称,意为“可携带文档格式”,是由 Adobe Systems 用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF 文件以 PostScript 语言图象模型为基础,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即 PDF 会忠实地再现原稿的每一个字符、颜色以及图象。
鉴于 PDF 文件格式比较复杂,一般通过第三方组件来对 PDF 进行操作,本文使用的是 itext7 。
官网:https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
通过 NuGet 引入 itext7 组件之后,可以使用以下代码提取 PDF 文件中的文字:
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}示例代码:
var lines = PdfHelper.ExtractText("{PDF文件路径}").ToList();需要注意的是:如果你的 PDF 文件是基于图片的扫描版,那么本文的代码是无法提取到文字的,你需要的是 OCR 技术。
边栏推荐
- Flutter 小技巧之 ListView 和 PageView 的各種花式嵌套
- Golang Modules
- View CSDN personal resource download details
- How web pages interact with applets
- C # use gdi+ to add text to the picture and make the text adaptive to the rectangular area
- 自动化的优点有哪些?
- Launpad | 基礎知識
- Hands on deep learning (40) -- short and long term memory network (LSTM)
- Matlab tips (25) competitive neural network and SOM neural network
- How can Huawei online match improve the success rate of player matching
猜你喜欢

Application of safety monitoring in zhizhilu Denggan reservoir area

C语言指针经典面试题——第一弹

Hands on deep learning (45) -- bundle search

Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
Web端自动化测试失败原因汇总

How do microservices aggregate API documents? This wave of show~

Histogram equalization

mmclassification 标注文件生成

2022-2028 global probiotics industry research and trend analysis report

Hands on deep learning (34) -- sequence model
随机推荐
2022-2028 global tensile strain sensor industry research and trend analysis report
Daughter love: frequency spectrum analysis of a piece of music
2. Data type
Application of safety monitoring in zhizhilu Denggan reservoir area
Golang Modules
法向量点云旋转
Logstack configuration details -- elasticstack (elk) work notes 020
In the case of easyUI DataGrid paging, click the small triangle icon in the header to reorder all the data in the database
C language pointer classic interview question - the first bullet
C # use gdi+ to add text with center rotation (arbitrary angle)
品牌连锁店5G/4G无线组网方案
H5 audio tag custom style modification and adding playback control events
C language pointer interview question - the second bullet
Are there any principal guaranteed financial products in 2022?
Intelligent gateway helps improve industrial data acquisition and utilization
C # use smtpclient The sendasync method fails to send mail, and always returns canceled
技术管理进阶——如何设计并跟进不同层级同学的绩效
Hands on deep learning (46) -- attention mechanism
Write a jison parser from scratch (4/10): detailed explanation of the syntax format of the jison parser generator
How do microservices aggregate API documents? This wave of show~