当前位置:网站首页>使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
2022-07-04 09:35:00 【dotNET跨平台】
PDF 是 Portable Document Format 的简称,意为“可携带文档格式”,是由 Adobe Systems 用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF 文件以 PostScript 语言图象模型为基础,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即 PDF 会忠实地再现原稿的每一个字符、颜色以及图象。
鉴于 PDF 文件格式比较复杂,一般通过第三方组件来对 PDF 进行操作,本文使用的是 itext7 。
官网:https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
通过 NuGet 引入 itext7 组件之后,可以使用以下代码提取 PDF 文件中的文字:
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}示例代码:
var lines = PdfHelper.ExtractText("{PDF文件路径}").ToList();需要注意的是:如果你的 PDF 文件是基于图片的扫描版,那么本文的代码是无法提取到文字的,你需要的是 OCR 技术。
边栏推荐
- Kotlin:集合使用
- Go context 基本介绍
- pcl::fromROSMsg报警告Failed to find match for field ‘intensity‘.
- Go context basic introduction
- 2022-2028 global special starch industry research and trend analysis report
- Leetcode (Sword finger offer) - 35 Replication of complex linked list
- Log cannot be recorded after log4net is deployed to the server
- C语言指针经典面试题——第一弹
- Solution to null JSON after serialization in golang
- 2022-2028 global optical transparency industry research and trend analysis report
猜你喜欢

How does idea withdraw code from remote push

Application of safety monitoring in zhizhilu Denggan reservoir area

Hands on deep learning (40) -- short and long term memory network (LSTM)

Hands on deep learning (33) -- style transfer

Hands on deep learning (45) -- bundle search

Hands on deep learning (38) -- realize RNN from scratch

Latex download installation record

xxl-job惊艳的设计,怎能叫人不爱

2022-2028 global small batch batch batch furnace industry research and trend analysis report

H5 audio tag custom style modification and adding playback control events
随机推荐
xxl-job惊艳的设计,怎能叫人不爱
libmysqlclient. so. 20: cannot open shared object file: No such file or directory
C # use smtpclient The sendasync method fails to send mail, and always returns canceled
【leetcode】540. A single element in an ordered array
Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
Normal vector point cloud rotation
PHP book borrowing management system, with complete functions, supports user foreground management and background management, and supports the latest version of PHP 7 x. Database mysql
Dynamic memory management
System.currentTimeMillis() 和 System.nanoTime() 哪个更快?别用错了!
JDBC and MySQL database
Kotlin: collection use
华为联机对战如何提升玩家匹配成功几率
Hands on deep learning (36) -- language model and data set
Basic data types in golang
`Example of mask ` tool use
Write a jison parser from scratch (5/10): a brief introduction to the working principle of jison parser syntax
Get the source code in the mask with the help of shims
Daughter love: frequency spectrum analysis of a piece of music
Nuxt reports an error: render function or template not defined in component: anonymous
Matlab tips (25) competitive neural network and SOM neural network