当前位置:网站首页>使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
2022-07-04 09:35:00 【dotNET跨平台】
PDF 是 Portable Document Format 的简称,意为“可携带文档格式”,是由 Adobe Systems 用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF 文件以 PostScript 语言图象模型为基础,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即 PDF 会忠实地再现原稿的每一个字符、颜色以及图象。
鉴于 PDF 文件格式比较复杂,一般通过第三方组件来对 PDF 进行操作,本文使用的是 itext7 。
官网:https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
通过 NuGet 引入 itext7 组件之后,可以使用以下代码提取 PDF 文件中的文字:
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}
示例代码:
var lines = PdfHelper.ExtractText("{PDF文件路径}").ToList();
需要注意的是:如果你的 PDF 文件是基于图片的扫描版,那么本文的代码是无法提取到文字的,你需要的是 OCR 技术。
边栏推荐
- Golang Modules
- Exercise 9-3 plane vector addition (15 points)
- Leetcode (Sword finger offer) - 35 Replication of complex linked list
- C语言指针面试题——第二弹
- Machine learning -- neural network (IV): BP neural network
- 2022-2028 global edible probiotic raw material industry research and trend analysis report
- MATLAB小技巧(25)竞争神经网络与SOM神经网络
- 7-17 crawling worms (15 points)
- Write a jison parser from scratch (4/10): detailed explanation of the syntax format of the jison parser generator
- Kubernetes CNI 插件之Fabric
猜你喜欢
pcl::fromROSMsg报警告Failed to find match for field ‘intensity‘.
C language pointer interview question - the second bullet
Hands on deep learning (35) -- text preprocessing (NLP)
ASP. Net to access directory files outside the project website
Hands on deep learning (45) -- bundle search
How can Huawei online match improve the success rate of player matching
A little feeling
MySQL develops small mall management system
Fabric of kubernetes CNI plug-in
Hands on deep learning (38) -- realize RNN from scratch
随机推荐
Write a jison parser from scratch (4/10): detailed explanation of the syntax format of the jison parser generator
Hands on deep learning (35) -- text preprocessing (NLP)
`Example of mask ` tool use
mmclassification 标注文件生成
C语言指针面试题——第二弹
C # use ffmpeg for audio transcoding
Kubernetes CNI 插件之Fabric
Summary of the most comprehensive CTF web question ideas (updating)
Write a jison parser (7/10) from scratch: the iterative development process of the parser generator 'parser generator'
2022-2028 global special starch industry research and trend analysis report
Hands on deep learning (36) -- language model and data set
2022-2028 global small batch batch batch furnace industry research and trend analysis report
2021-08-10 character pointer
法向量点云旋转
Exercise 9-5 address book sorting (20 points)
Exercise 7-4 find out the elements that are not common to two arrays (20 points)
浅谈Multus CNI
xxl-job惊艳的设计,怎能叫人不爱
2021-08-11 function pointer
Sort out the power node, Mr. Wang he's SSM integration steps