当前位置:网站首页>使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
2022-07-04 09:35:00 【dotNET跨平台】
PDF 是 Portable Document Format 的简称,意为“可携带文档格式”,是由 Adobe Systems 用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF 文件以 PostScript 语言图象模型为基础,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即 PDF 会忠实地再现原稿的每一个字符、颜色以及图象。
鉴于 PDF 文件格式比较复杂,一般通过第三方组件来对 PDF 进行操作,本文使用的是 itext7 。
官网:https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
通过 NuGet 引入 itext7 组件之后,可以使用以下代码提取 PDF 文件中的文字:
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}示例代码:
var lines = PdfHelper.ExtractText("{PDF文件路径}").ToList();需要注意的是:如果你的 PDF 文件是基于图片的扫描版,那么本文的代码是无法提取到文字的,你需要的是 OCR 技术。
边栏推荐
- Web端自动化测试失败原因汇总
- How web pages interact with applets
- H5 audio tag custom style modification and adding playback control events
- On Multus CNI
- Write a jison parser from scratch (2/10): learn the correct posture of the parser generator parser generator
- Latex download installation record
- C # use gdi+ to add text with center rotation (arbitrary angle)
- Exercise 9-1 time conversion (15 points)
- Luogu deep foundation part 1 Introduction to language Chapter 4 loop structure programming (2022.02.14)
- xxl-job惊艳的设计,怎能叫人不爱
猜你喜欢

Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)

Intelligent gateway helps improve industrial data acquisition and utilization

IIS configure FTP website

Dynamic memory management

C语言指针经典面试题——第一弹

How can people not love the amazing design of XXL job

Kubernetes CNI 插件之Fabric

Dynamic address book

2022-2028 global gasket metal plate heat exchanger industry research and trend analysis report

Write a mobile date selector component by yourself
随机推荐
MySQL case
System. Currenttimemillis() and system Nanotime (), which is faster? Don't use it wrong!
C语言指针经典面试题——第一弹
5g/4g wireless networking scheme for brand chain stores
Sort out the power node, Mr. Wang he's SSM integration steps
Matlab tips (25) competitive neural network and SOM neural network
Summary of small program performance optimization practice
Deep learning 500 questions
Launpad | 基礎知識
Modules golang
2022-2028 global probiotics industry research and trend analysis report
Explanation of for loop in golang
Write a jison parser from scratch (4/10): detailed explanation of the syntax format of the jison parser generator
Lauchpad X | 模式
Hands on deep learning (35) -- text preprocessing (NLP)
Exercise 7-3 store the numbers in the array in reverse order (20 points)
H5 audio tag custom style modification and adding playback control events
SQL replying to comments
Regular expression (I)
Hands on deep learning (42) -- bi-directional recurrent neural network (BI RNN)