当前位置:网站首页>Use C to extract all text in PDF files (support.Net core)
Use C to extract all text in PDF files (support.Net core)
2022-07-04 10:10:00 【Dotnet cross platform】
PDF yes Portable Document Format For short , Meaning for “ Portable document format ”, By Adobe Systems Used with applications 、 operating system 、 The file format developed by the file exchange in a hardware independent way .PDF Document to PostScript Language is based on image model , No matter in which kind of printer can guarantee the accurate color and accurate printing effect , namely PDF Will faithfully reproduce every character of the original 、 Color and image .
Whereas PDF The file format is complex , Generally, third-party components are used to PDF To operate , This article USES itext7 .
Official website :https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
adopt NuGet introduce itext7 After component , You can use the following code to extract PDF The text in the document :
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}
Sample code :
var lines = PdfHelper.ExtractText("{PDF File path }").ToList();
It should be noted that : If your PDF The document is a scanned version based on pictures , Then the code in this article cannot extract text , What you need is OCR technology .
边栏推荐
- Go context basic introduction
- Hands on deep learning (39) -- gating cycle unit Gru
- Exercise 7-2 finding the maximum value and its subscript (20 points)
- Hands on deep learning (46) -- attention mechanism
- Modules golang
- [200 opencv routines] 218 Multi line italic text watermark
- 按键精灵跑商学习-商品数量、价格提醒、判断背包
- PHP code audit 3 - system reload vulnerability
- 华为联机对战如何提升玩家匹配成功几率
- uniapp 小于1000 按原数字显示 超过1000 数字换算成10w+ 1.3k+ 显示
猜你喜欢
Hands on deep learning (45) -- bundle search
ASP. Net to access directory files outside the project website
回复评论的sql
Summary of small program performance optimization practice
PHP code audit 3 - system reload vulnerability
【Day2】 convolutional-neural-networks
Devop basic command
libmysqlclient. so. 20: cannot open shared object file: No such file or directory
Advanced technology management - how to design and follow up the performance of students at different levels
今日睡眠质量记录78分
随机推荐
El Table Radio select and hide the select all box
转载:等比数列的求和公式,及其推导过程
PHP代码审计3—系统重装漏洞
Lavel document reading notes -how to use @auth and @guest directives in lavel
Write a jison parser from scratch (3/10): a good beginning is half the success -- "politics" (Aristotle)
Matlab tips (25) competitive neural network and SOM neural network
MySQL case
Exercise 8-7 string sorting (20 points)
Uniapp--- initial use of websocket (long link implementation)
PHP personal album management system source code, realizes album classification and album grouping, as well as album image management. The database adopts Mysql to realize the login and registration f
MATLAB小技巧(25)竞争神经网络与SOM神经网络
C # use smtpclient The sendasync method fails to send mail, and always returns canceled
libmysqlclient. so. 20: cannot open shared object file: No such file or directory
自动化的优点有哪些?
法向量点云旋转
el-table单选并隐藏全选框
System.currentTimeMillis() 和 System.nanoTime() 哪个更快?别用错了!
Golang 类型比较
【Day2】 convolutional-neural-networks
浅谈Multus CNI