当前位置:网站首页>Use C to extract all text in PDF files (support.Net core)
Use C to extract all text in PDF files (support.Net core)
2022-07-04 10:10:00 【Dotnet cross platform】
PDF yes Portable Document Format For short , Meaning for “ Portable document format ”, By Adobe Systems Used with applications 、 operating system 、 The file format developed by the file exchange in a hardware independent way .PDF Document to PostScript Language is based on image model , No matter in which kind of printer can guarantee the accurate color and accurate printing effect , namely PDF Will faithfully reproduce every character of the original 、 Color and image .
Whereas PDF The file format is complex , Generally, third-party components are used to PDF To operate , This article USES itext7 .
Official website :https://itextpdf.com/
NuGet:https://www.nuget.org/packages/itext7/
adopt NuGet introduce itext7 After component , You can use the following code to extract PDF The text in the document :
using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
public static IEnumerable<string> ExtractText(string filename)
{
using (var r = new PdfReader(filename))
using (var doc = new PdfDocument(r))
{
for (int i = 1; i < doc.GetNumberOfPages(); i++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
yield return text;
}
}
}
}Sample code :
var lines = PdfHelper.ExtractText("{PDF File path }").ToList();It should be noted that : If your PDF The document is a scanned version based on pictures , Then the code in this article cannot extract text , What you need is OCR technology .
边栏推荐
猜你喜欢

El Table Radio select and hide the select all box

Write a mobile date selector component by yourself

Servlet基本原理与常见API方法的应用

Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)

Hands on deep learning (34) -- sequence model

用数据告诉你高考最难的省份是哪里!

Kubernetes CNI 插件之Fabric

A little feeling

PHP is used to add, modify and delete movie information, which is divided into foreground management and background management. Foreground users can browse information and post messages, and backgroun

5g/4g wireless networking scheme for brand chain stores
随机推荐
MySQL develops small mall management system
Lavel document reading notes -how to use @auth and @guest directives in lavel
PHP personal album management system source code, realizes album classification and album grouping, as well as album image management. The database adopts Mysql to realize the login and registration f
Nuxt reports an error: render function or template not defined in component: anonymous
H5 audio tag custom style modification and adding playback control events
A little feeling
C # use gdi+ to add text to the picture and make the text adaptive to the rectangular area
Exercise 9-3 plane vector addition (15 points)
mmclassification 标注文件生成
Dynamic memory management
Kotlin: collection use
浅谈Multus CNI
Hands on deep learning (46) -- attention mechanism
Uniapp--- initial use of websocket (long link implementation)
Hands on deep learning (43) -- machine translation and its data construction
How can people not love the amazing design of XXL job
MySQL case
Golang defer
leetcode1-3
Some summaries of the third anniversary of joining Ping An in China