当前位置:网站首页>C#/VB.NET 从PDF中提取表格
C#/VB.NET 从PDF中提取表格
2022-08-03 10:56:00 【InfoQ】
程序环境:
从PDF中提取表格具体步骤:
- 实例化PdfDocument类的对象并调用PdfDocument.LoadFromFile()方法加载文档。
- 通过 PdfTableExtractor.ExtractTable(intpageIndex) 方法提取指定页面中的表格。
- 通过 PdfTable.GetText(int rowIndex, intcolumnIndex) 方法将获取具体行和列中的单元格文本内容。
- 将获取的表格内容保存为TXT文件。
完整代码:
using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.IO;
using System.Text;
namespace ExtractTable
{
class Program
{
static void Main(string[] args)
{
//实例化PdfDocument类的对象
PdfDocument pdf = new PdfDocument();
//加载PDF文档
pdf.LoadFromFile("编程语言1.pdf");
//创建StringBuilder类的对象
StringBuilder builder = new StringBuilder();
//实例化PdfTableExtractor类的对象
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//声明PdfTable类的表格数组
PdfTable[] tableLists;
//遍历PDF页面
for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
{
//从页面提取表格
tableLists = extractor.ExtractTable(pageIndex);
//判断表格列表是否为空
if (tableLists != null && tableLists.Length > 0)
{
//遍历表格
foreach (PdfTable table in tableLists)
{
//获取表格中的行和列数
int row = table.GetRowCount();
int column = table.GetColumnCount();
//遍历表格行和列
for (int i = 0; i < row; i++)
{
for (int j = 0; j < column; j++)
{
//获取行和列中的文本
string text = table.GetText(i, j);
//写入文本到StringBuilder容器
builder.Append(text + " ");
}
builder.Append("\r\n");
}
}
}
}
//保存提取的表格内容为txt文档
File.WriteAllText("提取表格.txt", builder.ToString());
}
}
}Imports Spire.Pdf
Imports Spire.Pdf.Utilities
Imports System.IO
Imports System.Text
Namespace ExtractTable
Class Program
Private Shared Sub Main(args As String())
'实例化PdfDocument类的对象
Dim pdf As New PdfDocument()
'加载PDF文档
pdf.LoadFromFile("编程语言1.pdf")
'创建StringBuilder类的对象
Dim builder As New StringBuilder()
'实例化PdfTableExtractor类的对象
Dim extractor As New PdfTableExtractor(pdf)
'声明PdfTable类的表格数组
Dim tableLists As PdfTable()
'遍历PDF页面
For pageIndex As Integer = 0 To pdf.Pages.Count - 1
'从页面提取表格
tableLists = extractor.ExtractTable(pageIndex)
'判断表格列表是否为空
If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then
'遍历表格
For Each table As PdfTable In tableLists
'获取表格中的行和列数
Dim row As Integer = table.GetRowCount()
Dim column As Integer = table.GetColumnCount()
'遍历表格行和列
For i As Integer = 0 To row - 1
For j As Integer = 0 To column - 1
'获取行和列中的文本
Dim text As String = table.GetText(i, j)
'写入文本到StringBuilder容器
builder.Append(text & Convert.ToString(" "))
Next
builder.Append(vbCr & vbLf)
Next
Next
End If
Next
'保存提取的表格内容为txt文档
File.WriteAllText("提取表格.txt", builder.ToString())
End Sub
End Class
End Namespace
效果图

边栏推荐
猜你喜欢

With strong network, China mobile to calculate excitation surging energy network construction

redis基础知识总结——数据类型(字符串,列表,集合,哈希,集合)

二叉搜索树(搜索二叉树)模拟实现(有递归版本)

Dry goods!A highly structured and sparse linear transformation called Deformable Butterfly (DeBut)

后台图库上传功能

Skills required to be a good architect: How to draw a system architecture that everyone will love?What's the secret?Come and open this article to see it!...

孙宇晨式“溢价逻辑”:不局限眼前,为全人类的“星辰大海”大胆下注

开源一夏 | 教你快速实现“基于Docker快速构建基于Prometheus的MySQL监控系统”

This article understands the process from RS485 sensor to IoT gateway to cloud platform

Binary search tree (search binary tree) simulation implementation (there is a recursive version)
随机推荐
请问应该用什么关键字将内容主题设置为 dark 呢
[Star Project] Little Hat Plane Battle (9)
完全背包问题
3D激光SLAM:LeGO-LOAM---两步优化的帧间里程计及代码分析
面试突击71:GET 和 POST 有什么区别?
QT with OpenGL(Shadow Mapping)(面光源篇)
Guys, I have a problem: My source mysql has a table that has been writing to, I use mysql cdc connec
synchronized
Pixel mobile phone system
机器比人更需要通证
优炫数据库在linux平台下服务启动失败的原因
Enter the SQL Client to create the table, in another node into the SQL Client queries
谷歌实用插件分享
The way of programmer architecture practice: how to design a sustainable evolution system architecture?
_GLIBCXX_USE_CXX11_ABI 宏的作用
机器学习(第一章)—— 特征工程
Polymorphism in detail (simple implementation to buy tickets system simulation, covering/weight definition, principle of polymorphism, virtual table)
如何将Oracle/MySQL中的数据迁移到GBase 8c中?
嵌入式软件组件经典架构与存储器分类
servlet生命周期详解--【结合源码】