读取Word文档中的表格 – Power Query爱好者

同读取zip一样，使用二进制数据流解析docx文件，然后替换表格元素为html格式，最后用Web.Page解析网页。
核心代码源自于KenR，我修改简化了部分代码并封装为自定义函数。
由于过程过于复杂不作解释，你也不需要去理解是什么意思，需要用时调用自定义函数即可。

比如桌面上有个docx文件，里面有这样两张表格：

M语言代码如下：

let
    docx=(path as text,optional index as number)=>
    let
        DecompressFiles = (ZIPFile, Position, FileToExtract, XMLSoFar) => 
            let 
                MyBinaryFormat = try BinaryFormat.Record([DataToSkip=BinaryFormat.Binary(Position),MiscHeader=BinaryFormat.Binary(18),FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),UnCompressedFileSize=BinaryFormat.Binary(4),FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16,ByteOrder.LittleEndian),TheRest=BinaryFormat.Binary()]) otherwise null,
                MyCompressedFileSize = try MyBinaryFormat(ZIPFile)[FileSize]+1 otherwise null,
                MyFileNameLen = try MyBinaryFormat(ZIPFile)[FileNameLen] otherwise null,
                MyExtrasLen = try MyBinaryFormat(ZIPFile)[ExtrasLen] otherwise null,
                MyBinaryFormat2 = try BinaryFormat.Record([DataToSkip=BinaryFormat.Binary(Position), Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Text(MyFileNameLen), Extras=BinaryFormat.Binary(MyExtrasLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]) otherwise null,
                MyFileName = try MyBinaryFormat2(ZIPFile)[Filename] otherwise null,
                GetDataToDecompress = try MyBinaryFormat2(ZIPFile)[Data] otherwise null,
                DecompressData = try Binary.Decompress(GetDataToDecompress, Compression.Deflate) otherwise null,
                NewPosition = try Position + 30 + MyFileNameLen + MyExtrasLen + MyCompressedFileSize - 1 otherwise null,
                ImportedXML = DecompressData,
                AddedCustom = try Table.AddColumn(ImportedXML, "Filename", each MyFileName) otherwise ImportedXML,
                AppendedQuery = if ImportedXML = null then XMLSoFar else if (MyFileName = FileToExtract) then AddedCustom else if (FileToExtract = "") and Position <> 0 then Table.Combine({AddedCustom, XMLSoFar}) else AddedCustom   
            in
                if  (MyFileName = FileToExtract) or (AppendedQuery = XMLSoFar) then AppendedQuery else @DecompressFiles(ZIPFile, NewPosition, FileToExtract, AppendedQuery),
        MyXML = Lines.FromBinary(DecompressFiles(File.Contents(path), 0, "word/document.xml", "")){1},
        ReplaceList = {{"<w:tbl>","<table>"},{"</w:tbl>","</table>"},{"<w:tr>","<tr>"},{"</w:tr>","</tr>"},{"<w:tc>","<td>"},{"</w:tc>","</td>"}},
        RemoveList = List.Select(List.Transform({0..List.Count(Text.PositionOf(MyXML,"<",2))-1},each "<"&Text.BetweenDelimiters(MyXML,"<",">",_)&">"),each not List.Contains(List.Zip(ReplaceList){0},_)),
        Result = List.RemoveLastN(Web.Page(List.Accumulate(List.Transform(RemoveList,each {_}&{""})&ReplaceList,MyXML,(s,c)=>Text.Replace(s,c{0},c{1})))[Data])
    in
        try Result{index-1} otherwise Result
in
    docx

使用时，将以上代码复制粘贴到高级编辑器，出现自定义函数，要求输入两个参数：

第一个参数为docx文档的路径，第二个参数为可省略的索引，比如测试文件中有两张表，第二参数填1就读取第1张，填2就读取第2张，如不填则返回所有表格的列表，测试效果如下：

附件

读取Word文档中的表格 (15 kB)

打赏赞(11)

11 Replies to “读取Word文档中的表格”

李伟坚说道：

2018年1月21日下午2:56

有深度，果然大师级

回复
Leo说道：

2018年2月28日上午7:09

犀利，一次性提取多个表格就方便了。
能否写一篇关于Excel文件对比的帖子呢，实现类似于Spreadsheet Compare 2016的效果，或者提供下思路，谢谢

回复
txbzgh说道：

2018年7月9日上午11:08

太高深，感觉使用价值很高，目前看不懂

回复
hugo说道：

2018年7月26日下午7:30

请问能批量修改word么.....

回复
1. 施阳说道：
  
  2018年7月26日下午7:32
  
  这不是PQ该做的事
  
  回复
hugo说道：

2018年11月15日下午10:22

谢谢，可以使用，但是提取速度太慢了.....

回复
baigu3说道：

2019年6月6日上午8:32

看不懂，只能拿来主义。

回复
绿谷龙芽说道：

2019年6月7日上午11:50

原来早就可以这么玩，把百度来的VBA代码扔了！
学习了！

回复
老渔翁说道：

2022年6月25日下午3:54

一个文件夹里有多个单张WORD文档，怎么调入

回复
1. paul说道：
  
  2023年9月5日下午12:58
  
  提取速度慢，38个word文件批量提取，1小时等待中！
  
  回复
黑白大猫说道：

2024年4月26日下午4:33

此代码有个BUG 无法读取只有一列数据的表格

回复

附件

11 Replies to “读取Word文档中的表格”

发表回复 取消回复

发表回复取消回复