读取Word文档中的表格

同读取zip一样,使用二进制数据流解析docx文件,然后替换表格元素为html格式,最后用Web.Page解析网页。
核心代码源自于KenR,我修改简化了部分代码并封装为自定义函数。
由于过程过于复杂不作解释,你也不需要去理解是什么意思,需要用时调用自定义函数即可。

比如桌面上有个docx文件,里面有这样两张表格:

M语言代码如下:

let
    docx=(path as text,optional index as number)=>
    let
        DecompressFiles = (ZIPFile, Position, FileToExtract, XMLSoFar) => 
            let 
                MyBinaryFormat = try BinaryFormat.Record([DataToSkip=BinaryFormat.Binary(Position),MiscHeader=BinaryFormat.Binary(18),FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),UnCompressedFileSize=BinaryFormat.Binary(4),FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16,ByteOrder.LittleEndian),TheRest=BinaryFormat.Binary()]) otherwise null,
                MyCompressedFileSize = try MyBinaryFormat(ZIPFile)[FileSize]+1 otherwise null,
                MyFileNameLen = try MyBinaryFormat(ZIPFile)[FileNameLen] otherwise null,
                MyExtrasLen = try MyBinaryFormat(ZIPFile)[ExtrasLen] otherwise null,
                MyBinaryFormat2 = try BinaryFormat.Record([DataToSkip=BinaryFormat.Binary(Position), Header=BinaryFormat.Binary(30), Filename=BinaryFormat.Text(MyFileNameLen), Extras=BinaryFormat.Binary(MyExtrasLen), Data=BinaryFormat.Binary(MyCompressedFileSize), TheRest=BinaryFormat.Binary()]) otherwise null,
                MyFileName = try MyBinaryFormat2(ZIPFile)[Filename] otherwise null,
                GetDataToDecompress = try MyBinaryFormat2(ZIPFile)[Data] otherwise null,
                DecompressData = try Binary.Decompress(GetDataToDecompress, Compression.Deflate) otherwise null,
                NewPosition = try Position + 30 + MyFileNameLen + MyExtrasLen + MyCompressedFileSize - 1 otherwise null,
                ImportedXML = DecompressData,
                AddedCustom = try Table.AddColumn(ImportedXML, "Filename", each MyFileName) otherwise ImportedXML,
                AppendedQuery = if ImportedXML = null then XMLSoFar else if (MyFileName = FileToExtract) then AddedCustom else if (FileToExtract = "") and Position <> 0 then Table.Combine({AddedCustom, XMLSoFar}) else AddedCustom   
            in
                if  (MyFileName = FileToExtract) or (AppendedQuery = XMLSoFar) then AppendedQuery else @DecompressFiles(ZIPFile, NewPosition, FileToExtract, AppendedQuery),
        MyXML = Lines.FromBinary(DecompressFiles(File.Contents(path), 0, "word/document.xml", "")){1},
        ReplaceList = {{"<w:tbl>","<table>"},{"</w:tbl>","</table>"},{"<w:tr>","<tr>"},{"</w:tr>","</tr>"},{"<w:tc>","<td>"},{"</w:tc>","</td>"}},
        RemoveList = List.Select(List.Transform({0..List.Count(Text.PositionOf(MyXML,"<",2))-1},each "<"&Text.BetweenDelimiters(MyXML,"<",">",_)&">"),each not List.Contains(List.Zip(ReplaceList){0},_)),
        Result = List.RemoveLastN(Web.Page(List.Accumulate(List.Transform(RemoveList,each {_}&{""})&ReplaceList,MyXML,(s,c)=>Text.Replace(s,c{0},c{1})))[Data])
    in
        try Result{index-1} otherwise Result
in
    docx

使用时,将以上代码复制粘贴到高级编辑器,出现自定义函数,要求输入两个参数:

第一个参数为docx文档的路径,第二个参数为可省略的索引,比如测试文件中有两张表,第二参数填1就读取第1张,填2就读取第2张,如不填则返回所有表格的列表,测试效果如下:

11 Replies to “读取Word文档中的表格”

  1. 犀利,一次性提取多个表格就方便了。
    能否写一篇关于Excel文件对比的帖子呢,实现类似于Spreadsheet Compare 2016的效果,或者提供下思路,谢谢

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注