Using R To Get Data *Out Of* Word Docs

NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package.

Also: docxtractr is now on CRAN


This was asked on twitter recently:

The answer is a very cautious “yes”. Much depends on how well-formed and un-formatted the table is.

Take this really simple docx file: data.docx.

It has a single table in it:

data_docx

Now, .docx files are just zipped directories, so rename that to data.zip, unzip it and navigate to data/word/document.xml and you’ll see something like this (though it’ll be more compressed):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
    <w:tbl>
        <w:tblPr>
            <w:tblStyle w:val="TableGrid"/>
            <w:tblW w:w="0" w:type="auto"/>
            <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
        </w:tblPr>
        <w:tblGrid>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2337"/>
            <w:gridCol w:w="2338"/>
            <w:gridCol w:w="2338"/>
        </w:tblGrid>
        <w:tr w:rsidR="00244D8A" w14:paraId="6808A6FE" w14:textId="77777777" w:rsidTr="00244D8A">
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="7D006905" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>This</w:t>
                    </w:r>
                </w:p>
            </w:tc>
            <w:tc>
                <w:tcPr>
                    <w:tcW w:w="2337" w:type="dxa"/>
                </w:tcPr>
                <w:p w14:paraId="13C9E52C" w14:textId="77777777" w:rsidR="00244D8A" w:rsidRDefault="00244D8A">
                    <w:r>
                        <w:t>Is</w:t>
                    </w:r>
                </w:p>
            </w:tc>
...

We can easily make out a table structure with rows and columns. In the simplest cases (which is all I’ll cover in this post) where the rows and columns are uniform it’s pretty easy to grab the data:

library(xml2)

# read in the XML file
doc <- read_xml("data/word/document.xml")

# there is an egregious use of namespaces in these files
ns <- xml_ns(doc)

# extract all the table cells (this is assuming one table in the document)
cells <- xml_find_all(doc, ".//w:tbl/w:tr/w:tc", ns=ns)

# convert the cells to a matrix then to a data.frame)
dat <- data.frame(matrix(xml_text(cells), ncol=4, byrow=TRUE), 
                  stringsAsFactors=FALSE)

# if there are column headers, make them the column name and remove that line
colnames(dat) <- dat[1,]
dat <- dat[-1,]
rownames(dat) <- NULL

dat

##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo

You’ll need to clean up the column types, but you have at least freed the data from the evil file format it was in.

If there is more than one table you can use XML node targeting to process each one separately or into a list. I’ve wrapped that functionality into a rudimentary function that will:

  • auto-copy a Word doc to a temporary location
  • rename it to a zip
  • unzip it to a temporary location
  • read in the document.xml
  • auto-determine the number of tables in the document
  • auto-calculate # rows & # columns per table
  • convert each table
  • return all the tables into a list
  • clean up the temporarily created items
library(xml2)

get_tbls <- function(word_doc) {
  
  tmpd <- tempdir()
  tmpf <- tempfile(tmpdir=tmpd, fileext=".zip")
  
  file.copy(word_doc, tmpf)
  unzip(tmpf, exdir=sprintf("%s/docdata", tmpd))
  
  doc <- read_xml(sprintf("%s/docdata/word/document.xml", tmpd))
  
  unlink(tmpf)
  unlink(sprintf("%s/docdata", tmpd), recursive=TRUE)

  ns <- xml_ns(doc)
  
  tbls <- xml_find_all(doc, ".//w:tbl", ns=ns)
  
  lapply(tbls, function(tbl) {
    
    cells <- xml_find_all(tbl, "./w:tr/w:tc", ns=ns)
    rows <- xml_find_all(tbl, "./w:tr", ns=ns)
    dat <- data.frame(matrix(xml_text(cells), 
                             ncol=(length(cells)/length(rows)), 
                             byrow=TRUE), 
                      stringsAsFactors=FALSE)
    colnames(dat) <- dat[1,]
    dat <- dat[-1,]
    rownames(dat) <- NULL
    dat
    
  })
  
}

Using this multi-table Word doc – doc3:

data3

we can extract the three tables thusly:

get_tbls("~/Dropbox/data3.docx")

## [[1]]
##   This      Is     A   Column
## 1    1     Cat   3.4      Dog
## 2    3    Fish 100.3     Bird
## 3    5 Pelican   -99 Kangaroo
## 
## [[2]]
##   Foo Bar Baz
## 1  Aa  Bb  Cc
## 2  Dd  Ee  Ff
## 3  Gg  Hh  ii
## 
## [[3]]
##   Foo Bar
## 1  Aa  Bb
## 2  Dd  Ee
## 3  Gg  Hh
## 4  1    2
## 5  Zz  Jj
## 6  Tt  ii

This function tries to calculate the rows/columns per table but it does rely on a uniform table structure.

Have an alternate method or more feature-complete way of handling Word docs as tabular data sources? Then definitely drop a note in the comments.

Cover image from Data-Driven Security
Amazon Author Page

5 Comments Using R To Get Data *Out Of* Word Docs

  1. Pingback: Using R To Get Data *Out Of* Word Docs | Mubashir Qasim

  2. David Luckett

    Bob,
    This is an excellent post and a very useful function – well done, and thanks for sharing.
    Cheers
    David

    Reply
  3. richard telford

    Many thanks. I’ll try this – if my collaborators share the data (long story). There are also some metadata in each file that needs to be captured and are hopefully in a consistent format…

    Reply
  4. Pingback: New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs | rud.is

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.