søndag den 10. maj 2009

How To: Edit Mac OS .pages documents in Linux

Seeing as I had to spend an hour or two, figuring this out, I thought other might profit from my experience, at least with sufficient google-foo...

Pages is, according to a Machead friend of mine, "the Word of Mac OS X". Yuk.... Oh well, there is no arguing with these people. I had promised to help out with some grammar editing and translation on a document for another Mac user friend. My first step was to try OpenOffice but it does not appear to have an import/export filter for the format and neither Gmail nor Google Documents seemed to recognise it either.

I was about to give up and suggest using some kind of open format or, if all else failed, some old, ghastly incarnation of .doc, when I noticed that pcmanfm, my file manager, had classified the files as 'zip archives'. Aha! It turns out that the file format is nothing more than a zip archive, containing an XML file, whatever images have been added to the document, and a Quicklook folder, with a low quality PDF version and a JPEG thumbnail.

What follows is a step-by-step guide to my patented hi-tech-lo-fi approach to editing .pages files on a linux box. The outline is that you need to unzip the archive, make a few small adjustments using trusted old unix/GNU tools, like sed, cat, and tr, and can then easily edit the file, using a text editor. Creating a new pages file, is simply putting the process in reverse.
  1. Create a folder in which to unpack the archive or you'll have a small tar bomb on your hands. Using pcmanfm I just throw a copy of the archive into the new folder, choose extract here, and delete the archive copy.
  2. As mentioned above, the contents are very easy to understand:

    • The index.xml file is the main content file. This is the one to edit.

    • buildVersionHistory.plist is a file containing the version of the file format. I tried mailing an early example of my editing efforts to Machead friend 1 based on a file from Machead friend 2. The former got told by his copy of Pages that he needed to upgrade, so obviously the format has a version history.

    • The QuickLook folder contains a thumbnail image, supposedly for the file open dialog, and a low quality PDF file (any images included are extremely blurry)
    • Whatever images have been added to the document, will be in the 'root' folder, probably in TIFF format.

  3. The index.xml file can be 'hand edited' as it is, but the entire file is one lone line - maybe because it saves a few kilobytes? - so we'll need to make som adjustments. Incidentally, opening the XML file as it was in any XML knowledgeable text editor consumed practically all of my 2.0 GHz. It was like editing in treacle. You will probably need to and want to insert some newlines in order to make it easier on yourself. Note that 'text wrapping' will not do it - the CPU wll still be struggling to handle this one huge line. This we can do with one line of sed. In the directory containing the archive files, do:

    sed -i 's,>,>\n,g' index.xml

    This will insert a line break after each tag ending and make the XML file a lot easier on the eyes and CPU.

  4. Now the XML file is ready for editing. Open it in you favorite text editor, preferably something with the ability to colour code XML. I recommend Geany, a light-weight but very capable IDE, or medit, an all-purpose text editor, both of them GTK2. The first many lines will be incomprehensible formatting codes. The actual text is at the bottom fo the file. Try searching for an easy identifiable word or expression to find it. You can use the PDF file in the QuickLook folder as a visual guide to get some WYSIWYG help while editing. Be careful not to delete any XML tags - if the XML document is not well-formed and valid, chances are Pages won't open it. A few tags can however be easily identified and manipulated.

    [sf:br/]

    for instance is obviously a line break, and

    [sf:p style=""][/sf:p]

    is a paragraph. (note that i have switched ['s for <'s and so on in order to circumvent Blogger's moronic editor's sabotage). Still, since you cannot know if anything is broken before the Machead in question opens the file, it might be better to leave well enough alone.

  5. Once you're done editing, save and close. Now, we need to reverse the sed effect. sed is very good at inserting line breaks but due to its line editor nature, not very good at removing them. So I resorted to tr:

    cat index.xml | tr -d '\n' > index.xml.new

    which will remove all the line break sed inserted. One line break however should be kept: the one after the xml version declaration ([?xml version="1.0"?]), so open index.xml.new and add it manually. Then

    mv index.xml.new index.xml

  6. Now, we just need to re-zip the whole thing. I prefer using pcmanfm's option: just select the entire content of the directory (index.xml, QuickLook, etc.), right click, choose 'compress', and selct the 'zip' format. The same option is probably built into most file managers. The corresponding command line is 'zip -r *' but I don't know how to control what name is given to the archive or where it is put. Once your archive is created, rename it from .zip to .pages and email it back to the Machead. That's it, you're done.


I hope this has proven useful to you (and that I haven't just missed the correct option in OOo, in which case this is just stupid geekery for the sake of geekery...)

Ingen kommentarer:

Send en kommentar