Google Docs to HTML — cleanly…
In this tutorial, we’ll explore a quick way to repurpose a Google word processing document for use on the web with validated code, using a trick that will help you produce clean code.
by Scott Frangos, Managing Editor - WebHelperMagazine.com
Hey… have you used Google Documents lately? They’re getting more and more sophisticated with more and more reasons to use them for spreadsheet and word processing work, instead of Microsoft Office. Who cares… you’re a “WebMaster”, and you don’t use Word Processors to create your pages — right? Wait a minute, “big gun.” Sure, when your primary purpose is to code for the web, you probably would begin with a good coding tool (refer to Amaya versus JEDIT, article & also APTANA overview article, for some good open-source choices).
But even though Google produces ugly code (see below), there are some reasons why you might choose to start your web content writing there:
- Clients: You might have a client that provides copy using Google Documents, or another Word Processing program, where this same process will work for you. This will make you a “client friendly” WebMaster.
- Focus on Writing: You might like to use the Document editing tools to focus on writing prior to creating a page — like using Google for Content Management (The Web IS your Content Management System).
- Auto Blog Posting: BlogMasters can use Google Docs to write and Edit stories, then automatically post them to your blog (after inputting some information about your blog).
- Collaboration: One great feature of Google Docs is the ability to see a “revision” for your document which represents each time it was saved — a great tool when you are collaborating with another editor or writer on the same document (The Web is a a CMS).
- Quickly Clean Code: You can quickly clean up the mess of code Google docs produces using the trick I’ll teach you here. This same trick will help you clean up code for any webpage!
Steps… here we go:
1. To follow along, get yourself a Google account (the spreadsheet and word processing tools are great), and create a sample document, at Docs.Google.com.
2. Create a test document — Google Docs allows you to insert tables and photos, so I did that:

Note the tab controls at top left. To insert images and tables, you first click the “Insert” tab to access those controls.
3. You take a look at the HTML… a run-on mess with no DocType declaration, header or body tags:

This view is accessed by clicking the “Edit HTML” tab at top left. Sure, if you understand HTML you can read this, but it would be nice to have line numbers and properly structured code.
4. To confirm it’s an unvalidated mess, we first click “Preview” (not “Publish” unless you are ready to place your document online) to see the document as a web page (Google allows you to “publish” web pages right from Google Docs), then run the HTML validator on it using the Web Developer toolbar in FireFox (alternatively, you could jump over to the online validator at W3C.org, and input the URL for the Google page you have made).

First we use the Web Developer extension for FireFox (essential for every Open Source WebMaster to have in their tool kit) to select Tools Validate HTML.

The Validator makes a point to note there is no doctype declared, and then lists 28 Validation Output errors by line number. A careful read of the Validator results shows a number of problems, some which are in code which Google slaps into your document, but did not show you when you viewed the document under “Edit HTML.”
5. To clean the code, we’ll use HTML Validator FireFox extension that puts some teeth in the VIEW Source dialog and actually allows you to clean your code. If you’re following along, you’ll need FireFox, or a Mozilla browser, with the HTML Validator Extension pre-installed.
6. View the page in preview mode again, then select View Page Source from the main menu at top:

Thanks to the HTML Validator extension, we now have a WebMasters view of the document with line numbers and some excellent clean-up options. The large window at top shows the source code with the additions from our friends at Google, while the bottom left window shows warnings which the “Tidy” program built into the extension can fix. Note that at times it may show errors it cannot fix.
* Note also that Tidy located 22 errors, while the W3C validator identified 28. Why? That’s because we’re showing the code cleaning option using the Tidy Parser, while the W3C program uses the SGML parser. But you get to choose which one to use. I like the Tidy Parser, even though it doesn’t “agree” with the W3C, because it offers the feature to “clean up the page.”
7. Next we’ll choose “Clean up the page” — your time saving trick.

You click “Clean up the code”, then the dialog box you see above with your “cleaned code” appears, courtesy of Tidy. Nice. Note that this view allows you to compare the original source code to the cleaned code and also see how it looks in browser views (top row of buttons). Keep “replace FONT tags by CSS” checked (bottom right), if you want the program to generate a nice internal style sheet for you.
8. Once the page is “clean” (according to Tidy), I recommend you paste the cleaned code into your “real” webmaster editing program. Remove any unwanted Google code. You’ll see that Google places a number of links to external style sheets, and some JavaScript in your code (a lot of which has to do with the ability to edit your document using Google Docs) which you may simply delete. Note that at this writing, Google also pasted in code to tie your page into Google Analytics — a blow your mind/comprehensive statistics tool — you may wish to leave that in.

Above is a shot of my Editing program (”Taco HTML Edit” — excellent, and free for Mac users… http://tacosw.com/) with some of the code that I later removed, highlighted in blue. This program has a built in validator I could use for my final check, but I’ve included the following step — Step 9 — to indicate preferred use of the W3C.org’s validator… unless you know that your editing program already uses that validator.
9. Then validate your page once again using the W3C.org validator, and make the final changes by hand (you don’t want the machines to render you useless, do you?). For my page with a headline, a table and some copy… I now only have 6 errors, mostly having to do with how the table was coded — all easy, quick fixes.
Here’s the list of the final errors, for your reference:
* Line 77, Column 18: there is no attribute "TOPMARGIN". BODY TOPMARGIN="0" RIGHTMARGIN="0" LEFTMARGIN= * Line 77, Column 34: there is no attribute "RIGHTMARGIN". BODY TOPMARGIN="0" RIGHTMARGIN="0" LEFTMARGIN= * Line 78, Column 2: there is no attribute "LEFTMARGIN". "0" BOTTOMMARGIN="0" * Line 78, Column 19: there is no attribute "BOTTOMMARGIN". "0" BOTTOMMARGIN="0" * Line 87, Column 8: there is no attribute "BORDERCOLOR". "#000099" CELLPADDING="10" CELLSPACING= * Line 259, Column 8: end tag for "DIV" omitted, but its declaration does not permit this. /BODY * Line 85, Column 6: start tag was here. DIV CLASS="
The Bottom line (with some Q & A)…
The big question for webmasters is this: Is starting with Google docs to create a document worth the effort to “tidy” up your code for the web? The answer is definitely “yes” if you have more than one purpose for the document, and particularly so if that’s the best way to work with a client. I also like the ability to focus on writing first, and since Tidy cleans 80%-95% of the code problems automatically, there isn’t too much work I have to do in order to get this Google Doc writing advantage. Besides, at this key transition/update time (when will it end?), it is good to exercise your hand validation skills as a WebMaster of the 2,000’s.
Q & A…
- Where can I go to understand how Tidy “fixes” my documents? Take a good look at the user guide for the HTML Validator extension.
- Why are the code error results different in the Validator extension, than those reported at W3C.org’s validator? Again, the Validator example shown above shows results using the Tidy parser, while the W3C.org uses the SGML parser. You can set the “algorithm” to use either parser after clicking the “Options” link:
Here is a comparison table comparing features of the three choices you have for parser Algorithms in the Validator extension:

At first I thought the Serial algorithm might be good choice (above right), but note that if there are errors, it does not open HTML Tidy, and therefor, no cleanup page option is provided. - Can you recommend some resources for understanding validation issues? Start with this surprisingly easy to read tutorial at W3C.org (usually they write in a stuffier style).
- What should I know about “doctype” declarations? They’re important. Check out this article at AListApart. Then read the official lowdown from the big boys at the W3C.org., and check out their list of different doctypes (bet you didn’t know there were that many!)
- Where did you get the heat map image of the Moon? I went to Wikipedia.org and looked up “Moon.” Then I selected an image that was not copyrighted, and in the public domain. Wikipedia is an excellent source for such images.
About the Author: Scott Frangos is a web developer, college instructor and graphic designer. You can catch is thoughs on WebMaster matters at his blog, at the OpenSourceWebMaster.com. He is Managing Partner at WebFadds.com, a web development firm specializing in WordPress Content Management Websites. He lives in Portland, Oregon with wife and partner, Pepper, and their three dogs: Wisdom, Spirit, & Steggman. © 2007 - Scott Frangos, all rights reserved.
WebMaster Skills is an evolving series of articles which you may access in any order, depending on your level of knowledge and needs. The series covers techniques, applications, and concepts for today’s webmasters including AJAX, HTML, PHP, and Web 2.0 topics.
A lot of “Content Masters” (a name for all of you managers using a Content Management System for your business websites) focus their Search Engine Marketing (SEM) strategy on good SEO techniques to rank as high as possible. And with today’s modern CMS and blogging tools like WordPress, Joomla, and Drupal — you get a lot of built in help in the form of RSS feeds and… Continue reading
Nine ways to use the new WordPress 2.6 for greater productivity and successful content creation…
A new floating post box that grabs media you want to blog about, post versioning, avatar options and more… Is it time to upgrade, BlogMasters? Sure — WP 2.6 offers some great new features for work teams. WordPress 2.6 was released yesterday, so let’s take a look at a some ways… Continue reading
Once you install WordPress and start posting regularly, you can sit back and relax — WordPress will do the rest of the SEO work, right? Wrong. Sure, WordPress will automatically does some things automatically like sending out pings, and creating RSS feeds. But there’s more work to be done to take your Search Engine Optimization to the max, and thankfully there are a handful… Continue reading
Written by: Scott Frangos
This entry was posted on Thursday, December 27th, 2007 at 8:30 am and is filed under Blogging Help, OS WebMaster. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.





























December 27th, 2007 at 9:49 am
Google Docs to HTML — with a Code Cleaning Trick…
The web is a Content Management System, and Google has a number of tools to help you manage that content. Have you ever used the Word Processor at Google Docs? Well it works great, and when you or a client needs to focus on writing… it’s a good wa…
January 29th, 2008 at 9:10 pm
[...] Google Docs to HTML — cleanly… by Scott. [...]
January 30th, 2008 at 12:35 am
[...] Google Docs to HTML — cleanly… by Scott. [...]
January 30th, 2008 at 1:00 pm
[...] Google Docs to HTML — cleanly… by Scott. [...]
January 31st, 2008 at 9:41 am
[...] Google Docs to HTML — cleanly… by Scott. [...]
February 1st, 2008 at 10:24 am
[...] Google Docs to HTML — cleanly… by Scott [...]
February 2nd, 2008 at 11:42 am
[...] Google Docs to HTML — cleanly… by Scott. [...]
April 26th, 2008 at 12:03 pm
[...] Google Docs to HTML — cleanly… by Scott. [...]