Strip HTML Tags from your OpenOffice Document using Regular Expressions

| | 1 min read

Openoffice comes with a stripped down version of regular expression for both of its Find/Replace Utility as well as its Calc Formulae. It however still packs a punch for most common operations. Here is how you can strip HTML tags from an open office document using the Find/Replace dialog.

  • Open the Find/Replace dialog.
  • Click on the more options button and enable the regular expressions checkbox
  • Enter <([:alpha:]+)[^>]*>([^<]*)</(\1)> in the find box
  • Enter $2 in the replace dialog.
  • Press Find and Replace All until all tags are stripped out.

This method will not work if the HTML is not well formatted and it will not take out self closing tags. Properly self-closed tags can be replaced by using the pattern <[:alpha:]+[^>/]*/> in the find text box and an empty string in the replace box.

For everything else you will have to devise your own custom expressions :-)