Converting HTML to text is a simple and tedious process. HTML files are text files with an .html or .htm extension. There are several ways to remove HTML and keep the text of the web page. HTML tags must be removed from the file. Third-party software can strip the HTML tag file; a user can delete the tags in a text editor or copy the text from a browser and paste it without formatting into a text editor.
Copy and paste the text
- Open the .html file in the browser by clicking “File” and “Open File”.
- Select the text by clicking on the page and drag the cursor over the text to highlight the text. Press “Ctrl-C” (“Command-C” on a Mac) to copy the text.
- Open a text editor, such as Notepad on Windows or TextEdit on Mac OS X. Click “Edit” and select “Paste.” A simple text editor like Notepad will automatically strip the HTML tags in the text. However, if you are using a more powerful Word processor as a text editor such as Word, then you will need to choose “Paste Special” and paste as plain text or “text only” to remove the formatting.
Use of Third Party Software
- Download the third-party software of your choice that strips the HTML tags in the text, Like Doxillion Document Converter Software and HTMLAsText.
- Open the file in the third-party software. Depending on the software, you may need to open the file with “File” and “Open” to open the file. Some programs may require you to click “Browse” to load the export file.
- Click “OK” or “Save” or “Convert”, depending on the software, to start the conversion process.
Remove tags in a text editor
- Open the file in a text editor by clicking “File” and “Open.”
- Find the <body> tag in the file. Remove everything from the <body> tag above. This information is for the browser to parse the file for display and is not part of the text.
- Look for the </body> tag near the bottom of the document. Delete this and any tags below it.
- Remove all words and code between less than (<) and greater than (>) symbols and the symbols themselves. These are the HTML tags. If your text editor has a search and replace in the “Edit” menu, search for “<*>” and replace it with a blank field. The asterisk is a wildcard that will cover any text between the two symbols.