25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Payments   Humor   Music

Online calculator to count repeated words and phrases in text or web page HTML source code

Instructions

Enter text in the box and click Submit. The analysis is displayed in the text box.

  • Text length is limited to 200,000 characters. Anything beyond that is truncated.
  • It is ok to paste web page source code. HTML tags, comments, JavaScript, PHP, ASP are automatically removed5.
  • The output report is a TAB-separated list for copying and pasting into a spreadsheet:
    PHRASE = the repeated word or phrase.
    COUNT = the number of times it occurred in the text.
    PERCENT4 = the percentage of the total text accounted for by the repetitions of this word or phrase.
  • The report starts with a few statistics and status lines, not in 3-column format:
    --Total word count (including stop words) after tag removal.
    --Number of unique words.
    --Number of characters and lines (including spaces and blank lines) of the raw source text submitted.
    --Number of characters in the text actually analyzed (after tag removal and some other processing).

Phrase Counter background

The program began as a small part of an artificial intelligence project, to explore whether analyzing repeated words and phrases might help summarize text to get a sense of what it is about. It seems to be useful for that, at least as a quick and easy first step in the analysis. It might prove even more useful when combined with analysis of the word meanings, relationships, and grammatical constructions, tasks that are more complex and difficult and that I made some progress with in my "WTalk.cpp" chatterbot project.

Using Phrase Counter for search engine optimization

One day, I realized that a list of word and phrase repetitions has a more immediately practical application, analysis of keyword density on a web page.

"Keywords" are the words and phrases that search engines extract from a web page to determine what it is about (sound familiar?), and to determine how well the content of a page matches a particular search query that their users might enter, how "relevant" the page is for the query. If the query is about "search engine optimization" or "SEO", the search engine will look in its index for pages mentioning those words. A page mentioning those terms many times is probably about that topic = highly relevant. A page that only mentions them once or twice might be mostly about some other topic, with only tangential references to SEO = less relevant. 

That would make it sound as though the best way to make your page relevant to a particular search query is to use your targeted search terms as many times as humanly possible in your text, right?  Maybe, but it is called "keyword stuffing", and it is not the way to achieve top rankings! Although it can make your page look highly relevant, it is also an indicator to search engines that your page is probably low quality, causing it to drop far down in search results.

There are usually many relevant pages for a search query. Determining which ones to show first in the results is based on factors other than keywords. It has to do with how good an authority the search engine thinks the page is likely to be for the topic. A highly relevant but poor quality page might not appear in the search results at all.

The equation formula that a search engine uses for estimating the likely quality of a page, and therefore how high to place it in search results, is known as its ranking algorithm. The algorithms of different search engines are probably very different from each other. How much (or whether) a particular SEO technique will affect rankings at one or more search engines is something that can only be determined by experimentation.

Adjusting keyword density (number of repetitions of keywords and phrases) to an optimum level (not too much, not too little) is a technique used by SEO consultants, to make a page look relevant for targeted search queries while trying to avoid penalties for keyword stuffing

Some webmasters in competitive market niches adjust their keyword densities to match those of competitor sites that rank above them in search results.    

I have even seen recommendations online that a web page should consist of a specific percentage of targeted keywords. The exact percentage varies over time, according to what people think is working best at Google.

Using Phrase Counter to improve AdSense targeting

AdSense advertising is "contextually targeted", which means that AdSense ads on a page are supposed to be about the same topics that the page is about. Which means that Google must determine, in a way not as specific as matching text against a search query, what the page is generally "about". Word and phrase repetitions must play a part in that determination.

It has happened several times that I've published a new web page, only to discover that the AdSense targeting was poor, not relevant to the topic at all. It was easy to see where AdSense was going wrong, usually by misinterpreting the meaning or context of a repeated word. More than once, I was able to fix the problem, and trigger correctly targeted ads, by changing one repeated word on the page, or by changing every occurrence of an ambiguous word to a two-word phrase that made the meaning clear and that could not be misinterpreted. 

I still have some pages where AdSense ads are not about the topics I'd expect, and not what I'd prefer them to be. When I ran the text of one of the pages through Phrase Counter, I was surprised to see that the most often repeated words were not what I expected, and not tightly focused on the topic of the page. A human reader couldn't mistake what the page is about, but it appears that the method AdSense uses for determining the page topic might be misinterpreting. I'm not in general much of an advocate for the importance of keyword density, but in this case it does appear that the lack of it might be creating confusion, and an adjustment might be called for. 

If the full text of a page can't be keyword-adjusted for better AdSense targeting, there is always AdSense section targeting. It allows you to manually tag the text that you want AdSense to either emphasize, or ignore, for the purpose of targeting ads. 

One factor that can affect page topic interpretation for AdSense purposes more than for search engine purposes is keyword "heat". One or two occurrences of a high-heat keyword can potentially trigger off-topic AdSense ads. That is, some keywords seem to be so important to AdSense that a single occurrence can outrank other words and phrases on the page even if they are repeated. 

Notes about Phrase Counter

  1. Most punctuation and many words don't contribute much to the meaning of a text. Some examples are the, a, and and. These insignificant but very common words are sometimes called "stop words".

    In the output report, these stop words are by default not listed as single words. It doesn't matter how many times the is repeated in a text. Phrases that consist only of stopwords are not reported, either.

    When stopwords occur in a repeated phrase that does have significant words in it, the stopwords are included to preserve the meaning of the phrase.

    The list of ignored stopwords (English words only, but with some international punctuation characters also) is in this plain text (UTF-8) file. (Last revised Fri 2011-03-18 18:27:46 -0700.)

    To force individual stopwords and stopword-only phrases to be included in the analysis, clear the "Omit insignificant words" checkbox.
     
  2. If a long text has words and phrases that are repeated 40, 50, 60 times, the ones that are only repeated a few times fade into insignificance. There's not much point reporting them. The "repeat threshold" setting allows you to omit phrases that weren't repeated frequently enough to be of interest.
     
  3. The program always reports repeated single words (phrases of length 1). It is then able to search for repeated phrases of increasing lengths until it stops finding any. However, some input texts can contain very long sections repeated in their entirety, causing the calculator to waste time finding useless repeated phrases hundreds of words long. To prevent this, the maximum phrase length to look for is limited. Phrases longer than 8 words (tokens) are seldom useful, so that is the maximum allowed. Often, a maximum phrase length of 2 is sufficient.
     
  4. The PERCENT column = ([# of repetitions of the phrase] * [# of words in the phrase] / [total word count of the text]) * 100, to make it a percentage. When phrases overlap, it's possible for this figure to exceed 100%.

    Example: input text = "test test test test test test".
    The 3-word phrase "test test test" is repeated 4 times, starting from positions 1, 2, 3, and 4.
    4 repetitions x 3 words / 6 words total = 12/6 = 2 = 200%.
     
  5. When analyzing HTML source code: the calculator requires well-formed (valid) HTML to properly parse and remove all the tags. If there are errors in the HTML, there are likely to be errors in the result. If your HTML is invalid, an alternative method that will leave out the tags is to load your web page into a browser, then click and drag or use Ctrl+A to select all the text, and paste that into the Phrase Counter.
     
  6. The N-word sequences extracted by Phrase Counter are called n-grams.
     
  7. The output report can be incomplete (with a warning in the status lines) if Phrase Counter exceeds its maximum allowed execution time or if the amount of text in the report was excessive. Make one or more of the following changes to get a complete report:

    a) Check the "Omit insignificant words" box (this only makes a difference if the input text is English).
    b) Set "Only show words/phrases repeated at least this many times" to a higher number.
    c) Set "Show phrases whose length in words = 1 to N" to a lower number.
    d) Shorten the amount of text you submit.
     
  8. If you need a different format from the default TAB-delimited output, my online regular expression search-and-replace text editor can help convert the text to a different format.
     

Please submit bug reports, feature suggestions, other feedback in the Discussion Forum.

 

Valid HTML 4.01 Transitional Valid CSS
Yahoo! Search
Search the web Search this site
View content labeling at ICRA.