Computer lessons

Removing HTML tags from a string in PHP. Cleaning text from unnecessary html tags - Parsing from A to Z Php remove all tags

When you receive data from users through the same , it makes sense to process the data transmitted from them and receive pure text as output.

I’ll tell you now how this can be done in different ways.

How to remove all HTML tags from a string in PHP?

There is a function in PHP called "strip_tags". It allows you to quickly and easily remove all HTML tags from a variable.

Implementation:

In this case, we save the tags

AND . Tags that have a closing tag do not need to be specified when saving.

Please note that the function does not check the HTML markup for validity, and if there are unclosed tags, then you risk losing plain text.

How to remove all HTML tags from a string in JavaScript?

We will write our own small function in JavaScript, with which we will subsequently process the received data.

Implementation:

function strip(html) ( var tmp = document.createElement("div"); tmp.innerHTML = html; return tmp.textContent || tmp.innerText; ) var content = strip("Hello, world!");

This example works on a specific specified variable, but you can remake it to fit the received content, for example, from the input field.

JavaScript is blocked in your browser. Please enable JavaScript for the site to function!

strip_tags

(PHP 3 >= 3.0.8, PHP 4, PHP 5)

strip_tags - Removes HTML and PHP tags from a string Description string strip_tags (string str [, string allowable_tags])

This function returns the string str with HTML and PHP tags removed. To remove tags, an automaton similar to that used in the fgetss() function is used.

An optional second argument can be used to specify tags that should not be removed.

Note: The allowable_tags argument was added in PHP 3.0.13 and PHP 4.0b3. As of PHP 4.3.0, HTML comments are also removed.

Attention

Since strip_tags() does not check the correctness of the HTML code, incomplete tags can lead to the removal of text that is not part of the tags.

Example 1. Example of using strip_tags() $text = "

Paragraph.

A little more text"; echo strip_tags($text); echo "\n\n-------\n"; // do not delete

Echo strip_tags($text, "

"); // Allow ,, echo strip_tags($text, " ");

This example will output:

Paragraph. Some more text -------

Paragraph.

Some more text

Attention

This function does not change the attributes of tags specified in the allowable_tags argument, including style and onmouseover.

As of PHP 5.0.0, strip_tags() is safe to process data in binary form.

This function has a significant drawback - it glues words together when removing tags. In addition, the function has vulnerabilities. An alternative function similar to strip_tags:

See also function description

Absolutely everyone faces the task of cleaning HTML from unnecessary tags.

The first thing that comes to mind is to use the strip_tags() php function:
string strip_tags (string str [, string allowable_tags])

The function returns a string stripped of tags. Tags that do not need to be removed are passed as the allowable_tags argument. The function works, but, to put it mildly, it is not ideal. Along the way, there is no check for the validity of the code, which may entail deleting text that is not included in the tags.
Proactive developers did not sit idly by - improved functions can be found online. A good example is strip_tags_smart.

To use or not to use ready-made solutions is the personal choice of the programmer. It so happens that most often I do not need a “universal” handler and it is more convenient to clean up the code with regular expressions.

What determines the choice of one or another processing method?

1. From the source material and the complexity of its analysis.
If you need to process fairly simple htmp texts, without any fancy layout, clear as day :), then you can use standard functions.
If the texts have certain features that need to be taken into account, then special handlers are written. Some may simply use str_replace . For example:

$s = array("’" => "’", // Right-apostrophe (eg in I"m)
"“" => "“", // Opening speech mark
"–" => "—", // Long dash
"â€" => "”", // Closing speech mark
"Ã " => "é", // e acute accent
chr(226) . chr(128) . chr(153) => "’", // Right-apostrophe again
chr(226) . chr(128) . chr(147) => "—", // Long dash again
chr(226) . chr(128) . chr(156) => "“", // Opening speech mark
chr(226) . chr(128) . chr(148) => "—", // M dash again
chr(226) . chr(128) => "”", // Right speech mark
chr(195) . chr(169) => "é", // e acute again
);

foreach ($s as $needle => $replace)
{
$htmlText = str_replace($needle, $replace, $htmlText);
}

Others may be based on regular expressions. As an example:

Function getTextFromHTML($htmlText)
{
$search = array (""]*?>.*?"si", // Remove javaScript
""]*?>.*?"si", // Remove styles
""]*?>.*?"si", // Remove xml tags
"""si", // Remove HTML-tags
""([\r\n])[\s] "", // Remove spaces
""&(quot|#34);"i", // Replace HTML special chars
""&(amp|#38);"i",
""&(lt|#60);"i",
""&(gt|#62);"i",
""&(nbsp|#160);"i",
""&(iexcl|#161);"i",
""&(cent|#162);"i",
""&(pound|#163);"i",
""&(copy|#169);"i",
""(\d);"e"); // write as php

$replace = array("",
"",
"",
"",
"\\1",
"\"",
"&",
"",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
"chr(\\1)");

Return preg_replace($search, $replace, $htmlText);
}
(At such moments, the ability of preg_replace to work with arrays as parameters is more pleasing than ever). If necessary, you supplement the array with your own regulars. For example, this regular expression constructor can help you in composing them. Beginning developers may find the article "All about HTML tags. 9 Regular Expressions to strip HTML tags" useful. Look at the examples there, analyze the logic.

2. From volumes.
Volumes are directly related to the complexity of the analysis (from the previous paragraph). A large number of texts increases the likelihood that, while trying to plan and clean everything up in a regular manner, you may miss something. In this case, the “multi-stage” cleaning method is suitable. That is, clean it first, for example, with the strip_tags_smart function (we don’t delete the source code, just in case). Then we selectively review a certain number of texts to identify “anomalies”. Well, we “clean up” the anomalies with regular rules.

3. From what should be obtained as a result.
The processing algorithm can be simplified in different ways depending on the situation. The case I described in one of my previous articles demonstrates this well. Let me remind you that the text there was in a div, in which, in addition to it, there was also a div with “breadcrumbs”, an Adsense advertisement, and a list of similar articles. When analyzing a sample of articles, it was discovered that the articles did not contain pictures and were simply divided into paragraphs using . In order not to clean the “main” div from extraneous things, you can find all the paragraphs (with Simple HTML DOM Parser this is very easy) and connect their contents. So before you make up regular cleaning routines, see if you can get by with a little blood.

In general, between supporters of parsing HTML code, based purely on regular expressions, and parsing, which is based on the analysis of the DOM structure of a document, real firefights are flaring up on the Internet. For example, on overflow. Innocent at first sight

Validating and processing incoming data is one of the most common programming tasks. The PHP language is usually used for web applications, so the most important thing here is to remove HTML tags from the text, because they are the most susceptible to third-party injections. In this article, I want to remind you about the old man stip_tags() and its features, as well as offer solutions for removing sectional HTML tags and a couple more useful bonuses to go along with it.

So. Our main tool for removing HTML tags from text is the strip_tags() function. We tell her string value, and it removes HTML and PHP tags from it, for example:

$s = "

Paragraph.

More text.";
echo strip_tags($s);

This example will output the line:

Paragraph. More text.

It is noteworthy here that the function also has a second (optional, but useful) parameter, the value of which is a string with a list of allowed HTML tags, for example:

$s = "

Paragraph.

More text.";
echo strip_tags($s, "

This example will output the line:

Paragraph.

More text.

In my opinion, it is very convenient. However, this does not solve one important problem - removing sectional HTML tags, for example: script, noscript and style - these are the most common. When do I need to remove such section tags, as well as options starting with "< » и заканчивающиеся символом « >", I'm using the following PHP code:

$p = array(
""]*?>.*?"si",
""]*?>.*?"si",
""]*?>.*?"si",
"""si",
);
$r = array(" "," "," "," ");
$s = preg_replace($p, $r, $s);

Here the variable $p contains an array of regular expressions, and $r is an array of their corresponding replacements (I use spaces). All that remains is to make a replacement in the line, and we will remove HTML garbage from the text.

Obviously, the two above solutions can be combined. At the beginning I use replacement through regular expressions, and then strip_tags() and I get my own nohtml() function.

Finally, I want to offer you a few more useful solutions. So in the text it is better to replace the tab with a space; the result of interpreting both in the browser is identical, and there will be less hassle, for example:

$s = str_replace("\t", " ", $s);

If you don't need line breaks, they can also be replaced with spaces, for example:

$s = str_replace(array("\n", "\r"), " ", $s);

You can get rid of extra spaces using a simple regular expression, for example:

$s = preg_replace("/\s+/", " ", $s);
$s = trim($s); // will not be superfluous

That's all I have. Thank you for your attention. Good luck!

at 21:56 Edit message