RICKdenHAAN.net


Filtering out part of some text

This is a very basic tutorial on how to take a piece of text, and remove part of it. We'll be using two functions to do this: strpos() and substr().

For this tutorial, we'll be filtering an ordered list out of an HTML page. Here's the text:

HTML:
  1. <p>This is just a demo paragraph</p>
  2.     <li>Skip this list item</li>
  3.     <li>Skip this as well</li>
  4. </ul>
  5. <p>And here's another demo paragraph</p>

What we want to have when this is over, is this:

HTML:
  1. <p>This is just a demo paragraph</p>
  2. <p>And here's another demo paragraph</p>

Okay, great. Now that the mission is clear, let's take a look at what steps we need to take to make this happen.

  1. First of all, we'll need to store the original text in a variable, so we can work with it.
  2. Then, we need to find out where the list starts in the text.
  3. We also need to know where the list ends.
  4. Then we need to take the beginning of the text, up to the starting point of the list, and append the end of the text to that, starting from the end of the list.

Allright. We won't handle opening an HTML file and reading its contents, we'll assume the starting code is a fixed text in our PHP script, or entered by the user somewhere. We'll store it in $original:

PHP:
  1. $original = "<p>This is just a demo paragraph</p><ul><li>Skip this list item</li><li>Skip this as well</li></ul><p>And here's another demo paragraph</p>";

To find the starting point of the <ul>, and the ending point of the </ul>, we'll use the strpos() function. Why is it called strpos()? Because a variable containing a piece of text is called a "string", and this functions gets the position of a character or series of characters (also called a "substring") within a string.

To get the information we need, we have to give the substr()-function two bits of information: the string it needs to search in, and the character or substring to search for. In that order.

NOTE: This function is case-sensitive. That means, if you're searching for "<ul>" in lowercase, but the HTML-file contains "<UL>" in uppercase, it won't find it. You can use stripos() if necessary, that does the exact same thing as strpos(), but is case-insensitive.

So, to get the position in our HTML string where the list starts, we need to do this:

PHP:
  1. $listStartingPoint = strpos($original, "<ul>");

Remember that $original contains the HTML code? Right.

Now let's get the position in our HTML string where the list ends. This works the exact same way:

PHP:
  1. $listEndingPoint = strpos($original, "</ul>");

Now there is a small problem with our $listEndingPoint. And that problem is, it now contains the position in our HTML string, where the text "</ul>" begins. What we want to know, is where "</ul>" ends. Fortunately, we know that "</ul>" is five characters, so we can add those to the result of our strpos():

PHP:
  1. $listEndingPoint = $listEndingPoint + 5;

Now that we know where the list begins and ends, we can use the substr()-function to get the text around it. I said earlier, that a bit of text is called a string. A string inside a string is called a substring. The substr()-function returns a substring, which is where it gets its name.

First, we want to get the substring from the beginning of the text, up to the starting point of the list. To do this, we need to give the substr()-function three bits of information: the string to take a bit out of, where the substring starts, and where the substring ends. This last bit of information is optional, as we'll see later.

Remember one very important thing about PHP (and pretty much every other programming language in the world): counting starts at zero. We humans have a tendency to start counting at one, which when you think about it is actually quite stupid, because you'll never get to half (0.5). How can something last for half a second, if you start counting at one? Computers do it right, and start at zero.

The reason I'm telling you this, is because we need our substring to start at the beginning, or first character, of the original string. Because PHP starts counting at zero, that first character has a position of 0 in the original string.

PHP:
  1. $beforeList = substr($original, 0, $listStartingPoint);

$beforeList now contains the substring of our original HTML snippet, from the first character, up to the point where the list starts.

To get the part of our HTML code from the end of the list to the end of the string, we need to tell the substr()-function not to stop. We do this simply by not telling it where to stop. So:

PHP:
  1. $afterList = substr($original, $listEndingPoint);

$afterList now contains the substring of $original, starting at the point where the list ends, up to the end of $original.

As a final step, we need to combine our $beforeList and $afterList substrings, so that we have one string containing everything we want. That can't be simpler. Just place a period between them:

PHP:
  1. $withoutList = $beforeList . $afterList;

$withoutList now contains "<p>This is just a demo paragraph</p><p>And here's another demo paragraph</p>", which is exactly what we wanted.

But, what if we wanted the exact opposite? What if we wanted to keep the list, and remove everything else? Well, we'd still need the exact same information, that is, the starting point and ending point of the list. But, we'd use a different substring:

PHP:
  1. $listOnly = substr($original, $listStartingPoint, $listEndingPoint);

$listOnly now contains the substring of $original, from the starting point of the list, to the ending point of the list.

So, to top it off, here are both scripts in their entirety:

PHP:
  1. // Get the original piece of text we need to remove a part from
  2. $original = "<p>This is just a demo paragraph</p><ul><li>Skip this list item</li><li>Skip this as well</li></ul><p>And here's another demo paragraph</p>";
  3.  
  4. // Determine the position in the original string, where our list starts
  5. $listStartingPoint = strpos($original, "<ul>");
  6.  
  7. // Determine the position in the original string, where our list ends
  8. $listEndingPoint = strpos($original, "</ul>");
  9.  
  10. // This is now the STARTING POINT of </ul>, but we want to know
  11. // the ENDING POINT of </ul>, which is five characters further on
  12. $listEndingPoint = $listEndingPoint + 5;
  13.  
  14. // Get the substring from the beginning of the original string to the
  15. // start of our list
  16. $beforeList = substr($original, 0, $listStartingPoint);
  17.  
  18. // Get the substring from the end of our list, to the end of the
  19. // original string
  20. $afterList = substr($original, $listEndingPoint);
  21.  
  22. // Combine the two substrings from before and after the list
  23. $withoutList = $beforeList . $afterList;

And to keep only the list:

PHP:
  1. // Get the original piece of text we need to remove a part from
  2. $original = "<p>This is just a demo paragraph</p><ul><li>Skip this list item</li><li>Skip this as well</li></ul><p>And here's another demo paragraph</p>";
  3.  
  4. // Determine the position in the original string, where our list starts
  5. $listStartingPoint = strpos($original, "<ul>");
  6.  
  7. // Determine the position in the original string, where our list ends
  8. $listEndingPoint = strpos($original, "</ul>");
  9.  
  10. // This is now the STARTING POINT of </ul>, but we want to know
  11. // the ENDING POINT of </ul>, which is five characters further on
  12. $listEndingPoint = $listEndingPoint + 5;
  13.  
  14. // Get the substring from the beginning of the list to the end of it
  15. $listOnly = substr($original, $listStartingPoint, $listEndingPoint);