Lesson 4 - StringUtils library for working with texts in PHP
In the previous lesson, Finishing DateUtils library for PHP, we finished the DateUtils library, which helps us format and parse regional date and time. In today's tutorial, we'll program a StringUtils library.
Motivation
PHP provides a large variety of string functions. However, there are some important functions which would aid us greatly that are not available. If we skipped making this library, we'd have to implement them over and over again.
StringUtils
We'll add several functions that PHP is missing into the StringUtils library. The class will be auxiliary, meaning that it won't have an inner state. Also, we'll use its methods often and from different places in our applications. Therefore, we'll make all of its methods static.
class StringUtils
{
}
You can put the class in the Utility namespace if you'd like. Although it should be clear, the inner implementation will need to use the PHP functions with the "mb_" prefix, which support Unicode.
startsWith() and endsWith()
We often need to determine whether a string starts or ends with a particular substring. Consider a situation where we're parsing items and we want to process them based on their name. EndsWith() is especially useful for an autoloader that determines the folder location from which we want to load a class based on what the class name ends with.
public static function startsWith($haystack, $needle) { return (mb_strpos($haystack, $needle) === 0); } public static function endsWith($haystack, $needle) { return ((mb_strlen($haystack) >= mb_strlen($needle)) && ((mb_strpos($haystack, $needle, mb_strlen($haystack) - mb_strlen($needle))) !== false)); }
The methods accepts the main text as the first parameter and the substring as the second one. You will often encounter terms such as "haystack" and "needle" when referring to searching in computers. I assume most of you get why
The startsWith() method works in a very similar way. We'll have to use the triple equal sign (===) since the mb_strpos() function returns false in case it fails and 0 in case it finds the substring at the 0th position.
The EndsWith() method is more interesting. The haystack has to be longer or at least of the same length as the needle. Otherwise, it doesn't make sense for it to search for it. Then, the mb_strpos() function searches the haystack starting from the position where the needle could possibly be (haystack length - needle length), we're not interested in previous occurrences.
Let's try the functions out. Don't forget to set the internal encoding to UTF-8:
require_once('models/System/Utility/StringUtils.php'); mb_internal_encoding('utf-8'); var_dump(StringUtils::startsWith('Quarter classification', 'Quarter')); var_dump(StringUtils::startsWith('Semester classification', 'Quarter')); var_dump(StringUtils::startsWith('Semester and quarter classification', 'Quarter')); var_dump(StringUtils::endsWith('ArticleController', 'Controller')); var_dump(StringUtils::endsWith('ArticleHelperController', 'Helper'));
The result:
capitalize() and uncapitalize()
The functions mentioned above will convert the leading character of a string to either uppercase or lowercase. The capitalize() method is useful for printing data into a table. Simply because we want a user to see the values capitalized and still be able to work with them internally somehow in lowercase. On the contrary, the uncapitalize() method converts the first character to lowercase. We'll use both functions for internal purposes in the library.
Note: I'm not 100% sure about using the term "uncapitalize", however, I think it's better than naming it something like lowerFirst().
public static function capitalize($text) { return mb_strtoupper(mb_substr($text, 0, 1)) . mb_substr($text, 1, mb_strlen($text)); } public static function uncapitalize($text) { return mb_strtolower(mb_substr($text, 0, 1)) . mb_substr($text, 1, mb_strlen($text)); }
The code is trivial. Let's go ahead and try it out:
echo(StringUtils::capitalize('the quick brown fox') . '<br />'); echo(StringUtils::uncapitalize('The quick brown fox') . '<br />');
The result:
Note: You can try the capitalization method out on accent characters as well. For example, á will become Á and ř will become Ř. This wouldn't work if it weren't for the mb_* functions.
Shortening text
We often print labels for items in e-shops and need to specify a maximal length for the label, since very long texts won't fit in the template. If the text is longer than the specified maximal length, we'll truncate it and add three dots at the end to clarify that something is missing. If the text is shorter, we won't add anything. The maximal length includes said three dots.
public static function shorten($text, $length) { if (mb_strlen($text) - 3 > $length) $text = mb_substr($text, 0, $length - 3) . '...'; return $text; }
In some frameworks, you may find this method labeled as truncate(). Go ahead and try it out:
echo(StringUtils::shorten('Notebook - Intel Pentium 2020M Ivy Bridge, 15.6" LED 1366x768 glossy, RAM 4GB, Intel HD Graphics, HDD 500GB 5400 rpm, DVD, WiFi, Bluetooth, Webcam, USB 3.0, Windows 8 64-bit', 71) . '<br />'); echo(StringUtils::shorten('Notebook - Intel Pentium 2020M', 71) . '<br />');
Check the result to verify that everything is working as expected:
Removing accent characters
We'll end today's lesson with an accent removing method. We'll need in cases when we're generating an email address based on someone's name, the URL address for an article, or when we want to determine whether a user entered a password without accent characters. In the latter, the entered password has to match that password without accents, otherwise, there are accent characters within it.
Admittedly, I googled the method shown below Let's take a look at it before we go over how it works:
public static function removeAccents($text) { $chars = array( // Decompositions for Latin-1 Supplement chr(195).chr(128) => 'A', chr(195).chr(129) => 'A', chr(195).chr(130) => 'A', chr(195).chr(131) => 'A', chr(195).chr(132) => 'A', chr(195).chr(133) => 'A', chr(195).chr(135) => 'C', chr(195).chr(136) => 'E', chr(195).chr(137) => 'E', chr(195).chr(138) => 'E', chr(195).chr(139) => 'E', chr(195).chr(140) => 'I', chr(195).chr(141) => 'I', chr(195).chr(142) => 'I', chr(195).chr(143) => 'I', chr(195).chr(145) => 'N', chr(195).chr(146) => 'O', chr(195).chr(147) => 'O', chr(195).chr(148) => 'O', chr(195).chr(149) => 'O', chr(195).chr(150) => 'O', chr(195).chr(153) => 'U', chr(195).chr(154) => 'U', chr(195).chr(155) => 'U', chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y', chr(195).chr(159) => 's', chr(195).chr(160) => 'a', chr(195).chr(161) => 'a', chr(195).chr(162) => 'a', chr(195).chr(163) => 'a', chr(195).chr(164) => 'a', chr(195).chr(165) => 'a', chr(195).chr(167) => 'c', chr(195).chr(168) => 'e', chr(195).chr(169) => 'e', chr(195).chr(170) => 'e', chr(195).chr(171) => 'e', chr(195).chr(172) => 'i', chr(195).chr(173) => 'i', chr(195).chr(174) => 'i', chr(195).chr(175) => 'i', chr(195).chr(177) => 'n', chr(195).chr(178) => 'o', chr(195).chr(179) => 'o', chr(195).chr(180) => 'o', chr(195).chr(181) => 'o', chr(195).chr(182) => 'o', chr(195).chr(182) => 'o', chr(195).chr(185) => 'u', chr(195).chr(186) => 'u', chr(195).chr(187) => 'u', chr(195).chr(188) => 'u', chr(195).chr(189) => 'y', chr(195).chr(191) => 'y', // Decompositions for Latin Extended-A chr(196).chr(128) => 'A', chr(196).chr(129) => 'a', chr(196).chr(130) => 'A', chr(196).chr(131) => 'a', chr(196).chr(132) => 'A', chr(196).chr(133) => 'a', chr(196).chr(134) => 'C', chr(196).chr(135) => 'c', chr(196).chr(136) => 'C', chr(196).chr(137) => 'c', chr(196).chr(138) => 'C', chr(196).chr(139) => 'c', chr(196).chr(140) => 'C', chr(196).chr(141) => 'c', chr(196).chr(142) => 'D', chr(196).chr(143) => 'd', chr(196).chr(144) => 'D', chr(196).chr(145) => 'd', chr(196).chr(146) => 'E', chr(196).chr(147) => 'e', chr(196).chr(148) => 'E', chr(196).chr(149) => 'e', chr(196).chr(150) => 'E', chr(196).chr(151) => 'e', chr(196).chr(152) => 'E', chr(196).chr(153) => 'e', chr(196).chr(154) => 'E', chr(196).chr(155) => 'e', chr(196).chr(156) => 'G', chr(196).chr(157) => 'g', chr(196).chr(158) => 'G', chr(196).chr(159) => 'g', chr(196).chr(160) => 'G', chr(196).chr(161) => 'g', chr(196).chr(162) => 'G', chr(196).chr(163) => 'g', chr(196).chr(164) => 'H', chr(196).chr(165) => 'h', chr(196).chr(166) => 'H', chr(196).chr(167) => 'h', chr(196).chr(168) => 'I', chr(196).chr(169) => 'i', chr(196).chr(170) => 'I', chr(196).chr(171) => 'i', chr(196).chr(172) => 'I', chr(196).chr(173) => 'i', chr(196).chr(174) => 'I', chr(196).chr(175) => 'i', chr(196).chr(176) => 'I', chr(196).chr(177) => 'i', chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij', chr(196).chr(180) => 'J', chr(196).chr(181) => 'j', chr(196).chr(182) => 'K', chr(196).chr(183) => 'k', chr(196).chr(184) => 'k', chr(196).chr(185) => 'L', chr(196).chr(186) => 'l', chr(196).chr(187) => 'L', chr(196).chr(188) => 'l', chr(196).chr(189) => 'L', chr(196).chr(190) => 'l', chr(196).chr(191) => 'L', chr(197).chr(128) => 'l', chr(197).chr(129) => 'L', chr(197).chr(130) => 'l', chr(197).chr(131) => 'N', chr(197).chr(132) => 'n', chr(197).chr(133) => 'N', chr(197).chr(134) => 'n', chr(197).chr(135) => 'N', chr(197).chr(136) => 'n', chr(197).chr(137) => 'N', chr(197).chr(138) => 'n', chr(197).chr(139) => 'N', chr(197).chr(140) => 'O', chr(197).chr(141) => 'o', chr(197).chr(142) => 'O', chr(197).chr(143) => 'o', chr(197).chr(144) => 'O', chr(197).chr(145) => 'o', chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe', chr(197).chr(148) => 'R',chr(197).chr(149) => 'r', chr(197).chr(150) => 'R',chr(197).chr(151) => 'r', chr(197).chr(152) => 'R',chr(197).chr(153) => 'r', chr(197).chr(154) => 'S',chr(197).chr(155) => 's', chr(197).chr(156) => 'S',chr(197).chr(157) => 's', chr(197).chr(158) => 'S',chr(197).chr(159) => 's', chr(197).chr(160) => 'S', chr(197).chr(161) => 's', chr(197).chr(162) => 'T', chr(197).chr(163) => 't', chr(197).chr(164) => 'T', chr(197).chr(165) => 't', chr(197).chr(166) => 'T', chr(197).chr(167) => 't', chr(197).chr(168) => 'U', chr(197).chr(169) => 'u', chr(197).chr(170) => 'U', chr(197).chr(171) => 'u', chr(197).chr(172) => 'U', chr(197).chr(173) => 'u', chr(197).chr(174) => 'U', chr(197).chr(175) => 'u', chr(197).chr(176) => 'U', chr(197).chr(177) => 'u', chr(197).chr(178) => 'U', chr(197).chr(179) => 'u', chr(197).chr(180) => 'W', chr(197).chr(181) => 'w', chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y', chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z', chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z', chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z', chr(197).chr(190) => 'z', chr(197).chr(191) => 's', // Euro Sign chr(226).chr(130).chr(172) => 'E', // GBP (Pound) Sign chr(194).chr(163) => '' ); return strtr($text, $chars); }
It's simply a set of accent characters which are represented as multiple ASCII characters in UTF-8 encoding (e.g. ˇ and C makes Č). The set is represented as an array to make it work as a dictionary for PHP's strtr() (STRing TRanslate function) function.
The code is part of a WordPress module (see https://core.trac.wordpress.org/…rmatting.php). There are even more characters, but these will do for our intents and purposes.
Let's try the method out:
// The following sentence includes all of the accent characters in the Czech language // It could be roughly translated to "The horse who is too yellow was moaning devilish melodies" // It's something like the English pangram "The quick brown fox jumps over the lazy dog" echo(StringUtils::removeAccents('Příliš žluťoučký kůň úpěl ďábelské ódy. PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY. <br />'));
The result:
That'll be all for today. We'll finish up with the StringUtils library in the next lesson - Finishing StringUtils library for PHP.