Replace HTML Text without Regular Expressions

One of my latest job requirements was replacing some basic American English words with some more British words ( color-colour, favor-favour, etc.). The original iteration of this project used JavaScript to scan the page content and the replace the words properly. The only problem with this version is that AJAX calls would make browsers complain about breaking the DOM. The script for this iteration was:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
function replaceTerms(){	
	var searchArray = new Array("favors","labors","colors","favor","labor","color");
	var replaceArray = new Array("favours","labours","colours","favour","labour","colour");
 
	if (!document.body || typeof(document.body.innerHTML) == "undefined") {
		//alert("Sorry, for some reason the text of this page is unavailable. Searching will not work.");
		return false;
	}
 
	var bodyText = document.body.innerHTML;
	for (var i = 0; i < searchArray.length; i++) {
		bodyText = doReplace(bodyText, searchArray[i], replaceArray[i]);
	}
 
	//document.body.innerHTML = bodyText;
	return true;
}
 
function doReplace(bodyText, searchTerm, replaceWith) {
 
	// find all occurences of the search term in the given text, and add some "highlight" tags to them (we're not using a
	// regular expression search, because we want to filter out matches that occur within HTML tags and script blocks, so
	// we have to do a little extra validation)
 
	var newText = "";
	var i = -1;
	var lcSearchTerm = searchTerm.toLowerCase();
	var lcBodyText = bodyText.toLowerCase();
 
	while (bodyText.length > 0) {
 
		//Get index of search term
		i = lcBodyText.indexOf(lcSearchTerm, i+1);
 
		//if we can't fine it, replace the newText with the BodyText and return
		if (i < 0) {
			newText += bodyText;
			bodyText = "";
		} else {
 
			// skip anything inside an HTML tag
			if (bodyText.lastIndexOf(">", i) >= bodyText.lastIndexOf("<", i)) {
				// skip anything inside a <script> block
				if (lcBodyText.lastIndexOf("/script>", i) >= lcBodyText.lastIndexOf("<script", i)) {
 
					//Get Ascii Representation
					var	charCode = bodyText.charAt(i).charCodeAt(0);
 
					//Is this uppercase
					var isUpper = (charCode >= 65 && charCode <= 90);
 
					//Do replacing
					if(isUpper){
						newText += bodyText.substring(0, i) +  replaceWith.charAt(0).toUpperCase() + replaceWith.substr(1) + " ";
					}else{
						newText += bodyText.substring(0, i) +  replaceWith + " ";
					}
					bodyText = bodyText.substr(i + searchTerm.length);
					lcBodyText = bodyText.toLowerCase();
					i = -1;
				}
			}
		}
	}
 
	return newText;
}

Since we could not have our site breaking the DOM, the filtering was moved into the Visual Basic code behind of the project. The trick to filtering content in .NET is to leverage the Response.Filter property along with a custom class. This class will intercept the content and re-write it to however you see fit. More details on this can be found here.

The first version of our filter used regular expressions in order to replace the appropriate words in the page. This also caused a severe problem: The RegEx was replacing properties of tags, css, and control names ( understand that the word 'color' was being replaced ). My first attempt at a solution was to find the proper RegEx to replace words only inside paragraph tags. It turns out that writing a RegEx to parse HTML is nearly impossible ;)

The solution? The original JavaScript code was migrated into the Visual Basic filter. This worked like a charm as it was using no RegExes and was also based upon the original code that worked. The final Visual Basic code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Private Function doReplace(ByVal bodyText As String, ByVal searchTerm As String, ByVal replaceWith As String) As String
	Private Function doReplace(ByVal bodyText As String, ByVal searchTerm As String, ByVal replaceWith As String) As String
	Dim newText As String = ""
	Dim i As Integer = -1
	Dim lcSearchTerm As String = searchTerm.ToLower()
	Dim lcBodyText As String = bodyText.ToLower()
 
	While bodyText.Length > 0
		'Get the index of the search term
		i = lcBodyText.IndexOf(lcSearchTerm, i + 1)
 
		'If it isn't there, just return
		If i < 0 Then
			newText += bodyText
			bodyText = ""
		Else
			'Avoid tags
			If bodyText.LastIndexOf(">", i) >= bodyText.LastIndexOf("<", i) Then
				'Avoid scripts
				If lcBodyText.LastIndexOf("/script>", i) >= lcBodyText.LastIndexOf("<script", i) Then
					'Is the first character uppercase?
					Dim isUpper As Boolean = Char.IsUpper(bodyText.Chars(i))
 
					'If it is, then capitalize the replacement
					If isUpper Then
						newText += bodyText.Substring(0, i) + Char.ToUpper(replaceWith.Chars(0)) + replaceWith.Substring(1) + " "
					Else
						newText += bodyText.Substring(0, i) + replaceWith + " "
					End If
 
					'Truncate body text
					bodyText = bodyText.Substring(i + searchTerm.Length())
 
					'Reset current text
					lcBodyText = bodyText.ToLower()
					i = -1
				End If
			End If
		End If
	End While
 
	Return newText
End Function

The lesson learned here? Think before you code. If I would have simply migrated the JavaScript code I would have saved a lot of time.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">