One of my latest job requirements was replacing some basic American English words with some more British words ( color-colour, favor-favour, etc.). The original iteration of this project used JavaScript to scan the page content and the replace the words properly. The only problem with this version is that AJAX calls would make browsers complain about breaking the DOM. The script for this iteration was:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | function replaceTerms(){ var searchArray = new Array("favors","labors","colors","favor","labor","color"); var replaceArray = new Array("favours","labours","colours","favour","labour","colour"); if (!document.body || typeof(document.body.innerHTML) == "undefined") { //alert("Sorry, for some reason the text of this page is unavailable. Searching will not work."); return false; } var bodyText = document.body.innerHTML; for (var i = 0; i < searchArray.length; i++) { bodyText = doReplace(bodyText, searchArray[i], replaceArray[i]); } //document.body.innerHTML = bodyText; return true; } function doReplace(bodyText, searchTerm, replaceWith) { // find all occurences of the search term in the given text, and add some "highlight" tags to them (we're not using a // regular expression search, because we want to filter out matches that occur within HTML tags and script blocks, so // we have to do a little extra validation) var newText = ""; var i = -1; var lcSearchTerm = searchTerm.toLowerCase(); var lcBodyText = bodyText.toLowerCase(); while (bodyText.length > 0) { //Get index of search term i = lcBodyText.indexOf(lcSearchTerm, i+1); //if we can't fine it, replace the newText with the BodyText and return if (i < 0) { newText += bodyText; bodyText = ""; } else { // skip anything inside an HTML tag if (bodyText.lastIndexOf(">", i) >= bodyText.lastIndexOf("<", i)) { // skip anything inside a <script> block if (lcBodyText.lastIndexOf("/script>", i) >= lcBodyText.lastIndexOf("<script", i)) { //Get Ascii Representation var charCode = bodyText.charAt(i).charCodeAt(0); //Is this uppercase var isUpper = (charCode >= 65 && charCode <= 90); //Do replacing if(isUpper){ newText += bodyText.substring(0, i) + replaceWith.charAt(0).toUpperCase() + replaceWith.substr(1) + " "; }else{ newText += bodyText.substring(0, i) + replaceWith + " "; } bodyText = bodyText.substr(i + searchTerm.length); lcBodyText = bodyText.toLowerCase(); i = -1; } } } } return newText; } |
Since we could not have our site breaking the DOM, the filtering was moved into the Visual Basic code behind of the project. The trick to filtering content in .NET is to leverage the Response.Filter property along with a custom class. This class will intercept the content and re-write it to however you see fit. More details on this can be found here.
The first version of our filter used regular expressions in order to replace the appropriate words in the page. This also caused a severe problem: The RegEx was replacing properties of tags, css, and control names ( understand that the word 'color' was being replaced ). My first attempt at a solution was to find the proper RegEx to replace words only inside paragraph tags. It turns out that writing a RegEx to parse HTML is nearly impossible
The solution? The original JavaScript code was migrated into the Visual Basic filter. This worked like a charm as it was using no RegExes and was also based upon the original code that worked. The final Visual Basic code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | Private Function doReplace(ByVal bodyText As String, ByVal searchTerm As String, ByVal replaceWith As String) As String Private Function doReplace(ByVal bodyText As String, ByVal searchTerm As String, ByVal replaceWith As String) As String Dim newText As String = "" Dim i As Integer = -1 Dim lcSearchTerm As String = searchTerm.ToLower() Dim lcBodyText As String = bodyText.ToLower() While bodyText.Length > 0 'Get the index of the search term i = lcBodyText.IndexOf(lcSearchTerm, i + 1) 'If it isn't there, just return If i < 0 Then newText += bodyText bodyText = "" Else 'Avoid tags If bodyText.LastIndexOf(">", i) >= bodyText.LastIndexOf("<", i) Then 'Avoid scripts If lcBodyText.LastIndexOf("/script>", i) >= lcBodyText.LastIndexOf("<script", i) Then 'Is the first character uppercase? Dim isUpper As Boolean = Char.IsUpper(bodyText.Chars(i)) 'If it is, then capitalize the replacement If isUpper Then newText += bodyText.Substring(0, i) + Char.ToUpper(replaceWith.Chars(0)) + replaceWith.Substring(1) + " " Else newText += bodyText.Substring(0, i) + replaceWith + " " End If 'Truncate body text bodyText = bodyText.Substring(i + searchTerm.Length()) 'Reset current text lcBodyText = bodyText.ToLower() i = -1 End If End If End If End While Return newText End Function |
The lesson learned here? Think before you code. If I would have simply migrated the JavaScript code I would have saved a lot of time.