2 Replies - 721 Views - Last Post: 15 November 2018 - 05:56 PM

#1 atraub   User is offline

  • Pythoneer
  • member icon

Reputation: 835
  • View blog
  • Posts: 2,271
  • Joined: 23-December 08

URLEncoding - Reasoning with my professor

Posted 15 November 2018 - 04:29 PM

Hey all, long time no see. I'm in a grad class and I'm having a "spirited disagreement" with my professor. We had an assignment and 1 part of that was the following:


Write an input sanitization function called valid Input using Java or C++ language to filter urls containing javascript to prevent from XSS attacks.The input is a url and the output is the sanitized url as following. You can use either blacklist-based approach or whitelist-basedapproach. (25 pts)

I opted for a whitelist approach. It didn't take long, and after it was pretty good, I tested my output against java.net.URLEncoder.encode and after a few tweaks, they were identical. I felt good about this one. Out of 25 points, I received 15 with the comment: "This implementation will disable all html tags including some Javascript." What?! Surely this must be some mistake, I thought. I approached her in the next class, and we had a very long disagreement on the subject.

Me: "I'm confused, this is a URL encoder, why are we concerned about HTML?"
Her: "What if HTML was passed in a URL, your encoding would break it and make it unable to render on the page?"
Me: "I have literally never heard of anyone doing that."
Her: "I think some blogging software does it."
Me: "Why? They should use a POST rather than a GET if they're going to pass around HTML. That's more secure and you won't have to worry about the length limitations of Query Strings."
Her: "Well, I still think your approach is too heavy handed. Also, this was a big assignment and you didn't put in much effort."
Me: "I literally ran test cases comparing mine to Oracle's Encoding algorithm until mine had the same output as the industry standard. Yes, all of my code fits on 1 page, but that's because it's concise, clear code."
Her: "Still, it's too heavy handed, you might be blocking safe code"
Me: "I get that, but I don't think we should be writing our code from the perspective of 'what if they want to do something really weird or really bad form?' - there's really no reason why I should have server side code just so that I can make an ajax call to weather.com, but because it looks suspicious, Chrome still won't let me do an ajax request directly to their site from my client-side code."
Her: "Well, I was fair though, anyone who broke HTML lost the same amount"
Me: "I'm not trying to be disrespectful, but I'm not arguing that the rule was unfair because you selectively applied it, I'm saying it's unfair because we're being penalized for not following guidelines that weren't included with the assignment."
Her: "Well, I saw other student's approaches that I liked more."
Me: "Wouldn't you agree that it's a more fair assessment to judge my algorithm, not by students whose answers you like more, but by the industry standard put in place by Oracle?"
Her Frustrated: "Let's talk about this later"

Guys, what can I do? I've tried to find info on HTML inside query strings and there's just nothing out there. What can I say to prove to her that the case of html in a url is not realistic and does not merit such a heavy hit to our grades.

This post has been edited by atraub: 15 November 2018 - 04:42 PM

Is This A Good Question/Topic? 0
  • +

Replies To: URLEncoding - Reasoning with my professor

#2 BetaWar   User is offline

  • #include "soul.h"
  • member icon

Reputation: 1651
  • View blog
  • Posts: 8,523
  • Joined: 07-September 06

Re: URLEncoding - Reasoning with my professor

Posted 15 November 2018 - 05:18 PM

Teachers, in general, hate it when people argue with them. However, if you want to work towards getting back points for you and apparently other classmates, you may point her at the RFC for URIs, which specifies the valid characters allowed in a URI, and neither < nor > are valid characters - especially for a query string segment (RFC: https://tools.ietf.o...section-1.1.2). Additionally, you could look at the javascript URI encoder and decoder functions - this is literally what happens in the web browser when they are encoding their form - assuming it is done properly: https://developer.mo...jects/encodeURI

It is, however, worth noting that chrome doesn't appear to care about < or > showing up in the URL, so it is possible they are using a different RFC, as this:
<!DOCTYPE html>
        <form action="/form.html" method="GET">
            <input type="text" name="htmlQuery" value="<b>This is some bold text</b> and some other stuff. <a href='http://www.google.com'>Google</a>"/>
            <input type="submit" value="Submit"/>

Goes to the following URL:

My personal take would be that if you are maintaining the content, even though you are being heavy handed in the encoding, then everything is good. If you are stripping characters (such as < and > ) then that could be a problem. I will agree that people who pass HTML in a GET request are dumb, but that's the world for you.

So, in short: encoding more than strictly necessary characters into a hex-based value would be something I wouldn't find an issue with (as they are still there, just encoded); removing characters would be the problem.
Was This Post Helpful? 3
  • +
  • -

#3 atraub   User is offline

  • Pythoneer
  • member icon

Reputation: 835
  • View blog
  • Posts: 2,271
  • Joined: 23-December 08

Re: URLEncoding - Reasoning with my professor

Posted 15 November 2018 - 05:56 PM

It's funny, I brought up RFC3986 while we were "disagreeing" but I didn't have time to find the exact section where it sets the allowable list of characters. After reading it more closely, I definitely can show her that it's not allowed by the spec. This helps a lot, thanks!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1