Funny things about HTML
Additional HTML and BODY tags merge attributes.
Every HTML document implicitly contains an HTML node and a BODY node. Most documents also contain HTML and BODY tags. Some documents contain more than one HTML or BODY tags. For cases where unexpected HTML or BODY opening tags appear, any attributes on the unexpected tag are added to the first tag and then the unexpected tag is ignored.
<html one>
<body two>
<script>
console.log( document.querySelector( 'html' ).getAttributeNames() );
console.log( document.querySelector( 'html' ).getAttributeNames() );
</script>
<html three>
<script>
console.log( document.querySelector( 'html' ).getAttributeNames() );
console.log( document.querySelector( 'html' ).getAttributeNames() );
</script>
<body four>
<script>
console.log( document.querySelector( 'html' ).getAttributeNames() );
console.log( document.querySelector( 'html' ).getAttributeNames() );
</script>
Two URLs may not be the same even if they have identical strings.
Consider /search?q=©
In a UTF-8 HTML document the q value becomes %c2%a9, but in a latin1 document it becomes %a9. This is because the named character references encode a Code Point but not an encoding, while URL query parameters are percent-encoded as UTF-8 bytes. This should fail but some encoders, notably PHP’s, do not validate the input before encoding bytes.
XML and HTML have funny newline behavior because of network buffers.
Citation needed
In order to limit the memory buffers required, newlines inside the markup syntax provide points to split parsing. They are therefore superfluous, inserted as a means of injecting these parsing points, and not part of the user content. Therefore if you want newlines in an SGML/XML document you should create a tag or entity to encode that and not rely on the newline character in the document.
- Vague mention of systems with a limited line length. [History\WWW\Text]
- Suggestion of ignoring CR/LF altogether because of their multiple meanings, and general use to make text fit within readable lines. XML ought to use U+2028 and U+2029 instead for line and paragraph separators. [xml-dev Aug. 1997]
- Concern about infinite parser lookahead and confusion whether a CR “is a linend or just part of a CRLF sequence.” [xml-dev Aug. 1997]
- RFC 1521 (MIME) requires that 8bit-encoded lines not exceed 1000 octets, including the terminating CRLF. [RFC 1521]