1
0
mirror of https://github.com/GNOME/libxml2.git synced 2025-10-15 21:27:33 +08:00
Commit Graph

557 Commits

Author SHA1 Message Date
Nick Wellnhofer
7a41b18c62 parser: Remove xmlHaltParser
Always halt the parser on resource limit and entity loop errors and
remove the remaining calls which seem unnecessary.
2025-07-31 14:23:23 +02:00
Caolán McNamara
408bd0e18e const up allowPCData
similar to htmlScriptAttributes
2025-07-24 11:13:58 +01:00
Nick Wellnhofer
0c948334a8 html: Add newline to error message 2025-07-10 12:46:40 +02:00
Nick Wellnhofer
bc0bb67b57 html: Don't abort on encoding errors
Always enable recovery mode when parsing HTML, so we don't raise fatal
errors.

Regressed with 462bf0b7. Fixes #947.
2025-07-10 12:46:22 +02:00
Nick Wellnhofer
413cdfb34a html: Fix push parsing of doctype decls
Don't set state to "content" as we might still be in "misc" or "prolog".
2025-06-25 12:41:32 +02:00
Nick Wellnhofer
7c91385040 parser: Remove unnecessary dict checks when freeing strings
The following strings are never allocated from a dict:

- xmlParserCtxt.version
- xmlParserCtxt.encoding
- xmlParserCtxt.extSubURI
- xmlParserCtxt.extSubSystem
- xmlDoc.version
- xmlDoc.encoding
- xmlDoc.URL
- xmlDTD.ExternalID
- xmlDTD.SystemID
- xmlID.value

Also make the struct members point to non-const chars to avoid casts
when freeing.
2025-06-23 22:59:31 +02:00
Nick Wellnhofer
b424bae705 html: Fix pull-parsing of initial comments and doctype decls
- Parse more bogus comments and multiple doctype declarations before
  switching to content.
- Grow buffer after parsing comment.
2025-06-22 14:31:38 +02:00
Nick Wellnhofer
6b50d8c888 html: Add missing call to grow parser in htmlParseComment
Otherwise, long chains of short comments could exhaust the input buffer
when pull parsing.
2025-06-08 14:22:32 +02:00
Nick Wellnhofer
70335c41fc html: Don't stop on unsupported encoding
Continue to parse unlike in the XML case.
2025-06-08 14:22:32 +02:00
Nick Wellnhofer
416da89d0b html: Make htmlCtxtReset call xmlCtxtReset
The two implementations shouldn't diverge.
2025-06-08 14:22:32 +02:00
Nick Wellnhofer
c6206c9387 html: Ignore ASCII-incompatible encoding in meta tag
After successfully parsing an ASCII-encoded meta tag, switching to an
encoding that isn't ASCII-compatible cannot work.
2025-06-05 22:24:50 +02:00
Nick Wellnhofer
6a6a46f017 doc: Fix autolink errors
Fix links, remove links to internal functions.
2025-05-28 16:02:41 +02:00
Nick Wellnhofer
7bd8d1d9cc doc: Prefix autolinks with '#'
Use `#func` instead of `func()` to ignore parameters and make all
autolinks work.
2025-05-28 16:01:52 +02:00
Nick Wellnhofer
c5b45fbc07 doc: Misc fixes 2025-05-16 19:04:20 +02:00
Nick Wellnhofer
6f4b452742 parser: Stop using ctxt->linenumbers
I think this was used to avoid setting the `line` member before it was
added (20+ years ago).
2025-05-16 18:03:12 +02:00
Nick Wellnhofer
258d870629 codegen: Consolidate tools for code generation
Move tools, source files and output tables into codegen directory.

Rename some files.

Adjust tools to match modified files. Remove generation date and source
files from output.

Distribute all tools and sources.
2025-05-16 18:03:12 +02:00
Nick Wellnhofer
adfbeb7e08 doc: Stop using *Ptr typedefs in documentation 2025-05-16 18:03:12 +02:00
Nick Wellnhofer
a40f36e7f2 include: Stop using *Ptr typedefs in public headers 2025-05-16 18:03:12 +02:00
Nick Wellnhofer
2d83a84ca6 doc: Misc improvements 2025-05-16 18:03:12 +02:00
Nick Wellnhofer
f0983199e8 html: Map some encodings according to HTML5
Windows-1252 is a superset of ISO-8859-1 and should be used instead.
Same for ASCII.

Also map UCS-2 and UTF-16 to UTF-16LE.
2025-05-12 14:04:30 +02:00
Nick Wellnhofer
05b8fe0a06 html: Don't escape RAWTEXT and PLAINTEXT
Align with HTML5.
2025-05-11 20:57:07 +02:00
Nick Wellnhofer
809ded586b html: Add more empty elements
Add empty HTML5 elements <bgsound>, <keygen>, <source>, <track> and
<wbr>.

Make <embed> an empty element.
2025-05-11 20:46:50 +02:00
Nick Wellnhofer
c7c4964342 html: Move DTD creation to endDocument SAX callback 2025-05-11 20:29:25 +02:00
Nick Wellnhofer
46f05ea4d5 html: Rework meta charset handling
Don't use encoding from meta tags when serializing. Only use the value
in `doc->encoding`, matching the XML serializer. This is the actual
encoding used when parsing.

Stop modifying the input document by setting meta tags before
serializing. Meta tags are now injected during serialization.

Add full support for <meta charset=""> which is also used when adding
meta tags.

Align with HTML5 and implement the "algorithm for extracting a character
encoding from a meta element". Only modify the encoding substring in
Content-Type meta tags.

Only switch encoding once when parsing.

Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading
UTF-8 charset.

Fixes #909.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
f3a080bc48 html: Ignore U+0000 in body text
Align with HTML5. Fixes #908.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
9bbffec568 doc: Move brief to top, params to bottom of doc comments 2025-05-06 19:51:38 +02:00
Nick Wellnhofer
b7274fb02f doc: Misc fixes to HTML parser docs 2025-05-06 19:51:38 +02:00
Nick Wellnhofer
4a01087585 doc: Move parser option docs to enum 2025-05-06 19:51:38 +02:00
Nick Wellnhofer
cb1635a642 doc: Use @since command 2025-05-02 19:05:25 +02:00
Nick Wellnhofer
e78e05c990 doc: Fix autolinks to functions
Unfortunately, autolinks in .c files aren't converted by Doxygen for
some reason.
2025-05-02 17:45:31 +02:00
Nick Wellnhofer
f7c412874b doc: Remove more comment block headers 2025-05-02 17:41:26 +02:00
Nick Wellnhofer
e525564f65 doc: Remove empty lines at start of block
These lines were left over after automatic conversion.
2025-05-02 11:42:05 +02:00
Nick Wellnhofer
e549622bc5 doc: Convert documentation to Doxygen
Automated conversion based on a few regexes.
2025-05-01 23:23:42 +02:00
Nick Wellnhofer
69879da88f doc: Remove email addresses from documentation
Also remove authorship information from generated files, hash.c and
globals.c which were rewritten.
2025-05-01 23:23:42 +02:00
Nick Wellnhofer
61890e399d doc: Prepare for conversion to Doxygen
Fix many params in internal functions (not really necessary but Doxygen
warns about that in XML mode).

Fix formatting in a few corner cases that automatic conversion can't
handle.

Rearrange some DOC_DISABLE blocks.
2025-05-01 23:23:42 +02:00
Nick Wellnhofer
4ba1f9238a html: Avoid HTML_PARSE_HTML5 clashing with XML_PARSE_NOENT
There are several users that pass invalid XML parser options to the
HTML parser. Choose a value that is less likely to clash.
2025-04-18 18:48:25 +02:00
Nick Wellnhofer
b8018afa4c html: Fix documentation of parser options 2025-04-10 16:36:03 +02:00
Nick Wellnhofer
2ecc08f6dc html: Deprecate more functions 2025-04-10 16:36:03 +02:00
Nick Wellnhofer
69b83bb68e encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.

It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.

Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
8873a49846 html: Fix areBlanks check
Short-lived regression from 71122421.
2025-03-09 16:21:13 +01:00
Nick Wellnhofer
5f0b1378d7 parser: Add more parser context accessors
Fixes #763.
2025-03-08 22:36:06 +01:00
Nick Wellnhofer
5237d90fae html: Process data before switching encoding
This reduces the amount of data to convert and avoids issues with EOF
detection.

Also reset EOF flag after switching encoding as a precaution.
2025-03-07 21:19:16 +01:00
Nick Wellnhofer
0b27097a92 encoding: Rename unprefixed public functions 2025-03-04 16:46:53 +01:00
Nick Wellnhofer
5ed4eafd8a html: Don't invoke SAX callbacks if parser was stopped 2025-02-22 14:52:47 +01:00
Nick Wellnhofer
63dfcca670 fuzz: Reduce initial array size 2025-02-20 12:22:12 +01:00
Nick Wellnhofer
b8234e8c73 html: Fix check for partial named character references
Digits are allowed after the first character.
2025-02-19 12:53:32 +01:00
Nick Wellnhofer
7a61c32bfa html: Use enum instead of magic values for insertion modes 2025-02-17 11:41:57 +01:00
Nick Wellnhofer
8cf6129bbd html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
8d7e38d536 fuzz: Ignore encodings when fuzzing on Apple
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
2025-02-02 11:15:45 +01:00