Hi,
I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.
My sample:
- I've the following html document
<data>café</data>
- I evaluate the following xpath expression
normalize-space(//data).
Sample code:
use std::ffi::CStr;
use std::os::raw;
use libxml::parser::{Parser, ParserOptions};
use libxml::xpath::Context;
fn main() {
let parser = Parser::default_html();
let options = ParserOptions { encoding: Some("utf-8"), ..Default::default()};
let data = "<data>café</data>";
let doc = parser.parse_string_with_options(data, options).unwrap();
let context = Context::new(&doc).unwrap();
let result = context.evaluate("normalize-space(//data)").unwrap();
assert_eq!(unsafe { *result.ptr }.type_, libxml::bindings::xmlXPathObjectType_XPATH_STRING);
let value = unsafe { *result.ptr }.stringval;
let value = value as *const raw::c_char;
let value = unsafe { CStr::from_ptr(value) };
let value = value.to_string_lossy();
println!("{value}")
}
With libxml 2.11.0, the value printed is café, with libxml 2.11.1 the value printed is café:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
I've the impression that the encoding value of ParserOptions is not evaluated correctly through the crate (note: to reproduce the bug, you've to use Parser::default_html() and not Parser::default())
To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:
#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
int main() {
xmlDocPtr doc = NULL;
xmlXPathContextPtr context = NULL;
xmlXPathObjectPtr result = NULL;
// <data>café</data> in utf-8:
char data[] = (char[]) {0x3c, 0x64, 0x61, 0x74, 0x61, 0x3e, 0x63, 0x61, 0x66, 0xc3, 0xa9, 0x3c, 0x2f, 0x64, 0x61,
0x74, 0x61, 0x3e};
doc = htmlReadMemory(data, strlen(data), NULL, "utf-8",
HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
// Creating result request
context = xmlXPathNewContext(doc);
result = xmlXPathEvalExpression((const unsigned char *) "normalize-space(//data)", context);
if (result->type == XPATH_STRING) {
printf("%s\n", result->stringval);
}
xmlXPathFreeObject(result);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
return 0;
}
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib -l xml2 test.c
$ ./a.out
$ café
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib -l xml2 test.c
$ ./a.out
$ café
My suspision is in
|
pub fn parse_string_with_options<Bytes: AsRef<[u8]>>( |
When I debug the following code:
// Process encoding.
let encoding_cstring: Option<CString> =
parser_options.encoding.map(|v| CString::new(v).unwrap());
let encoding_ptr = match encoding_cstring {
Some(v) => v.as_ptr(),
None => DEFAULT_ENCODING,
};
// Process url.
let url_ptr = DEFAULT_URL;
If parser encoding is initialized with Some("utf-8"), encoding_ptr is not valid just before // Process url (it points to a null char).
So the call to the binding htmlReadMemory is made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !
Regards,
Jc
Hi,
I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.
My sample:
<data>café</data>normalize-space(//data).Sample code:
With libxml 2.11.0, the value printed is
café, with libxml 2.11.1 the value printed iscafé:I've the impression that the
encodingvalue ofParserOptionsis not evaluated correctly through the crate (note: to reproduce the bug, you've to useParser::default_html()and notParser::default())To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:
My suspision is in
rust-libxml/src/parser.rs
Line 292 in a10a5a6
When I debug the following code:
If parser encoding is initialized with Some("utf-8"),
encoding_ptris not valid just before// Process url(it points to a null char).So the call to the binding
htmlReadMemoryis made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !Regards,
Jc