在Java中剥离无效的XML字符,可以使用XML解析器。以下是剥离无效字符的代码示例:
import org.xml.sax.*;
import java.io.*;
public class StripInvalidXmlChars {
public static void main(String[] args) throws Exception {
String inputText = "Hello, world!";
String strippedText = stripInvalidXmlChars(inputText);
System.out.println("Original Text: \n" + inputText);
System.out.println("Stripped Text: \n" + strippedText);
}
public static String stripInvalidXmlChars(String inputText) throws SAXException {
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
InputSource inputSource = new InputSource(new StringReader(inputText));
xmlReader.setContentHandler(new StripContentHandler(xmlReader));
xmlReader.parse();
return xmlReader.getXMLReader().getLexicalHandler().getDomNode().getStringValue();
}
}
class StripContentHandler implements ContentHandler {
private XMLReader xmlReader;
public StripContentHandler(XMLReader xmlReader) {
this.xmlReader = xmlReader;
}
public void startDocument() throws SAXException {
xmlReader.getLexicalHandler().startDocument();
}
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
xmlReader.getLexicalHandler().startElement(uri, localName, qName, atts);
}
public void characters(char[] ch, int start, int length) throws SAXException {
xmlReader.getLocator().setCharacterStream(new StringReader(new String(ch, start, length)));
xmlReader.getLexicalHandler().characters();
}
public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
xmlReader.getLocator().setCharacterStream(new StringReader(new String(ch, start, length)));
xmlReader.getLexicalHandler().ignorableWhitespace();
}
public void endElement(String uri, String localName, String qName) throws SAXException {
boolean hasInvalidChars = false;
String chars = xmlReader.getLexicalHandler().endElement(uri, localName, qName);
for (int i = 0; i< chars.length(); i++) {
char c = chars.charAt(i);
if (!Character.isLetterOrDigit(c)
|| c == ':' || c == '.' || c == ',' || c == '='
|| (c >= '0' && c <= '9') && c != ']' && c != ']'
|| c >= '\u0080' && !Character.isISOControl(c)) {
hasInvalidChars = true;
break;
}
}
if (hasInvalidChars) {
xmlReader.getLocator().setCharacterStream(new StringReader(new String(chars).replaceAll("[^A-Za-z0-9:.,=]", ""))).findCharacterEncoding();
xmlReader.setContentHandler(new StripContentHandler(xmlReader));
xmlReader.parse();
return;
}
xmlReader.getLexicalHandler().endElement(uri, localName, qName);
}
public void endDocument() throws SAXException {
xmlReader.getLexicalHandler().endDocument();
}
}
上面代码使用Java的XML解析器来剥离无效字符。其中,StripContentHandler是一个实现了ContentHandler接口的类,它可以处理XML解析的事件,例如文本的开始、结点、空格、注释等。在处理结点时,我们遍历字符节点,检查它是否是一个有效的字符(不是字母、数字、冒号、点、等于号、方括号或ISO控制字符),如果无效,则重置解析器并重新解析文档。
如果解析后的文本中有未剥离有效的XML字符,则可以根据需要重复使用StripContentHandler类进行多次迭代解析。最终获得的字符串是无效字符被剥离后的文本。
领取专属 10元无门槛券
手把手带您无忧上云