Filter by topic and date
RFC v3 XML Issues
- John LevineTemporary RFC Series Project Manager
3 Nov 2020
We have three general categories of issues: stabilizing and documenting the v3 grammar, managing the costs and speed of the v3 editing process, and managing the output formats.
XML Vocabulary and Style Guide
The XML dialect that xml2rfc currently accepts is somewhat different from that specified in RFC 7991. The differences are documented in the implementation notes, which describes changes through July 2020 and the RNG file in each version of xml2rfc and a descriptive file output by xml2rfc. As of xml2rfc 3.0.0 we have stopped changing the grammar.
The author of RFC 7991 and I are slowly updating 7991bis to document the actual vocabulary that xml2rfc accepts and that RFCs use. As of xml2rfc 3.0.0, the tool has an option to generate an XML description of the vocabulary, which I have spliced into the draft, so all of the elements are present but some have no description.
A key role of the RFC Series Editor (RSE) has been to support the RFC Production Center (RPC) and the xml2rfc developer, Henrik Levkowetz, in triaging incoming change requests for the v3 XML and the RFC Style Guide and deciding how those are addressed. This could mean saying 'no', or any combination of changing the XML of an I-D, changing the v3 XML schema, and changing the Style Guide.
My role doesn’t include this responsibility and that has left a gap with no clear management of these decisions. Consequently, the RSOC has formed an "RFC XML and Style Guide change management team" to fill this gap as a temporary measure while the IAB process to determine the future role of an RSE progresses. This team is made up of Henrik Levkowetz as the developer, me as the Temporary RFC Series Project Manager (TRFCSPM), Robert Sparks as the Tools Team PM and Peter Saint-Andre as an RSOC representative. The role of the team will be to discuss and approve all changes to the RFC XML and Style Guide.
There are a number of changes to the XML Vocabulary and Style Guide that were implemented after the departure of the RSE and before the introduction of the “RFC XML and Style Guide change management team”. These changes need to be reviewed by the new change management team to decide if they stay in place or are rolled back. Once this project has been completed, 7991bis can be finalised.
Document production and SLA
RFCs are edited in XML, then run through xml2rfc to produce the three output formats, unpaginated text, HTML, and PDF. RPC staff proofreads one output, inserting any edits in the XML, and then does a side-by-side review with the other output versions checking for visual discrepancies. The RPC also carefully checks typically problematic areas such as lists and tables.
After a year using xml2rfc, the RPC is still finding a few formatting problems where the XML is fine, but the output looks bad. These are generally fixed with changes to the xml2rfc rendering code rather than any change to the XML or the content of the RFC.
For the PDF output, xml2rfc first creates an HTML version and then renders the HTML into a PDF using an external library. This sometimes causes ugly output due to the limits of HTML’s cascading style sheets. Since none of the output formats are canonical, the results are treated as good enough for now.
v3 XML and the SLA
The introduction of the v3 XML led to a temporary suspension of the SLA while the impact of the switch was understood. As it currently stands the impact is a permanent increase in workload as the RPC are spending significant time and effort in adding many of the new v3 tags as authors are not adding those themselves. This in turn means an increase in resourcing is needed to maintain the same SLA targets as before the introduction of the v3 XML.
The “RFC XML and Style Guide change management team” has been asked to work with the RPC to identify if it is possible to agree a minimal subset of the v3 XML, which limits the tags that the RPC adds to documents thereby reducing the workload.
At the same time, a survey is being prepared to go to document authors to understand more about why they do not submit their files in native XML v3 with all of the tags added.
RFCs are currently rendered in three formats, unpaginated text, HTML, and PDF.
Recent PDF rerendering
The PDF versions of RFCs are in a profile called PDF/A-3U that is intended to be bitrot-resistant, and have the original XML embedded as an attachment. We do this by taking the PDF output from xml2rfc and running it through a commercial package from Callas software to do the embedding and conversion to PDF/A. We discovered in mid-2020 that the published PDFs were not PDF/A, due to a coding error in the production script. We regenerated the PDFs from the existing xml2rfc output as PDF/A and advised the community. This did not change the appearance or contents of the PDFs.
Over the years, the number of renderings of RFCs, both “official” and “unofficial”, has grown and looks like to continue to grow as technology develops. For example, it is now relatively easy for someone to create a new HTML rendering simply by applying new CSS to the official HTML version. We might also see people producing their own PDFs with page numbers in the table of contents and with footers as those are no longer in the “official” versions.
This suggests that the probability of confusion over what is an official version and what is not is likely to increase and it may be appropriate to act now to mitigate the impact of that confusion.
Since it seems unlikely that we’ll be doing a lot of regenerating, date stamps on the generated files should be adequate. The HTML and PDF output have internal tokens identifying the version of xml2rfc and the time they were created but the text versions do not.
However, if we are concerned that confusion over different versions of the same rendering will become a problem then we may wish to consider including specific identification data as part of the rendered output as a means of disambiguation. For example, identifying the publisher, tool that generated the output and the timestamp of the output.
Changing the XML of published RFCs
Until now, RFCs once published have been immutable. The switch to XML as the canonical format has introduced some new issues that may require us to take a more nuanced approach that maintains the immutability of the text of an RFC while allowing for limited circumstances under which the formatting and/or semantic markup of an RFC can change after publication.
What follows is not a plan but an explanation of the two key issues that are suggesting a need to refine the concept of immutability.
Stable v3 tidy up
There are two reasons we could update or republish some or all of the existing XML RFCs at some point when we have a stable v3 XML. One is that the “RFC XML and Style Guide change management team” may decide to back out or deprecate some of the vocabulary changes, so we would need to change any RFCs that use them. The other is that some of the changes are not used consistently. For example, a new <toc> element identifies the table of contents. That’s reasonable and we may keep it, but for some reason RFC 8651 does not have the <toc> element while all the other RFCs do. If we keep the element, we should add it to RFC 8651, if we don’t, we should delete it from all the others.
Once we have this all sorted out with a finalized RFC 7991bis, we can adjust xml2rfc and the existing XML RFCs to agree with the final vocabulary, verify that the text hasn’t changed (automatic tools should help here). This might involve mechanically editing the XML to back out deprecated elements, and then the RPC would republish them by rerunning them through the latest xml2rfc. Rerunning xml2rfc is a mechanical process which takes a few hours to reprocess all XML RFCs to date.
Retroactive semantic markup
The v3 XML introduced a number of semantic markup elements with that markup used for formatting by the renderers. For example, <sourcecode> can now be used to markup source code, in preference to <artwork>. It is likely that new semantic markup will be agreed at some point in the future which does not change the meaning of the text. We may also choose to use RDF compliant semantic markup with standard document ontologies, rather than our own private semantic space. It is not inconceivable that at some point we may need to consider if our decision that RFCs are immutable applies to the content, or the content and markup, or the published PDF/A file, keeping in mind that the PDF/A includes a copy of the XML.