The current specification lists a large number of possible fields. Many are not used in typical scenarios.
The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.
The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).
I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
The protocol prefix has been removed as unnecessary along with the opening ellipse.
The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).
The third field would remain the original URL.
The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.
The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.
The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.
Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.
There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.