String `xsd:double` value crash `toRrdf()` #202

anatoly-scherbakov · 2025-10-02T18:26:55Z

Why

Here's a JSON-LD document.

{
  "@context": {
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"
  },
  "@graph": [
    {
      "@id": "http://www.wikidata.org/entity/Q399",
      "geoLongitude": {
        "@type": "xsd:double",
        "@value": "45"
      }
    }
  ]
}

Conversion of this document to a RDF dataset works at JSON-LD playground, but it will crash pyld 2.0.4:

    def _object_to_rdf(self, item, issuer, triples, rdfDirection):
        """
        Converts a JSON-LD value object to an RDF literal or a JSON-LD string
        or node object to an RDF resource.
    
        :param item: the JSON-LD value or node object.
        :param issuer: the IdentifierIssuer for issuing blank node identifiers.
        :param triples: the array of triples to append list entries to.
        :param rdfDirection: for creating datatyped literals.
        :param rdfDirection: for creating datatyped literals.
    
        :return: the RDF literal or RDF resource.
        """
        object = {}
    
        if _is_value(item):
            object['type'] = 'literal'
            value = item['@value']
            datatype = item.get('@type')
    
            # convert to XSD datatypes as appropriate
            if item.get('@type') == '@json':
                object['value'] = canonicalize(value).decode('UTF-8')
                object['datatype'] = RDF_JSON_LITERAL
            elif _is_bool(value):
                object['value'] = 'true' if value else 'false'
                object['datatype'] = datatype or XSD_BOOLEAN
            elif _is_double(value) or datatype == XSD_DOUBLE:
                # canonical double representation
                object['value'] = re.sub(
                    r'(\d)0*E\+?0*(\d)', r'\1E\2',
>                   ('%1.15E' % value))
E               TypeError: must be real number, not str

What

This PR

Creates a unit test demonstrating the problem,
Contains a fix to it.

anatoly-scherbakov · 2025-10-02T18:41:48Z

@BigBlueHat @davidlehn please let me know what you think. Thanks!

davidlehn · 2025-10-31T15:51:00Z

The code style is different. Extra newlines and such. Would be better to match what is there.
None of the other code has type annotations. (This code pre-dates that even existing.) I'm not sure if we want to add that in new code piece by piece or add it all at once later.
Does the test suite not cover this? If not, it should, so all implementations test this case. That might remove the need for the test file.

anatoly-scherbakov · 2025-11-01T15:20:46Z

I agree about importance of maintaining a unified code style. However, both describing such as style and mimicking it, are rather difficult if done informally. An automatic formatter, such as black, is often considered a fit resolution to such issues (even though I dislike certain decisions black does, I have to agree its benefits are worthwhile). We might consider introducing it to the project and to CI.
This is a matter of taste; IMHO type annotations are de facto standard in industry programming in Python, and I tend to use them in every piece of code I write, not even questioning whether I should, I am afraid. A typechecker can be introduced into development workflow and in CI to validate type annotations.
The purpose of the test is to a) illustrate the existence of the issue and b) the ability of the proposed fix to remedy it. Writing a test was a part of my debugging process when I discovered the problem when parsing a document from Wikidata, and I included the test as a material proof.

An extra test should not bring about much harm. It can be deduplicated and, if necessary, merged into the test suite of the API specification later.

davidlehn · 2025-11-13T05:13:30Z

Conversion of this document to a RDF dataset works at JSON-LD playground, but it will crash pyld 2.0.4:

Please don't assume "it works in the json-ld.org playground" means the output is correct! We need the shared tests to be reviewed to ensure they are actually correct. It's entirely possible that the playground js code is doing something wrong and needs fixing too. Better to have the test in the approved shared suite so everyone agrees on behavior.

I agree about importance of maintaining a unified code style. [...]

Great, please update the patch. Looks like it's probably a few blank lines, trailing commas, and some trailing whitespace to fix?

Open an issue and/or PR about using other tooling.

This is a matter of taste; [...]

Again, seems like this patch should match the style of the existing code, which doesn't yet use annotations. Probably best to discuss it in an issue then slowly or bulk add everywhere needed.

The purpose of the test is to a) illustrate the existence of the issue and b) the ability of the proposed fix to remedy it. Writing a test was a part of my debugging process when I discovered the problem when parsing a document from Wikidata, and I included the test as a material proof.

I wasn't questioning the need or reason to have a test. Did you determine if the test suite had anything to cover this? That a test needed to be written seems to imply no. This 100% needs to be in the spec api test suite if that is the case. All implementations need to cover the issue.

I know it's a bit awkward to figure out how to add tests to the spec suite. Naming there has been a mess. But adding something that can get some feedback and adjusted later is not too bad.

At least in the js implementation there are (clunky) features to run single or a group of tests. I use that all the time. It's often as easy as local per-implementation tests and the general tests need to be in the api test suite anyway.

An extra test should not bring about much harm.

It is absolutely harmful to have general tests that only exist in one implementation! Interoperability is what JSON-LD is about. All implementations need to know they function the same on a test suite shared and approved by the community.

It can be deduplicated and, if necessary, merged into the test suite of the API specification later.

Duplication is a problem. It would mix up which tests are Python specific vs ones that need to be added to the API test suite and removed. If it is a duplicate, than please clearly document why and refer to the API test. But what would that be for?

It's now later and it is necessary. Please add it to the spec api test suite.

I also find adding to the test suite sometimes encourages me or others to add more variations. Especially when it's about a handful of triples. Like here, do positive, negative, zero, lots of sig digits, values near maximums and minimums, maybe some non-numbers if it makes sense, etc.

All that being said, thank you for the test and fix. :-)

davidlehn · 2025-11-13T04:31:34Z

lib/pyld/jsonld.py

-                    r'(\d)0*E\+?0*(\d)', r'\1E\2',
-                    ('%1.15E' % value))
-                object['datatype'] = datatype or XSD_DOUBLE
+


Match style and remove appropriate newlines.

davidlehn · 2025-11-13T04:31:47Z

lib/pyld/jsonld.py

+                return {
+                    **object,
+                    'value': _canonicalize_double(value),
+                    'datatype': datatype or XSD_DOUBLE,


Match style and remove trailing commas.

The reason to add trailing commas is to reduce the size of a diff on a subsequent change.

If, below

'datatype': datatype or XSD_DOUBLE

someone would decide another item in the dictionary, — the diff will cover two lines:

the existing line, with an extra comma being added,

and the next line, which has been inserted.

Is there a style guide in this project which would forbid certain style elements?

I am aware of pros/cons of this and other style choices. The code is the style guide. Please match it for now.

For the time being, I will refrain from making any additional changes to this PR until an automatic formatter, which would take care of the code style, is introduced. Otherwise, we are stuck in a discussion about "how I would have done it", which has no resolution. A formatter alleviates such questions altogether. I might look into its introduction later.

davidlehn · 2025-11-13T04:35:52Z

lib/pyld/jsonld.py

-                object['datatype'] = datatype or XSD_DOUBLE
+
+            elif _is_double(value):
+                return {


Why use this construct? Would matching the other code and setting value and datatype work? Or if it is to maybe avoid the IRI conditional below, could style use that style and a return, rather than create a new object. But in that case, everything else here should match. Maybe better to use existing style and refactor it all later if needed.

This is a matter of taste. In the matters of taste, there is no right and wrong.

To my taste, the creation of this object makes it very easy for a person reading the code to understand what will be returned, making it very visual, if you will. Very declarative.

If the community agrees on a formatter rule which bans this kind of expression, I will be fine with removing it.

Match the other code

Looking at the other code, it is not clear which of its mannerisms are

a conscious engineering choice,

or, part of style dictated for the project,

or, an expression of someone's subjective taste.

My opnion is as follows.

Style choices which can be deterministically maintained by linters & formatters can be implemented as project policies.

Style choices which cannot be maintained in such a way should not be enforced at all.

davidlehn · 2025-11-13T04:40:21Z

tests/test_double_to_rdf.py

+
+    def test_offline_pyld_bug_reproduction(self):
+        """Test reproducing the PyLD bug with captured Wikidata data structure."""
+        # This is the exact problematic data structure captured from Wikidata Q399


The comments here and below should be adjust to only say what is happening, if needed at all. Text about a contemporary bug will not make much sense after the issue is fixed. That seems like text for the PR comments, not something that needs to be in the code.

This test was primarily written as an illustration to prove that the bug exists to those who read this PR. Thanks for pointing this out, I will clean it up if/when this PR is being prepared to merge after the more urgent changes. At that point, this test might also require refactoring because it will be running under an updated test system.

davidlehn · 2025-11-13T05:10:33Z

tests/test_double_to_rdf.py

+        data = {
+            "@context": {
+                "xsd": "http://www.w3.org/2001/XMLSchema#",
+                "geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"
+            },
+            "@graph": [
+                {
+                    "@id": "http://www.wikidata.org/entity/Q399",
+                    "geoLongitude": {
+                        "@type": "xsd:double",
+                        "@value": "45"  # This string number causes the PyLD bug
+                    }
+                }
+            ]
+        }


I think tests should be as minimal as possible. This could inline xsd prefix, but if this were to be slightly expanded to have various values, xsd: is easier to read. How about something like:

Suggested change

data = {

"@context": {

"xsd": "http://www.w3.org/2001/XMLSchema#",

"geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"

},

"@graph": [

{

"@id": "http://www.wikidata.org/entity/Q399",

"geoLongitude": {

"@type": "xsd:double",

"@value": "45" # This string number causes the PyLD bug

}

}

]

}

data = {

"@context": {

"xsd": "http://www.w3.org/2001/XMLSchema#"

},

"@graph": [

{

"@id": "ex:1",

"ex:p": {

"@type": "xsd:double",

"@value": "45"

}

}

]

}

davidlehn · 2025-11-13T05:14:34Z

tests/test_double_to_rdf.py

+        # The bug was in PyLD's _object_to_rdf method where string values
+        # with @type: "xsd:double" were not being converted to float
+        result = pyld.jsonld.to_rdf(data)
+


Remove trailing whitespace here and elsewhere.

davidlehn · 2025-11-13T05:18:07Z

tests/test_double_to_rdf.py

+Tests for to_rdf functionality, specifically focusing on double/float handling bugs.
+"""
+
+import json


anatoly-scherbakov added 3 commits October 2, 2025 22:23

Add gitignore patterns for JetBrains IDEs & pyenv

94246ff

Test a string xsd:double to RDF conversion

a467fcc

Handle a string xsd:double when converting to RDF

a98377e

anatoly-scherbakov changed the title ~~to rdf crashing on string xsd double~~ String xsd:double value crash toRrdf() Oct 2, 2025

BigBlueHat requested a review from davidlehn October 29, 2025 16:02

mielvds added this to the v2.0.5 milestone Nov 12, 2025

mielvds added the bug label Nov 12, 2025

davidlehn reviewed Nov 13, 2025

View reviewed changes

String xsd:double value crash toRrdf() #202

Are you sure you want to change the base?

String xsd:double value crash toRrdf() #202

Conversation

anatoly-scherbakov commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

anatoly-scherbakov commented Oct 2, 2025

Uh oh!

davidlehn commented Oct 31, 2025

Uh oh!

anatoly-scherbakov commented Nov 1, 2025

Uh oh!

davidlehn commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

String `xsd:double` value crash `toRrdf()` #202

String `xsd:double` value crash `toRrdf()` #202

anatoly-scherbakov commented Oct 2, 2025 •

edited

Loading