Skip to content

Conversation

@anatoly-scherbakov
Copy link
Collaborator

@anatoly-scherbakov anatoly-scherbakov commented Oct 2, 2025

Why

Here's a JSON-LD document.

{
  "@context": {
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"
  },
  "@graph": [
    {
      "@id": "http://www.wikidata.org/entity/Q399",
      "geoLongitude": {
        "@type": "xsd:double",
        "@value": "45"
      }
    }
  ]
}

Conversion of this document to a RDF dataset works at JSON-LD playground, but it will crash pyld 2.0.4:

    def _object_to_rdf(self, item, issuer, triples, rdfDirection):
        """
        Converts a JSON-LD value object to an RDF literal or a JSON-LD string
        or node object to an RDF resource.
    
        :param item: the JSON-LD value or node object.
        :param issuer: the IdentifierIssuer for issuing blank node identifiers.
        :param triples: the array of triples to append list entries to.
        :param rdfDirection: for creating datatyped literals.
        :param rdfDirection: for creating datatyped literals.
    
        :return: the RDF literal or RDF resource.
        """
        object = {}
    
        if _is_value(item):
            object['type'] = 'literal'
            value = item['@value']
            datatype = item.get('@type')
    
            # convert to XSD datatypes as appropriate
            if item.get('@type') == '@json':
                object['value'] = canonicalize(value).decode('UTF-8')
                object['datatype'] = RDF_JSON_LITERAL
            elif _is_bool(value):
                object['value'] = 'true' if value else 'false'
                object['datatype'] = datatype or XSD_BOOLEAN
            elif _is_double(value) or datatype == XSD_DOUBLE:
                # canonical double representation
                object['value'] = re.sub(
                    r'(\d)0*E\+?0*(\d)', r'\1E\2',
>                   ('%1.15E' % value))
E               TypeError: must be real number, not str

What

This PR

  • Creates a unit test demonstrating the problem,
  • Contains a fix to it.

@anatoly-scherbakov anatoly-scherbakov changed the title to rdf crashing on string xsd double String xsd:double value crash toRrdf() Oct 2, 2025
@anatoly-scherbakov
Copy link
Collaborator Author

@BigBlueHat @davidlehn please let me know what you think. Thanks!

@BigBlueHat BigBlueHat requested a review from davidlehn October 29, 2025 16:02
@davidlehn
Copy link
Member

  • The code style is different. Extra newlines and such. Would be better to match what is there.
  • None of the other code has type annotations. (This code pre-dates that even existing.) I'm not sure if we want to add that in new code piece by piece or add it all at once later.
  • Does the test suite not cover this? If not, it should, so all implementations test this case. That might remove the need for the test file.

@anatoly-scherbakov
Copy link
Collaborator Author

  1. I agree about importance of maintaining a unified code style. However, both describing such as style and mimicking it, are rather difficult if done informally. An automatic formatter, such as black, is often considered a fit resolution to such issues (even though I dislike certain decisions black does, I have to agree its benefits are worthwhile). We might consider introducing it to the project and to CI.

  2. This is a matter of taste; IMHO type annotations are de facto standard in industry programming in Python, and I tend to use them in every piece of code I write, not even questioning whether I should, I am afraid. A typechecker can be introduced into development workflow and in CI to validate type annotations.

  3. The purpose of the test is to a) illustrate the existence of the issue and b) the ability of the proposed fix to remedy it. Writing a test was a part of my debugging process when I discovered the problem when parsing a document from Wikidata, and I included the test as a material proof.

An extra test should not bring about much harm. It can be deduplicated and, if necessary, merged into the test suite of the API specification later.

@mielvds mielvds added this to the v2.0.5 milestone Nov 12, 2025
@mielvds mielvds added the bug label Nov 12, 2025
@davidlehn
Copy link
Member

Conversion of this document to a RDF dataset works at JSON-LD playground, but it will crash pyld 2.0.4:

Please don't assume "it works in the json-ld.org playground" means the output is correct! We need the shared tests to be reviewed to ensure they are actually correct. It's entirely possible that the playground js code is doing something wrong and needs fixing too. Better to have the test in the approved shared suite so everyone agrees on behavior.

  1. I agree about importance of maintaining a unified code style. [...]

Great, please update the patch. Looks like it's probably a few blank lines, trailing commas, and some trailing whitespace to fix?

Open an issue and/or PR about using other tooling.

  1. This is a matter of taste; [...]

Again, seems like this patch should match the style of the existing code, which doesn't yet use annotations. Probably best to discuss it in an issue then slowly or bulk add everywhere needed.

  1. The purpose of the test is to a) illustrate the existence of the issue and b) the ability of the proposed fix to remedy it. Writing a test was a part of my debugging process when I discovered the problem when parsing a document from Wikidata, and I included the test as a material proof.

I wasn't questioning the need or reason to have a test. Did you determine if the test suite had anything to cover this? That a test needed to be written seems to imply no. This 100% needs to be in the spec api test suite if that is the case. All implementations need to cover the issue.

I know it's a bit awkward to figure out how to add tests to the spec suite. Naming there has been a mess. But adding something that can get some feedback and adjusted later is not too bad.

At least in the js implementation there are (clunky) features to run single or a group of tests. I use that all the time. It's often as easy as local per-implementation tests and the general tests need to be in the api test suite anyway.

An extra test should not bring about much harm.

It is absolutely harmful to have general tests that only exist in one implementation! Interoperability is what JSON-LD is about. All implementations need to know they function the same on a test suite shared and approved by the community.

It can be deduplicated and, if necessary, merged into the test suite of the API specification later.

Duplication is a problem. It would mix up which tests are Python specific vs ones that need to be added to the API test suite and removed. If it is a duplicate, than please clearly document why and refer to the API test. But what would that be for?

It's now later and it is necessary. Please add it to the spec api test suite.

I also find adding to the test suite sometimes encourages me or others to add more variations. Especially when it's about a handful of triples. Like here, do positive, negative, zero, lots of sig digits, values near maximums and minimums, maybe some non-numbers if it makes sense, etc.

All that being said, thank you for the test and fix. :-)

r'(\d)0*E\+?0*(\d)', r'\1E\2',
('%1.15E' % value))
object['datatype'] = datatype or XSD_DOUBLE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match style and remove appropriate newlines.

return {
**object,
'value': _canonicalize_double(value),
'datatype': datatype or XSD_DOUBLE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match style and remove trailing commas.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to add trailing commas is to reduce the size of a diff on a subsequent change.

If, below

'datatype': datatype or XSD_DOUBLE

someone would decide another item in the dictionary, — the diff will cover two lines:

  • the existing line, with an extra comma being added,
  • and the next line, which has been inserted.

Is there a style guide in this project which would forbid certain style elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of pros/cons of this and other style choices. The code is the style guide. Please match it for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I will refrain from making any additional changes to this PR until an automatic formatter, which would take care of the code style, is introduced. Otherwise, we are stuck in a discussion about "how I would have done it", which has no resolution. A formatter alleviates such questions altogether. I might look into its introduction later.

object['datatype'] = datatype or XSD_DOUBLE

elif _is_double(value):
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this construct? Would matching the other code and setting value and datatype work? Or if it is to maybe avoid the IRI conditional below, could style use that style and a return, rather than create a new object. But in that case, everything else here should match. Maybe better to use existing style and refactor it all later if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of taste. In the matters of taste, there is no right and wrong.

To my taste, the creation of this object makes it very easy for a person reading the code to understand what will be returned, making it very visual, if you will. Very declarative.

If the community agrees on a formatter rule which bans this kind of expression, I will be fine with removing it.

Match the other code

Looking at the other code, it is not clear which of its mannerisms are

  • a conscious engineering choice,
  • or, part of style dictated for the project,
  • or, an expression of someone's subjective taste.

My opnion is as follows.

  • Style choices which can be deterministically maintained by linters & formatters can be implemented as project policies.
  • Style choices which cannot be maintained in such a way should not be enforced at all.


def test_offline_pyld_bug_reproduction(self):
"""Test reproducing the PyLD bug with captured Wikidata data structure."""
# This is the exact problematic data structure captured from Wikidata Q399
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments here and below should be adjust to only say what is happening, if needed at all. Text about a contemporary bug will not make much sense after the issue is fixed. That seems like text for the PR comments, not something that needs to be in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was primarily written as an illustration to prove that the bug exists to those who read this PR. Thanks for pointing this out, I will clean it up if/when this PR is being prepared to merge after the more urgent changes. At that point, this test might also require refactoring because it will be running under an updated test system.

Comment on lines +23 to +37
data = {
"@context": {
"xsd": "http://www.w3.org/2001/XMLSchema#",
"geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"
},
"@graph": [
{
"@id": "http://www.wikidata.org/entity/Q399",
"geoLongitude": {
"@type": "xsd:double",
"@value": "45" # This string number causes the PyLD bug
}
}
]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests should be as minimal as possible. This could inline xsd prefix, but if this were to be slightly expanded to have various values, xsd: is easier to read. How about something like:

Suggested change
data = {
"@context": {
"xsd": "http://www.w3.org/2001/XMLSchema#",
"geoLongitude": "http://www.w3.org/2003/01/geo/wgs84_pos#longitude"
},
"@graph": [
{
"@id": "http://www.wikidata.org/entity/Q399",
"geoLongitude": {
"@type": "xsd:double",
"@value": "45" # This string number causes the PyLD bug
}
}
]
}
data = {
"@context": {
"xsd": "http://www.w3.org/2001/XMLSchema#"
},
"@graph": [
{
"@id": "ex:1",
"ex:p": {
"@type": "xsd:double",
"@value": "45"
}
}
]
}

# The bug was in PyLD's _object_to_rdf method where string values
# with @type: "xsd:double" were not being converted to float
result = pyld.jsonld.to_rdf(data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace here and elsewhere.

Tests for to_rdf functionality, specifically focusing on double/float handling bugs.
"""

import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants