Skip to content

WIP: Python buffer (PEP 3118) implementation#70

Open
jakebolewski wants to merge 5 commits intoJuliaPy:masterfrom
jakebolewski:jcb/pybuffer
Open

WIP: Python buffer (PEP 3118) implementation#70
jakebolewski wants to merge 5 commits intoJuliaPy:masterfrom
jakebolewski:jcb/pybuffer

Conversation

@jakebolewski
Copy link
Copy Markdown
Collaborator

This is very much a work in progress branch to explore replacing PyCall's numpy dependency with Python Buffers / memory views (#38, and JuliaLang/IJulia.jl#49). Link to PEP proposal: http://legacy.python.org/dev/peps/pep-3118/

Note: to get something working I've ignored for the time being important things like proper reference counting, and python initialization. I have also removed including the numpy code to make testing easier.

I want to get some feedback and to bring up some questions / problems I can foresee in implementing the rest of this. This pull request currently only works for Python v. > 3.2 as the layout of the Buffer structure changed in 3.3. Python initialization is deferred until runtime, so is it best to defer PyBuffer type creation until runtime?. Now with faster startup times, is there a reason not to initialize libpython when first loading the module? Another wrinkle is that Buffer support has only been backported to 2.7. The other option is the keep the pointer opaque and do a copy of the data upon Buffer creation.

Currently the buffer implementation matches the functionality of the numpy interface (most of the code is largely the same with few modifications) under the assumption that the buffers are contiguous. Python's buffer interface allows for non-contiguous buffers so this will have to be implemented. Sould there be a distinction between the two PyArray wrapper types? There have been recent discussions about Julia's Array type hierarchy reflecting the underlying data layout.

The biggest advantage of utilizing python's buffer interface is that it supports heterogeneous structure types. This will allow for conversions between numpy's record arrays and julia array's of immutable types. Parsing the buffer format specification is going to take a bit more work. The main question I have is how to best generate the types for these buffers. For instance upon being passed a numpy record array specifying some structure, do we create a requisite type on the fly or look to the global environment and try to match an existing type? I would be interested in any thoughts anyone might have as for how this would work.

The last thing to do would be to implement the buffer interface for Julia's array types to support Python -> Julia interfaces. What is the best way to do this?

@stevengj
Copy link
Copy Markdown
Member

Hi Jake, thanks for doing this!

I don't think non-contiguous buffers should be handled differently. As long as it is strided data, it falls under the same DenseArray abstract class (or whatever we decide to call it in 0.3).

The reason I don't initialize when first loading the module is to allow more freedom in how Python is initialized. But maybe I should switch to doing that purely via environment variables?

I would generate a separate type for each heterogeneous structure, but cache the types. This way, PyCall will return consistent types for multiple calls with the same data, which I think is important. I don't think we should look at the global environment for matching structures, as that could lead to very surprising and inconsistent results.

To implement Julia arrays as Python buffers, we'll want to implement a new Python type that implements the requisite method slots. See, for example, jl_IOType in io.jl.

Comment thread src/buffer.jl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stick with the current pyinitialize pattern for now, you could always define PyBuffer at initialization time by calling eval.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that seems like the least disruptive change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakebolewski, why do you say that the obj field is not in the struct for versions ≤ 3.2?

I just looked in the Python 2.7.6 Include/object.h header file, and Py_buffer is defined as

/* Py3k buffer interface */
typedef struct bufferinfo {
    void *buf;
    PyObject *obj;        /* owned reference */
    Py_ssize_t len;
    Py_ssize_t itemsize;  /* This is Py_ssize_t so it can be                    
                             pointed to by strides in simple case.*/
    int readonly;
    int ndim;
    char *format;
    Py_ssize_t *shape;
    Py_ssize_t *strides;
    Py_ssize_t *suboffsets;
    Py_ssize_t smalltable[2];  /* static store for shape and strides of         
                                  mono-dimensional buffers. */
    void *internal;
} Py_buffer;

which seems identical except for the smalltable field (which is missing in Python 3.4). However, we can just add some padding to the structure for compatibility, since we don't actually access those fields ourselves.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevengj I had a look, that comment is just wrong. Perhaps I wrote that initially based on the documentation ->
Python 2.7 vs Python 3.4 where the obj field is documentated in the latter.

It was a bit speculative at the time, but I really should finish this PR. Bidirectional zero copy buffers would be a really nice feature.

@jakebolewski
Copy link
Copy Markdown
Collaborator Author

@stevengj thanks for the comments, I'll works towards integrating your suggestions. This was prompted by doing a bit of profiling with pyjulia. Using memory views over numpy data types delivers a boost in performance with Cython, I hope that the same is true with PyCall when all is said and done. Using buffers eliminates object introspection for numpy arrays so I'm hoping it will give a boost in performance.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented Apr 6, 2014

Yes, the NumPy array interface that we are currently using seems like it will have greater overhead than the buffer-interface C calls. Though for operations on large arrays the overhead of introspection shouldn't be an issue.

Comment thread src/buffer.jl Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from?

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 2, 2014

Just catching up on this - why did you go with buffer objects rather than memoryview objects? Or maybe the thing to do here is to actually define a new JuliaArray object which correctly exposes the buffer protocol?

@jakebolewski
Copy link
Copy Markdown
Collaborator Author

PyBuffer should be lowlevel, it just redefines the Py_Buffer struct in julia so we can pass the julia object by reference in ccall when working with the buffer C API directly. We could wrap this in a julia memoryview object similar to how Cython works, but this is really what PyArray is doing. Maybe PyArray should be renamed to PyMemoryView as it is similar to Cython's memory view object.

Exposing julia objects to python through the buffer protocol you need to define a new type like you suggested and how @stevengj did with julia's IO type in src/io.jl. He would be the best person to comment about how to go about doing this.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 2, 2014

Cool – I spent some time reading through src/io.jl and I think I understand what's going on. It looks like src/pytype.jl defines a bunch of convenience routines and macros for creating a new C-level Python object within Julia itself.

What needs to be done is to use this to create a Python object which exposes the buffer interface and contains a reference to a Julia array object. Once that's done, it's just a matter of creating a PyObject constructor specialization for the Julia array type which returns this custom Python type.

I think that if we went that route, we wouldn't need the current PyArray type at all. We'd just need the above PyObject constructor which returns the appropriate Python-viewable Julia structure.

Does that sound about right?

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 3, 2014

@jakevdp, yes, that sounds about right.

However, that would be for Julia -> Python conversions, PyArray is for Python -> Julia conversions, and we certainly want some Array (eventually DenseArray) subtype that provides a transparent copy-free wrapper around NumPy arrays and Python buffers; I'm fine with using the buffer interface rather than the NumPy array interface for this.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 3, 2014

I think the best way to do numpy->Julia would be via the buffer interface as well. If we make a method whereby anything with a buffer interface can be transparently viewed as a Julia array, then numpy will come for free! I haven't thought much about that direction, but it seems like it wouldn't be too much extra work.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 3, 2014

@jakevdp, that's basically what @jakebolewski has done with PyBuffer. Re-implementing PyArray on top of PyBuffer (or replacing the former with the latter) should be simple; I would prefer to keep the name PyArray for this functionality.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 3, 2014

OK - the functionality I desire, though, is to have something like this in Python:

j = julia.Julia()
x = j.run("[1:10]")  # x is now a JuliaArray object, which exposes the buffer interface
xA = np.asarray(x)  # xA is a numpy array view of the julia array

Is this possible with the current PyBuffer/PyArray approach?

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 3, 2014

I should specify that I want any julia expression which returns an object compatible with a strided array to have the same behavior.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 3, 2014

@jakevdp, that's the Julia -> Python direction, and yes that requires defining a new type like in io.jl. (Straightforward but a but tedious and requires some care because of the dangerous nature of such low-level coding.)

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 3, 2014

Do you see any way to have the Python->Julia direction and Julia->Python direction using the same framework? I don't think I've quite wrapped my mind around the whole problem yet.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 3, 2014

No, the two directions generally require different code. If you look at the PyCall source, you'll see that for every converted type T there are two functions: a PyObject(x::T) function (Julia→Python) and a convert(::Type{T}, o::PyObject) function (Python→Julia).

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 4, 2014

Just had a thought on this: what if the JuliaArray object contains a constructor which uses Python's buffer interface to create the internal Julia array? Then the single object structure could handle both directions for a very general set of Python array-like objects.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 4, 2014

@jakevdp, whether you put the function in PyObject(x::DenseArray) or in PyArray(x::DenseArray), you still have to implement separate functions for Julia→Python and Python→Julia conversions. But you need the former in any case, because a PyObject constructor is needed in PyCall for every Julia type that is going to get passed to Python.

(Also, in Julia, functions don't really "belong" to objects in the way that they do in an OO language like Python.)

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 5, 2014

I understand that it's two distinct functions that are needed - the point I was making was that perhaps you could take advantage of Python's buffer interface for both directions, rather than re-implementing the concept within the PyObject constructor for only numpy arrays. Then rather than requiring a numpy array to create a Julia array via PyObject, you'd be able to create the Julia array from any Python object which defines the buffer interface. It seems like that would be much more general and much more useful in the long run.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 5, 2014

@jakevdp, I think we all agree that it would be better to make a buffer object for Julia→Python than relying on NumPy.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 5, 2014

@stevengj Yes - but my primary point is that it would also be good to exploit the buffer interface for Python→Julia.

Comment thread src/buffer.jl Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that buf and obj are switched: https://docs.python.org/3.3/c-api/buffer.html#Py_buffer

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 5, 2014

@stevengj Yes - but my primary point is that it would also be good to exploit the buffer interface for Python→Julia.

Just looking more closely at this... I see that this is already done in the PyArray object. Sorry for the confusion on that.

Just to be clear, it sounds like what you have in mind is to keep the current PyArray object and create a new Python object (lets say JuliaArray) which implements the Python Buffer interface using a means similar to that in src/io.jl. So then there would basically be two object types: PyArray, which is a Julia object that uses the buffer interface to convert Python→Julia, and JuliaArray which is a Python object (defined in Julia) that uses the buffer interface to convert Julia→Python.

My question here (which may have been lost in my own confusion on things) is this: would it not be simpler to define a single structure which accomplishes both these things? The components at the beginning of the structure could define the Python side of things ala src/io.jl, and additional pieces could be used to define what Julia needs. Then there can be both Python-side and Julia-side constructors & operations on this unified interface object.

Is there any particular reason to separate the two functionalities rather than taking this unified approach?

@stevengj
Copy link
Copy Markdown
Member

stevengj commented May 5, 2014

The problem is that JuliaArray is not a type in Julia, it is a type in Python. Hence it cannot be the same structure as PyArray, and in fact will have virtually no code in common with PyArray.

As an analogy, look at the conversion of Function objects to/from Python callable objects. Converting Python→Julia has literally zero code in common with converting Julia→Python, nor could the two conceivably share any code.

@jakevdp
Copy link
Copy Markdown

jakevdp commented May 5, 2014

Ah, OK. I think I'm convinced now 😄 Thanks for the patience.

I've been working today on understanding the Python buffer protocol. There's not much out there, so I'm writing a quick tutorial that I'll put on my blog. Once I've figured that out, I'll take a stab at implementing it in Julia.

turn off np.frombuffer tests and switch to RECORDS buffer protocol support
Only immutable types have C-ABI compatability in Julia.
Here we make the Py_buffer struct immutable and wrap it with PyBuffer
so we can attach a finalizer for auto memory management.  This will enable
us to reuse the Py_buffer struct for the Julia -> Python buffer
implementation.
…on of the original buffer. Calling asarray preserves this information.
Comment thread src/buffer.jl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the length zero for ndim == 0 ... aren't zero-dimensional arrays normally length 1? Why do you have this check here?

The Python documentation says The number of dimensions the memory represents as an n-dimensional array. If it is 0, buf points to a single item representing a scalar.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is shape is NULL then itemsize should be disregarded and assumed to be 1, according to the docs.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented Mar 3, 2015

@jakebolewski, I merged a modified subset of this PR in order to fix JuliaPy/PyPlot.jl#118 — basically, my PyObject(::IO) wrappers weren't working because they need to use the buffer interface to get access to the raw bytes in order to implement write.

It would be good to have an updated PR which adds the other stuff I omitted, in particular your new NumPy-free PyArray (probably in a separate pyarray.jl file).

@PallHaraldsson
Copy link
Copy Markdown
Contributor

PallHaraldsson commented Oct 15, 2016

"I merged a modified subset of this PR in order to fix JuliaPy/PyPlot.jl#118"

[Because of this and old comments] I only scanned this issue, and should it still be open?

Is it my correct understanding, that even without, PyCall is pretty good (at least the multidimensional aspects)? This issue would just be icing-on-the cake? Not really faster, just loosing a dependency? [I see (unrelated) bugs fixed all the time, in PyCall, I think, and README seems to support, that it mostly works, rather than the opposite.]

In the source code (can ignore if you want, I'm just trying to understand, so I can maybe help..):

function NpyArray{T<:NPY_TYPES}(a::StridedArray{T}, revdims::Bool) #not needing default =false (as I think not exported)
[..]

PyReverseDims{T<:NPY_TYPES}(a::StridedArray{T}) = NpyArray(a, true)
PyReverseDims(a::BitArray) = PyReverseDims(Array(a))

[doc]
PyReverseDims(a::AbstractArray) #this stray line is puzzling to me.. I thought a function body needed, can't do similar myself..

[PyReverseDims of course comes at an unavoidable performance cost], unless you chose not to flip, that is often ok; unclear from my reading of the code, if there's a cost for 1D arrays, and someone uses PyReverseDims out of habit.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants