WIP: Python buffer (PEP 3118) implementation#70
WIP: Python buffer (PEP 3118) implementation#70jakebolewski wants to merge 5 commits intoJuliaPy:masterfrom
Conversation
|
Hi Jake, thanks for doing this! I don't think non-contiguous buffers should be handled differently. As long as it is strided data, it falls under the same The reason I don't initialize when first loading the module is to allow more freedom in how Python is initialized. But maybe I should switch to doing that purely via environment variables? I would generate a separate type for each heterogeneous structure, but cache the types. This way, PyCall will return consistent types for multiple calls with the same data, which I think is important. I don't think we should look at the global environment for matching structures, as that could lead to very surprising and inconsistent results. To implement Julia arrays as Python buffers, we'll want to implement a new Python type that implements the requisite method slots. See, for example, |
There was a problem hiding this comment.
If we stick with the current pyinitialize pattern for now, you could always define PyBuffer at initialization time by calling eval.
There was a problem hiding this comment.
Ok, that seems like the least disruptive change.
There was a problem hiding this comment.
@jakebolewski, why do you say that the obj field is not in the struct for versions ≤ 3.2?
I just looked in the Python 2.7.6 Include/object.h header file, and Py_buffer is defined as
/* Py3k buffer interface */
typedef struct bufferinfo {
void *buf;
PyObject *obj; /* owned reference */
Py_ssize_t len;
Py_ssize_t itemsize; /* This is Py_ssize_t so it can be
pointed to by strides in simple case.*/
int readonly;
int ndim;
char *format;
Py_ssize_t *shape;
Py_ssize_t *strides;
Py_ssize_t *suboffsets;
Py_ssize_t smalltable[2]; /* static store for shape and strides of
mono-dimensional buffers. */
void *internal;
} Py_buffer;which seems identical except for the smalltable field (which is missing in Python 3.4). However, we can just add some padding to the structure for compatibility, since we don't actually access those fields ourselves.
There was a problem hiding this comment.
@stevengj I had a look, that comment is just wrong. Perhaps I wrote that initially based on the documentation ->
Python 2.7 vs Python 3.4 where the obj field is documentated in the latter.
It was a bit speculative at the time, but I really should finish this PR. Bidirectional zero copy buffers would be a really nice feature.
|
@stevengj thanks for the comments, I'll works towards integrating your suggestions. This was prompted by doing a bit of profiling with pyjulia. Using memory views over numpy data types delivers a boost in performance with Cython, I hope that the same is true with PyCall when all is said and done. Using buffers eliminates object introspection for numpy arrays so I'm hoping it will give a boost in performance. |
|
Yes, the NumPy array interface that we are currently using seems like it will have greater overhead than the buffer-interface C calls. Though for operations on large arrays the overhead of introspection shouldn't be an issue. |
|
Just catching up on this - why did you go with |
|
Exposing julia objects to python through the buffer protocol you need to define a new type like you suggested and how @stevengj did with julia's IO type in |
|
Cool – I spent some time reading through What needs to be done is to use this to create a Python object which exposes the buffer interface and contains a reference to a Julia array object. Once that's done, it's just a matter of creating a I think that if we went that route, we wouldn't need the current Does that sound about right? |
|
@jakevdp, yes, that sounds about right. However, that would be for Julia -> Python conversions, |
|
I think the best way to do numpy->Julia would be via the buffer interface as well. If we make a method whereby anything with a buffer interface can be transparently viewed as a Julia array, then numpy will come for free! I haven't thought much about that direction, but it seems like it wouldn't be too much extra work. |
|
@jakevdp, that's basically what @jakebolewski has done with |
|
OK - the functionality I desire, though, is to have something like this in Python: j = julia.Julia()
x = j.run("[1:10]") # x is now a JuliaArray object, which exposes the buffer interface
xA = np.asarray(x) # xA is a numpy array view of the julia arrayIs this possible with the current |
|
I should specify that I want any julia expression which returns an object compatible with a strided array to have the same behavior. |
|
@jakevdp, that's the Julia -> Python direction, and yes that requires defining a new type like in |
|
Do you see any way to have the Python->Julia direction and Julia->Python direction using the same framework? I don't think I've quite wrapped my mind around the whole problem yet. |
|
No, the two directions generally require different code. If you look at the PyCall source, you'll see that for every converted type |
|
Just had a thought on this: what if the |
|
@jakevdp, whether you put the function in (Also, in Julia, functions don't really "belong" to objects in the way that they do in an OO language like Python.) |
|
I understand that it's two distinct functions that are needed - the point I was making was that perhaps you could take advantage of Python's buffer interface for both directions, rather than re-implementing the concept within the |
|
@jakevdp, I think we all agree that it would be better to make a buffer object for Julia→Python than relying on NumPy. |
|
@stevengj Yes - but my primary point is that it would also be good to exploit the buffer interface for Python→Julia. |
There was a problem hiding this comment.
I think that buf and obj are switched: https://docs.python.org/3.3/c-api/buffer.html#Py_buffer
There was a problem hiding this comment.
That looks like a mistake in the doc: see https://github.com/python/cpython/blob/master/Include/object.h#L178-191
Just looking more closely at this... I see that this is already done in the Just to be clear, it sounds like what you have in mind is to keep the current My question here (which may have been lost in my own confusion on things) is this: would it not be simpler to define a single structure which accomplishes both these things? The components at the beginning of the structure could define the Python side of things ala Is there any particular reason to separate the two functionalities rather than taking this unified approach? |
|
The problem is that As an analogy, look at the conversion of |
|
Ah, OK. I think I'm convinced now 😄 Thanks for the patience. I've been working today on understanding the Python buffer protocol. There's not much out there, so I'm writing a quick tutorial that I'll put on my blog. Once I've figured that out, I'll take a stab at implementing it in Julia. |
turn off np.frombuffer tests and switch to RECORDS buffer protocol support
Only immutable types have C-ABI compatability in Julia. Here we make the Py_buffer struct immutable and wrap it with PyBuffer so we can attach a finalizer for auto memory management. This will enable us to reuse the Py_buffer struct for the Julia -> Python buffer implementation.
…on of the original buffer. Calling asarray preserves this information.
There was a problem hiding this comment.
Why is the length zero for ndim == 0 ... aren't zero-dimensional arrays normally length 1? Why do you have this check here?
The Python documentation says The number of dimensions the memory represents as an n-dimensional array. If it is 0, buf points to a single item representing a scalar.
There was a problem hiding this comment.
Also, is shape is NULL then itemsize should be disregarded and assumed to be 1, according to the docs.
|
@jakebolewski, I merged a modified subset of this PR in order to fix JuliaPy/PyPlot.jl#118 — basically, my It would be good to have an updated PR which adds the other stuff I omitted, in particular your new NumPy-free |
|
"I merged a modified subset of this PR in order to fix JuliaPy/PyPlot.jl#118" [Because of this and old comments] I only scanned this issue, and should it still be open? Is it my correct understanding, that even without, PyCall is pretty good (at least the multidimensional aspects)? This issue would just be icing-on-the cake? Not really faster, just loosing a dependency? [I see (unrelated) bugs fixed all the time, in PyCall, I think, and README seems to support, that it mostly works, rather than the opposite.] In the source code (can ignore if you want, I'm just trying to understand, so I can maybe help..): function NpyArray{T<:NPY_TYPES}(a::StridedArray{T}, revdims::Bool) #not needing default =false (as I think not exported) PyReverseDims{T<:NPY_TYPES}(a::StridedArray{T}) = NpyArray(a, true) [doc] [PyReverseDims of course comes at an unavoidable performance cost], unless you chose not to flip, that is often ok; unclear from my reading of the code, if there's a cost for 1D arrays, and someone uses PyReverseDims out of habit.] |
This is very much a work in progress branch to explore replacing PyCall's numpy dependency with Python Buffers / memory views (#38, and JuliaLang/IJulia.jl#49). Link to PEP proposal: http://legacy.python.org/dev/peps/pep-3118/
Note: to get something working I've ignored for the time being important things like proper reference counting, and python initialization. I have also removed including the numpy code to make testing easier.
I want to get some feedback and to bring up some questions / problems I can foresee in implementing the rest of this. This pull request currently only works for Python v. > 3.2 as the layout of the Buffer structure changed in 3.3. Python initialization is deferred until runtime, so is it best to defer PyBuffer type creation until runtime?. Now with faster startup times, is there a reason not to initialize libpython when first loading the module? Another wrinkle is that Buffer support has only been backported to 2.7. The other option is the keep the pointer opaque and do a copy of the data upon Buffer creation.
Currently the buffer implementation matches the functionality of the numpy interface (most of the code is largely the same with few modifications) under the assumption that the buffers are contiguous. Python's buffer interface allows for non-contiguous buffers so this will have to be implemented. Sould there be a distinction between the two PyArray wrapper types? There have been recent discussions about Julia's Array type hierarchy reflecting the underlying data layout.
The biggest advantage of utilizing python's buffer interface is that it supports heterogeneous structure types. This will allow for conversions between numpy's record arrays and julia array's of immutable types. Parsing the buffer format specification is going to take a bit more work. The main question I have is how to best generate the types for these buffers. For instance upon being passed a numpy record array specifying some structure, do we create a requisite type on the fly or look to the global environment and try to match an existing type? I would be interested in any thoughts anyone might have as for how this would work.
The last thing to do would be to implement the buffer interface for Julia's array types to support Python -> Julia interfaces. What is the best way to do this?