Conversation
…ent to os.makedirs
|
I guess you could have a dedicated job dispatching task that does while True:
await sleep(...)
job = await q.get()
asyncio.ensure_future(job)I guess the tricky bit then would be figuring out when to shut everything down... the problem is you don't know ahead of time how many jobs you'll need... |
|
I was thinking that something like This really seems like there should be a ready, nice pattern from somewhere in the concurrent programming universe…isn't part of a server's job being able to handle the case where you don't know ahead of time how many jobs you'll need? |
|
@mpacer, happy to explain the different checking steps of
So 1 and 4 are standard checks that could be pretty easily incorporated into an async workflow, whereas 2 and 3 are checking a different set of files and/or downloading or comparing files than are likely not to be in the temporary directory. |
|
@mpacer Could you resolve the merge conflicts? |
|
I'm not too concerned with the merge conflicts as I don't want this to be merged yet. It's still a long way off & I am going to need a lot more time to think about how to approach this. When I push more commits I'll definitely fix the conflicts, but I'll wait until then so that I fix all the merge conflicts all at once. That's generally how I approach slow moving WIP PRs and merge conflicts since it only matters at the point when it's no longer WIP. |
|
Ok, I didn't know, my criteria was not to open a PR until ready to merge.
…On Dec 9, 2017 4:12 PM, "M Pacer" ***@***.***> wrote:
I'm not too concerned with the merge conflicts as I don't want this to be
merged yet. It's still a long way off & I am going to need a lot more time
to think about how to approach this. When I push more commits I'll
definitely fix the conflicts, but I'll wait until then so that I fix all
the merge conflicts all at once. That's generally how I approach slow
moving WIP PRs and merge conflicts since it only matters at the point when
it's no longer WIP.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#46 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOjta9_10kQG8zQLniXA2kJNXjBpjHFks5s-yHngaJpZM4QxyFx>
.
|
This is a first pass at making the corpus update commands work asynchronously.
There's still a lot I need to do, but this is enough to at least get the ball rolling about how to think about this.
In order to use
async/awaitsyntax it requires python3.5+, so I updated the requirements accordingly.Some of the code I have there originates from the original tutorials I was splicing together to accomplish this… so if something seems to not make sense, please point it out so I can address that non-sensemaking-ness.
Notes to individuals
@eseiver
Explaining `plos_corpus.download_check_and_move`
Could you describe explicitly what the reasoning behind each of the steps of your `plos_corpus.download_check_and_move` function from the perspective of working on an individual file? Right now the approach is to process everything as a batch in steps, but from an async perspective, that actually works against using the available cycles efficiently.Review of `Article.from_bytes`
Also… if I don't use the BytesIO wrapper and read the bytes directly, the object that is created when using et.from_string has different content than what is currently being saved to disk.I don't like the
from_bytesmethod having the ability to write articles to the corpus, but it seems like there should be some way to do that. I wanted to start with minimal surface area on existing objects, so I just included a single method that did everything I want. Nonetheless, I think that there's a lot of API discussions that we'll need to have around this.@njsmith
it would be nice to be able to add tasks from inside the inner loop (`fetch`, based on how the code is structured as of 07ffabc )… That way we could trigger the new corrected article download if it was needed not in a batch after the fact, but all at once. Is the only way to add tasks from an inner coroutine to either use a global variable, define the inner loop as a closure (so it has access to the higher scope list), or to pass it in? It seems like that kind of state management could be handled more cleanly (e.g., a global variable seems like a terrible idea), but I'm not sure how to approach this.