-
Notifications
You must be signed in to change notification settings - Fork 217
Baseline phase2 #2527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseline phase2 #2527
Conversation
We're trying to minimize changes to the code and want to ensure that we can translate the results to other natural languages (while not requiring translators to do unnecessary work). Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
This also improves fix_markdown. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
The script/fix_markdown script isn't ready for mass use. Make that clear, and do some fixups of it so that maybe someday it will be. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
For now, let's *not* require URLs for any baseline answers. That will make it easy to use as we get started. We can change this later. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
The old plan for machine translation was terrible. Here's a better one. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
As an optimization, detect trivial strings & don't call the full markdown processor to process them. We expect to have many cases where only trivial strings are provided. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2527 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 60 60
Lines 2388 2400 +12
=========================================
+ Hits 2388 2400 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Fix issues and clean up the configuration for code coverage reporting. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
|
@TonyLHansen @SecurityCRob - here's the pull request that implements baseline phase 2. This yanks the data from the baseline site, loads it into our system, and creates the necessary database fields. It doesn't let you see or edit those fields - that's phase 3, where finally get to see some actual results :-). As part of this process I've completely changed the documentation on how I plan to handle machine translation of natural language text. Basically, if there's no human translation, we'll use a machine translation from an LLM, and have the LLM double-check its work. We'll create those missing translations incrementally. Machine translation is NOT as good as human translation, but it'll make the material understandable to many more people. |
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
|
Thanks for the update, David. Lots of good work you’re doing.
I’ve definitely gotten different responses when translating text using a couple different LLMs. Perhaps it might be useful to involve multiple LLMs in some way.
Would it be possible to include a note on the bottom of pages that were machine-translated indicating that such happened, along with a link to how someone can help make such translations better?
|
|
The problem is that we don't translate a "page". A page typically has hundreds of text segments, and we translate many individual segments of text. Some segments will be human-translated, some machine-translated. Machine translation isn't perfect. Human translation is better when we can get it, but it's not perfect either. Maybe we need to add something to our footer. Something like, "We provide this material in many natural languages, using a combination of human and machine translation. If there is an error, the English version governs. We [welcome proficient speakers] willing to help perform translations." and include a link from the bracketed text to a URL describing our translation approach & how people can get involved. |
|
Yes, I suppose that's the best we can do. Thanks for considering it. |
No description provided.