diff --git a/docs/manual/cellediting.md b/docs/manual/cellediting.md index c22434ec..dabd289d 100644 --- a/docs/manual/cellediting.md +++ b/docs/manual/cellediting.md @@ -57,7 +57,7 @@ You can also convert cells into null values or empty strings. This can be useful ## Fill down and blank down {#fill-down-and-blank-down} -Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity. +Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring.md#rows-vs-records) - that is, multiple rows associated with one specific entity. If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column. @@ -125,7 +125,7 @@ The clustering pop-up window offers you two categories of clustering methods: 6 **Key collisions** are very fast and can process millions of cells in seconds: -**Fingerprinting** +##### Fingerprinting {#fingerprinting} Fingerprinting is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data cleaning behind the scenes that you might think to do manually: @@ -138,33 +138,33 @@ Fingerprinting is the least likely to produce false positives, so it’s a good For an in-depth understanding of fingerprinting, check this [document](../technical-reference/clustering-in-depth) -**N-gram Fingerprinting** +##### N-gram Fingerprinting {#n-gram} N-gram fingerprinting allows you to set the _n_ value to whatever number you’d like and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a fingerprint. -**For example**, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). +For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”). This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify because it separates words). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”). For an in-depth understanding of N-gram fingerprinting, check this [document](../technical-reference/clustering-in-depth#n-gram-fingerprint) -**Phonetic Clustering** +##### Phonetic Clustering {#phonetic-clustering} The next four methods are phonetic algorithms: they identify letters that sound the same when pronounced out loud, and assess text values based on that (such as knowing that a word with an “S” might be a mistype of a word with a “Z”). They are great for spotting mistakes made by not knowing the spelling of a word or name after hearing it spoken aloud. -**Metaphone3 Fingerprinting** +##### Metaphone3 Fingerprinting {#metaphone3-fingerprinting} Metaphone3 fingerprinting is an English-language phonetic algorithm. For example, “Reuben Gevorkiantz” and “Ruben Gevorkyants” share the same phonetic fingerprint in English. -**Cologne Fingerprinting** +##### Cologne Fingerprinting {#cologne-fingerprinting} Cologne fingerprinting is another phonetic algorithm, but for German pronunciation. -**Daitch-Mokotoff** +##### Daitch-Mokitoff {#daitch-mokotoff} Daitch-Mokotoff is a phonetic algorithm for Slavic and Yiddish words, especially names. -**Baider-Morse** +##### Baider-Morse {#baider-morse} Baider-Morse is a version of Daitch-Mokotoff that is slightly more strict. @@ -182,13 +182,13 @@ We recommend setting the block number to at least 3, and then increasing it if y **Note** that bigger block values will take much longer to process, while smaller blocks may miss matches. Increasing the radius will make the matches more lax, as bigger differences will be clustered. -**Levenshtein Distance** +#### Levenshtein Distance {#levenshtein-distance} Levenshtein distance counts the number of edits required to make one value perfectly match another. As in the key collision methods above, it will do things like change uppercase to lowercase, fix whitespace, change special characters, etc. Each character that gets changed counts as 1 “distance.” “New York” and “newyork” have an edit distance value of 3 (“N” to “n”; “Y” to “y”; remove the space). It can do relatively advanced edits, such as understanding the distance between “M. Makeba” and “Miriam Makeba” (5), but it may create false positives if these distances are greater than other, simpler transformations (such as the one-character distance to “B. Makeba,” another person entirely). -**PPM (Prediction by Partial Matching)** +#### PPM {#ppm} PPM (Prediction by Partial Matching) uses compression to see whether two values are similar or different. In practice, this method is very lax even for small radius values and tends to generate many false positives, but because it operates at a sub-character level it is capable of finding substructures that are not easily identifiable by distances that work at the character level. So it should be used as a “last resort” clustering method. It is also more effective on longer strings than on shorter ones. diff --git a/docs/manual/columnediting.md b/docs/manual/columnediting.md index 754a61c6..373b4f8a 100644 --- a/docs/manual/columnediting.md +++ b/docs/manual/columnediting.md @@ -6,17 +6,17 @@ sidebar_label: Column editing ## Overview {#overview} -Column editing contains some of the most powerful data-improvement methods in OpenRefine. The operations in the Edit column menu involve using one column of data to add entirely new columns and fields to your dataset. +Column editing contains some of the most powerful data-improvement methods in OpenRefine. The operations in the Edit column menu involve using one column of data to add entirely new columns and fields to your dataset. ## Splitting or joining {#splitting-or-joining} -Many users find that they frequently need to make their data more granular: for example, splitting a “Firstname Lastname” column into two columns, one for first names and one for last names. The reverse is also often true: you may have several columns of category values that you want to join into one “category” column. -. +Many users find that they frequently need to make their data more granular: for example, splitting a “Firstname Lastname” column into two columns, one for first names and one for last names. The reverse is also often true: you may have several columns of category values that you want to join into one “category” column. + ### Split into several columns {#split-into-several-columns} ![A screenshot of the settings window for splitting columns.](/img/columnsplit.png) -You can find this operation at Edit columnSplit into several columns.... Splitting one column into several columns requires you to identify the character, string lengths, or evaluating expression you want to split on. Just like [splitting multi-valued cells into rows](cellediting#split-multi-valued-cells), splitting cells into multiple columns will remove the separator character or string you indicate. Splitting by lengths will discard any information that comes after the specified total length. +You can find this operation at Edit columnSplit into several columns.... Splitting one column into several columns requires you to identify the character, string lengths, or evaluating expression you want to split on. Just like [splitting multi-valued cells into rows](cellediting#split-multi-valued-cells), splitting cells into multiple columns will remove the separator character or string you indicate. Splitting by lengths will discard any information that comes after the specified total length. You can also specify a maximum number of new columns to be made: separator characters after this limit will be ignored, and the remaining characters will end up in the last column. @@ -26,45 +26,45 @@ New columns will be named after the original column, with a number: “Location ![A screenshot of the settings window for joining columns.](/img/columnjoin.png) -You can join columns by selecting Edit columnJoin columns.... All the columns currently in your dataset will appear in the pop-up window. You can select or un-select all the columns you want to join, and drag columns to put them in the order you want to join them in. You will define a separator character (optional) and define a string to insert into empty cells (nulls). +You can join columns by selecting Edit columnJoin columns.... All the columns currently in your dataset will appear in the pop-up window. You can select or un-select all the columns you want to join, and drag columns to put them in the order you want to join them in. You will define a separator character (optional) and define a string to insert into empty cells (nulls). -The joined data will appear in the column you originally selected, or you can create a new column for this content and specify a name. You can delete all the columns that were used in this join operation. +The joined data will appear in the column you originally selected, or you can create a new column for this content and specify a name. You can delete all the columns that were used in this join operation. ## Add column based on this column {#add-column-based-on-this-column} -Selecting Edit columnAdd column based on this column... will open up an [expressions](expressions) window where you can transform the data from this column (using `value`), or write a more complex expression that takes information from any number of columns or from external sources. +Selecting Edit columnAdd column based on this column... will open up an [expressions](expressions) window where you can transform the data from this column (using `value`), or write a more complex expression that takes information from any number of columns or from external sources. Expressions used in this operation will rely on your knowledge of variables. You can learn more in the [Expressions section on variables](expressions#variables). -The simplest way to use this operation is simply leave the default `value` in the expression field, to create an exact copy of your column. For a column of [reconciled data](reconciling), you can use the variable `cell` instead, to copy both the original string and the existing reconciliation data. This will include matched values, candidates, and new items. +The simplest way to use this operation is simply leave the default `value` in the expression field, to create an exact copy of your column. For a column of [reconciled data](reconciling), you can use the variable `cell` instead, to copy both the original string and the existing reconciliation data. This will include matched values, candidates, and new items. One useful expression is to create a column based on concatenating (merging) two other columns. Select either of the source columns, choose Edit columnAdd column based on this column..., name your new column, and use the following format in the expression window: -``` +```grel cells["Column 1"].value + cells["Column 2"].value ``` If your column names do not contain spaces, you can use the following format instead: -``` +```grel cells.Column1.value + cells.Column2.value ``` If you are in records mode instead of rows mode, you can concatenate using the following format: -``` +```grel row.record.cells.Column1.value + row.record.cells.Column2.value ``` -You may wish to add separators or spaces, or modify your input during this operation with more advanced expressions. +You may wish to add separators or spaces, or modify your input during this operation with more advanced expressions. ## Add column by fetching URLs {#add-column-by-fetching-urls} -Through the Add column by fetching URLs function, OpenRefine supports the ability to fetch HTML or data from web pages or services. In this operation you will be building URL strings based on your column of data, by using `value` to insert a relevant substring. Your chosen column needs to contains parts of paths to valid HTML pages or files online. +Through the Add column by fetching URLs function, OpenRefine supports the ability to fetch HTML or data from web pages or services. In this operation you will be building URL strings based on your column of data, by using `value` to insert a relevant substring. Your chosen column needs to contains parts of paths to valid HTML pages or files online. If you have a column of URLs and want to fetch the information that they point to, you can simply run the expression as `value`. If your column has, for example, unique identifiers for Wikidata entities (numerical values starting with Q), you can download the JSON-formatted metadata about each entity with -``` +```grel "https://www.wikidata.org/wiki/Special:EntityData/" + value + ".json" ``` @@ -72,14 +72,16 @@ or whatever metadata format you prefer. Information about the format options in ![A screenshot of the settings window for fetching URLs.](/img/fetchingURLs.png) -This service is more useful when getting metadata files instead of HTML, but you may wish to work with a page’s entire HTML contents and then parse out information from that. +This service is more useful when getting metadata files instead of HTML, but you may wish to work with a page’s entire HTML contents and then parse out information from that. :::caution -Be aware that the fetching process can take quite some time and that servers may not want to fulfill hundreds or thousands of page requests in seconds. Fetching allows you to set a “throttle delay” which determines the amount of time between requests. The default is 5 seconds per row in your dataset (5000 milliseconds). We recommend leaving this at 1000 or greater. +Be aware that the fetching process can take quite some time and that servers may not want to fulfill hundreds or thousands of page requests in seconds. Fetching allows you to set a “throttle delay” which determines the amount of time between requests. The default is 5 seconds per row in your dataset (5000 milliseconds). We recommend leaving this at 1000 or greater. ::: Note the following: + * Before pressing “OK,” copy and paste a URL or two from the preview and test them in another browser tab to make sure they work. + * In some situations you may need to set [HTTP request headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers). To set these, click the small “Show” button next to “HTTP headers to be used when fetching URLs” in the settings window. The authorization credentials get logged in your operation history in plain text, which may be a security concern for you. You can set the following request headers: * [User-Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) * [Accept](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept) @@ -89,25 +91,25 @@ Note the following: When OpenRefine attempts to fetch information from a web service, it can fail in a variety of ways. The following information is meant to help troubleshoot and fix problems encountered when using this function. -First, make sure that your fetching operation is storing errors (check “store error”). Then run the fetch and look at the error messages. +First, make sure that your fetching operation is storing errors (check “store error”). Then run the fetch and look at the error messages. **“HTTP error 403 : Forbidden”** can be simply down to you not having access to the URL you are trying to use. If you can access the same URL with your browser, the remote site may be blocking OpenRefine because it doesn't recognize its request as valid. Changing the [User-Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) request header may help. If you believe you should have access to a site but are “forbidden,” you may wish to contract the administrators. -**“HTTP error 404 : Not Found”** indicates that the information you are requesting does not exist, perhaps due to a problem with your cell values if it only happening in certain rows. +**“HTTP error 404 : Not Found”** indicates that the information you are requesting does not exist, perhaps due to a problem with your cell values if it only happening in certain rows. -**“HTTP error 500 : Internal Server Error”** indicates the remote server is having a problem filling your request. You may wish to simply wait and try again later, or double-check the URLs. +**“HTTP error 500 : Internal Server Error”** indicates the remote server is having a problem filling your request. You may wish to simply wait and try again later, or double-check the URLs. **“error: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure”** can occur when you are trying to retrieve information over HTTPS but the remote site is using an encryption not supported by the Java virtual machine being used by OpenRefine. -You can check which encryption methods are supported by your OpenRefine/Java installation by using a service such as **How's my SSL**. Add the URL `https://www.howsmyssl.com/a/check` to an OpenRefine cell and run “Add column by fetching URLs” on it, which will provide a description of the SSL client being used. +You can check which encryption methods are supported by your OpenRefine/Java installation by using a service such as **How's my SSL**. Add the URL `https://www.howsmyssl.com/a/check` to an OpenRefine cell and run “Add column by fetching URLs” on it, which will provide a description of the SSL client being used. -You can try installing additional encryption supports by installing the [Java Cryptography Extension](https://www.oracle.com/java/technologies/javase-jce8-downloads.html). -Note that for Mac users and for Windows users with the OpenRefine installation with bundled JRE, these updated cipher suites need to be dropped into the Java install within the OpenRefine application: +You can try installing additional encryption supports by installing the [Java Cryptography Extension](https://www.oracle.com/java/technologies/javase-jce8-downloads.html). +Note that for Mac users and for Windows users with the OpenRefine installation with bundled JRE, these updated cipher suites need to be dropped into the Java install within the OpenRefine application: * On Mac, it will look something like `/Applications/OpenRefine.app/Contents/PlugIns/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security`. * On Windows: `\server\target\jre\lib\security`. -**“javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed”** can appear when the remote site is using an HTTPS certificate not trusted by your local Java installation. You will need to make sure that the certificate, or (more likely) the root certificate, is trusted. +**“javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed”** can appear when the remote site is using an HTTPS certificate not trusted by your local Java installation. You will need to make sure that the certificate, or (more likely) the root certificate, is trusted. The list of trusted certificates is stored in an encrypted file called `cacerts` in your local Java installation. This can be read and updated by a tool called “keytool.” You can find directions on how to add a security certificate to the list of trusted certificates for a Java installation [here](http://magicmonster.com/kb/prg/java/ssl/pkix_path_building_failed.html) and [here](http://javarevisited.blogspot.co.uk/2012/03/add-list-certficates-java-keystore.html). @@ -118,7 +120,7 @@ Note that for Mac users and for Windows users with the OpenRefine installation w ## Renaming, removing, and moving {#renaming-removing-and-moving} -Every column's Edit column dropdown contains options to move it (to the beginning, end, left, or right), rename it, and delete it. -These operations can be undone, but a removed column cannot be restored later if you keep modifying your data. If you wish to temporarily hide a column, go to [View](sortview#view)Collapse this column instead. +Every column's Edit column dropdown contains options to move it (to the beginning, end, left, or right), rename it, and delete it. +These operations can be undone, but a removed column cannot be restored later if you keep modifying your data. If you wish to temporarily hide a column, go to [View](sortview#view)Collapse this column instead. -Be cautious about moving columns in [records mode](cellediting#rows-vs-records): if you change the first column in your dataset (the key column), your records may change in unintended ways. +Be cautious about moving columns in [records mode](exploring#rows-vs-records): if you change the first column in your dataset (the key column), your records may change in unintended ways. diff --git a/docs/manual/exploring.md b/docs/manual/exploring.md index b3b23c18..2c9cf1ed 100644 --- a/docs/manual/exploring.md +++ b/docs/manual/exploring.md @@ -38,12 +38,11 @@ A “null” data type is a special type that means “this cell has no value. Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as TransformCommon transformsTo date may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section. -To transform data from one type to another, see [Transforming data](cellediting#data-type-transforms) for information on using common tranforms, and see [Expressions](expressions) for information on using [toString()](grelfunctions#tostringo-string-format-optional), [toDate()](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-), and other functions. - +To transform data from one type to another, see [Transforming data](cellediting#data-type-transforms) for information on using common tranforms, and see [Expressions](expressions) for information on using [toString()](grelfunctions#tostringo-string-format-optional), [toDate()](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-), and other functions. ### Dates {#dates} -A “date” type is created when a column is [transformed into dates](transforming#to-date), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date”. +A “date” type is created when a column is [transformed into dates](cellediting#data-type-transforms), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date”. Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ. @@ -66,27 +65,27 @@ You can convert dates into a more human-readable format when you [export your da The following table shows some example [date and time formatting styles for the U.S. and French locales](https://docs.oracle.com/javase/tutorial/i18n/format/dateFormat.html): -|Style |U.S. Locale |French Locale| +|Style|U.S. Locale|French Locale| |---|---|---| -|Default |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47| -|Short |6/30/09 7:03 AM |30/06/09 07:03| -|Medium |Jun 30, 2009 7:03:47 AM |30 juin 2009 07:03:47| -|Long |June 30, 2009 7:03:47 AM PDT |30 juin 2009 07:03:47 PDT| -|Full |Tuesday, June 30, 2009 7:03:47 AM PDT |mardi 30 juin 2009 07 h 03 PDT| +|Default|Jun 30, 2009 7:03:47 AM|30 juin 2009 07:03:47| +|Short|6/30/09 7:03 AM|30/06/09 07:03| +|Medium|Jun 30, 2009 7:03:47 AM|30 juin 2009 07:03:47| +|Long|June 30, 2009 7:03:47 AM PDT|30 juin 2009 07:03:47 PDT| +|Full|Tuesday, June 30, 2009 7:03:47 AM PDT|mardi 30 juin 2009 07 h 03 PDT| ## Rows vs. records {#rows-vs-records} -A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response. +A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response. -In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with. +In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with. -Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on. +Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on. -OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset. +OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset. -If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group. +If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group. -You can imagine the structure as a tree with many branches, all leading back to the same trunk. +You can imagine the structure as a tree with many branches, all leading back to the same trunk. For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title. @@ -107,12 +106,12 @@ For example, your key column may be a film or television show, with multiple cas | | Margaret Hamilton | Miss Almira Gulch | | | | The Wicked Witch of the West | -Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values). +Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values). OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the All column. -To [split multi-valued cells](transforming#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming). +To split multi-valued cells and apply other operations that take advantage of records mode, see [Transforming data](transforming). -Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another. +Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another. This feature is related to [Column Groups](../technical-reference/architecture-before-4#column-groups), which however is incomplete and deprecated. diff --git a/docs/manual/exporting.md b/docs/manual/exporting.md index d827b556..4367acc1 100644 --- a/docs/manual/exporting.md +++ b/docs/manual/exporting.md @@ -18,20 +18,20 @@ Many of the options only export data in the current view - that is, with current To export data from a project, click the Export dropdown button in the top right corner and pick the format you want. Your options are: -* Tab-separated value (TSV) or Comma-separated value (CSV) -* HTML-formatted table -* Excel spreadsheet (XLS or XLSX) -* Open Document Format (ODF) spreadsheet (ODS) -* Upload to Google Sheets (requires [Google account authorization](starting#google-sheet-from-drive)) -* [Custom tabular exporter](#custom-tabular-exporter) -* [SQL statement exporter](#sql-statement-exporter) -* [Templating exporter](#templating-exporter), which generates JSON by default +* Tab-separated value (TSV) or Comma-separated value (CSV) +* HTML-formatted table +* Excel spreadsheet (XLS or XLSX) +* Open Document Format (ODF) spreadsheet (ODS) +* Upload to Google Sheets (requires [Google account authorization](starting#google-sheet-from-drive)) + [Custom tabular exporter](#custom-tabular-exporter) +* [SQL statement exporter](#sql-exporter) +* [Templating exporter](#templating-exporter), which generates JSON by default You can also export reconciled data to Wikidata, or export your Wikidata schema for future use with other OpenRefine projects: -* [Upload edits to Wikidata](wikibase/uploading#uploading-with-openrefine) -* [Export to QuickStatements](wikibase/uploading#uploading-with-quickstatements) (version 1) -* [Export Wikidata schema](wikibase/overview#import-and-export-schema) +* [Upload edits to Wikidata](wikibase/uploading#uploading-with-openrefine) +* [Export to QuickStatements](wikibase/uploading#uploading-with-quickstatements) (version 1) +* [Export Wikidata schema](wikibase/overview#import-and-export-schema) ### Custom tabular exporter {#custom-tabular-exporter} diff --git a/docs/manual/expressions.md b/docs/manual/expressions.md index a86cbffa..d8a9d73f 100644 --- a/docs/manual/expressions.md +++ b/docs/manual/expressions.md @@ -17,22 +17,22 @@ You can use expressions in multiple places in OpenRefine to extend data cleanup * Transform… * Split multi-valued cells… * Join multi-valued cells… -* Edit column: +* Edit column: * Split * Join * Add column based on this column * Add column by fetching URLs. -In the expressions editor window you have the opportunity to select a supported language. The default is [GREL (General Refine Expression Language)](grel); OpenRefine also comes with support for [Clojure](jythonclojure#clojure) and [Jython](jythonclojure#jython). Extensions may offer support for more expressions languages. +In the expressions editor window you have the opportunity to select a supported language. The default is [GREL (General Refine Expression Language)](grel); OpenRefine also comes with support for [Clojure](jythonclojure#clojure) and [Jython](jythonclojure#jython). Extensions may offer support for more expressions languages. These languages have some syntax differences but support many of the same [variables](#variables). For example, the GREL expression `value.split(" ")[1]` would be written in Jython as `return value.split(" ")[1]`. -This page is a general reference for available functions, variables, and syntax. For examples that use these expressions for common data tasks, look at the [Recipes section on the wiki](https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users#recipes-and-worked-examples). +This page is a general reference for available functions, variables, and syntax. For examples that use these expressions for common data tasks, look at the [Recipes section on the wiki](https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users#recipes-and-worked-examples). ## Expressions {#expressions} -There are significant differences between OpenRefine's expressions and the spreadsheet formulas you may be used to using for data manipulation. OpenRefine does not store formulas in cells and display output dynamically: OpenRefine’s transformations are one-time operations that can change column contents or generate new columns. These are applied using variables such as `value` or `cell` to perform the same modification to each cell in a column. +There are significant differences between OpenRefine's expressions and the spreadsheet formulas you may be used to using for data manipulation. OpenRefine does not store formulas in cells and display output dynamically: OpenRefine’s transformations are one-time operations that can change column contents or generate new columns. These are applied using variables such as `value` or `cell` to perform the same modification to each cell in a column. Take the following example: @@ -43,30 +43,31 @@ Take the following example: Were you to apply a transformation to the “friend” column with the expression -``` +```grel value.split(" ")[1] ``` OpenRefine would work through each row, splitting the “friend” values based on a space character. The `value` for row 1 is “John Smith” so the output would be “Smith” (as "[1]" selects the second part of the created output); the `value` for row 2 is “Jane Doe” so the output would be “Doe”. Using variables, a single expression yields different results for different rows. The old information would be discarded; you couldn't get "John" and "Jane" back unless you undid the operation in the [History](running#history-undoredo) tab. -For another example, if you were to create a new column based on your data using the expression `row.starred`, it would generate a column of true and false values based on whether your rows were starred at that moment. If you were to then star more rows and unstar some rows, that data would not dynamically update - you would need to run the operation again to have current true/false values. +For another example, if you were to create a new column based on your data using the expression `row.starred`, it would generate a column of true and false values based on whether your rows were starred at that moment. If you were to then star more rows and unstar some rows, that data would not dynamically update - you would need to run the operation again to have current true/false values. Note that an expression is typically based on one particular column in the data - the column whose drop-down menu is first selected. Many variables are created to stand for things about the cell in that “base column” of the current row on which the expression is evaluated. There are also variables about rows, which you can use to access cells in other columns. ## The expressions editor {#the-expressions-editor} -When you select a function that accepts expressions, you will see a window overlay the screen with what we call the expressions editor. +When you select a function that accepts expressions, you will see a window overlay the screen with what we call the expressions editor. ![The expressions editor window with a simple expression: value + 10.](/img/expression-editor.png) -The expressions editor offers you a field for entering your formula and shows you a preview of its transformation on your first few rows of cells. +The expressions editor offers you a field for entering your formula and shows you a preview of its transformation on your first few rows of cells. -There is a dropdown menu from which you can choose an expression language. The default at first is GREL; if you begin working with another language, that selection will persist across OpenRefine. Jython and Clojure are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations. +There is a dropdown menu from which you can choose an expression language. The default at first is GREL; if you begin working with another language, that selection will persist across OpenRefine. Jython and Clojure are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations. There are also tabs for: -* History, which shows you formulas you’ve recently used from across all your projects -* Starred, which shows you formulas from your History that you’ve starred for reuse -* Help, a quick reference to GREL functions. + +* History, which shows you formulas you’ve recently used from across all your projects +* Starred, which shows you formulas from your History that you’ve starred for reuse +* Help, a quick reference to GREL functions. Starring formulas you’ve used in the past can be helpful for repetitive tasks you’re performing in batches. @@ -82,7 +83,7 @@ If this is your first time working with regex, you may wish to read [this tutori To write a regular expression inside a GREL expression, wrap it between a pair of forward slashes (/) much like the way you would in Javascript. For example, in -``` +```grel value.replace(/\s+/, " ") ``` @@ -91,20 +92,21 @@ the regular _expression_ is `\s+`, and the _syntax_ used to denote a regular exp Do not use slashes to wrap regular expressions outside of a GREL expression. On the [GREL functions](grelfunctions) page, functions that support regex will indicate that with a “p” for “pattern.” The GREL functions that support regex are: -* [contains](grelfunctions#containss-sub-or-p) -* [replace](grelfunctions#find-and-replace) -* [find](grelfunctions#find-and-replace) -* [match](grelfunctions#matchs-p) -* [partition](grelfunctions#partitions-s-or-p-fragment-b-omitfragment-optional) -* [rpartition](grelfunctions#rpartitions-s-or-p-fragment-b-omitfragment-optional) -* [split](grelfunctions#splits-s-or-p-sep-b-preservetokens-optional) -* [smartSplit](grelfunctions#smartsplits-s-or-p-sep-optional) + +* [contains](grelfunctions#containss-sub-or-p) +* [replace](grelfunctions#find-and-replace) +* [find](grelfunctions#find-and-replace) +* [match](grelfunctions#matchs-p) +* [partition](grelfunctions#partitions-s-or-p-fragment-b-omitfragment-optional) +* [rpartition](grelfunctions#rpartitions-s-or-p-fragment-b-omitfragment-optional) +* [split](grelfunctions#splits-s-or-p-sep-b-preservetokens-optional) +* [smartSplit](grelfunctions#smartsplits-s-or-p-sep-optional) ### Jython-supported regex {#jython-supported-regex} You can also use [regex with Jython expressions](http://www.jython.org/docs/library/re.html), instead of GREL, for example with a Custom Text Facet: -``` +```grel python import re g = re.search(ur"\u2014 (.*),\s*BWV", value) return g.group(1) ``` @@ -112,7 +114,7 @@ python import re g = re.search(ur"\u2014 (.*),\s*BWV", value) return g.group(1) [Clojure](https://clojure.org/reference/reader) uses the same regex engine as Java, and can be invoked with [re-find](http://clojure.github.io/clojure/clojure.core-api.html#clojure.core/re-find), [re-matches](http://clojure.github.io/clojure/clojure.core-api.html#clojure.core/re-matches), etc. You can use the #"pattern" reader macro as described [in the Clojure documentation](https://clojure.org/reference/other_functions#regex). For example, to get the nth element of a returned sequence, you can use the nth function: -``` +```grel clojure (nth (re-find #"\u2014 (.*),\s*BWV" value) 1) ``` @@ -120,8 +122,8 @@ clojure (nth (re-find #"\u2014 (.*),\s*BWV" value) 1) Most OpenRefine variables have attributes: aspects of the variables that can be called separately. We call these attributes “member fields” because they belong to certain variables. For example, you can query a record to find out how many rows it contains with `row.record.rowCount`: `rowCount` is a member field specific to the `record` variable, which is a member field of `row`. Member fields can be called using a dot separator, or with square brackets (`row["record"]`). The square bracket syntax is also used for variables that can call columns by name, for example, `cells["Postal Code"]`. -|Variable |Meaning | -|-|-| +| Variable | Meaning | +| - | - | | `value` | The value of the cell in the current column of the current row (can be null) | | `row` | The current row | | `row.record` | One or more rows grouped together to form a record | @@ -135,8 +137,8 @@ Most OpenRefine variables have attributes: aspects of the variables that can be The `row` variable itself is best used to access its member fields, which you can do using either a dot operator or square brackets: `row.index` or `row["index"]`. -|Field |Meaning | -|-|-| +| Field | Meaning | +| - | - | | `row.index` | The index value of the current row (the first row is 0) | | `row.cells` | The cells of the row, returned as an array | | `row.columnNames` | An array of the column names of the project. This will report all columns, even those with null cell values in that particular row. Call a column by number with `row.columnNames[3]` | @@ -146,7 +148,7 @@ The `row` variable itself is best used to access its member fields, which you ca For array objects such as `row.columnNames` you can preview the array using the expressions window, and output it as a string using `toString(row.columnNames)` or with something like: -``` +```grel forEach(row.columnNames,v,v).join("; ") ``` @@ -160,49 +162,50 @@ A `cell` object contains all the data of a cell and is stored as a single object You can use `cell` on its own in the expressions editor to copy all the contents of a column to another column, including reconciliation information. Although the preview in the expressions editor will only show a small representation (“[object Cell]”), it will actually copy all the cell's data. Try this with Edit ColumnAdd Column based on this column .... -|Field |Meaning |Member fields | -|-|-|-| +| Field | Meaning | Member fields | +| - | - | - | | `cell` | An object containing the entire contents of the cell | .value, .recon, .errorMessage | | `cell.value` | The value in the cell, which can be a string, a number, a boolean, null, or an error | | | `cell.recon` | An object encapsulating reconciliation results for that cell | See the [reconciliation](expressions#reconciliation) section | -| `cell.errorMessage` | Returns the message of an *EvalError* instead of the error object itself (use value to return the error object) | .value | +| `cell.errorMessage` | Returns the message of an _EvalError_ instead of the error object itself (use value to return the error object) | .value | ### Reconciliation {#reconciliation} Several of the fields here provide the data used in [reconciliation facets](reconciling#reconciliation-facets). You must type `cell.recon`; `recon` on its own will not work. -|Field|Meaning |Member fields | -|-|-|-| -| `cell.recon.judgment` | A string: either “matched”, "new”, "none” | | -| `cell.recon.judgmentAction` | A string: either "single” or “similar” (or “unknown”) | | -| `cell.recon.judgmentHistory` | A number, the epoch timestamp (in milliseconds) of your judgment | | -| `cell.recon.matched` | A boolean, true if judgment is “matched” | | +| Field | Meaning | Member fields | +| - | - | - | +| `cell.recon.judgment` | A string: either “matched”, "new”, "none” | | +| `cell.recon.judgmentAction` | A string: either "single” or “similar” (or “unknown”) | | +| `cell.recon.judgmentHistory` | A number, the epoch timestamp (in milliseconds) of your judgment | | +| `cell.recon.matched` | A boolean, true if judgment is “matched” | | | `cell.recon.match` | The recon candidate that has been matched against this cell (or null) | .id, .name, .type | | `cell.recon.best` | The highest scoring recon candidate from the reconciliation service (or null) | .id, .name, .type, .score | -| `cell.recon.features` | An array of reconciliation features to help you assess the accuracy of your matches | .typeMatch, .nameMatch, .nameLevenshtein, .nameWordDistance | -| `cell.recon.features.typeMatch` | A boolean, true if your chosen type is “matched” and false if not (or “(no type)” if unreconciled) | | -| `cell.recon.features.nameMatch` | A boolean, true if the cell and candidate strings are identical and false if not (or “(unreconciled)”) | | -| `cell.recon.features.nameLevenshtein` | A number representing the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance): larger if the difference is greater between value and candidate | | -| `cell.recon.features.nameWordDistance` | A number based on the [word similarity](reconciling#reconciliation-facets) | | +| `cell.recon.features` | An array of reconciliation features to help you assess the accuracy of your matches | .typeMatch, .nameMatch, .nameLevenshtein, .nameWordDistance | +| `cell.recon.features.typeMatch` | A boolean, true if your chosen type is “matched” and false if not (or “(no type)” if unreconciled) | | +| `cell.recon.features.nameMatch` | A boolean, true if the cell and candidate strings are identical and false if not (or “(unreconciled)”) | | +| `cell.recon.features.nameLevenshtein` | A number representing the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance): larger if the difference is greater between value and candidate | | +| `cell.recon.features.nameWordDistance` | A number based on the [word similarity](reconciling#reconciliation-facets) | | | `cell.recon.candidates` | An array of the top 3 candidates (default) | .id, .name, .type, .score | -The `cell.recon.candidates` and `cell.recon.best` objects have a few deeper fields: `id`, `name`, `type`, and `score`. `type` is an array of type identifiers for a list of candidates, or a single string for the best candidate. +The `cell.recon.candidates` and `cell.recon.best` objects have a few deeper fields: `id`, `name`, `type`, and `score`. `type` is an array of type identifiers for a list of candidates, or a single string for the best candidate. Arrays such as `cell.recon.candidates` and `cell.recon.candidates.type` can be joined into lists and stored as strings with something like: -``` + +```grel forEach(cell.recon.candidates,v,v.name).join("; ") ``` ### Record {#record} -A `row.record` object encapsulates one or more rows that are grouped together, when your project is in records mode. You must call it as `row.record`; `record` will not return values. +A `row.record` object encapsulates one or more rows that are grouped together, when your project is in records mode. You must call it as `row.record`; `record` will not return values. -|Field|Meaning | -|-|-| +| Field | Meaning | +| - | - | | `row.record.index` | The index of the current record (starting at 0) | | `row.record.cells` | An array of the [cells](#cells) in the given column of the record | | `row.record.fromRowIndex` | The row index of the first row in the record | | `row.record.toRowIndex` | The row index of the last row in the record + 1 (i.e. the next record) | | `row.record.rowCount` | A count of the number of rows in the record | -For example, you can facet by number of rows in each record by creating a Custom Numeric Facet (or a Custom Text Facet) and entering `row.record.rowCount`. +For example, you can facet by number of rows in each record by creating a Custom Numeric Facet (or a Custom Text Facet) and entering `row.record.rowCount`. diff --git a/docs/manual/facets.md b/docs/manual/facets.md index 2319a5aa..af4afa08 100644 --- a/docs/manual/facets.md +++ b/docs/manual/facets.md @@ -6,23 +6,22 @@ sidebar_label: Facets ## Overview {#overview} -Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from! +Facets are one of OpenRefine’s strongest features - that’s where the diamond logo comes from! Faceting allows you to look for patterns and trends. Facets are essentially aspects or angles of data variance in a given column. For example, if you had survey data where respondents indicated one of five responses from “Strongly agree” to “Strongly disagree,” those five responses make up a text facet, showing how many people selected each option. -Faceted browsing gives you a big-picture look at your data (do they agree or disagree?) and also allows you to filter down to a specific subset to explore it more (what do people who disagree say in other responses?). +Faceted browsing gives you a big-picture look at your data (do they agree or disagree?) and also allows you to filter down to a specific subset to explore it more (what do people who disagree say in other responses?). Typically, you create a facet on a particular column. That facet selection appears on the left, in the Facet/Filter tab, and you can click on a displayed facet to view all the records that match. You can also “exclude” the facet, to view every record that does _not_ match, and you can select more than one facet by clicking “include.” - ### An example {#an-example} You can learn about facets and filtering with the following example. You can copy the following table and paste it using the Clipboard method of starting a project if you would like to try it yourself. Check the "Attempt to parse cell text into numbers" option so that you can use numeric faceting. -We collected a list of the [10 most populous cities from Wikidata](https://w.wiki/3Em), using an example query of theirs. We removed the GPS coordinates and added the country. +We collected a list of the [10 most populous cities from Wikidata](https://w.wiki/3Em), using an example query of theirs. We removed the GPS coordinates and added the country. | cityLabel | population | countryLabel | -|-|-|-| +| - | - | - | | Shanghai | 23390000 | People's Republic of China | | Beijing | 21710000 | People's Republic of China | | Lagos | 21324000 | Nigeria | @@ -34,9 +33,9 @@ We collected a list of the [10 most populous cities from Wikidata](https://w.wik | Guangzhou | 13080500 | People's Republic of China | | São Paulo | 12106920 | Brazil | -If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells. +If we want to see which countries have the most populous cities, we can create a text facet on the “countryLabel” column and OpenRefine will generate a list of all the different strings used in these cells. -We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count at the top of the facet window, you’ll learn which countries hold the most populous cities. +We will see in the sidebar that the countries identified are displayed, along with the number of matches (the “count”). We can sort this list alphabetically or by the count. If you sort by count at the top of the facet window, you’ll learn which countries hold the most populous cities. |Facet|Count| |---|---| @@ -54,25 +53,25 @@ You’ll see the “10 rows” indicator change to “4 matching rows (10 total) If you want to go back to the original dataset, click Reset All or the small “exclude” text next to the facet. If you want to view the most populous cities in both China and India, click “include” next to each facet. Now you’ll see 5 rows - #1, 2, 5, 8, 9. -We can also explore our data using the population information. In this case, because population is a number, we can create a numeric facet. This will give us the ability to explore by range rather than by exact matching values. +We can also explore our data using the population information. In this case, because population is a number, we can create a numeric facet. This will give us the ability to explore by range rather than by exact matching values. -With the numeric facet, we are given a scale from the smallest to the largest value in the column. We can drag the range minimum and maximum to narrow the results. In this case, if we narrow down to only cities with more than 20 million in population, we get 3 matching rows out of the original 10. +With the numeric facet, we are given a scale from the smallest to the largest value in the column. We can drag the range minimum and maximum to narrow the results. In this case, if we narrow down to only cities with more than 20 million in population, we get 3 matching rows out of the original 10. -When you look back at the text facet display of country names, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows. +When you look back at the text facet display of country names, you should see a smaller list with a reduced count: OpenRefine is now displaying the facets of the 3 matching rows, not the total dataset of 10 rows. -We can combine these facets - say, by narrowing to only the Chinese cities with populations greater than 20 million - simply by clicking in both. You should see 2 matching rows for both these criteria. +We can combine these facets - say, by narrowing to only the Chinese cities with populations greater than 20 million - simply by clicking in both. You should see 2 matching rows for both these criteria. ### Things to know about facets {#things-to-know-about-facets} -When you have facets applied, you will see “matching rows” in the [project grid header](running#project-grid-header). If you click Export and copy your data out of OpenRefine while facets are active, many of the exporting options will only export the matching rows, not all the rows in your project. +When you have facets applied, you will see “matching rows” in the [project grid header](running#the-grid-header). If you click Export and copy your data out of OpenRefine while facets are active, many of the exporting options will only export the matching rows, not all the rows in your project. -OpenRefine has several default facets, which you’ll learn about below. The most powerful facets are the ones designed by you - custom facets, written using [expressions](expressions) to transform the data behind the scenes and help you narrow down to precisely what you’re looking for. +OpenRefine has several default facets, which you’ll learn about below. The most powerful facets are the ones designed by you - custom facets, written using [expressions](expressions) to transform the data behind the scenes and help you narrow down to precisely what you’re looking for. Facets are not saved in the project along with the data. But you can save a link to the current state of the application. Find the [Permalink](running#the-project-bar) next to the project’s name. You can modify any facet expression by clicking the “change” button to the right of the column name in the facet sidebar. -Facet boxes that appear in the sidebar can be resized and rearranged. You can drag and drop the title bar of each box to reorder them, and drag on the bottom bar of text facet boxes. +Facet boxes that appear in the sidebar can be resized and rearranged. You can drag and drop the title bar of each box to reorder them, and drag on the bottom bar of text facet boxes. :::info Operations that don't respect facets @@ -86,13 +85,13 @@ Certain operations don't respect facet settings. If you perform any of the follo ## Text facet {#text-facet} -A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to FacetText facet. The created facet will be sorted alphabetically, and can be sorted by count. +A text facet can be generated on any column with the “text” data type. Select the column dropdown and go to FacetText facet. The created facet will be sorted alphabetically, and can be sorted by count. -A text facet is very simple: it takes the total contents of the cells of the column in question and matches them up. It does no guessing about typos or near-matches. +A text facet is very simple: it takes the total contents of the cells of the column in question and matches them up. It does no guessing about typos or near-matches. -You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in the column. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](transforming#cluster-and-edit): a “Cluster” button is displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window. +You can edit any entry that appears in the facet display, by hovering over the facet and clicking the “edit” button that appears. You can then type in a new value manually. This will mass-edit every identical cell in the column. This is a great way to fix typos, whitespace, and other issues that may be affecting the way facets appear. You can also automate the cleanup of facets by using [clustering](cellediting#cluster-and-edit): a “Cluster” button is displayed within the facet window. It may be most efficient to cluster cells to one value, and then mass-edit that value to your desired string within the clustering operation window. -Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which may slow down your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will permanently edit that preference for you. +Each text facet shows up to 2,000 choices by default. You can [increase this limit on the Preferences screen](running#preferences) if you need to, which may slow down your browser. If your applied facet has more choices than the current limit, you'll be offered the option to increase the limit, which will permanently edit that preference for you. The choices and counts displayed in each facet can be copied as tab-separated values. To do so, click on the "X choices" link near the top left corner of the facet. This can be useful to generate small summary tables of your data. @@ -102,7 +101,7 @@ The choices and counts displayed in each facet can be copied as tab-separated va ![A screenshot of an example numeric facet.](/img/numericfacet.png) -Whereas a text facet groups unique text values into groups, a numeric facet sorts numbers by their range - smallest to biggest. This displays visually as a histogram, and allows you to set a custom facet within that range. You can drag the minimum and maximum range markers to set a range. OpenRefine snaps to some basic equal-sized divisions - 19 in the example set above. +Whereas a text facet groups unique text values into groups, a numeric facet sorts numbers by their range - smallest to biggest. This displays visually as a histogram, and allows you to set a custom facet within that range. You can drag the minimum and maximum range markers to set a range. OpenRefine snaps to some basic equal-sized divisions - 19 in the example set above. You will be offered the option to include blank, non-numeric, and error values in your numeric visualization; these will appear in the visual range as “0” values. @@ -124,79 +123,79 @@ The facet appears with a count of blank cells and those with errors, which can h ## Scatterplot facet {#scatterplot-facet} -A scatterplot is a visual representation of two related sets of numeric data. +A scatterplot is a visual representation of two related sets of numeric data. -You have the option to generate linear scatterplots (where the X and Y axes show continuous increases) or logarithmic scatterplots (where the X and Y axes show exponential or scaled increases). You can also rotate the plot by 45 degrees in either direction, and you can choose the size of the dot indicating a datapoint. You can make these choices in both the preview and in the facet display. +You have the option to generate linear scatterplots (where the X and Y axes show continuous increases) or logarithmic scatterplots (where the X and Y axes show exponential or scaled increases). You can also rotate the plot by 45 degrees in either direction, and you can choose the size of the dot indicating a datapoint. You can make these choices in both the preview and in the facet display. -A scatterplot facet can be generated on any column. You require two or more number columns to generate scatterplots. Selecting FacetScatterplot facet will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further. You can control which columns are on the X and Y axes by rearranging the columns in your dataset. +A scatterplot facet can be generated on any column. You require two or more number columns to generate scatterplots. Selecting FacetScatterplot facet will create a preview of data plotted from every number-formatted column in your dataset, comparing every column against every other column. Each scatterplot will show in its own square, allowing you to choose which data comparison you would like to analyze further. You can control which columns are on the X and Y axes by rearranging the columns in your dataset. ![A simple scatterplot of two numeric values.](/img/scatterplot.png) -When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle (as shown by the rectangle inside the square in the image above). This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square. +When you click on your desired square, that two-column comparison will appear in the facets sidebar. From here, you can drag your mouse to draw a rectangle inside the scatterplot, which will narrow down to just the rows matching the points plotted inside that rectangle (as shown by the rectangle inside the square in the image above). This rectangle can be resized by dragging any of the four edges. To draw a new rectangle, simply click and drag your mouse again. To add more scatterplots to the facet sidebar, re-run this process and select a different square. If you have multiple facets applied, plotted points in your scatterplot displays will be greyed out if they are not part of the current matching data subset. If the rectangle you have drawn within a scatterplot display only includes grey dots, you will see no matching rows. -If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG file that you can save. +If you would like to export a scatterplot, OpenRefine will open a new tab with a generated PNG file that you can save. ## Custom text facet {#custom-text-facet} -You may want to explore your textual data with modifications that aren't permanent. Creating custom text facets will load your column into memory, transform the data temporarily, and store those transformations inside the facet. +You may want to explore your textual data with modifications that aren't permanent. Creating custom text facets will load your column into memory, transform the data temporarily, and store those transformations inside the facet. -You can also use custom text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and “false” as values. +You can also use custom text facets to analyze numerical data, such as by analyzing a number as a string, or by creating a test that will return “true” and “false” as values. -Clicking on FacetCustom text facet… will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. +Clicking on FacetCustom text facet… will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. -A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot click “edit” on the facets that appear in the sidebar and change the matching cells in your dataset - because what they display is modified, not the original entries. +A custom text facet operates just like a [text facet](#text-facet) by default. Unlike a text facet, however, you cannot click “edit” on the facets that appear in the sidebar and change the matching cells in your dataset - because what they display is modified, not the original entries. For example, you may wish to analyze only the first word in a text field - perhaps the first name in a column of “[First Name] [Last Name]” entries. In this case, you can tell OpenRefine to facet only on the information that comes before the first space: -``` +```grel value.split(" ")[0] ``` -In this case, `split()` is creating an array of text strings based on every space in the cells ["Firstname", "Lastname"]. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. (Assuming the first name is one word, not something like “Mary Anne.”) We can do the same splitting and ask for the last name with +In this case, `split()` is creating an array of text strings based on every space in the cells ["Firstname", "Lastname"]. Because arrays number their entries starting with 0, we want the first value, so we ask for `[0]`. (Assuming the first name is one word, not something like “Mary Anne.”) We can do the same splitting and ask for the last name with -``` +```grel value.split(" ")[1] ``` You may want to create a facet that references several columns. For example, let’s say you have two columns, “First Name” and “Last Name”, and you want out how many people have the same initial letter for both names (e.g., Marilyn Monroe, Steven Segal). To do so, create a custom text facet on either column and enter the expression -``` +```grel cells["First Name"].value[0] == cells["Last Name"].value[0] ``` -That expression will look for the first letter (the character at index 0) of each entry and compare them. Then it will facet your rows into “true” and “false.” +That expression will look for the first letter (the character at index 0) of each entry and compare them. Then it will facet your rows into “true” and “false.” -You can learn more about text-modification functions on the [Expressions page](expressions). +You can learn more about text-modification functions on the [Expressions page](expressions). ## Custom numeric facet {#custom-numeric-facet} -You may want to explore your numerical data with modifications that aren't permanent. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`). +You may want to explore your numerical data with modifications that aren't permanent. You can also use custom numeric facets to analyze textual data, such as by getting the length of text strings (with `value.length()`), or by analyzing it as though it were formatted as numbers (with `toNumber(value)`). If you would like to build your own version of a numeric facet, you can use the Custom Numeric Facet option. Clicking on FacetCustom Numeric Facet… will bring up an [expressions](expressions) window where you can enter in a GREL, Jython, or Clojure expression to modify how the facet works. A custom numeric facet operates just like a [numeric facet](#numeric-facet) by default. For example, you may wish to create a numeric facet that rounds your value to the nearest integer, enter -``` +```grel round(value) ``` If you have two columns of numbers and for each row you wish to create a numeric facet only on the larger of the two, enter -``` +```grel max(cells["Column1"].value, cells["Column2"].value) ``` If the numeric values in a column are drawn from a power law distribution, then it's better to group them by their logs: -``` +```grel value.log() ``` If the values are periodic you could take the modulus by the period to understand if there's a pattern: -``` +```grel mod(value, 7) ``` @@ -206,31 +205,31 @@ You can learn more about numeric-modification functions on the [Expressions page Customized facets have been added to expand the number of default facets users can apply with a single click. They represent some common and useful functions you shouldn’t have to work out using an [expression](expressions). -All facets that display in the Facet/Filter tab can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used. +All facets that display in the Facet/Filter tab can be edited by clicking on the “change” button to the right of the column title. This brings up the expressions window that will allow you to modify and preview the expression being used. ### Word facet {#word-facet} A Word facet is a simple version of a text facet: it splits up the content of the cells based on spaces, and outputs each character string as a facet: -``` +```grel value.split(" ") ``` -This can be useful for exploring the language used in a corpus, looking for common first and last names or titles, or seeing what’s in multi-valued cells you don’t wish to split up. +This can be useful for exploring the language used in a corpus, looking for common first and last names or titles, or seeing what’s in multi-valued cells you don’t wish to split up. -Word facet is case-sensitive and only splits by spaces, not by line breaks or other natural divisions. +Word facet is case-sensitive and only splits by spaces, not by line breaks or other natural divisions. ### Duplicates facet {#duplicates-facet} A Duplicates facet will return only rows that have non-unique values in the column you’ve selected. It will create a facet of “true” and “false” values - true being cells that are not unique, and “false” being cells that are. The actual expression being used is -``` +```grel facetCount(value, 'value', '[Column]') > 1 ``` Duplicates facets are case-sensitive and you may wish to filter out things like leading and trailing whitespace or other hard-to-see issues. You can modify the facet expression, for example, with: -``` +```grel facetCount(trim(toLowercase(value)), 'trim(toLowercase(value))', 'cityLabel') > 1 ``` @@ -240,8 +239,8 @@ Logarithmic scales reduce wide-ranging quantities to more compact and manageable For example, we can look at [this data about the body weight of various mammals](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Brain2BodyWeight): -|Species|BodyWeight (kg)| -|---|---| +| Species | BodyWeight (kg) | +| --- | --- | | Newborn_Human | 3.2 | | Adult_Human | 73 | | Pithecanthropus_Man | 70 | @@ -260,20 +259,19 @@ Most values will be clustered in the 0-100 range, but 35,000 is many magnitudes ![A screenshot of a numeric facet first and a numeric log facet second.](/img/numericlogfacet.png) -A 1-bounded numeric log facet can be used if you'd like to exclude all the values below 1 (including zero and negative numbers). +A 1-bounded numeric log facet can be used if you'd like to exclude all the values below 1 (including zero and negative numbers). ### Text-length facet {#text-length-facet} The Text-length facet returns a numerical value for each cell and plots it on a numeric facet chart. The expression used is -``` +```grel value.length() ``` -This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date in YYYY/MM/DD is eight to ten characters). - -You can also employ a Log of text-length facet that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out. +This can be useful to, for example, look for values that did not successfully split on an earlier split operation, or to validate that data is a certain expected length (such as whether a date in YYYY/MM/DD is eight to ten characters). +You can also employ a Log of text-length facet that allows you to navigate more easily a wide range of string lengths. This can be useful in the case of web-scraping, where lots of textual data is loaded into single cells and needs to be parsed out. ### Unicode character-code facet {#unicode-character-code-facet} @@ -285,55 +283,56 @@ This facet creates a numerical chart, which offers you the ability to narrow dow ### Facet by error {#facet-by-error} -An error is a data type created by OpenRefine in the process of transforming data. For example, say you had converted a column to the number data type. If one cell had text characters in it, OpenRefine could either output the original text string unchanged or output an error. If you allow errors to be created, you can facet by them later to search for them and fix them. +An error is a data type created by OpenRefine in the process of transforming data. For example, say you had converted a column to the number data type. If one cell had text characters in it, OpenRefine could either output the original text string unchanged or output an error. If you allow errors to be created, you can facet by them later to search for them and fix them. ![A view of the expressions window with an error converting a string to a number.](/img/error.png) -To store errors in cells, ensure that you have store error selected for the “On error” option in the expressions window. +To store errors in cells, ensure that you have store error selected for the “On error” option in the expressions window. ### Facet by null, empty, or blank {#facet-by-null-empty-or-blank} -Any column can be faceted for [null and/or empty cells](#cell-data-types). These can help you find cells where you want to manually enter content. +Any column can be faceted for [null and/or empty cells](exploring#data-types). These can help you find cells where you want to manually enter content. -“Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank. +“Blank” means both null values and empty values. All three facets will generate “true” and “false” facets, “true” being blank. An empty cell is a cell that is set to contain a string, but doesn’t have any characters in it (a zero-length string). This can be left over from an operation that removed characters, or from manually editing a cell and deleting its contents. ### Facet by star or flag {#facet-by-star-or-flag} -Stars and flags offer you the opportunity to mark specific rows for yourself for later focus. Stars and flags persist through closing and opening your project, and thus can provide a different function than using a permalink to persist your facets. Stars and flags can be used in any way you want, although they are designed to help you flag errors and star rows of particular importance. +Stars and flags offer you the opportunity to mark specific rows for yourself for later focus. Stars and flags persist through closing and opening your project, and thus can provide a different function than using a permalink to persist your facets. Stars and flags can be used in any way you want, although they are designed to help you flag errors and star rows of particular importance. -You can manually star or flag rows simply by clicking on the icons to the left of each row. +You can manually star or flag rows simply by clicking on the icons to the left of each row. -You can also apply stars or flags to all matching rows by using the All dropdown menu (on the first column) and selecting Edit rowsStar rows or Flag rows. This will create “true” and “false” facets in the Facet/Filter. These operations will modify all matching rows in your current subset. You can unstar or unflag them as well. +You can also apply stars or flags to all matching rows by using the All dropdown menu (on the first column) and selecting Edit rowsStar rows or Flag rows. This will create “true” and “false” facets in the Facet/Filter. These operations will modify all matching rows in your current subset. You can unstar or unflag them as well. -You may wish to create a custom subset of your data through a series of separate faceting activities (rather than successively narrowing down with multiple facets applied). For example, you may wish to: -* apply a facet -* star all the matching rows -* remove that facet -* apply another, unrelated facet -* star all the new matching rows (which will not modify already-starred rows) -* remove that facet -* and then work with all of the cumulative starred rows. +You may wish to create a custom subset of your data through a series of separate faceting activities (rather than successively narrowing down with multiple facets applied). For example, you may wish to: + +* apply a facet +* star all the matching rows +* remove that facet +* apply another, unrelated facet +* star all the new matching rows (which will not modify already-starred rows) +* remove that facet +* and then work with all of the cumulative starred rows. You can also create a text facet on any column with the expression `row.starred` or `row.flagged`. ## Text filter {#text-filter} -Filters allow you to narrow down your data based on whether a given column includes a text string. +Filters allow you to narrow down your data based on whether a given column includes a text string. -When you choose Text filter a box appears in the Facet/Filter tab that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression. +When you choose Text filter a box appears in the Facet/Filter tab that allows you to enter in text. Matching rows will narrow dynamically with every character you enter. You can set the search to be case-sensitive or not, and you can use this box to enter in a regular expression. -For example, you can enter in “side” as a text filter, and it will return all cells in that column containing “side,” “sideways,” “offside,” etc. +For example, you can enter in “side” as a text filter, and it will return all cells in that column containing “side,” “sideways,” “offside,” etc. The text filter field supports [regular expressions](expressions#regular-expressions). For example, you can employ a regular expression to view all properly-formatted emails: -``` +```regex ([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9\-\.]+)\.([a-zA-Z0-9\-]{2,15}) ``` -You can press “invert” on this facet to then see blank cells or invalid email addresses. +You can press “invert” on this facet to then see blank cells or invalid email addresses. -This filter works differently than facets because it is always active as long as it appears in the sidebar. If you “reset” it, you will delete all the text or expression you have entered. +This filter works differently than facets because it is always active as long as it appears in the sidebar. If you “reset” it, you will delete all the text or expression you have entered. You can apply multiple text filters in succession, which will successively narrow your data subset. This can be useful if you apply multiple inverted filters, such as to filter out all rows that respond “yes” or “maybe” and only look at the remaining responses. diff --git a/docs/manual/grel.md b/docs/manual/grel.md index 37bcf076..8644b6e0 100644 --- a/docs/manual/grel.md +++ b/docs/manual/grel.md @@ -8,15 +8,15 @@ sidebar_label: General Refine Expression Language GREL (General Refine Expression Language) is designed to resemble Javascript. Formulas use variables and depend on data types to do things like string manipulation or mathematical calculations: -|Example|Output| -|---|---| +| Example | Output | +| --- | --- | | `value + " (approved)"` | Concatenate two strings; whatever is in the cell gets converted to a string first | | `value + 2.239` | Add 2.239 to the existing value (if a number); append text "2.239" to the end of the string otherwise | | `value.trim().length()`     | Trim leading and trailing whitespace of the cell value and then output the length of the result | | `value.substring(7, 10)` | Output the substring of the value from character index 7, 8, and 9 (excluding character index 10) | | `value.substring(13)` | Output the substring from index 13 to the end of the string | -Note that the operator for string concatenation is `+` (not “&” as is used in Excel). +Note that the operator for string concatenation is `+` (not “&” as is used in Excel). Evaluating conditions uses symbols such as `<`, `>`, `*`, `/`, etc. To check whether two objects are equal, use two equal signs (`value=="true"`). @@ -24,19 +24,19 @@ See the [GREL functions page for a thorough reference](grelfunctions) on each fu ## Operators {#operators} -#### Arithmetic Operators {#arithmetic-operators} +### Arithmetic Operators {#arithmetic-operators} Refer [GREL functions page](/docs/manual/grelfunctions#math-functions) for details on Division Operator. -###### Modulus {#modulus} +#### Modulus {#modulus} When using the `%` operator, if both operands are numbers such as `1 % 2` the result will be a whole number. However, if either or both of the operands are floating-point numbers like `1.0 % 2` they will be promoted to floating point and the result will also be in floating-point format. It's important to note that the `%` operator may not behave as expected with floating-point numbers due to precision issues. -###### Multiplication {#multiplication} +#### Multiplication {#multiplication} The behavior of the `*` operator is nuanced based on the data types of the operands. When both operands are integers such as `1 * 2`, the result is an integer. Conversely, if either or both operands are floating-point numbers the result becomes a floating-point number. You can use simple evaluations such as `3.5 * 2` -#### Relational Operators {#relational-operators} +### Relational Operators {#relational-operators} `==` and `!=` operators are used to assess equality and inequality. For instance, `"a" == "b"` returns false and `"a" != "b"` returns true. When applied to integers, `5 == 5` returns true, while `3 != 3` returns false. @@ -44,19 +44,20 @@ The `<` operator checks if the left operand is less than the right operand, whil #### References {#references} -- [String Concatenation](/docs/manual/grel#basic) +- [String Concatenation](/docs/manual/grel#basics) - [Logical Functions](/docs/manual/grelfunctions#boolean-functions) ## Syntax {#syntax} In GREL, functions can use either of these two forms: -* functionName(arg0, arg1, ...) -* arg0.functionName(arg1, ...) + +- `functionName(arg0, arg1, ...)` +- `arg0.functionName(arg1, ...)` The second form is a shorthand to make expressions easier to read. It simply pulls the first argument out and appends it to the front of the function, with a dot: -|Dot notation |Full notation | -|-|-| +| Dot notation | Full notation | +| - | - | | `value.trim().length()` | `length(trim(value))` | | `value.substring(7, 10)` | `substring(value, 7, 10)` | @@ -64,20 +65,20 @@ So, in the dot shorthand, the functions occur from left to right in the order of The dot notation can also be used to access the member fields of [variables](expressions#variables). For referring to column names that contain spaces (anything not a continuous string), use square brackets instead of dot notation: -|Example |Description | -|-|-| +| Example | Description | +| - | - | | `cells.FirstName` | Access the cell in the column named “FirstName” of the current row | | `cells["First Name"]` | Access the cell in the column called “First Name” of the current row | Square brackets can also be used to get substrings and sub-arrays, and single items from arrays: -|Example |Description | -|-|-| +| Example | Description | +| - | - | | `value[1,3]` | A substring of value, starting from character 1 up to but excluding character 3 | | `"internationalization"[1,-2]` | Will return “nternationalizati” (negative indexes are counted from the end) | | `row.columnNames[5]` | Will return the name of the fifth column | -Any function that outputs an array can use square brackets to select only one part of the array to output as a string (remember that the index of the items in an array starts with 0). +Any function that outputs an array can use square brackets to select only one part of the array to output as a string (remember that the index of the items in an array starts with 0). For example, [partition()](grelfunctions#partitions-s-or-p-fragment-b-omitfragment-optional) would normally output an array of three items: the part before your chosen fragment, the fragment you've identified, and the part after. Selecting only the third part with `"internationalization".partition("nation")[2]` will output “alization” (and so will [-1], indicating the final item in the array). @@ -87,93 +88,97 @@ GREL offers controls to support branching and looping (that is, “if” and “ Please note that the GREL control names are case-sensitive: for example, the isError() control can't be called with iserror(). -#### if(e, eTrue, eFalse) {#ife-etrue-efalse} +### if(e, eTrue, eFalse) {#ife-etrue-efalse} Expression e is evaluated to a value. If that value is true, then expression eTrue is evaluated and the result is the value of the whole if() expression. Otherwise, expression eFalse is evaluated and that result is the value. Examples: -| Example expression | Result | +| Example expression | Result | | ------------------------------------------------------------------------ | ------------ | | `if("internationalization".length() > 10, "big string", "small string")` | “big string” | -| `if(mod(37, 2) == 0, "even", "odd")` | “odd” | +| `if(mod(37, 2) == 0, "even", "odd")` | “odd” | Nested if (switch case) example: - if(value == 'Place', 'http://www.example.com/Location', +```grel + +if(value == 'Place', 'http://www.example.com/Location', - if(value == 'Person', 'http://www.example.com/Agent', + if(value == 'Person', 'http://www.example.com/Agent', - if(value == 'Book', 'http://www.example.com/Publication', + if(value == 'Book', 'http://www.example.com/Publication', - null))) +null))) +``` -#### with(e1, variable v, e2) {#withe1-variable-v-e2} +### with(e1, variable v, e2) {#withe1-variable-v-e2} Evaluates expression e1 and binds its value to variable v. Then evaluates expression e2 and returns that result. -| Example expression | Result | +| Example expression | Result | | ------------------------------------------------------------------------------------ | ---------- | -| `with("european union".split(" "), a, a.length())` | 2 | -| `with("european union".split(" "), a, forEach(a, v, v.length()))` | [ 8, 5 ] | -| `with("european union".split(" "), a, forEach(a, v, v.length()).sum() / a.length())` | 6.5 | +| `with("european union".split(" "), a, a.length())` | 2 | +| `with("european union".split(" "), a, forEach(a, v, v.length()))` | [ 8, 5 ] | +| `with("european union".split(" "), a, forEach(a, v, v.length()).sum() / a.length())` | 6.5 | -#### filter(e1, v, e test) {#filtere1-v-e-test} +### filter(e1, v, e test) {#filtere1-v-e-test} Evaluates expression e1 to an array. Then for each array element, binds its value to variable v, evaluates expression test - which should return a boolean. If the boolean is true, pushes v onto the result array. -| Expression | Result | +| Expression | Result | | ---------------------------------------------- | ------------- | | `filter([ 3, 4, 8, 7, 9 ], v, mod(v, 2) == 1)` | [ 3, 7, 9 ] | -#### forEach(e1, v, e2) {#foreache1-v-e2} +### forEach(e1, v, e2) {#foreache1-v-e2} Evaluates expression e1 to an array. Then for each array element, binds its value to variable v, evaluates expression e2, and pushes the result onto the result array. When e1 is a JSON object, `forEach` iterates over its keys. -| Expression | Result | +| Expression | Result | | ------------------------------------------ | ------------------- | | `forEach([ 3, 4, 8, 7, 9 ], v, mod(v, 2))` | [ 1, 0, 0, 1, 1 ] | -#### forEachIndex(e1, i, v, e2) {#foreachindexe1-i-v-e2} +### forEachIndex(e1, i, v, e2) {#foreachindexe1-i-v-e2} Evaluates expression e1 to an array. Then for each array element, binds its index to variable i and its value to variable v, evaluates expression e2, and pushes the result onto the result array. -| Expression | Result | +| Expression | Result | | ------------------------------------------------------------------------------- | --------------------------- | | `forEachIndex([ "anne", "ben", "cindy" ], i, v, (i + 1) + ". " + v).join(", ")` | 1. anne, 2. ben, 3. cindy | -#### forRange(n from, n to, n step, v, e) {#forrangen-from-n-to-n-step-v-e} +### forRange(n from, n to, n step, v, e) {#forrangen-from-n-to-n-step-v-e} Iterates over the variable v starting at from, incrementing by the value of step each time while less than to. At each iteration, evaluates expression e, and pushes the result onto the result array. -#### forNonBlank(e, v, eNonBlank, eBlank) {#fornonblanke-v-enonblank-eblank} +### forNonBlank(e, v, eNonBlank, eBlank) {#fornonblanke-v-enonblank-eblank} Evaluates expression e. If it is non-blank, forNonBlank() binds its value to variable v, evaluates expression eNonBlank and returns the result. Otherwise (if e evaluates to blank), forNonBlank() evaluates expression eBlank and returns that result instead. Unlike other GREL functions beginning with “for,” forNonBlank() is not iterative. forNonBlank() essentially offers a shorter syntax to achieving the same outcome by using the isNonBlank() function within an “if” statement. -#### isBlank(e), isNonBlank(e), isNull(e), isNotNull(e), isNumeric(e), isError(e) {#isblanke-isnonblanke-isnulle-isnotnulle-isnumerice-iserrore} +### isBlank(e), isNonBlank(e), isNull(e), isNotNull(e), isNumeric(e), isError(e) {#isblanke-isnonblanke-isnulle-isnotnulle-isnumerice-iserrore} Evaluates the expression e, and returns a boolean based on the named evaluation. Examples: -| Expression | Result | +| Expression | Result | | ------------------- | ------- | -| `isBlank("abc")` | false | -| `isNonBlank("abc")` | true | -| `isNull("abc")` | false | -| `isNotNull("abc")` | true | -| `isNumeric(2)` | true | -| `isError(1)` | false | -| `isError("abc")` | false | -| `isError(1 / 0)` | true | +| `isBlank("abc")` | false | +| `isNonBlank("abc")` | true | +| `isNull("abc")` | false | +| `isNotNull("abc")` | true | +| `isNumeric(2)` | true | +| `isError(1)` | false | +| `isError("abc")` | false | +| `isError(1 / 0)` | true | Remember that these are controls and not functions: you can’t use dot notation (for example, the format `e.isX()` will not work). ## Constants {#constants} -|Name |Meaning | -|-|-| + +| Name | Meaning | +| - | - | | true | The boolean constant true | | false | The boolean constant false | | PI | From [Java's Math.PI](https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html#PI), the value of pi (that is, 3.1415...) | diff --git a/docs/manual/grelfunctions.md b/docs/manual/grelfunctions.md index 81509602..15649b72 100644 --- a/docs/manual/grelfunctions.md +++ b/docs/manual/grelfunctions.md @@ -71,15 +71,15 @@ Date/times without timezone info were interpreted as **local** up until May 2018 ###### startsWith(s, sub) {#startswiths-sub} -Returns a boolean indicating whether s starts with sub. For example, `"food".startsWith("foo")` returns true, whereas `"food".startsWith("bar")` returns false. +Returns a boolean indicating whether s starts with sub. For example, `"food".startsWith("foo")` returns true, whereas `"food".startsWith("bar")` returns false. ###### endsWith(s, sub) {#endswiths-sub} -Returns a boolean indicating whether s ends with sub. For example, `"food".endsWith("ood")` returns true, whereas `"food".endsWith("odd")` returns false. +Returns a boolean indicating whether s ends with sub. For example, `"food".endsWith("ood")` returns true, whereas `"food".endsWith("odd")` returns false. ###### contains(s, sub or p) {#containss-sub-or-p} -Returns a boolean indicating whether s contains sub, which is either a substring or a regex pattern. For example, `"food".contains("oo")` returns true whereas `"food".contains("ee")` returns false. +Returns a boolean indicating whether s contains sub, which is either a substring or a regex pattern. For example, `"food".contains("oo")` returns true whereas `"food".contains("ee")` returns false. You can search for a regular expression by wrapping it in forward slashes rather than quotes: `"rose is a rose".contains(/\s+/)` returns true. startsWith() and endsWith() can only take strings, while contains() can take a regex pattern, so you can use contains() to look for beginning and ending string patterns. @@ -143,7 +143,7 @@ Returns the first character index of sub as it last occurs in s; or, returns -1 ###### replace(s, s or p find, s replace) {#replaces-s-or-p-find-s-replace} -Returns the string obtained by replacing the find string with the replace string in the inputted string. For example, `"The cow jumps over the moon and moos".replace("oo", "ee")` returns the string “The cow jumps over the meen and mees”. Find can be a regex pattern. For example, `"The cow jumps over the moon and moos".replace(/\s+/, "_")` will return “The_cow_jumps_over_the_moon_and_moos”. +Returns the string obtained by replacing the find string with the replace string in the inputted string. For example, `"The cow jumps over the moon and moos".replace("oo", "ee")` returns the string “The cow jumps over the meen and mees”. Find can be a regex pattern. For example, `"The cow jumps over the moon and moos".replace(/\s+/, "_")` will return “The_cow_jumps_over_the_moon_and_moos”. You cannot find or replace nulls with this, as null is not a string. You can instead: @@ -166,7 +166,7 @@ This function is available since OpenRefine 3.6. Outputs an array of all consecutive substrings inside string s that match the substring or [regex](expressions#grel-supported-regex) pattern p. For example, `"abeadsabmoloei".find(/[aeio]+/)` would result in the array [ "a", "ea", "a", "o", "oei" ]. -You can supply a substring instead of p, by putting it in quotes, and OpenRefine will compile it into a regex pattern. Anytime you supply quotes, OpenRefine interprets the contents as a string, not regex. If you wish to use any regex notation, wrap the pattern in forward slashes. +You can supply a substring instead of p, by putting it in quotes, and OpenRefine will compile it into a regex pattern. Anytime you supply quotes, OpenRefine interprets the contents as a string, not regex. If you wish to use any regex notation, wrap the pattern in forward slashes. :::tip @@ -184,12 +184,12 @@ Remember to enclose your regex in forward slashes, and to escape characters and For example, if `value` is “hello 123456 goodbye”, the following would occur: -|Expression|Result| -|-|-| -|`value.match(/\d{6}/)` |null (does not match the full string)| -|`value.match(/.*\d{6}.*/)` |[ ] (no indicated substring)| -|`value.match(/.*(\d{6}).*/)` |[ "123456" ] (array with one value)| -|`value.match(/(.*)(\d{6})(.*)/)` |[ "hello ", "123456", " goodbye" ] (array with three values)| +| Expression | Result | +| - | - | +| `value.match(/\d{6}/)` | null (does not match the full string) | +| `value.match(/.*\d{6}.*/)` | [ ] (no indicated substring) | +| `value.match(/.*(\d{6}).*/)` | [ "123456" ] (array with one value) | +| `value.match(/(.*)(\d{6})(.*)/)` | [ "hello ", "123456", " goodbye" ] (array with three values) | :::tip @@ -214,12 +214,12 @@ Returns the array of strings obtained by splitting s into substrings with the gi Like other functions that return an array, it also allows array slicing on the returned array. In that case, it returns the array consisting of a subset of elements between i1 and (i2 – 1). For example, -|Expression|Result| -|-|-| -|`"internationalization".splitByLengths(5, 6, 3)[0,3]` |Returns an array of 3 strings: [ "inter", "nation", “ali” .| -|`"internationalization".splitByLengths(5, 6, 3)[0,2]` |Returns an array of 2 strings: [ "inter", "nation" ]| -|`"internationalization".splitByLengths(5, 6, 3)[1,3]` |Returns an array of 2 string: [ "nation", “ali” ]| -|`"internationalization".splitByLengths(5, 6, 3)[1]` |Returns string at position 1: "nation" | +| Expression | Result | +| - | - | +| `"internationalization".splitByLengths(5, 6, 3)[0,3]` | Returns an array of 3 strings: [ "inter", "nation", “ali” ]. | +| `"internationalization".splitByLengths(5, 6, 3)[0,2]` | Returns an array of 2 strings: [ "inter", "nation" ] | +| `"internationalization".splitByLengths(5, 6, 3)[1,3]` | Returns an array of 2 string: [ "nation", “ali” ] | +| `"internationalization".splitByLengths(5, 6, 3)[1]` | Returns string at position 1: "nation" | ###### smartSplit(s, s or p sep (optional)) {#smartsplits-s-or-p-sep-optional} @@ -233,7 +233,7 @@ Returns an array of strings obtained by splitting s into groups of consecutive c ###### partition(s, s or p fragment, b omitFragment (optional)) {#partitions-s-or-p-fragment-b-omitfragment-optional} -Returns an array of strings [ a, fragment, z ] where a is the substring within s before the first occurrence of fragment, and z is the substring after fragment. Fragment can be a string or a regex. For example, `"internationalization".partition("nation")` returns 3 strings: [ "inter", "nation", "alization" ]. If s does not contain fragment, it returns an array of [ s, "", "" ] (the original unpartitioned string, and two empty strings). +Returns an array of strings [ a, fragment, z ] where a is the substring within s before the first occurrence of fragment, and z is the substring after fragment. Fragment can be a string or a regex. For example, `"internationalization".partition("nation")` returns 3 strings: [ "inter", "nation", "alization" ]. If s does not contain fragment, it returns an array of [ s, "", "" ] (the original unpartitioned string, and two empty strings). If the omitFragment boolean is true, for example with `"internationalization".partition("nation", true)`, the fragment is not returned. The output is [ "inter", "alization" ]. @@ -255,7 +255,7 @@ Escapes s in the given escaping mode. The mode can be one of: "html", "xml", "cs ###### unescape(s, s mode) {#unescapes-s-mode} -Unescapes s in the given escaping mode. The mode can be one of: "html", "xml", "csv", "url", "javascript". Note that quotes are required around your mode. See the [recipes](https://github.com/OpenRefine/OpenRefine/wiki/Recipes#atampampt----att) for examples of escaping and unescaping. +Unescapes s in the given escaping mode. The mode can be one of: "html", "xml", "csv", "url", "javascript". Note that quotes are required around your mode. See the [recipes](https://github.com/OpenRefine/OpenRefine/wiki/Recipes#atampampt----att) for examples of escaping and unescaping. ###### encode(s, s encoding) {#encodes-s-encoding} @@ -281,13 +281,13 @@ Returns the [SHA-1 hash](https://en.wikipedia.org/wiki/SHA-1) of an object. If f Returns a phonetic encoding of a string, based on an available phonetic algorithm. See the [section on phonetic clustering](cellediting#clustering-methods) for more information. Can be one of the following supported phonetic methods: [metaphone, doublemetaphone, metaphone3](https://www.wikipedia.org/wiki/Metaphone), [soundex](https://en.wikipedia.org/wiki/Soundex), [cologne-phonetic](https://en.wikipedia.org/wiki/Cologne_phonetics), [daitch-mokotoff](https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex), [beider-morse](https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex#Beider%E2%80%93Morse_Phonetic_Name_Matching_Algorithm). Quotes are required around your encoding method. For example, `"Ruth Prawer Jhabvala".phonetic("metaphone")` outputs the string “R0PRWRJHBFL”. -###### reinterpret(s, s encoderTarget, s encoderSource) {#reinterprets-s-encodertarget-s-encodersource} +###### reinterpret(s, s encoderTarget, s encoderSource) {#reinterpret} Returns s reinterpreted through the given character encoders. You must supply one of the [supported encodings](http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html) for each of the original source and the target output. Note that quotes are required around your character encoder. -When an OpenRefine project is started, data is imported and interpreted. A specific character encoding is identified or manually selected at that time (such as UTF-8). You can reinterpret a column into another specificed encoding using this function. This function may not fix your data; it may be better to use this in conjunction with new projects to test the interpretation, and pre-format your data as needed. +When an OpenRefine project is started, data is imported and interpreted. A specific character encoding is identified or manually selected at that time (such as UTF-8). You can reinterpret a column into another specificed encoding using this function. This function may not fix your data; it may be better to use this in conjunction with new projects to test the interpretation, and pre-format your data as needed. -###### fingerprint(s) {#fingerprints} +###### fingerprint(s) {#fingerprint} Returns the fingerprint of s, a string that is the first step in [fingerprint clustering methods](cellediting#clustering-methods): it will trim whitespaces, convert all characters to lowercase, remove punctuation, sort words alphabetically, etc. For example, `"Ruth Prawer Jhabvala".fingerprint()` outputs the string “jhabvala prawer ruth”. @@ -354,36 +354,40 @@ The GREL expression `forEach(value.parseJson().keywords,v,v.text).join(":::")` w ### Jsoup XML and HTML parsing {#jsoup-xml-and-html-parsing} ###### parseHtml(s) {#parsehtmls} -Given a cell full of HTML-formatted text, parseHtml() simplifies HTML tags (such as by removing “ /” at the end of self-closing tags), closes any unclosed tags, and inserts linebreaks and indents for cleaner code. You cannot pass parseHtml() a URL, but you can pre-fetch HTML with the [Add column by fetching URLs](columnediting#add-column-by-fetching-urls) menu option. +Given a cell full of HTML-formatted text, parseHtml() simplifies HTML tags (such as by removing “ /” at the end of self-closing tags), closes any unclosed tags, and inserts linebreaks and indents for cleaner code. You cannot pass parseHtml() a URL, but you can pre-fetch HTML with the [Add column by fetching URLs](columnediting#add-column-by-fetching-urls) menu option. -A cell cannot store the output of parseHtml() unless you convert it with toString(): for example, `value.parseHtml().toString()`. +A cell cannot store the output of parseHtml() unless you convert it with toString(): for example, `value.parseHtml().toString()`. -When parseHtml() simplifies HTML, it can sometimes introduce errors. When closing tags, it makes its best guesses based on line breaks, indentation, and the presence of other tags. You may need to manually check the results. +When parseHtml() simplifies HTML, it can sometimes introduce errors. When closing tags, it makes its best guesses based on line breaks, indentation, and the presence of other tags. You may need to manually check the results. You can then extract or [select()](#selects-element) which portions of the HTML document you need for further splitting, partitioning, etc. An example of extracting all table rows from a div using parseHtml().select() together is described more in depth at [StrippingHTML](https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML). ###### parseXml(s) {#parsexmls} -Given a cell full of XML-formatted text, parseXml() returns a full XML document and adds any missing closing tags. You can then extract or [select()](#selects-element) which portions of the XML document you need for further splitting, partitioning, etc. Functions the same way as parseHtml() is described above. +Given a cell full of XML-formatted text, parseXml() returns a full XML document and adds any missing closing tags. You can then extract or [select()](#selects-element) which portions of the XML document you need for further splitting, partitioning, etc. Functions the same way as parseHtml() is described above. ###### select(s, element) {#selects-element} Returns an array of all the desired elements from an HTML or XML document, if the element exists. Elements are identified using the [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html). For example, `value.parseHtml().select("img.portrait")[0]` would return the entirety of the first “img” tag with the “portrait” class found in the parsed HTML inside `value`. Returns an empty array if no matching element is found. Use with toString() to capture the results in a cell. A tutorial of select() is shown in [StrippingHTML](https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML). You can use select() more than once: -``` +```grel value.parseHtml().select("div#content")[0].select("tr").toString() ``` ###### htmlAttr(s, element) {#htmlattrs-element} + Returns a string from an attribute on an HTML element. Use it in conjunction with parseHtml() as in the following example: `value.parseHtml().select("a.email")[0].htmlAttr("href")` would retrieve the email address attached to a link with the “email” class. ###### xmlAttr(s, element) {#xmlattrs-element} + Returns a string from an attribute on an XML element. Functions the same way htmlAttr() is described above. Use it in conjunction with parseXml(). ###### htmlText(element) {#htmltextelement} -Returns a string of the text from within an HTML element (including all child elements), removing HTML tags and line breaks inside the string. Use it in conjunction with parseHtml() and select() to provide an element, as in the following example: `value.parseHtml().select("div.footer")[0].htmlText()`. + +Returns a string of the text from within an HTML element (including all child elements), removing HTML tags and line breaks inside the string. Use it in conjunction with parseHtml() and select() to provide an element, as in the following example: `value.parseHtml().select("div.footer")[0].htmlText()`. ###### xmlText(element) {#xmltextelement} + Returns a string of the text from within an XML element (including all child elements). Functions the same way htmlText() is described above. Use it in conjunction with parseXml() and select() to provide an element. ###### wholeText(element) {#wholetextelement} @@ -393,15 +397,19 @@ Selects the (unencoded) text of an element and its children, including any new l This function is available since OpenRefine 3.5. ###### innerHtml(element) {#innerhtmlelement} + Returns the [inner HTML](https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML) of an HTML element. This will include text and children elements within the element selected. Use it in conjunction with parseHtml() and select() to provide an element. ###### innerXml(element) {#innerxmlelement} + Returns the inner XML elements of an XML element. Does not return the text directly inside your chosen XML element - only the contents of its children. To select the direct text, use ownText(). To select both, use xmlText(). Use it in conjunction with parseXml() and select() to provide an element. ###### ownText(element) {#owntextelement} + Returns the text directly inside the selected XML or HTML element only, ignoring text inside children elements (for this, use innerXml()). Use it in conjunction with a parser and select() to provide an element. ###### parent(element) {#parentelement} + Returns the parent node or null if no parent. Use it in conjunction with parseHtml() and select() to provide an element. This function is available since OpenRefine 3.6. @@ -409,50 +417,61 @@ This function is available since OpenRefine 3.6. ### URI parsing {#uri-parsing} ###### parseUri(s) {#parseUris} + Given a valid URI string (for example: https://www.openrefine.org:80/documentation#download?format=xml&os=mac), parseUri() returns a JSON object with the following properties: - - `scheme`: The scheme of the URI, e.g. `http` - - `host`: the host of the URI (e.g. `www.openrefine.org`) - - `port`: the port of the URI (e.g. `80`) - - `path`: the path of the URI (e.g. `/documentation`) - - `query`: the query of the URI (e.g. `format=xml&os=mac`) - - `authority`: the authority of the URI (e.g. `www.openrefine.org:80`) - - `fragment`: the fragment of the URI (e.g. `download`) - - `query_params`: the query of the URI as an object (e.g. `{format: "xml", os: "mac"}`) + +- `scheme`: The scheme of the URI, e.g. `http` +- `host`: the host of the URI (e.g. `www.openrefine.org`) +- `port`: the port of the URI (e.g. `80`) +- `path`: the path of the URI (e.g. `/documentation`) +- `query`: the query of the URI (e.g. `format=xml&os=mac`) +- `authority`: the authority of the URI (e.g. `www.openrefine.org:80`) +- `fragment`: the fragment of the URI (e.g. `download`) +- `query_params`: the query of the URI as an object (e.g. `{format: "xml", os: "mac"}`) This function is available since OpenRefine 3.6. ## Array functions {#array-functions} ###### length(a) {#lengtha} -Returns the size of an array, meaning the number of objects inside it. Arrays can be empty, in which case length() will return 0. + +Returns the size of an array, meaning the number of objects inside it. Arrays can be empty, in which case length() will return 0. ###### slice(a, n from, n to (optional)) {#slicea-n-from-n-to-optional} + Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0. If the to value is omitted, it is understood to be the end of the array. For example, `[0, 1, 2, 3, 4].slice(1, 3)` returns [ 1, 2 ], and `[ 0, 1, 2, 3, 4].slice(2)` returns [ 2, 3, 4 ]. Also works with strings; see [String functions](#slices-n-from-n-to-optional). ###### get(a, n from, n to (optional)) {#geta-n-from-n-to-optional} -Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0. + +Returns a sub-array of a given array, from the first index provided and up to and excluding the optional last index provided. Remember that array objects are indexed starting at 0. If the to value is omitted, only one array item is returned, as a string, instead of a sub-array. To return a sub-array from one index to the end, you can set the to argument to a very high number such as `value.get(2,999)` or you can use something like `with(value,a,a.get(1,a.length()))` to count the length of each array. Also works with strings; see [String functions](#gets-n-from-n-to-optional). ###### inArray(a, s) {#inarraya-s} + Returns true if the array contains the desired string, and false otherwise. Will not convert data types; for example, `[ 1, 2, 3, 4 ].inArray("3")` will return false. ###### reverse(a) {#reversea} + Reverses the array. For example, `[ 0, 1, 2, 3].reverse()` returns the array [ 3, 2, 1, 0 ]. ###### sort(a) {#sorta} -Sorts the array in ascending order. Sorting is case-sensitive, uppercase first and lowercase second. For example, `[ "al", "Joe", "Bob", "jim" ].sort()` returns the array [ "Bob", "Joe", "al", "jim" ]. + +Sorts the array in ascending order. Sorting is case-sensitive, uppercase first and lowercase second. For example, `[ "al", "Joe", "Bob", "jim" ].sort()` returns the array [ "Bob", "Joe", "al", "jim" ]. ###### sum(a) {#suma} + Return the sum of the numbers in the array. For example, `[ 2, 1, 0, 3 ].sum()` returns 6. ###### join(a, sep) {#joina-sep} + Joins the items in the array with sep, and returns it all as a string. For example, `[ "and", "or", "not" ].join("/")` returns the string “and/or/not”. ###### uniques(a) {#uniquesa} -Returns the array with duplicates removed. Case-sensitive. For example, `[ "al", "Joe", "Bob", "Joe", "Al", "Bob" ].uniques()` returns the array [ "Joe", "al", "Al", "Bob" ]. + +Returns the array with duplicates removed. Case-sensitive. For example, `[ "al", "Joe", "Bob", "Joe", "Al", "Bob" ].uniques()` returns the array [ "Joe", "al", "Al", "Bob" ]. As of OpenRefine 3.4.1, uniques() reorders the array items it returns; in 3.4 beta 644 and onwards, it preserves the original order (in this case, [ "al", "Joe", "Bob", "Al" ]). @@ -471,18 +490,18 @@ Returns the date object according to your system clock. For example, `now()` ret Returns the inputted object converted to a date object. Without arguments, it returns the ISO 8601 extended format. With arguments, you can control the output format: * monthFirst: set false if the date is formatted with the day before the month. -* formatN: attempt to parse the date using an ordered list of possible formats. Supply formats based on the [SimpleDateFormat](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) syntax (and see the table below for a handy reference). +* formatN: attempt to parse the date using an ordered list of possible formats. Supply formats based on the [SimpleDateFormat](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) syntax (and see the table below for a handy reference). For example, you can parse a column containing dates in different formats, such as cells with “Nov-09” and “11/09”, using `value.toDate('MM/yy','MMM-yy').toString('yyyy-MM')` and both will output “2009-11”. For another example, “1/4/2012 13:30:00” can be parsed into a date using `value.toDate('d/M/y H:m:s')`. If parsing a date with text components in a language other than your system language you can specify a language code as the format1 argument. For example, a French language date such as "10 janvier 2023" could be parsed with `value.toDate('fr','dd MMM yyyy')`. | Letter | Date or Time Component | Presentation | Examples | -|-|-|-|-| +| - | - | - | - | | G | Era designator | Text | AD | | u | Year | Year | 1996; 96 | | y | year-of-era | year | 1996; 96 | -| M/L | Month in year | Number/text |7; 07; Jul; July; J | +| M/L | Month in year | Number/text | 7; 07; Jul; July; J | | Q/q | quarter-of-year | number/text | 3; 03; Q3; 3rd quarter | -| Y | week-based-year | year | 1996;96 | +| Y | week-based-year | year | 1996;96 | | w | Week in year | Number | 27 | | W | Week in month | Number | 2 | | D | Day in year | Number | 189 | @@ -507,7 +526,7 @@ For example, you can parse a column containing dates in different formats, such Given two dates, returns a number indicating the difference in a given time unit (see the table below). For example, `diff(("Nov-11".toDate('MMM-yy')), ("Nov-09".toDate('MMM-yy')), "weeks")` will return 104, for 104 weeks, or two years. The later date should go first. If the output is negative, invert d1 and d2. -Also works with strings; see [diff() in string functions](#diffsd1-sd2-s-timeunit-optional). +Also works with strings; see [diff() in string functions](#diffs1-s2-s-timeunit-optional). ###### inc(d, n, s timeUnit) {#incd-n-s-timeunit} @@ -515,12 +534,12 @@ Returns a date changed by the given amount in the given unit of time (see the ta ###### datePart(d, s timeUnit) {#datepartd-s-timeunit} -Returns part of a date. The data type returned depends on the unit (see the table below). +Returns part of a date. The data type returned depends on the unit (see the table below). OpenRefine supports the following values for timeUnit: | Unit | Date part returned | Returned data type | Example using [date 2014-03-14T05:30:04.000789000Z] as value | -|-|-|-|-| +| - | - | - | - | | years, year | Year | Number | value.datePart("years") → 2014 | | months, month | Month | Number | value.datePart("months") → 2 | | weeks, week, w | Week of the month | Number | value.datePart("weeks") → 3 | @@ -587,13 +606,15 @@ Some of these math functions don't recognize integers when supplied as the first ## Other functions {#other-functions} ###### type(o) {#typeo} + Returns a string with the data type of o, such as undefined, string, number, boolean, etc. For example, a [Transform](cellediting#transform) operation using `value.type()` will convert all cells in a column to strings of their data types. ###### facetCount(choiceValue, s facetExpression, s columnName) {#facetcountchoicevalue-s-facetexpression-s-columnname} + Returns the facet count corresponding to the given choice value, by looking for the facetExpression in the choiceValue in columnName. For example, to create facet counts for the following table, we could generate a new column based on “Gift” and enter in `value.facetCount("value", "Gift")`. This would add the column we've named “Count”: | Gift | Recipient | Price | Count | -|-|-|-|-| +| - | - | - | - | | lamp | Mary | 20 | 1 | | clock | John | 57 | 2 | | watch | Amit | 80 | 1 | @@ -602,21 +623,25 @@ Returns the facet count corresponding to the given choice value, by looking for The facet expression, wrapped in quotes, can be useful to manipulate the inputted values before counting. For example, you could do a textual cleanup using fingerprint(): `(value.fingerprint()).facetCount(value.fingerprint(),"Gift")`. ###### hasField(o, s name) {#hasfieldo-s-name} -Returns a boolean indicating whether o has a member field called [name](expressions#variables). For example, `cell.recon.hasField("match")` will return false if a reconciliation match hasn’t been selected yet, or true if it has. You cannot chain your desired fields: for example, `cell.hasField("recon.match")` will return false even if the above expression returns true). + +Returns a boolean indicating whether o has a member field called [name](expressions#variables). For example, `cell.recon.hasField("match")` will return false if a reconciliation match hasn’t been selected yet, or true if it has. You cannot chain your desired fields: for example, `cell.hasField("recon.match")` will return false even if the above expression returns true. ###### coalesce(o1, o2, o3, ...) {#coalesceo1-o2-o3-} + Returns the first non-null from a series of objects. For example, `coalesce(value, "")` would return an empty string “” if `value` was null, but otherwise return `value`. -###### cross(cell, s projectName (optional), s columnName (optional)) {#crosscell-s-projectname-optional-s-columnname-optional} -Returns an array of zero or more rows in the project projectName for which the cells in their column columnName have the same content as the cell in your chosen column. For example, if two projects contained matching names, and you wanted to pull addresses for people by their names from a project called “People” you would apply the following expression to your column of names: -``` +###### cross(cell, s projectName (optional), s columnName (optional)) {#cross} + +Returns an array of zero or more rows in the project projectName for which the cells in their column columnName have the same content as the cell in your chosen column. For example, if two projects contained matching names, and you wanted to pull addresses for people by their names from a project called “People” you would apply the following expression to your column of names: + +```grel cell.cross("People","Name")[0].cells["Address"].value ``` -This would match your current column to the “Name” column in “People” and, using those matches, pull the respective “Address” value into your current project. +This would match your current column to the “Name” column in “People” and, using those matches, pull the respective “Address” value into your current project. You may need to do some data preparation with cross(), such as using trim() on your key columns or deduplicating values. -The first argument will be interpreted as `cell.value` if set to `cell`. If you omit projectName and columnName, they will default to the current project and index column (number 0). +The first argument will be interpreted as `cell.value` if set to `cell`. If you omit projectName and columnName, they will default to the current project and index column (number 0). Recipes and more examples for using cross() can be found [on our wiki](https://github.com/OpenRefine/OpenRefine/wiki/Recipes#combining-datasets). diff --git a/docs/manual/installing.md b/docs/manual/installing.md index 863ec14e..276a1e0b 100644 --- a/docs/manual/installing.md +++ b/docs/manual/installing.md @@ -16,9 +16,10 @@ OpenRefine is designed to work with **Windows**, **Mac**, and **Linux** operatin #### Java {#java} -Java must be installed and configured on your computer to run OpenRefine. The Mac version of OpenRefine includes Java; new in OpenRefine 3.4, there is also a Windows package with Java included. +Java must be installed and configured on your computer to run OpenRefine. The Mac version of OpenRefine includes Java. +Since OpenRefine 3.4, there is also a OpenRefine Windows package with Java already included. -If you want to install Java yourself, you can install a pre-built Java Runtime Environment (JRE) from [Adoptium.net](https://adoptium.net/releases.html). Please note that OpenRefine works with Java 11 to Java 17 for OpenRefine 3.7. +If you want to manually install Java yourself, you can install a pre-built Java Runtime Environment (JRE) from [Adoptium.net](https://adoptium.net/releases.html). Please note that OpenRefine works with Java 11 to Java 17 for OpenRefine 3.7. If you install and start OpenRefine on a Windows computer without Java, it will automatically open up a browser window to this page. @@ -26,13 +27,13 @@ If you install and start OpenRefine on a Windows computer without Java, it will OpenRefine works best on browsers based on WebKit, such as: -* [Google Chrome](https://www.google.com/chrome/) -* [Chromium](https://ungoogled-software.github.io/) -* [Opera](https://www.opera.com/) -* [Microsoft Edge](https://www.microsoft.com/edge) -* [Safari](https://www.apple.com/safari/) +* [Google Chrome](https://www.google.com/chrome/) +* [Chromium](https://ungoogled-software.github.io/) +* [Opera](https://www.opera.com/) +* [Microsoft Edge](https://www.microsoft.com/edge) +* [Safari](https://www.apple.com/safari/) -We are aware of some minor rendering and performance issues on other browsers such as Firefox. We don't support Internet Explorer. If you are having issues running OpenRefine, see the [section on Running](running.md#troubleshooting). +We are aware of some minor rendering and performance issues on other browsers such as Firefox. We don't support Internet Explorer. If you are having issues running OpenRefine, see [Troubleshooting](troubleshooting.md). ### Release versions {#release-versions} @@ -318,9 +319,10 @@ Using a Mac, you can [run OpenRefine using the terminal](running#starting-and-ex ## Increasing memory allocation {#increasing-memory-allocation} OpenRefine relies on having computer memory available to it to work effectively. If you are planning to work with large datasets, you may wish to set up OpenRefine to handle it at the outset. By “large” we generally mean one of the following indicators: -* more than one million total cells -* an input file size of more than 50 megabytes (MB) -* more than 50 [rows per record in records mode](running#records-mode) + +* more than one million total cells +* an input file size of more than 50 megabytes (MB) +* more than 50 [rows per record in records mode](exploring#rows-vs-records) By default OpenRefine is set to operate with 1 gigabyte (GB) of memory (1024MB). If you feel that OpenRefine is running slowly, or you are getting “out of memory” errors (for example, `java.lang.OutOfMemoryError`), you can try allocating more memory. @@ -348,7 +350,7 @@ If your project is big enough to need more than the default amount of memory, co If you run `openrefine.exe`, you will need to edit the `openrefine.l4j.ini` file found in the program directory and edit the line -``` +```ini # max memory memory heap size -Xmx1024M ``` @@ -356,7 +358,7 @@ If you run `openrefine.exe`, you will need to edit the `openrefine.l4j.ini` file The line “-Xmx1024M” defines the amount of memory available in megabytes. Change the number “1024” - for example, edit the line to “-Xmx2048M” to make 2048MB [2GB] of memory available. :::caution openrefine.exe not running? -Once you increase the memory allocation, you may find that you cannot run `openrefine.exe`. In this case, your computer needs a 64-bit version of [Java](https://www.java.com/en/download/help/index_installing.xml) (this is different from [Java JDK](#install-or-upgrade-java). Look for the “Windows Offline (64-bit)” download on the Downloads page and install that. Your system must also be set to use the 64-bit version of Java by [changing the Java configuration](https://www.java.com/en/download/help/update_runtime_settings.xml). +Once you increase the memory allocation, you may find that you cannot run `openrefine.exe`. In this case, your computer needs a 64-bit version of [Java](https://www.java.com/en/download/help/index_installing.xml) (this is different from [Java JDK](#java). Look for the “Windows Offline (64-bit)” download on the Downloads page and install that. Your system must also be set to use the 64-bit version of Java by [changing the Java configuration](https://www.java.com/en/download/help/update_runtime_settings.xml). ::: #### Using refine.bat {#using-refinebat} @@ -365,45 +367,52 @@ On Windows, OpenRefine can also be run by using the file `refine.bat` in the pro To set the maximum amount of memory on the command line when using `refine.bat`, `cd` to the program directory, then type -```refine.bat /m 2048m``` +```cmd +refine.bat /m 2048m +``` where “2048” is the maximum amount of MB that you want OpenRefine to use. To change the default that `refine.bat` uses, edit the `refine.ini` line that reads -```REFINE_MEMORY=1024M``` +```ini +REFINE_MEMORY=1024M +``` Note that this file is only read if you use `refine.bat`, not `openrefine.exe`. -:::caution +:::caution Before proceeding, double-check that you've completed the installation steps outlined above. Skipping those steps may result in an error about a read-only volume when you try to edit the `Info.plist` file in the next steps. ::: If you have downloaded the `.dmg` package and you start OpenRefine by double-clicking on it: -* close OpenRefine -* control-click on the OpenRefine icon (opens the contextual menu) -* click on "show package content” (a finder window opens) -* open the “Contents” folder -* open and edit the `Info.plist` file with any text editor (like Mac's default TextEdit) -* Change “-Xmx1024M” into, for example, “-Xmx2048M” or “-Xmx8G” -* save the file -* restart OpenRefine +* close OpenRefine +* control-click on the OpenRefine icon (opens the contextual menu) +* click on "show package content” (a finder window opens) +* open the “Contents” folder +* open and edit the `Info.plist` file with any text editor (like Mac's default TextEdit) +* Change “-Xmx1024M” into, for example, “-Xmx2048M” or “-Xmx8G” +* save the file +* restart OpenRefine If you have downloaded the `.tar.gz` package and you start OpenRefine from the command line, add the “-m xxxxM” parameter like this: -`./refine -m 2048m` + +```sh +./refine -m 2048m` +``` #### Setting a default {#setting-a-default} If you don't want to set this option on the command line each time, you can also set it in the `refine.ini` file. Edit the line -``` +```ini REFINE_MEMORY=1024M ``` @@ -415,7 +424,6 @@ Make sure it is not commented out (that is, that the line doesn't start with a --- - ## Installing extensions {#installing-extensions} Extensions have been created by our contributor community to add functionality or provide convenient shortcuts for common uses of OpenRefine. [We list extensions we know about on our extensions page](/extensions). @@ -428,8 +436,8 @@ If you’d like to create or modify an extension, [see our developer documentati You can [install extensions in one of two places](#set-where-data-is-stored): -* Into your OpenRefine program folder, so they will only be available to that version/installation of OpenRefine (meaning the extension will not run if you upgrade OpenRefine), or -* Into your workspace, where your projects are stored, so they will be available no matter which version of OpenRefine you’re using. +* Into your OpenRefine program folder, so they will only be available to that version/installation of OpenRefine (meaning the extension will not run if you upgrade OpenRefine), or +* Into your workspace, where your projects are stored, so they will be available no matter which version of OpenRefine you’re using. We provide these options because you may wish to reinstall a given extension manually each time you upgrade OpenRefine, in order to be sure it works properly. @@ -438,8 +446,9 @@ We provide these options because you may wish to reinstall a given extension man If you want to install the extension into the program folder, go to your program directory and then go to `webapp\extensions` (or create it if not does not exist). If you want to install the extension into your workspace, you can: -* [Locate your workspace directory](#set-where-data-is-stored) -* Create a new folder called “extensions” inside the workspace if it does not exist. + +* [Locate your workspace directory](#set-where-data-is-stored) +* Create a new folder called “extensions” inside the workspace if it does not exist. You can also [find your workspace on each operating system using these instructions](#set-where-data-is-stored). @@ -451,8 +460,8 @@ Some extensions may have multiple versions, to match OpenRefine versions, so be Generally, the installation process will be: -* Download the extension (usually as a zip file from GitHub) -* Extract the zip contents into the `webapp\extensions` directory, making sure all the contents go into one folder with the name of the extension -* Start (or restart) OpenRefine. +* Download the extension (usually as a zip file from GitHub) +* Extract the zip contents into the `webapp\extensions` directory, making sure all the contents go into one folder with the name of the extension +* Start (or restart) OpenRefine. To confirm that installation was a success, follow the instructions provided by the extension. Each extension will appear in its own way inside the OpenRefine interface. Make sure you read its documentation to know where the functionality will appear, such as under specific dropdown menus. diff --git a/docs/manual/jythonclojure.md b/docs/manual/jythonclojure.md index eecfde11..87cd9159 100644 --- a/docs/manual/jythonclojure.md +++ b/docs/manual/jythonclojure.md @@ -12,7 +12,7 @@ Python code that depends on C bindings will not work in OpenRefine, which uses J You will need to restart OpenRefine, so that new Jython or Python libraries are initialized during startup. -OpenRefine now has [most of the Jsoup.org library built into GREL functions](grelfunctions#jsoup-xml-and-html-parsing-functions) for parsing and working with HTML and XML elements. +OpenRefine now has [most of the Jsoup.org library built into GREL functions](grelfunctions#jsoup-xml-and-html-parsing) for parsing and working with HTML and XML elements. ### Syntax {#syntax} diff --git a/docs/manual/reconciling.md b/docs/manual/reconciling.md index ebc045b3..bf7f8592 100644 --- a/docs/manual/reconciling.md +++ b/docs/manual/reconciling.md @@ -211,7 +211,7 @@ Remember to set an appropriate throttle and to refer to the service documentatio ## Keep all the suggestions made {#keep-all-the-suggestions-made} -To generate a list of each suggestion made, rather than only the best candidate, you can use a [GREL expression](expressions#GREL). Go to Edit columnAdd column based on this column. To create a list of all the possible matches, use something like +To generate a list of each suggestion made, rather than only the best candidate, you can use a [GREL expression](expressions.md). Go to Edit columnAdd column based on this column. To create a list of all the possible matches, use something like ``` forEach(cell.recon.candidates,c,c.name).join(", ") @@ -236,7 +236,7 @@ OpenRefine supplies a number of variables related specifically to reconciled val * `cell.recon.judgmentHistory` (the values used in the “judgment action timestamp” facet) * `cell.recon.matched` (a “true” or “false” value) -You can find out more in the [reconciliaton variables](expressions#reconciliaton-variables) section. +You can find out more in the [reconciliaton variables](expressions#reconciliation) section. :::tip Make a copy of a reconciled column diff --git a/docs/manual/running.md b/docs/manual/running.md index 5c9541d0..f0611507 100644 --- a/docs/manual/running.md +++ b/docs/manual/running.md @@ -6,22 +6,23 @@ sidebar_label: Running ## Starting and exiting {#starting-and-exiting} -OpenRefine does not require internet access to run its basic functions. Once you download and install it, it runs as a small web server on your own computer, and you access that local web server by using your browser. +OpenRefine does not require internet access to run its basic functions. Once you download and install it, it runs as a small web server on your own computer, and you access that local web server by using your browser. -You will see a command line window open when you run OpenRefine. Ignore that window while you work on datasets in your browser. +You will see a command line window open when you run OpenRefine. Ignore that window while you work on datasets in your browser. No matter how you start OpenRefine, it will load its interface in your computer’s default browser. If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: [http://127.0.0.1:3333/](http://127.0.0.1:3333/). OpenRefine works best on browsers based on WebKit, such as: -* [Google Chrome](https://www.google.com/chrome/) -* [Chromium](https://ungoogled-software.github.io/) -* [Opera](https://www.opera.com/) -* [Microsoft Edge](https://www.microsoft.com/edge) -* [Safari](https://www.apple.com/safari/) + +* [Google Chrome](https://www.google.com/chrome/) +* [Chromium](https://ungoogled-software.github.io/) +* [Opera](https://www.opera.com/) +* [Microsoft Edge](https://www.microsoft.com/edge) +* [Safari](https://www.apple.com/safari/) We are aware of some minor rendering and performance issues on other browsers such as Firefox. We don't support Internet Explorer. -You can view and work on multiple projects at the same time by simply having multiple tabs or browser windows open. From the Open Project screen, you can right-click on project names and open them in new tabs or windows. +You can view and work on multiple projects at the same time by simply having multiple tabs or browser windows open. From the Open Project screen, you can right-click on project names and open them in new tabs or windows. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; @@ -38,12 +39,14 @@ import TabItem from '@theme/TabItem'; -#### With openrefine.exe {#with-openrefineexe} +### With openrefine.exe {#with-openrefineexe} + You can run OpenRefine by double-clicking `openrefine.exe` or calling it from the command line. If you want to [modify the way `openrefine.exe` opens](#starting-with-modifications), you can edit the `openrefine.l4j.ini` file. -#### With refine.bat {#with-refinebat} +### With refine.bat {#with-refinebat} + On Windows, OpenRefine can also be run by using the file `refine.bat` in the program directory. If you start OpenRefine using `refine.bat`, you can do so by opening the file itself, or by calling it from the command line. If you call `refine.bat` from the command line, you can [start OpenRefine with modifications](#starting-with-modifications). @@ -51,7 +54,7 @@ If you want to modify the way `refine.bat` opens through double-clicking or usin #### Exiting {#exiting} -To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard. This will save any last changes to your projects. +To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down `Control` and press `C` on your keyboard. This will save any last changes to your projects. @@ -61,14 +64,14 @@ You can find OpenRefine in your Applications folder, or you can open it using Te To run OpenRefine using Terminal: -* Open the Terminal by using the “Go” menu, choose Utilities -* In the Utilities window, start the Terminal application. -* Inside the terminal, write this: `/Applications/OpenRefine.app/Contents/MacOS/JavaAppLauncher` +* Open the Terminal by using the “Go” menu, choose Utilities +* In the Utilities window, start the Terminal application. +* Inside the terminal, write this: `/Applications/OpenRefine.app/Contents/MacOS/JavaAppLauncher` To exit, close all your OpenRefine browser tabs, find the applications called “JavaAppLauncher”, press `Command` and `Q` to close it down, and do the same with the Terminal application. :::caution Problems starting? -If you are using an older version of OpenRefine or are on an older version of MacOS, [check our Wiki for solutions to problems with MacOS](https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions#macos). +If you are using an older version of OpenRefine or are on an older version of MacOS, [check our Wiki for solutions to problems with MacOS](https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions#macos). ::: @@ -77,7 +80,7 @@ If you are using an older version of OpenRefine or are on an older version of Ma Use a terminal to launch OpenRefine. First, navigate to the installation folder. Then call the program: -``` +```shell cd openrefine-3.4.1 ./refine ``` @@ -89,7 +92,7 @@ To exit, close all the browser tabs, and then press `control` and `C` in the ter :::caution Did you get a JAVA_HOME error? “Error: Could not find the ‘java’ executable at ‘’, are you sure your JAVA_HOME environment variable is pointing to a proper java installation?” -If you see this error, you need to [install and configure a JDK package](installing#linux), including setting up `JAVA_HOME`. +If you see this error, you need to [install and configure a JDK package](installing#java), including setting up `JAVA_HOME`. ::: @@ -98,13 +101,9 @@ If you see this error, you need to [install and configure a JDK package](install --- -### Troubleshooting {#troubleshooting} - -If you are having problems connecting to OpenRefine with your browser, [check our Wiki for information about browser settings and operating-system issues](https://github.com/OpenRefine/OpenRefine/wiki/FAQ#i-am-having-trouble-connecting-to-openrefine-with-my-browser). - ### Starting with modifications {#starting-with-modifications} -When you run OpenRefine from a command line, you can change a number of default settings. +When you run OpenRefine from a command line, you can change a number of default settings. refine /i 127.0.0.2 /p 3334 ``` -Get a list of all the commands with `refine /?`. +Get a list of all the commands with `refine /?`. -| Command |Use|Syntax example| +|Command|Use|Syntax example| |---|---|---| |/w|Path to the webapp|refine /w /path/to/openrefine| |/m|Memory maximum heap|refine /m 6000M| @@ -149,7 +148,11 @@ You cannot start the Mac version with modifications using Terminal, but you can -To see the full list of command-line options, run `./refine -h`. +To see the full list of command-line options, run + +```sh +./refine -h` +``` |Command|Use|Syntax example| |---|---|---| @@ -173,16 +176,17 @@ To see the full list of command-line options, run `./refine -h`. #### Modifications set within files {#modifications-set-within-files} -On Windows, you can modify the way `openrefine.exe` runs by editing `openrefine.l4j.ini`; you can modify the way `refine.bat` runs by editing `refine.ini`. +On Windows, you can modify the way `openrefine.exe` runs by editing `openrefine.l4j.ini`; you can modify the way `refine.bat` runs by editing `refine.ini`. -You can modify the Mac application by editing `info.plist`. +You can modify the Mac application by editing `info.plist`. -On Linux, you can edit `refine.ini`. +On Linux, you can edit `refine.ini`. -Some settings, such as changing memory allocations, are already set inside these files, and all you have to do is change the values. Some lines need to be un-commented to work. +Some settings, such as changing memory allocations, are already set inside these files, and all you have to do is change the values. Some lines need to be un-commented to work. -For example, inside `refine.ini`, you should see: -``` +For example, inside `refine.ini`, you should see: + +```ini no_proxy="localhost,127.0.0.1" #REFINE_PORT=3334 #REFINE_INTERFACE=127.0.0.1 @@ -195,33 +199,33 @@ REFINE_MEMORY=1400M # Set initial java heap space (default: 256M) for better performance with large datasets REFINE_MIN_MEMORY=1400M -... + +# Any personal preferences with a JAVA_OPTIONS= line like for autosaving every 8 minutes +JAVA_OPTIONS=-Drefine.autosave=8 ``` ##### JVM preferences {#jvm-preferences} -Further modifications can be performed by using JVM preferences. These JVM preferences are different options and have different syntax than the key/value descriptions used on the command line. +Further modifications can be performed by using JVM preferences. These JVM preferences are different options and have different syntax than the key/value descriptions used on the command line. Some of the most common keys (with their defaults) are: |Description|Argument|Syntax example| |---|---|---| -|Proxy host or IP|`-Dhttp.proxyHost`|proxy.example.org or 192.168.1.10 -|Proxy port|`-Dhttp.proxyPort`|8080 -|The project [autosave](starting#autosaving) frequency|`-Drefine.autosave`|5 [minutes] -|The workspace director|`-Drefine.data_dir`|/ -|Development mode|`-Drefine.development`|false -|Headless mode|`-Drefine.headless`|false -|IP|`-Drefine.interface`|127.0.0.1 -|Domain name|`-Drefine.host`|mymachine.local -|Port|`-Drefine.port`|3333 -|The application folder|`-Drefine.webapp`|main/webapp -|New version notice|`-Drefine.display.new.version.notice`|true -|Google Data Client ID|`-Dext.gdata.clientid`|000000000000-********************************.apps.googleusercontent.com -|Google Data Client secret|`-Dext.gdata.clientsecret`|************************ -|Google Data API Key|`-Dext.gdata.apikey`|*************************************** - - +|Proxy host or IP|`-Dhttp.proxyHost`|proxy.example.org or 192.168.1.10| +|Proxy port|`-Dhttp.proxyPort`|8080| +|The project [autosave](starting#autosaving) frequency|`-Drefine.autosave`|5 [minutes]| +|The workspace director|`-Drefine.data_dir`|/| +|Development mode|`-Drefine.development`|false| +|Headless mode|`-Drefine.headless`|false| +|IP|`-Drefine.interface`|127.0.0.1| +|Domain name|`-Drefine.host`|mymachine.local| +|Port|`-Drefine.port`|3333| +|The application folder|`-Drefine.webapp`|main/webapp| +|New version notice|`-Drefine.display.new.version.notice`|true| +|Google Data Client ID|`-Dext.gdata.clientid`|000000000000-********************************.apps.googleusercontent.com| +|Google Data Client secret|`-Dext.gdata.clientsecret`|************************| +|Google Data API Key|`-Dext.gdata.apikey`|***************************************| The syntax is as follows: @@ -239,7 +243,7 @@ The syntax is as follows: Locate the `refine.l4j.ini` file, and insert lines in this way: -``` +```ini -Drefine.port=3334 -Drefine.interface=127.0.0.2 -Drefine.host=mymachine.local @@ -250,23 +254,24 @@ Locate the `refine.l4j.ini` file, and insert lines in this way: In `refine.ini`, use a similar syntax, but set multiple parameters within a single line starting with `JAVA_OPTIONS=`: -``` +```ini JAVA_OPTIONS=-Drefine.data_dir=C:\Users\user\Documents\OpenRefine\ -Drefine.port=3334 ``` + Locate the `info.plist`, and find the `array` element that follows the line -``` +```plist JVMOptions ``` Typically this looks something like: -``` +```plist JVMOptions -Xms256M @@ -277,14 +282,15 @@ Typically this looks something like: ``` To see this list with the Terminal, use this command: -``` + +```sh export OR_INFO="/Applications/OpenRefine.app/Contents/Info.plist" defaults read $OR_INFO JVMOptions ``` Add in values such as: -``` +```plist JVMOptions -Xms256M @@ -304,10 +310,12 @@ Add in values such as: ``` If the values aren’t already there, you can add them easily with this Terminal command: -``` + +```sh export OR_INFO="/Applications/OpenRefine.app/Contents/Info.plist" defaults write $OR_INFO JVMOptions -array-add "-Drefine.interface=192.168.0.10" ``` + This will not work if you already have the value defined, so whatch out for that. @@ -316,7 +324,7 @@ This will not work if you already have the value defined, so whatch out for that Locate the `refine.ini` file, and add `JAVA_OPTIONS=` before the `-Drefine.preference` declaration. You can un-comment and edit the existing suggested lines, or add lines: -``` +```ini JAVA_OPTIONS=-Drefine.autosave=2 JAVA_OPTIONS=-Drefine.port=3334 JAVA_OPTIONS=-Drefine.interface=192.168.0.10 @@ -332,7 +340,6 @@ JAVA_OPTIONS=-Dext.gdata.apikey=*************************************** - --- @@ -340,13 +347,13 @@ Refer to the [official Java documentation](https://docs.oracle.com/javase/8/docs ## The home screen {#the-home-screen} -When you first launch OpenRefine, you will see a screen with a menu on the left hand side that includes Create Project, Open Project, Import Project, and Language Settings. This is called the “home screen,” where you can manage your projects and general settings. +When you first launch OpenRefine, you will see a screen with a menu on the left hand side that includes Create Project, Open Project, Import Project, and Language Settings. This is called the “home screen,” where you can manage your projects and general settings. In the lower left-hand corner of the screen, you'll see Preferences, Help, and About. ### Language settings {#language-settings} -From the home screen, look in the options to the left for Language Settings. You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface: +From the home screen, look in the options to the left for Language Settings. You can set your preferred interface language here. This language setting will persist until you change it again in the future. Languages are translated as a community effort; some languages are partially complete and default back to English where unfinished. Currently OpenRefine supports the following languages for 75% or more of the interface: * Cebuano * German @@ -371,7 +378,7 @@ We use Weblate to provide translations for the interface. You can check [our pro ### Preferences {#preferences} -In the bottom left corner of the screen, look for Preferences. At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it. +In the bottom left corner of the screen, look for Preferences. At this time you can set preferences using a key/value pair: that is, selecting one of the keys below and setting a value for it. |Setting|Key|Value syntax|Default|Example|Version| |---|---|---|---|---|---| @@ -399,7 +406,7 @@ The project screen (or work screen) is where you will spend most of your time on ### The project bar {#the-project-bar} -The project bar runs across the very top of the project screen. It contains the the OpenRefine logo, the project title, and the project control buttons on the right side. +The project bar runs across the very top of the project screen. It contains the the OpenRefine logo, the project title, and the project control buttons on the right side. At any time you can close your current project and go back to the home screen by clicking on the OpenRefine logo. If you’d like to open another project in a new browser tab or window, you can right-click on the logo and use “Open in a new tab.” You will lose [your current facets and view settings](#facetfilter) if you close your project (but data transformations will be saved in the [History](#history-undoredo) of the project). @@ -411,7 +418,7 @@ You can rename a project at any time by clicking inside the project title, which The Permalink allows you to return to a project at a specific view state - that is, with [facets and filters](facets) applied. The Permalink can help you pick up where you left off if you have to close your project while working with facets and filters. It puts view-specific information directly into the URL: clicking on it will load this current-view URL in the existing tab. You can right-click and copy the Permalink URL to copy the current view state to your clipboard, without refreshing the tab you’re using. -The Open… button will open up a new browser tab showing the Create Project screen. From here you can change settings, start a new project, or open an existing project. +The Open… button will open up a new browser tab showing the Create Project screen. From here you can change settings, start a new project, or open an existing project. Export is a dropdown menu that allows you to pick a format for exporting a dataset. Many of the export options will only export rows and records that are currently visible - the currently selected facets and filters, not the total data in the project. @@ -419,13 +426,13 @@ The Open… button will open up a new browser tab ### The grid header {#the-grid-header} -The grid header sits below the project bar and above the project grid (where the data of your project is displayed). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in [rows or records mode](exploring#rows-vs-records). +The grid header sits below the project bar and above the project grid (where the data of your project is displayed). The grid header will tell you the total number of rows or records in your project, and indicate whether you are in [rows or records mode](exploring#rows-vs-records). -It will also tell you if you’re currently looking at a select number of rows via facets or filtering, rather than the entire dataset, by displaying either, for example, “180 rows” or “67 matching rows (180 total).” +It will also tell you if you’re currently looking at a select number of rows via facets or filtering, rather than the entire dataset, by displaying either, for example, “180 rows” or “67 matching rows (180 total).” -Directly below the row number, you have the ability to switch between [row mode and records mode](exploring#rows-vs-records). OpenRefine stores projects persistently in one of the two modes, and displays your data as records by default if you are. +Directly below the row number, you have the ability to switch between [row mode and records mode](exploring#rows-vs-records). OpenRefine stores projects persistently in one of the two modes, and displays your data as records by default if you are. -To the right of the rows/records selection is the array of options for how many rows/records to view on screen at one time. At the far right of the screen you can navigate through your entire dataset one page at a time. +To the right of the rows/records selection is the array of options for how many rows/records to view on screen at one time. At the far right of the screen you can navigate through your entire dataset one page at a time. ### Extensions {#extensions} @@ -433,15 +440,15 @@ The Extensions dropdown offers you options for ex ### The grid {#the-grid} -The area of the project screen that displays your dataset is called the “grid” (or the “data grid,” or the “project grid”). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you. +The area of the project screen that displays your dataset is called the “grid” (or the “data grid,” or the “project grid”). The grid presents data in a tabular format, which may look like a normal spreadsheet program to you. -Columns widths are automatically set based on their contents; some column headers may be cut off, but can be viewed by mousing over the headers. +Columns widths are automatically set based on their contents; some column headers may be cut off, but can be viewed by mousing over the headers. In each column header you will see a small arrow. Clicking on this arrow brings up a dropdown menu containing column-specific data exploration and transformation options. You will learn about each of these options in the [Exploring data](exploring) and [Transforming data](transforming) sections. -The first column in every project will always be All, which contains options to flag, star, and do non-column-specific operations. The All column is also where rows/records are numbered. Numbering shows the permanent order of rows and records; a temporary sorting or facet may reorder the rows or show a limited set, but numbering will show you the original identifiers unless you make a permanent change. +The first column in every project will always be All, which contains options to flag, star, and do non-column-specific operations. The All column is also where rows/records are numbered. Numbering shows the permanent order of rows and records; a temporary sorting or facet may reorder the rows or show a limited set, but numbering will show you the original identifiers unless you make a permanent change. -The project grid may display with both vertical and horizontal scrolling, depending on the number and width of columns, and the number of rows/records displayed. You can control the display of the project grid by using [Sort and View options](exploring#sort-and-view). +The project grid may display with both vertical and horizontal scrolling, depending on the number and width of columns, and the number of rows/records displayed. You can control the display of the project grid by using [Sort and View options](sortview). Mousing over individual cells will allow you to [edit cells individually](cellediting#edit-one-cell-at-a-time). @@ -449,11 +456,11 @@ Mousing over individual cells will allow you to [edit cells individually](celled The Facet/Filter tab is one of the main ways of exploring your data: displaying the patterns and trends in your data, and helping you narrow your focus and modify that data. [Facets](facets) and [filters](facets#text-filter) are explained more in [Exploring data](exploring). -![A screenshot of facets and filters in action.](/img/facetfilter.png) +![A screenshot of facets and filters in action.](/img/facetfilter.png) -In the tab, you will see three buttons: Refresh, Reset all, and Remove all. +In the tab, you will see three buttons: Refresh, Reset all, and Remove all. -Refreshing your facets will ensure you are looking at the latest information about each facet, for example if you have changed the counts or eliminated some options. +Refreshing your facets will ensure you are looking at the latest information about each facet, for example if you have changed the counts or eliminated some options. Resetting your facets will remove any inclusion or exclusion you may have set - the facet options will stay in the sidebar, but your view settings will be undone. @@ -465,34 +472,34 @@ You can preserve your facets and filters for future use by copying a Undo / Redo tab in the sidebar of any project, that project’s history is shown as a list of changes in order, with the first “change” being the action of creating the project itself. (That first change, indexed as step zero, cannot be undone.) Here is a sample history with 3 changes: -``` +```text 0. Create project 1. Remove 7 rows 2. Create new column Last Name based on column Name with grel:value.split(" ") 3. Split 230 cell(s) in column Address into several columns by separator ``` -The current state of the project is highlighted with a dark blue background. If you move back and forth on the timeline you will see the current state become highlighted, while the actions that came after that state will be grayed out. +The current state of the project is highlighted with a dark blue background. If you move back and forth on the timeline you will see the current state become highlighted, while the actions that came after that state will be grayed out. To revert your data back to an earlier state, simply click on the last action in the timeline you want to keep. In the example above, if we keep the removal of 7 rows but revert everything we did after that, then click on “Remove 7 rows.” The last 2 changes will be undone, in order to bring the project back to state #1. -In this example, changes #2 and #3 will now be grayed out. You can redo a change by clicking on it in the history - everything up to and including it will be redone. +In this example, changes #2 and #3 will now be grayed out. You can redo a change by clicking on it in the history - everything up to and including it will be redone. -If you have moved back one or more states, and then you perform a new operation on your data, the later actions (everything that’s greyed out) will be erased and cannot be re-applied. +If you have moved back one or more states, and then you perform a new operation on your data, the later actions (everything that’s greyed out) will be erased and cannot be re-applied. The Undo/Redo tab will indicate which step you’re on, and if you’re about to risk erasing work - by saying something like “4/5" or “1/7” at the end. #### Reusing operations {#reusing-operations} -Operations that you perform in OpenRefine can be reused. For example, a formula you wrote inside one project can be copied and applied to another project later. +Operations that you perform in OpenRefine can be reused. For example, a formula you wrote inside one project can be copied and applied to another project later. -To reuse one or more operations, first extract it from the project where it was first applied. Click to the Undo/Redo tab and click Extract…. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON to the clipboard. +To reuse one or more operations, first extract it from the project where it was first applied. Click to the Undo/Redo tab and click Extract…. This brings up a box that lists all operations up to the current state (it does not show undone operations). Select the operation or operations you want to extract using the checkboxes on the left, and they will be encoded as JSON on the right. Copy that JSON to the clipboard. Move to the second project, go to the Undo/Redo tab, click Apply… and paste in that JSON. @@ -514,24 +521,27 @@ By default (and for security reasons), OpenRefine only listens to TCP requests c In these examples below, `mymachine.local` is used as the hostname, which must be a valid domain name that either resolve through the DNS server, or that is defined in the `hosts` file of your machine. -``` +```sh ./refine -i 0.0.0.0 -H mymachine.local ``` or set this option in `refine.ini`: -``` + +```ini REFINE_INTERFACE=0.0.0.0 REFINE_HOST=mymachine.local ``` or set this JVM option: + ``` -Drefine.interface=0.0.0.0 -Drefine.host=mymachine.local ``` On Mac, you can add a specific entry to the `Info.plist` file located within the app bundle (`/Applications/OpenRefine.app/Contents/Info.plist`): -``` + +```plist JVMOptions @@ -547,10 +557,10 @@ OpenRefine has no built-in security or version control for multi-user scenarios. ### Automating OpenRefine {#automating-openrefine} -Some users may wish to employ OpenRefine for batch processing as part of a larger automated pipeline. Not all OpenRefine features can work without human supervision and advancement (such as clustering), but many data transformation tasks can be automated. +Some users may wish to employ OpenRefine for batch processing as part of a larger automated pipeline. Not all OpenRefine features can work without human supervision and advancement (such as clustering), but many data transformation tasks can be automated. :::caution -The following are all third-party extensions and code; the OpenRefine team does not maintain them and cannot guarantee that any of them work. +The following are all third-party extensions and code; the OpenRefine team does not maintain them and cannot guarantee that any of them work. ::: Some examples: @@ -559,6 +569,6 @@ Some examples: * This project allows OpenRefine to be run from the command line using [operations saved in a JSON file](#reusing-operations): [OpenRefine batch processing](https://github.com/opencultureconsulting/openrefine-batch) * A Python project for applying a JSON file of operations to a data file, outputting the new file, and deleting the temporary project, written by David Huynh and Max Ogden: [Python client library for Google Refine](https://github.com/maxogden/refine-python) * And the same in Ruby: [Refine-Ruby](https://github.com/maxogden/refine-ruby) -* Another Python client library, by Paul Makepeace: [OpenRefine Python Client Library](https://github.com/PaulMakepeace/refine-client-py) +* Another Python client library, by Paul Makepeace: [OpenRefine Python Client Library](https://github.com/PaulMakepeace/refine-client-py) To look for other instances, search our former Google Groups [for users](https://groups.google.com/g/openrefine) and [for developers](https://groups.google.com/g/openrefine-dev), where [these projects were originally posted](https://groups.google.com/g/openrefine/c/GfS1bfCBJow/m/qWYOZo3PKe4J). diff --git a/docs/manual/starting.md b/docs/manual/starting.md index d1920d22..fb49f47d 100644 --- a/docs/manual/starting.md +++ b/docs/manual/starting.md @@ -8,37 +8,37 @@ sidebar_label: Starting a project An OpenRefine project is started by importing in some existing data - OpenRefine doesn’t allow you to create a dataset from nothing. -No matter where your data comes from, OpenRefine won’t modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored). +No matter where your data comes from, OpenRefine won’t modify your original data source. It copies all the information from your input, creates its own project file, and stores it in your [workspace directory](installing#set-where-data-is-stored). -The data and all of your edits are [automatically saved](#autosaving) inside the project file. When you’re finished modifying the data, you can [export it back out](exporting) into the file format of your choice. +The data and all of your edits are [automatically saved](#autosaving) inside the project file. When you’re finished modifying the data, you can [export it back out](exporting) into the file format of your choice. -You can also receive and open other people’s projects, or send them yours, by [exporting a project archive](exporting#export-a-project) and [importing it](#import-a-project). +You can also receive and open other people’s projects, or send them yours, by [exporting a project archive](exporting#export-a-project) and [importing it](#import-a-project). ## Create a project by importing data {#create-a-project-by-importing-data} -When you start OpenRefine, you’ll be taken to the Create Project screen. You’ll see on the left side of the screen that your options are to: +When you start OpenRefine, you’ll be taken to the Create Project screen. You’ll see on the left side of the screen that your options are to: -* import data from one or more files on your computer -* import data from one or more links on the web -* import data by pasting in text from your clipboard -* import data from a database (using SQL), and -* import one or more Sheets from Google Drive. +* import data from one or more files on your computer +* import data from one or more links on the web +* import data by pasting in text from your clipboard +* import data from a database (using SQL), and +* import one or more Sheets from Google Drive. From these sources, you can load any of the following file formats: -* comma-separated values (CSV) or text-separated values (TSV) -* Text files -* Fixed-width columns -* JSON -* XML -* OpenDocument spreadsheet (ODS) -* Excel spreadsheet (XLS or XLSX) -* PC-Axis (PX) -* MARC -* RDF data (JSON-LD, N3, N-Triples, Turtle, RDF/XML) -* Wikitext +* comma-separated values (CSV) or text-separated values (TSV) +* Text files +* Fixed-width columns +* JSON +* XML +* OpenDocument spreadsheet (ODS) +* Excel spreadsheet (XLS or XLSX) +* PC-Axis (PX) +* MARC +* RDF data (JSON-LD, N3, N-Triples, Turtle, RDF/XML) +* Wikitext -More formats can be imported by [adding extensions to provide that functionality](/extensions). +More formats can be imported by [adding extensions to provide that functionality](/extensions). If you supply two or more files for one project, the files’ rows will be loaded in the order that you specify, and OpenRefine will create a column at the beginning of the dataset with the source URL or file name in it to help you identify where each row came from. If the files have columns with identical names, the data will load in those columns; if not, the successive files will append all of their new columns to the end of the dataset: @@ -49,31 +49,31 @@ If you supply two or more files for one project, the files’ rows will be loade |berries.csv||9|Mulberry|Greece| |berries.csv||2|Blueberry|Canada| -You cannot combine two datasets into one project by appending data within rows. You can, however, combine two projects later using functions such as [cross()](grelfunctions/#crosscell-s-projectname-s-columnname), or [fetch further data](columnediting) using other methods. +You cannot combine two datasets into one project by appending data within rows. You can, however, combine two projects later using functions such as [cross()](grelfunctions/#cross), or [fetch further data](columnediting) using other methods. For whichever method you choose to start your project, when you click Next >> you will be given a preview and a chance to configure the way OpenRefine interprets the data you input. ### Get data from this computer {#get-data-from-this-computer} -Click on Browse… and select a file (or several) on your hard drive. All files will be shown, not just compatible ones. +Click on Browse… and select a file (or several) on your hard drive. All files will be shown, not just compatible ones. -If you import an archive file (something with the extension `.zip`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.gz`, or `.bz2`), OpenRefine detects the files inside it, shows you a preview screen, and allows you to select which ones to load. This does not work with `.rar` files. When importing multiple archives you can store the name of the archive each file was extracted from by ticking the `Store archive file` option upon import. +If you import an archive file (something with the extension `.zip`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.gz`, or `.bz2`), OpenRefine detects the files inside it, shows you a preview screen, and allows you to select which ones to load. This does not work with `.rar` files. When importing multiple archives you can store the name of the archive each file was extracted from by ticking the `Store archive file` option upon import. ### Web addresses (URLs) {#web-addresses-urls} -Type or paste the URL to a data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview the project for you. +Type or paste the URL to a data file into the field provided. You can add as many fields as you want. OpenRefine will download the file and preview the project for you. -If you supply two or more file URLs, OpenRefine will identify each one and ask you to choose which (or all) to load. +If you supply two or more file URLs, OpenRefine will identify each one and ask you to choose which (or all) to load. -Do not use this form to load a Google Sheet by its link; use [the Google Data form instead](#google-data). +Do not use this form to load a Google Sheet by its link; use [the Google Data form instead](#google-data). ### Clipboard {#clipboard} -You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into rows. OpenRefine recognizes each new text line as a row. +You can copy and paste in data from anywhere. OpenRefine will recognize comma-separated, tab-separated, or table-formatted information copied from sources such as word-processing documents, spreadsheets, and tables in PDFs. You can also just paste in a list of items that you want to turn into rows. OpenRefine recognizes each new text line as a row. -This can be useful if you want to pre-select a specific number of rows from your source data, or paste together rows from different places, rather than delete unwanted rows later in the project interace. +This can be useful if you want to pre-select a specific number of rows from your source data, or paste together rows from different places, rather than delete unwanted rows later in the project interace. -This can also be useful if you would like to paste in a list of URLs, which you can use later to [fetch more data](columnediting). +This can also be useful if you would like to paste in a list of URLs, which you can use later to [fetch more data](columnediting). ### Database (SQL) {#database-sql} @@ -94,8 +94,9 @@ If your connection is successful, you will see a Query Editor where you can run ### Google data {#google-data} You have two ways to load in data from Google Sheets: -* providing a link to an accessible Google Sheet (that is, one with link-sharing turned on), and -* selecting a Google Sheet in your Google Drive. + +* providing a link to an accessible Google Sheet (that is, one with link-sharing turned on), and +* selecting a Google Sheet in your Google Drive. #### Google Sheet by URL {#google-sheet-by-url} @@ -111,57 +112,55 @@ This will only work with Sheets, not with any other Google Drive file that might You can authorize OpenRefine to access your Google Drive data and import data from any Google Sheet it finds there. This will include Sheets that belong to you and Sheets that are shared with you, as well as Sheets that are in your trash. -When you select a Google option (either here, or [when exporting project data to Google Drive or Google Sheets](exporting), you will see a pop-up window that asks you to select a Google account to authorize with. You may see an error message when you authorize: if so, try your import or export operation again and it should succeed. +When you select a Google option (either here, or [when exporting project data to Google Drive or Google Sheets](exporting)), you will see a pop-up window that asks you to select a Google account to authorize with. You may see an error message when you authorize: if so, try your import or export operation again and it should succeed. OpenRefine will not show spreadsheets that are in your email inbox or stored in any other Google property - only in Drive. It also won’t show all compatible file formats, only Sheets files. -OpenRefine will generate a list of all Sheets it finds, with the most recently modified Sheets at the top. If a file you’ve just added isn’t showing in this list, you can close and restart OpenRefine, or simply navigate to an existing project, open it, then head back to the Create Project window and check again. +OpenRefine will generate a list of all Sheets it finds, with the most recently modified Sheets at the top. If a file you’ve just added isn’t showing in this list, you can close and restart OpenRefine, or simply navigate to an existing project, open it, then head back to the Create Project window and check again. When you click Preview the Sheet will open in a new browser tab. When you click the Sheet title, OpenRefine will begin to process the data. - ## Project preview {#project-preview} -Once OpenRefine is ready to import the data, you will see a screen with Configure Parsing Options at the top. You’ll see a preview of the first 100 rows and all identified columns. +Once OpenRefine is ready to import the data, you will see a screen with Configure Parsing Options at the top. You’ll see a preview of the first 100 rows and all identified columns. At the bottom of the screen you will find options for telling OpenRefine how to process what it has found. You can tell it which row(s) to parse as column headers, as well as to ignore any number of rows at the top. You can also select a specific range of rows to work with, by discarding some rows at the top (excluding the header) and limiting the total number of rows it loads. OpenRefine tries to guess how to parse your data based on the file extension. For example, `.xml` files are going to be parsed as though they are formatted in XML. An unknown file extension (or your clipboard copy-paste) is assumed to be either tab-separated or comma-separated. OpenRefine looks for a tab character, and if one is found, it assumes you have imported tab-separated data. -If OpenRefine isn’t certain what format you imported, it will provide a list of possibilities under Parse data as and some settings. You can specify a custom separator now, or split columns later while [transforming your data](transforming). +If OpenRefine isn’t certain what format you imported, it will provide a list of possibilities under Parse data as and some settings. You can specify a custom separator now, or split columns later while [transforming your data](transforming). -If you imported a spreadsheet with multiple worksheets, they will be listed along with the number of rows they contain. You can only select data from one worksheet. +If you imported a spreadsheet with multiple worksheets, they will be listed along with the number of rows they contain. You can only select data from one worksheet. -Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file. Hyperlinked text will be input as plain text, but OpenRefine will recognize links and make them clickable inside the project interface. +Note that OpenRefine does not preserve any formatting, such as cell or text colour, that my have been in the original data file. Hyperlinked text will be input as plain text, but OpenRefine will recognize links and make them clickable inside the project interface. :::info Encoding issues? -Look for character encoding issues at this stage. You may want to manually select an encoding, such as UTF-8, UTF-16, or ASCII, if OpenRefine does not display some characters correctly in the preview. Once your project is created, you can specify another encoding for specific columns using the [reinterpret() function](grelfunctions#reinterprets-s-encoder). +Look for character encoding issues at this stage. You may want to manually select an encoding, such as UTF-8, UTF-16, or ASCII, if OpenRefine does not display some characters correctly in the preview. Once your project is created, you can specify another encoding for specific columns using the [reinterpret() function](grelfunctions#reinterpret). ::: You should create a project name at this stage. You can also supply tags to keep your projects organized. When you’re happy with the preview, click Create Project. - ## Import a project {#import-a-project} -Because OpenRefine only runs locally on your computer, you can’t have a project accessible to more than one person at the same time. +Because OpenRefine only runs locally on your computer, you can’t have a project accessible to more than one person at the same time. The best way to collaborate with another person is to export and import projects that save all your changes, so that you can pick up where someone else left off. You can also [export projects](exporting#export-a-project) and import them to other computers, such as for working on the same project from the office and from home. -An exported project will include all of the [history](running#history-undoredo), so you can see (and undo) all the changes from the previous user. It is essentially a point-in-time snapshot of their work. OpenRefine only exports projects as `.tar.gz` files at this time. +An exported project will include all of the [history](running#history-undoredo), so you can see (and undo) all the changes from the previous user. It is essentially a point-in-time snapshot of their work. OpenRefine only exports projects as `.tar.gz` files at this time. + :::caution If you wish to hide the original state of your data and your history of edits (for example, if you are using OpenRefine to anonymize information), export your cleaned dataset only and do not share your project archive. ::: -Once someone has sent you a project archive file from their computer, you can save it anywhere. OpenRefine will import it like a new project and save its information to your workspace directory. +Once someone has sent you a project archive file from their computer, you can save it anywhere. OpenRefine will import it like a new project and save its information to your workspace directory. -In the left-hand menu of the home screen, click Import Project. Click Browse… and navigate to wherever you saved the file you were sent (for example, your Downloads folder). +In the left-hand menu of the home screen, click Import Project. Click Browse… and navigate to wherever you saved the file you were sent (for example, your Downloads folder). You can rename the project if you’d like - we recommend adding your name, a date, or a version number, if you’re planning to continue collaborating with another person (or working from multiple computers). -Then, click Import Project. Your project should appear with a step count beside Undo/Redo if steps were saved by the exporter. - -OpenRefine will store the project in its own workspace directory, so you can now delete the original file that was sent to you. +Then, click Import Project. Your project should appear with a step count beside Undo/Redo if steps were saved by the exporter. +OpenRefine will store the project in its own workspace directory, so you can now delete the original file that was sent to you. ## Project management {#project-management} @@ -169,26 +168,26 @@ You can access all of your created projects by clicking on Clipboard importing. Project names don’t have to be unique, and OpenRefine will create many projects with the same name unless you intervene. +You may have multiple projects from the same dataset, or multiple versions from sharing a project with another person. OpenRefine automatically generates a project name from the imported file, or “clipboard” when you use Clipboard importing. Project names don’t have to be unique, and OpenRefine will create many projects with the same name unless you intervene. -You can edit a project's name when you create it or import it, and you can rename a project later by opening it and clicking on the project name at the top of the screen. +You can edit a project's name when you create it or import it, and you can rename a project later by opening it and clicking on the project name at the top of the screen. ### Autosaving {#autosaving} OpenRefine [saves all of your actions](running#history-undoredo) (everything you can see in the Undo/Redo panel). That includes flagging and starring rows. -It doesn’t, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, and any sorting or column collapsing you may have done. A good rule of thumb is: if it’s not showing in Undo/Redo, you will lose it when you leave the project workspace. +It doesn’t, however, save your facets, filters, or any kind of view you may have in place while you work. This includes the number of rows showing, and any sorting or column collapsing you may have done. A good rule of thumb is: if it’s not showing in Undo/Redo, you will lose it when you leave the project workspace. Autosaving happens by default every five minutes. You can [change this preference by following these directions](running#jvm-preferences). -You can only save and share facets and filters, not any other type of view. To save current facets and filters, click Permalink. The project will reload with a different URL, which you can then copy and save elsewhere. This permalink will save both the facets and filters you’ve set, and the settings for each one (such as sorting by count rather than by name). +You can only save and share facets and filters, not any other type of view. To save current facets and filters, click Permalink. The project will reload with a different URL, which you can then copy and save elsewhere. This permalink will save both the facets and filters you’ve set, and the settings for each one (such as sorting by count rather than by name). ### Deleting projects {#deleting-projects} You can delete projects, which will erase the project files from the workspace directory on your computer. This is immediate and cannot be undone. -Go to Open Project and find the project you want to delete. Click on the X to the left of the project name. There will be a confirmation dialog. +Go to Open Project and find the project you want to delete. Click on the X to the left of the project name. There will be a confirmation dialog. ### Project files {#project-files} -You can find all of your raw project files in your work directory. They will be named according to the unique “Project ID” that OpenRefine has assigned them, which you can find on the Open Project screen, under the “About” link for each project. +You can find all of your raw project files in your work directory. They will be named according to the unique “Project ID” that OpenRefine has assigned them, which you can find on the Open Project screen, under the “About” link for each project. diff --git a/docs/manual/transforming.md b/docs/manual/transforming.md index 53a43637..c1082c8b 100644 --- a/docs/manual/transforming.md +++ b/docs/manual/transforming.md @@ -6,29 +6,33 @@ sidebar_label: Overview ## Overview {#overview} -OpenRefine gives you powerful ways to clean, correct, codify, and extend your data. Without ever needing to type inside a single cell, you can automatically fix typos, convert things to the right format, and add structured categories from trusted sources. +OpenRefine gives you powerful ways to clean, correct, codify, and extend your data. Without ever needing to type inside a single cell, you can automatically fix typos, convert things to the right format, and add structured categories from trusted sources. This section of ways to improve data are organized by their appearance in the menu options in OpenRefine. You can: -* change the order of [rows](#edit-rows) or [columns](columnediting#rename-remove-and-move) -* edit [cell contents](cellediting) within a particular column -* [transform](transposing) rows into columns, and columns into rows -* [split or join columns](columnediting#split-or-join) -* [add new columns](columnediting) based on existing data, with fetching new information, or through [reconciliation](reconciling) -* convert your rows of data into [multi-row records](exploring#rows-vs-records). +* change the order of [rows](#edit-rows) or [columns](columnediting#renaming-removing-and-moving) +* edit [cell contents](cellediting) within a particular column +* [transform](transposing) rows into columns, and columns into rows +* [split or join columns](columnediting#splitting-or-joining) +* [add new columns](columnediting) based on existing data, with fetching new information, or through [reconciliation](reconciling) +* convert your rows of data into [multi-row records](exploring#rows-vs-records). +* [split or join multi-valued cells](cellediting#split-multi-valued-cells) + +and an especially **powerful** feature called + +* [Clustering](cellediting#cluster-and-edit) ## Edit rows {#edit-rows} -Moving rows around is a permanent change to your data. +Moving rows around is a permanent change to your data. -You can [sort your data](sortview#sort) based on the values in one column, but that change is a temporary view setting. With that setting applied, you can make that new order permanent. +You can [sort your data](sortview#sort) based on the values in one column, but that change is a temporary view setting. With that setting applied, you can make that new order permanent. ![A screenshot of where to find the Sort menu with a sorting applied.](/img/sortPermanent.png) -In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select Reorder rows permanently. You will see the numbering of the rows change under the All column. - +In the project grid header, the word “Sort” will appear when a sort operation is applied. Click on it to show the dropdown menu, and select Reorder rows permanently. You will see the numbering of the rows change under the All column. :::info Reordering all rows -Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through [facets and filters](facets). +Reordering rows permanently will affect all rows in the dataset, not just those currently viewed through [facets and filters](facets). ::: -You can undo this action using the [History tab](running#history-undoredo). \ No newline at end of file +You can undo this action using the [History tab](running#history-undoredo). diff --git a/docs/manual/wikibase/overview.md b/docs/manual/wikibase/overview.md index 46b49a16..15f1b171 100644 --- a/docs/manual/wikibase/overview.md +++ b/docs/manual/wikibase/overview.md @@ -2,6 +2,7 @@ id: overview title: Overview of Wikibase support sidebar_label: Overview +remark-directive: --- [Wikibase](https://wikiba.se/) is free software (a set of MediaWiki extensions) used by many organizations around the world to store and publish Linked Open Data. Wikibase is the software behind [Wikidata](https://www.wikidata.org/), a free, multilingual collaborative knowledge base and a sister project of Wikipedia. Wikidata offers structured data about the world and can be edited by anyone. Wikibase also powers [structured data](https://commons.wikimedia.org/wiki/Commons:Structured_data) on [Wikimedia Commons](https://commons.wikimedia.org/), the media repository of Wikipedia. @@ -20,11 +21,12 @@ Wikidata is built by creating entities (such as people, organizations, or places For example, you may wish to create entities for local authors and the books they've set in your community. Each writer will be an entity with the occupation [author (Q482980)](https://www.wikidata.org/wiki/Q482980), each book will be an entity with the property “instance of” ([P31](https://www.wikidata.org/wiki/Property:P31)) linking it to a class such as [literary work (Q7725634)](https://www.wikidata.org/wiki/Q7725634), and books will be related to authors through a property [author (P50)](https://www.wikidata.org/wiki/Property:P50). Books can have places where they are set, with the property [narrative location (P840)](https://www.wikidata.org/wiki/Property:P840). -To do this with OpenRefine, you'll need a column of publication titles that you have reconciled (and create new items where needed); each publication will have one or more locations in a “setting” column, which is also reconciled to municipalities or regions where they exist (and create new items where needed). Then you can add those new relationships, and create new entities for authors, books, and places where needed. You do not need columns for properties; those are defined later, in the creation of your [schema](#edit-wikidata-schema). +To do this with OpenRefine, you'll need a column of publication titles that you have reconciled (and create new items where needed); each publication will have one or more locations in a “setting” column, which is also reconciled to municipalities or regions where they exist (and create new items where needed). Then you can add those new relationships, and create new entities for authors, books, and places where needed. You do not need columns for properties; those are defined later, in the creation of your [schema](#wikidata-schema). There is a list of [tutorials and walkthroughs on Wikidata](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing) that will allow you to see the full process. You can save your schemas and drafts in OpenRefine, and your progress stays in draft until you are ready to upload it to Wikidata. Batches of edits to Wikidata that are created with OpenRefine can be undone. You can test out the uploading process by reconciling to several “sandbox” entities created specifically for drafting edits and learning about Wikidata: + * https://www.wikidata.org/wiki/Q4115189 * https://www.wikidata.org/wiki/Q13406268 * https://www.wikidata.org/wiki/Q15397819 @@ -37,10 +39,11 @@ You can use OpenRefine's reconciliation preview to look at the target Wikidata e ### Wikidata schema {#wikidata-schema} A [schema](https://en.wikipedia.org/wiki/Database_schema) is a plan for how to structure information in a database. In OpenRefine, the schema operates as a template for how Wikidata edits should be applied: how to translate your tabular data into statements. With a schema, you can: -* preview the Wikidata edits and inspect them manually; -* analyze and fix any issues highlighted by OpenRefine; -* upload your changes to Wikidata by logging in with your own account; -* export the changes to the QuickStatements v1 format. + +* preview the Wikidata edits and inspect them manually; +* analyze and fix any issues highlighted by OpenRefine; +* upload your changes to Wikidata by logging in with your own account; +* export the changes to the QuickStatements v1 format. For example, if your dataset has columns for authors, publication titles, and publication years, your schema can be conceptualized as: [publication title] has the author [author], and was published in [publication year]. To establish these facts, you need to establish one or more columns as “items,” for which you will make “statements” that relate them to other columns. @@ -82,6 +85,7 @@ You could upload the “Translated titles” to “Label” with the language sp #### Unsupported field types With OpenRefine, it is not yet possible to edit: + * sitelinks (links to Wikipedia or other Wikimedia projects, in the case of Wikidata); * any field on Wikibase properties; * lexemes, forms or senses. @@ -95,6 +99,7 @@ Use the Extensions menu to select Edit columnAdd columns from reconciled values... -See [Add columns from reconciled values](./reconciling#add-columns-from-reconciled-values) for general information about this feature. +See [Add columns from reconciled values](../reconciling.md#add-columns-from-reconciled-values) for general information about this feature. ### Prepare columns with structured data diff --git a/docs/manual/wikibase/quality-assurance.md b/docs/manual/wikibase/quality-assurance.md index a8ead43e..3ae58530 100644 --- a/docs/manual/wikibase/quality-assurance.md +++ b/docs/manual/wikibase/quality-assurance.md @@ -23,9 +23,10 @@ You should always assess the quality of your reconciliation results first. OpenR ## Constraint violations {#constraint-violations} -Constraints are retrieved as defined on the properties, using [ (P2302)](https://www.wikidata.org/wiki/Property:P2302). +Constraints are retrieved as defined on the properties, using [(P2302)](https://www.wikidata.org/wiki/Property:P2302). The following constraints are supported: + * [format constraint (Q21502404)](https://www.wikidata.org/wiki/Q21502404), checked on all values * [inverse constraint (Q21510855)](https://www.wikidata.org/wiki/Q21510855): OpenRefine assumes that the inverses of the candidate statements are not in Wikidata yet. If you know that the inverse statements are already in Wikidata, you can safely ignore this issue. * [used for values only constraint (Q21528958)](https://www.wikidata.org/wiki/Q21528958), [used as qualifier constraint (Q21510863)](https://www.wikidata.org/wiki/Q21510863) and [used as reference constraint (Q21528959)](https://www.wikidata.org/wiki/Q21528959) @@ -39,6 +40,7 @@ A comparison of the supported constraints with respect to other implementations ## Generic issues {#generic-issues} OpenRefine also detects issues that are not flagged (yet) by constraint violations on Wikidata: + * Statements without references. This does not rely on [citation needed constraint (Q54554025)](https://www.wikidata.org/wiki/Q54554025): all statements are expected to have references. (The idea is that when importing a dataset, every statement you add * should link to this dataset - it does not hurt to do it even for generic properties such as [instance of (P31)](https://www.wikidata.org/wiki/Property:P31).) * Spurious whitespace and non-printable characters in strings (including labels, descriptions and aliases); diff --git a/docs/manual/wikibase/reconciling.md b/docs/manual/wikibase/reconciling.md index e1f2fc69..48ed7e1f 100644 --- a/docs/manual/wikibase/reconciling.md +++ b/docs/manual/wikibase/reconciling.md @@ -5,14 +5,15 @@ sidebar_label: Reconciling with Wikibase --- The Wikidata [reconciliation service](reconciling) for OpenRefine [supports](https://reconciliation-api.github.io/testbench/): -* A large number of potential types to reconcile against -* Previewing and viewing entities -* Suggesting entities, types, and properties -* Augmenting your project with more information pulled from Wikidata. + +- A large number of potential types to reconcile against +- Previewing and viewing entities +- Suggesting entities, types, and properties +- Augmenting your project with more information pulled from Wikidata. You can find documentation and further resources on the reconciliation API [here](https://wikidata.reconci.link/). -For the most part, Wikidata reconciliation behaves the same way other reconciliation services do, but there are a few processes and features specific to Wikidata. +For the most part, Wikidata reconciliation behaves the same way other reconciliation services do, but there are a few processes and features specific to Wikidata. ## Language settings {#language-settings} @@ -24,7 +25,7 @@ When reconciling using this interface, items and properties will be displayed in ## Restricting matches by type {#restricting-matches-by-type} -In Wikidata, types are items themselves. For instance, the [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) has the type [public university (Q875538)](https://www.wikidata.org/wiki/Q875538), using the [instance of (P31)](https://www.wikidata.org/wiki/Property:P31) property. Types can be subclasses of other types, using the [subclass of (P279)](https://www.wikidata.org/wiki/Property:P279) property. For instance, [public university (Q875538)](https://www.wikidata.org/wiki/Q875538) is a subclass of [university (Q3918)](https://www.wikidata.org/wiki/Q3918). You can visualize these structures with the [Wikidata Graph Builder](https://angryloki.github.io/wikidata-graph-builder/). +In Wikidata, types are items themselves. For instance, the [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) has the type [public university (Q875538)](https://www.wikidata.org/wiki/Q875538), using the [instance of (P31)](https://www.wikidata.org/wiki/Property:P31) property. Types can be subclasses of other types, using the [subclass of (P279)](https://www.wikidata.org/wiki/Property:P279) property. For instance, [public university (Q875538)](https://www.wikidata.org/wiki/Q875538) is a subclass of [university (Q3918)](https://www.wikidata.org/wiki/Q3918). You can visualize these structures with the [Wikidata Graph Builder](https://angryloki.github.io/wikidata-graph-builder/). When you select or enter a type for reconciliation, OpenRefine will include that type and all of its subtypes. For instance, if you select [university (Q3918)](https://www.wikidata.org/wiki/Q3918), then [university of Ljubljana (Q1377)](https://www.wikidata.org/wiki/Q1377) will be a possible match, though that item isn't directly linked to Q3918 - because it is directly linked to Q875538, the subclass of Q3918. @@ -32,7 +33,7 @@ Some items and types may not yet be set as an instance or subclass of anything ( ## Reconciling via unique identifiers {#reconciling-via-unique-identifiers} -You can supply a column of unique identifiers (in the form "Q###" for entities) directly to Wikidata in order to pull more data, but [these strings will not be “reconciled” against the external dataset](reconciling#reconciling-with-unique-identifiers). Apply the operation ReconcileUse values as identifiers on your column of QIDs. All cells will appear as dark blue “confirmed” matches. Some of the “matches” may be errors, which you will need to hover over or click on to identify. You cannot use this to reconcile properties (in the form "P###"). +You can supply a column of unique identifiers (in the form "Q###" for entities) directly to Wikidata in order to pull more data, but [these strings will not be “reconciled” against the external dataset](../reconciling.md#reconciling-with-unique-identifiers). Apply the operation ReconcileUse values as identifiers on your column of QIDs. All cells will appear as dark blue “confirmed” matches. Some of the “matches” may be errors, which you will need to hover over or click on to identify. You cannot use this to reconcile properties (in the form "P###"). If the identifier you submit is assigned to multiple Wikidata items (because Wikidata is crowdsourced), all of the items are returned as candidates, with none automatically matched. @@ -42,14 +43,12 @@ Wikidata's hierarchical property structure can be called by using property paths Labels, aliases, descriptions and sitelinks can be accessed as follows (L for label , D for description, A for aliases, S for sitelink): - Len for Label in English - Dfi for Description in Finnish - Apt for Alias in Portuguese - Sdewiki for Sitelink in German Wikipedia page titles - Scommonswiki for Commons sitelink +- `Len` for **Label** in English +- `Dfi` for **Description** in Finnish +- `Apt` for **Alias** in Portuguese +- `Sdewiki` for **Sitelink** in German Wikipedia page titles +- `Scommonswiki` for Commons **Sitelink** The lowercase letters are Wikimedia language codes which select which language the terms will be fetched. No language fall-back is performed when retrieving the values. For information on how to do this, read the [documentation and further resources here](https://wikidata.reconci.link/#documentation). - - diff --git a/docs/manual/wikibase/schema-alignment.md b/docs/manual/wikibase/schema-alignment.md index 6b77259a..0e1d8209 100644 --- a/docs/manual/wikibase/schema-alignment.md +++ b/docs/manual/wikibase/schema-alignment.md @@ -23,20 +23,20 @@ should be made is the same for all rows), or any reconciled column can be dropped in this field. In this case, the edits will depend on the reconciliation status of each cell: -- If the cell is matched to an item, edits will be made on that item; -- If the cell is marked as corresponding to a new item, a new item - will be created for it. See [New items](./new-entities) for more +- If the cell is matched to an item, edits will be made on that item; +- If the cell is marked as corresponding to a new item, a new item + will be created for it. See [New items](new-entities) for more details about how this works; -- If the cell has reconciliation candidates but has not been matched +- If the cell has reconciliation candidates but has not been matched to any of them, the edit will be skipped (even if there is only one candidate with a high reconciliation score); -- If the cell is not reconciled or blank, the edit will be skipped. +- If the cell is not reconciled or blank, the edit will be skipped. Do not worry about the ordering of items in the schema or the order of your rows, as OpenRefine will rearrange your edits to optimize their upload. If your project makes edits on the same item across multiple rows, these edits will be merged together and performed in one edit. See -[Uploading your changes](./uploading) about that. +[Uploading your changes](uploading) about that. ## Terms {#terms} @@ -52,12 +52,12 @@ are designated by language codes. For each term that you want to add to an item, you will need to specify the language for this term. There are two cases: -- Either the language is constant across your dataset: you know that +- Either the language is constant across your dataset: you know that all the names in a given column are spelled in the same language. In this case, type the name of the language in the input and select the language in the drop-down suggestion dialog. This will place the appropriate language code in the input. -- Or the language varies across your dataset. In this case, you need +- Or the language varies across your dataset. In this case, you need to provide a column of Wikimedia language codes that indicates the language for each term that you want to add. Just drag and drop this column to the language field. If there are any invalid language @@ -91,7 +91,7 @@ not possible to remove aliases or to override any existing aliases. You can add statements in the schema: this will generate new statements on the corresponding items. These statements will be merged with any -existing statements on the actual Wikibase items and [this merging process depends on the upload medium](./uploading#Merging-strategies-for-statements). +existing statements on the actual Wikibase items and [this merging process depends on the upload medium](uploading#merging-strategy-for-terms-and-statements). It is forecast to give more control over the merging strategy in the near future. @@ -127,33 +127,33 @@ reference will be discarded but the reference will still be added The editing mode of a statement determines how it contributes to the corresponding entity. OpenRefine offers three editing modes: -* **Add or merge**, which adds the statement or merges it with the first existing statement that matches it; -* **Add**, which only adds the statement if there are no matching statements on the entity. Otherwise, leave those statements untouched; -* **Delete**, which deletes all matching statements. + +- **Add or merge**, which adds the statement or merges it with the first existing statement that matches it; +- **Add**, which only adds the statement if there are no matching statements on the entity. Otherwise, leave those statements untouched; +- **Delete**, which deletes all matching statements. The way statements are matched is controlled by the matching strategy, which can be configured for each statement in the schema. ### Matching strategy {#matching-strategy} The matching strategy determines how the candidate statements generated by the schema are compared to the existing statements on the entity. OpenRefine offers three merging strategies: -* **Property**, which compares statements by their main property only. This means that any two statements using the same main property will be considered equivalent. For intance, using this merging strategy in conjunction with the **Delete** editing - mode will delete all statements with a particular main property on the target entity. -* **Property and value**, which compares statements by their main property and main value only. This is what QuickStatements does. In addition, it is possible (and enabled by default) to match statement values in a lax way, for instance to ignore - differences in trailing whitespace or rounding of quantities. -* **Qualifiers**, which compare statements using their property, main value and qualifiers. It is possible to define a list of property identifiers which determines which qualifiers are discriminating. Other qualifiers will not be taken into account when - comparing statements. By default, all qualifiers are taken into account. This matching strategy also supports lax value matching. + +- **Property**, which compares statements by their main property only. This means that any two statements using the same main property will be considered equivalent. For intance, using this merging strategy in conjunction with the **Delete** editing mode will delete all statements with a particular main property on the target entity. +- **Property and value**, which compares statements by their main property and main value only. This is what QuickStatements does. In addition, it is possible (and enabled by default) to match statement values in a lax way, for instance to ignore differences in trailing whitespace or rounding of quantities. +- **Qualifiers**, which compare statements using their property, main value and qualifiers. It is possible to define a list of property identifiers which determines which qualifiers are discriminating. Other qualifiers will not be taken into account when comparing statements. By default, all qualifiers are taken into account. This matching strategy also supports lax value matching. These matching strategies are not honoured when exporting to QuickStatements, as the QuickStatements formats do not make it possible to represent them. #### Lax value matching {#lax-value-matching} When lax value matching is enabled, the following values are considered equal for statement matching purposes: -* strings which differ by whitespace at the beginning or end (such as ` Berlin` and `Berlin `); -* URLs which differ by trailing slash or `http` / `https` differences (such as `http://wikiba.se` and `https://wikiba.se/`); -* quantities with the same unit, whose uncertainty domain overlap (such as `47±1` and `48±0.5`); -* geographical coordinates whose uncertainty domain overlap (note that since the uncertainty of geographical coordinates is expressed in degrees, this does not guarantee a distance threshold below which the coordinates will match); -* monolingual text values whose values differ by leading or trailing whitespace; -* dates which differ in attributes which are rendered irrelevant by the lowest precision of both values to compare (such as `1976-01-01` and `1976`). + +- strings which differ by whitespace at the beginning or end (such as ` Berlin` and `Berlin `); +- URLs which differ by trailing slash or `http` / `https` differences (such as `http://wikiba.se` and `https://wikiba.se/`); +- quantities with the same unit, whose uncertainty domain overlap (such as `47±1` and `48±0.5`); +- geographical coordinates whose uncertainty domain overlap (note that since the uncertainty of geographical coordinates is expressed in degrees, this does not guarantee a distance threshold below which the coordinates will match); +- monolingual text values whose values differ by leading or trailing whitespace; +- dates which differ in attributes which are rendered irrelevant by the lowest precision of both values to compare (such as `1976-01-01` and `1976`). ### Ranks {#ranks} @@ -189,9 +189,9 @@ null. Monolingual texts consist of two parts: -- the language: see [Languages](#languages) for their +- the language: see [Languages](#languages) for their structure; -- the value of the text: see [the section above](#strings-and-external-identifiers). +- the value of the text: see [the section above](#strings-and-external-identifiers). A monolingual text is skipped when any of its parts is skipped (that is, if the language or the text are invalid). @@ -202,37 +202,37 @@ Dates are parsed from cell contents (or from any constant provided in the schema) and the precision of the date is inferred from its format. Here are the valid formats: -- `YYYYM`, such as `2001M` (millenium precision) -- `YYYYC`, such as `1901C` (century precision) -- `YYYYD`, such as `1981D` (decade precision) -- `YYYY`, such as `1984` (year precision) -- `YYYY-MM`, such as `2019-03` (month precision) -- `YYYY-MM-DD`, such as `1897-08-14` (day precision) +- `YYYYM`, such as `2001M` (millenium precision) +- `YYYYC`, such as `1901C` (century precision) +- `YYYYD`, such as `1981D` (decade precision) +- `YYYY`, such as `1984` (year precision) +- `YYYY-MM`, such as `2019-03` (month precision) +- `YYYY-MM-DD`, such as `1897-08-14` (day precision) Any value that does not match any of these formats will be ignored. All dates are represented in UTC, Gregorian calendar. In OpenRefine 3.3, the following new formats have been introduced: -- `TODAY` returns today's date with day precision. This will be +- `TODAY` returns today's date with day precision. This will be evaluated when performing the edits (or exporting to QuickStatements); -- `YYYY-MM-DD_QID` can be used to specify a date in a particular - calendar (such as the [proleptic Julian calendar (Q1985786)](https://www.wikidata.org/wiki/Q1985786). +- `YYYY-MM-DD_QID` can be used to specify a date in a particular + calendar (such as the [proleptic Julian calendar (Q1985786)](https://www.wikidata.org/wiki/Q1985786)). In OpenRefine 3.5, the following new format has been introduced: -- `-234` represents the year 234 [BCE](https://en.wikipedia.org/wiki/Common_Era) +- `-234` represents the year 234 [BCE](https://en.wikipedia.org/wiki/Common_Era) :::tip -See also https://www.wikidata.org/wiki/Help:Dates +See also [Wikidata Help:Dates](https://www.wikidata.org/wiki/Help:Dates) ::: ### Quantities {#quantities} Quantities consist of two parts: the amount and the unit. -- the amount is mandatory and must be a string, such as `18,229.1020`. +- the amount is mandatory and must be a string, such as `18,229.1020`. The precision that is displayed will be respected (the same number of trailing zeros will be shown in Wikibase). By default, no upper and lower bounds will be set. To define these, one needs to use the @@ -240,7 +240,7 @@ Quantities consist of two parts: the amount and the unit. as `3,450±5`. As usual, the amount can be provided as a constant or as a column variable. In the latter case, the values in the column must be strings. -- the unit is optional. It is an item, so it can be provided either +- the unit is optional. It is an item, so it can be provided either with the auto-suggest dialog or as a reconciled column. It is important to note that if a reconciled column is used, any unreconciled cells will discard the entire quantity value. So a @@ -249,16 +249,19 @@ Quantities consist of two parts: the amount and the unit. ### Globe coordinates {#globe-coordinates} +:::note +OpenRefine has [extensions](http://openrefine.org/download.html) that can sometimes help with Geo shaped data +::: + Geographic coordinates are specified as strings with the following formats, where all components are floating point numbers in degrees: -- `latitude,longitude` for a default precision of ten micro degrees +- `latitude,longitude` for a default precision of ten micro degrees (for instance: [`49.265278,4.028611`](https://tools.wmflabs.org/geohack/geohack.php?params=49.265277777778_N_4.0286111111111_E_globe:earth&language=en) can be used indicate the position of Reims, France. - -- `latitude,longitude,precision` when specifying an explicit precision +- `latitude,longitude,precision` when specifying an explicit precision (for instance: `49.265278,4.028611,0.1` can be used indicate the position of Reims within a tenth of a degree). diff --git a/docs/manual/wikibase/uploading.md b/docs/manual/wikibase/uploading.md index 1360c0d1..1e6a7421 100644 --- a/docs/manual/wikibase/uploading.md +++ b/docs/manual/wikibase/uploading.md @@ -29,14 +29,13 @@ This requires that the Wikibase site has an associated [QuickStatements](https:/ ### Merging strategy for terms and statements {#merging-strategy-for-terms-and-statements} -OpenRefine offers various merging strategies for terms and statements. QuickStatements only supports one non-configurable merging strategy. Therefore, the merging strategies specified by the user in the schema are ignored when exporting to QuickStatements, -which can result in unintended changes. +OpenRefine offers various merging strategies for terms and statements. QuickStatements only supports one non-configurable merging strategy. Therefore, the merging strategies specified by the user in the schema are ignored when exporting to QuickStatements, which can result in unintended changes. ### New item creation {#new-item-creation} OpenRefine supports creating new items with arbitrary relations between them. -QuickStatements supports creating new items with the CREATE instruction, and subsequent instructions can use the LAST placeholder to use the Qid of the last created item. When generating QuickStatements instructions, OpenRefine reorders your edits so that this syntax can be used. In rare cases, such as when a statement links two newly-created items, it is impossible to use QuickStatements to perform the edit. In this case, no QuickStatements script will be generated. +QuickStatements supports creating new items with the `CREATE` instruction, and subsequent instructions can use the `LAST` placeholder to use the Qid of the last created item. When generating QuickStatements instructions, OpenRefine reorders your edits so that this syntax can be used. In rare cases, such as when a statement links two newly-created items, it is impossible to use QuickStatements to perform the edit. In this case, no QuickStatements script will be generated. ### Speed and number of edits {#speed-and-number-of-edits} diff --git a/docs/technical-reference/architecture-before-4.md b/docs/technical-reference/architecture-before-4.md index e679db08..c0f88921 100644 --- a/docs/technical-reference/architecture-before-4.md +++ b/docs/technical-reference/architecture-before-4.md @@ -10,14 +10,15 @@ This architecture provides a good separation of concerns (data vs. UI); allows t ## Technology stack {#technology-stack} -The server-side (back-end) part of OpenRefine is implemented in Java as one single servlet which is executed by the [Jetty](http://jetty.codehaus.org/jetty/) web server and servlet container. The use of Java strikes a balance between performance and portability across operating systems (there is very little OS-specific code and has mostly to do with starting the application). +The server-side (back-end) part of OpenRefine is implemented in Java as one single servlet which is executed by the [Jetty](http://jetty.codehaus.org/jetty/) web server and servlet container. The use of Java strikes a balance between performance and portability across operating systems (there is very little OS-specific code and has mostly to do with starting the application). The functional extensibility of OpenRefine is provided by a fork of the [SIMILE Butterfly](https://github.com/OpenRefine/simile-butterfly) modular web application framework. With this framework, extensions are able to provide new functionality both in the server- and client-side. A [list of known extensions](/extensions) is maintained on our website and we have [specific documentation for extension developers](technical-reference/writing-extensions.md). The client-side part of OpenRefine is implemented in HTML, CSS and plain Javascript. It primariy uses the following libraries: -* [jQuery](http://jquery.com/) -* [Wikimedia's jQuery.i18n](https://github.com/wikimedia/jquery.i18n) + +- [jQuery](http://jquery.com/) +- [Wikimedia's jQuery.i18n](https://github.com/wikimedia/jquery.i18n) The front-end dependencies are fetched at build time via [NPM](https://www.npmjs.com/). The server-side part of OpenRefine relies on many libraries, for instance to implement import and export in many different formats. @@ -109,10 +110,12 @@ In summary, - generalizable processes can be re-constructed from abstract operations ## Client-side architecture {#client-side-architecture} + The client-side part of OpenRefine is implemented in HTML, CSS and Javascript and uses the following Javascript libraries: -* [jQuery](http://jquery.com/) -* [jQueryUI](http:jqueryui.com/) -* [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n) + +- [jQuery](http://jquery.com/) +- [jQueryUI](http:jqueryui.com/) +- [Recurser jquery-i18n](https://github.com/recurser/jquery-i18n) ### Importing architecture {#importing-architecture} @@ -161,7 +164,7 @@ Refine.DefaultImportingController = function(createProjectUI) { Refine.CreateProjectUI.controllers.push(Refine.DefaultImportingController); // register the controller ``` -We will cover the server-side code [below](#importingcontrollers). +We will cover the server-side code [below](#importingcontroller). #### Data Source Selection UIs {#data-source-selection-uis} @@ -215,8 +218,9 @@ is not uncommon that this initial choice must be overriden by the user. Beyond this choice of format, the parsing UI panel offers a configuration panel for the chosen importer. This part of the UI can be defined independently for each input format, given that not all options are relevant for all formats. For instance, when selecting the "Text file" option, the specific UI of the `LinedBasedImporter` will be shown. This UI is defined in: -* `main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.html` -* `main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.js` + +- `main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.html` +- `main/webapp/modules/core/scripts/index/parser-interfaces/line-based-parser-ui.js` Other importers generally define their own parsing configuration panel as well. @@ -227,12 +231,11 @@ where those components are registered together in the `ImportingManager`. #### ImportingController {#importingcontroller} -An importing controller is a component of the back-end which is in charge of the entire importing workflow, from the initial transfer of the raw data to be imported to the created project, with all the configuration steps in between, [as described in the -earlier section](#importing-controllers). OpenRefine comes with -a default importing controller which implements this for data coming from: -* file upload by the user via the web interface -* upload of textual information using the clipboard import form -* download of a file by supplying a URL +An importing controller is a component of the back-end which is in charge of the entire importing workflow, from the initial transfer of the raw data to be imported to the created project, with all the configuration steps in between, [as described in the earlier section](#importing-controllers). OpenRefine comes with a default importing controller which implements this for data coming from: + +- file upload by the user via the web interface +- upload of textual information using the clipboard import form +- download of a file by supplying a URL For all of these data sources, the first step consists of storing the corresponding input files in a temporary directory inside the workspace. The default importing controller provides an HTTP API used by the front-end to select which files to import, [predict the format they are in](#formatguesser), provide default importing options for the selected format, preview the project's first few rows with the given options, and finally create the project.