impl(bigquery): basic row parsing#5916
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces robust error handling and type conversion capabilities for BigQuery query results, including the FromSql trait for mapping database values to Rust types and a schema-aware Row parser. The reviewer feedback highlights critical correctness and performance improvements: addressing unnecessary clones and allocations during row parsing, preventing silent data corruption when handling special float values (like NaN and Infinity), ensuring consistent row-to-schema length validation, and improving error diagnostics by propagating actual field names instead of hardcoding 'unknown' in conversion errors.
| match schema.get_field_by_index(i) { | ||
| Some(f) => { | ||
| let field_type = f.r#type.clone(); | ||
| let schema = Arc::new(Schema::new_from_field(f.clone())); | ||
| let value = convert_value(value, field_type, &schema)?; | ||
| values.push(value); | ||
| } |
There was a problem hiding this comment.
The current implementation clones f.r#type (a String), clones f (a potentially large TableFieldSchema), and allocates a new Arc<Schema> for every single cell in every row. Furthermore, the _schema parameter in convert_value is completely unused.
To avoid these expensive allocations and clones, and to allow passing the actual column name for better error diagnostics, we can pass &f.r#type and &f.name to convert_value.
match schema.get_field_by_index(i) {
Some(f) => {
let value = convert_value(value, &f.r#type, &f.name)?;
values.push(value);
}
None => continue,
}| "FLOAT" | "FLOAT64" => { | ||
| let num = value.parse::<f64>().map_err(|e| RowError::TypeConversion { | ||
| column: "unknown".to_string(), | ||
| source: ConvertError::Convert(Box::new(e)), | ||
| })?; | ||
| Ok(Value::Number( | ||
| serde_json::Number::from_f64(num).unwrap_or_else(|| serde_json::Number::from(0)), | ||
| )) | ||
| } |
There was a problem hiding this comment.
BigQuery FLOAT64 columns can return special values like "Infinity", "-Infinity", or "NaN". While f64::parse successfully parses these, serde_json::Number::from_f64 returns None for non-finite floats. The current unwrap_or_else silently defaults these values to 0, causing silent data corruption.
Instead of defaulting to 0, we can fall back to storing the original string representation as Value::String(value). Since FromSql for f64 already supports parsing from Value::String, this will correctly preserve and parse special float values.
"FLOAT" | "FLOAT64" => {
let num = value.parse::<f64>().map_err(|e| RowError::TypeConversion {
column: field_name.to_string(),
source: ConvertError::Convert(Box::new(e)),
})?;
match serde_json::Number::from_f64(num) {
Some(n) => Ok(Value::Number(n)),
None => Ok(Value::String(value)),
}
}| pub(crate) fn try_new(row: &Struct, schema: &Arc<Schema>) -> Result<Self> { | ||
| let field_list = get_field_list(row)?; | ||
|
|
||
| let mut values = ListValue::new(); | ||
| for (i, cell) in field_list.iter().enumerate() { | ||
| let value = get_field_value(cell)?; | ||
| match schema.get_field_by_index(i) { | ||
| Some(f) => { | ||
| let field_type = f.r#type.clone(); | ||
| let schema = Arc::new(Schema::new_from_field(f.clone())); | ||
| let value = convert_value(value, field_type, &schema)?; | ||
| values.push(value); | ||
| } | ||
| None => continue, | ||
| } | ||
| } | ||
|
|
||
| if values.len() != schema.len() { | ||
| return Err(RowError::InvalidRowFormat(format!( | ||
| "schema and row cell mismatch (expected {}, got {})", | ||
| schema.len(), | ||
| values.len() | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Checking the length mismatch at the end of the loop is inconsistent because if field_list has more elements than schema, the extra elements are silently ignored (since schema.get_field_by_index(i) returns None and the loop continues), resulting in values.len() == schema.len() and succeeding.
Performing an early length check at the beginning of try_new ensures consistent error behavior for any mismatch (more or fewer cells) and avoids unnecessary processing.
| fn convert_value(value: Value, field_type: String, _schema: &Arc<Schema>) -> Result<Value> { | ||
| match value { | ||
| Value::Null => Ok(Value::Null), | ||
| Value::String(v) => convert_basic_type(v, field_type), | ||
| Value::Object(_) => unimplemented!("TODO(#5592): nested records not implemented"), | ||
| Value::Array(_) => unimplemented!("TODO(#5592): repeated fields not implemented"), | ||
| _ => Err(RowError::InvalidRowFormat(format!( | ||
| "cell value is not an object: value={:?}, field_type={:?}", | ||
| value, field_type | ||
| ))), | ||
| } | ||
| } |
There was a problem hiding this comment.
Update convert_value to accept field_type as &str and field_name as &str to avoid unnecessary clones and provide better error context.
| fn convert_value(value: Value, field_type: String, _schema: &Arc<Schema>) -> Result<Value> { | |
| match value { | |
| Value::Null => Ok(Value::Null), | |
| Value::String(v) => convert_basic_type(v, field_type), | |
| Value::Object(_) => unimplemented!("TODO(#5592): nested records not implemented"), | |
| Value::Array(_) => unimplemented!("TODO(#5592): repeated fields not implemented"), | |
| _ => Err(RowError::InvalidRowFormat(format!( | |
| "cell value is not an object: value={:?}, field_type={:?}", | |
| value, field_type | |
| ))), | |
| } | |
| } | |
| fn convert_value(value: Value, field_type: &str, field_name: &str) -> Result<Value> { | |
| match value { | |
| Value::Null => Ok(Value::Null), | |
| Value::String(v) => convert_basic_type(v, field_type, field_name), | |
| Value::Object(_) => unimplemented!("TODO(#5592): nested records not implemented"), | |
| Value::Array(_) => unimplemented!("TODO(#5592): repeated fields not implemented"), | |
| _ => Err(RowError::InvalidRowFormat(format!( | |
| "cell value is not an object: value={:?}, field_type={:?}", | |
| value, field_type | |
| ))), | |
| } | |
| } |
| fn convert_basic_type(value: String, field_type: String) -> Result<Value> { | ||
| match field_type.as_str() { |
There was a problem hiding this comment.
| "INTEGER" | "INT64" => { | ||
| let num = value.parse::<i64>().map_err(|e| RowError::TypeConversion { | ||
| column: "unknown".to_string(), | ||
| source: ConvertError::Convert(Box::new(e)), | ||
| })?; | ||
| Ok(Value::Number(serde_json::Number::from(num))) | ||
| } |
There was a problem hiding this comment.
Use the actual field_name instead of the hardcoded "unknown" for better error diagnostics.
| "INTEGER" | "INT64" => { | |
| let num = value.parse::<i64>().map_err(|e| RowError::TypeConversion { | |
| column: "unknown".to_string(), | |
| source: ConvertError::Convert(Box::new(e)), | |
| })?; | |
| Ok(Value::Number(serde_json::Number::from(num))) | |
| } | |
| "INTEGER" | "INT64" => { | |
| let num = value.parse::<i64>().map_err(|e| RowError::TypeConversion { | |
| column: field_name.to_string(), | |
| source: ConvertError::Convert(Box::new(e)), | |
| })?; | |
| Ok(Value::Number(serde_json::Number::from(num))) | |
| } |
| "BOOLEAN" | "BOOL" => { | ||
| let b = value | ||
| .to_lowercase() | ||
| .parse::<bool>() | ||
| .map_err(|e| RowError::TypeConversion { | ||
| column: "unknown".to_string(), | ||
| source: ConvertError::Convert(Box::new(e)), | ||
| })?; | ||
| Ok(Value::Bool(b)) | ||
| } |
There was a problem hiding this comment.
Use the actual field_name instead of the hardcoded "unknown" for better error diagnostics.
| "BOOLEAN" | "BOOL" => { | |
| let b = value | |
| .to_lowercase() | |
| .parse::<bool>() | |
| .map_err(|e| RowError::TypeConversion { | |
| column: "unknown".to_string(), | |
| source: ConvertError::Convert(Box::new(e)), | |
| })?; | |
| Ok(Value::Bool(b)) | |
| } | |
| "BOOLEAN" | "BOOL" => { | |
| let b = value | |
| .to_lowercase() | |
| .parse::<bool>() | |
| .map_err(|e| RowError::TypeConversion { | |
| column: field_name.to_string(), | |
| source: ConvertError::Convert(Box::new(e)), | |
| })?; | |
| Ok(Value::Bool(b)) | |
| } |
|
Splitting errors into #5917. Later will split FromSql initial impl. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5916 +/- ##
==========================================
- Coverage 97.90% 97.84% -0.07%
==========================================
Files 234 236 +2
Lines 59940 60202 +262
==========================================
+ Hits 58683 58902 +219
- Misses 1257 1300 +43 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Split FromSql trait and some basic conversion into PR #5924 |
Towards #5844