New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45357][CONNECT][TESTS] Normalize dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
#43155
Conversation
dataframeId
when comparing CollectMetrics in SparkConnectProtoSuite
.dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
.
@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
// Compares proto plan with LogicalPlan. | |||
private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = { | |||
val connectAnalyzed = analyzePlan(transform(connectPlan)) | |||
comparePlans(connectAnalyzed, sparkPlan, false) | |||
(connectAnalyzed, sparkPlan) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the first case, this pr only made a simple fix.
dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
.dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
dataframeId
when comparing CollectMetrics
in SparkConnectProtoSuite
@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
// Compares proto plan with LogicalPlan. | |||
private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = { | |||
val connectAnalyzed = analyzePlan(transform(connectPlan)) | |||
comparePlans(connectAnalyzed, sparkPlan, false) | |||
(connectAnalyzed, sparkPlan) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current scenario, connectAnalyzed
is transformed from proto.Relation
. When it is CollectMetrics
, the dataframeId
is always 0.
But if the sparkPlan
is CollectMetrics
, its dataframeId
value is determined by its corresponding DataFrame.
In the sbt test, SparkConnectProtoSuite
is tested earlier, sparkTestRelation
is the first created DataFrame
with id
as 0, thus the GA test didn't trigger the failure described in the pr.
When using Maven for testing, SparkConnectProtoSuite
is tested later, sparkTestRelation
is not the first created DataFrame
with id
not being 0, thus causing the test to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
also cc @amaliujia for double check |
@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
// Compares proto plan with LogicalPlan. | |||
private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = { | |||
val connectAnalyzed = analyzePlan(transform(connectPlan)) | |||
comparePlans(connectAnalyzed, sparkPlan, false) | |||
(connectAnalyzed, sparkPlan) match { | |||
case (l: CollectMetrics, r: CollectMetrics) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just add a small normalize function to reset df id to 0 for all CollectMetrics
in the query plan?
@@ -1067,7 +1067,11 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
|
|||
// Compares proto plan with LogicalPlan. | |||
private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = { | |||
def normalizeDataframeId(plan: LogicalPlan): LogicalPlan = plan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new small function, is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use transform? why only top-level CollectMetrics
?
LGTM Another way to fix is manually construct the CollectMetrics to compare with the proto generated version. But this PR's approach is fine too. |
In spark/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala Lines 119 to 121 in 4863dec
IIRC, the |
No, maybe it's not feasible. The current test case is comparing the plans generated by
spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Lines 2219 to 2222 in 97597ba
Personally, I think this change is not worth it. On the other hand, even if we are willing to make the above changes, we might still need to address how to synchronize |
The GA failure is unrelated to the current pr:
|
@zhengruifeng Moreover, this test case is in the spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala Lines 1089 to 1111 in 5e6986b
It seems that there is no And the spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala Lines 3286 to 3288 in 5e6986b
|
rebase to fix python lint |
also cc @hvanhovell |
merged to master |
Thanks @zhengruifeng @cloud-fan @amaliujia ~ |
…`CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43155 from LuciferYang/SPARK-45357. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…`CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43155 from LuciferYang/SPARK-45357. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
This PR add a new function
normalizeDataframeId
to sets thedataframeId
to the constant 0 ofCollectMetrics
before comparingLogicalPlan
in the test case ofSparkConnectProtoSuite
.Why are the changes needed?
The test scenario in
SparkConnectProtoSuite
does not need to compare thedataframeId
inCollectMetrics
Does this PR introduce any user-facing change?
No
How was this patch tested?
run
Before
After
Was this patch authored or co-authored using generative AI tooling?
No