[SPARK-45357][CONNECT][TESTS] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #43155

LuciferYang · 2023-09-27T17:39:27Z

What changes were proposed in this pull request?

This PR add a new function normalizeDataframeId to sets the dataframeId to the constant 0 of CollectMetrics before comparing LogicalPlan in the test case of SparkConnectProtoSuite.

Why are the changes needed?

The test scenario in SparkConnectProtoSuite does not need to compare the dataframeId in CollectMetrics

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually check

run

build/mvn clean install -pl connector/connect/server -am -DskipTests
build/mvn test -pl connector/connect/server

Before

- Test observe *** FAILED ***
  == FAIL: Plans do not match ===
  !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53
   +- LocalRelation <empty>, [id#0, name#0]                                                                 +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179)

After

Run completed in 41 seconds, 631 milliseconds.
Total number of tests run: 882
Suites: completed 24, aborted 0
Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2023-09-27T17:41:27Z

...nect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
  // Compares proto plan with LogicalPlan.
  private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = {
    val connectAnalyzed = analyzePlan(transform(connectPlan))
-    comparePlans(connectAnalyzed, sparkPlan, false)
+    (connectAnalyzed, sparkPlan) match {


Since this is the first case, this pr only made a simple fix.

LuciferYang · 2023-09-27T18:10:18Z

...nect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
  // Compares proto plan with LogicalPlan.
  private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = {
    val connectAnalyzed = analyzePlan(transform(connectPlan))
-    comparePlans(connectAnalyzed, sparkPlan, false)
+    (connectAnalyzed, sparkPlan) match {


In the current scenario, connectAnalyzed is transformed from proto.Relation. When it is CollectMetrics, the dataframeId is always 0.

But if the sparkPlan is CollectMetrics, its dataframeId value is determined by its corresponding DataFrame.

In the sbt test, SparkConnectProtoSuite is tested earlier, sparkTestRelation is the first created DataFrame with id as 0, thus the GA test didn't trigger the failure described in the pr.

When using Maven for testing, SparkConnectProtoSuite is tested later, sparkTestRelation is not the first created DataFrame with id not being 0, thus causing the test to fail.

Thanks for the clarification!

LuciferYang · 2023-09-28T03:00:04Z

cc @cloud-fan @zhengruifeng

LuciferYang · 2023-09-28T03:01:46Z

also cc @amaliujia for double check

cloud-fan · 2023-09-28T03:20:16Z

...nect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

@@ -1068,6 +1068,10 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
  // Compares proto plan with LogicalPlan.
  private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = {
    val connectAnalyzed = analyzePlan(transform(connectPlan))
-    comparePlans(connectAnalyzed, sparkPlan, false)
+    (connectAnalyzed, sparkPlan) match {
+      case (l: CollectMetrics, r: CollectMetrics) =>


why not just add a small normalize function to reset df id to 0 for all CollectMetrics in the query plan?

LuciferYang · 2023-09-28T03:27:22Z

...nect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

@@ -1067,7 +1067,11 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {

  // Compares proto plan with LogicalPlan.
  private def comparePlans(connectPlan: proto.Relation, sparkPlan: LogicalPlan): Unit = {
+    def normalizeDataframeId(plan: LogicalPlan): LogicalPlan = plan match {


add a new small function, is it ok?

shall we use transform? why only top-level CollectMetrics?

amaliujia · 2023-09-28T15:20:16Z

LGTM

Another way to fix is manually construct the CollectMetrics to compare with the proto generated version. But this PR's approach is fine too.

zhengruifeng · 2023-10-04T06:23:06Z

In PlanGenerationTestSuite, the planId was reset before each test

spark/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

Lines 119 to 121 in 4863dec

    
           override protected def beforeEach(): Unit = { 
        
             session.resetPlanIdGenerator() 
        
           }

IIRC, the dataframeId in CollectMetrics is also the planId, so is it possible to simply reset the planId before problematic test suites?

LuciferYang · 2023-10-04T06:34:51Z

In PlanGenerationTestSuite, the planId was reset before each test

spark/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala

Lines 119 to 121 in 4863dec

override protected def beforeEach(): Unit = {

session.resetPlanIdGenerator()

}

IIRC, the dataframeId in CollectMetrics is also the planId, so is it possible to simply reset the planId before problematic test suites?

No, maybe it's not feasible. The current test case is comparing the plans generated by connectTestRelation.observe and sparkTestRelation.observe respectively, we can't just clear one side's dataframeId. If follow your argument, we would also need to add a new function for Dataset to reset Dataset#curId.

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Line 70 in 97597ba

val curId = new java.util.concurrent.atomic.AtomicLong()

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Line 204 in 97597ba

private val id = Dataset.curId.getAndIncrement()

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines 2219 to 2222 in 97597ba

    
           @varargs 
        
           def observe(name: String, expr: Column, exprs: Column*): Dataset[T] = withTypedPlan { 
        
             CollectMetrics(name, (expr +: exprs).map(_.named), logicalPlan, id) 
        
           }

Personally, I think this change is not worth it. On the other hand, even if we are willing to make the above changes, we might still need to address how to synchronize curId and planId, otherwise, they still cannot achieve the same value.

LuciferYang · 2023-10-04T06:56:58Z

The GA failure is unrelated to the current pr:

starting mypy annotations test...
annotations failed mypy checks:
/usr/local/lib/python3.9/dist-packages/torch/_dynamo/variables/tensor.py:369: error: INTERNAL ERROR -- Please try using mypy master on GitHub:
https://mypy.readthedocs.io/en/stable/common_issues.html#using-a-development-mypy-build
If this issue continues with mypy master, please report a bug at https://github.com/python/mypy/issues
version: 0.982
/usr/local/lib/python3.9/dist-packages/torch/_dynamo/variables/tensor.py:369: : note: please use --show-traceback to print a traceback when reporting a bug
2
Error: Process completed with exit code 2.

LuciferYang · 2023-10-04T07:14:16Z

@zhengruifeng Moreover, this test case is in the connect-server module, the function used by connectTestRelation.observe in the test is

spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

Lines 1089 to 1111 in 5e6986b

    
           def observe(name: String, expr: Expression, exprs: Expression*): Relation = { 
        
             Relation 
        
               .newBuilder() 
        
               .setCollectMetrics( 
        
                 CollectMetrics 
        
                   .newBuilder() 
        
                   .setInput(logicalPlan) 
        
                   .setName(name) 
        
                   .addAllMetrics((expr +: exprs).asJava)) 
        
               .build() 
        
           } 
        
           def observe(observation: Observation, expr: Expression, exprs: Expression*): Relation = { 
        
             Relation 
        
               .newBuilder() 
        
               .setCollectMetrics( 
        
                 CollectMetrics 
        
                   .newBuilder() 
        
                   .setInput(logicalPlan) 
        
                   .setName(observation.name) 
        
                   .addAllMetrics((expr +: exprs).asJava)) 
        
               .build() 
        
           }

It seems that there is no planId here and there's no way of calling session.resetPlanIdGenerator() .

And the observe function on the client side has not been implemented yet.

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines 3286 to 3288 in 5e6986b

    
           def observe(name: String, expr: Column, exprs: Column*): Dataset[T] = { 
        
             throw new UnsupportedOperationException("observe is not implemented.") 
        
           }

LuciferYang · 2023-10-04T07:18:21Z

rebase to fix python lint

zhengruifeng · 2023-10-04T07:52:16Z

also cc @hvanhovell

zhengruifeng · 2023-10-06T06:10:15Z

merged to master

LuciferYang · 2023-10-06T09:23:22Z

Thanks @zhengruifeng @cloud-fan @amaliujia ~

…`CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43155 from LuciferYang/SPARK-45357. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

init

4f918ac

LuciferYang marked this pull request as draft September 27, 2023 17:39

github-actions bot added SQL CONNECT labels Sep 27, 2023

LuciferYang changed the title ~~[SPARK-45357][CONNECT][TESTS] Ignore dataframeId when comparing CollectMetrics in SparkConnectProtoSuite.~~ [SPARK-45357][CONNECT][TESTS] Ignore dataframeId when comparing CollectMetrics in SparkConnectProtoSuite. Sep 27, 2023

LuciferYang commented Sep 27, 2023

View reviewed changes

LuciferYang changed the title ~~[SPARK-45357][CONNECT][TESTS] Ignore dataframeId when comparing CollectMetrics in SparkConnectProtoSuite.~~ [SPARK-45357][CONNECT][TESTS] Ignore dataframeId when comparing CollectMetrics in SparkConnectProtoSuite Sep 27, 2023

LuciferYang changed the title ~~[SPARK-45357][CONNECT][TESTS] Ignore dataframeId when comparing CollectMetrics in SparkConnectProtoSuite~~ [SPARK-45357][CONNECT][TESTS] Normalize dataframeId when comparing CollectMetrics in SparkConnectProtoSuite Sep 27, 2023

LuciferYang commented Sep 27, 2023

View reviewed changes

LuciferYang marked this pull request as ready for review September 27, 2023 18:10

cloud-fan reviewed Sep 28, 2023

View reviewed changes

add a new function

a15f6a9

LuciferYang commented Sep 28, 2023

View reviewed changes

LuciferYang added 2 commits September 28, 2023 12:38

use transform

28dedd7

format

de6ac53

LuciferYang added 2 commits October 2, 2023 00:34

Merge branch 'upmaster' into SPARK-45357

6bd5b2f

Merge branch 'upmaster' into SPARK-45357

3fd5695

Merge branch 'upmaster' into SPARK-45357

59888a5

zhengruifeng approved these changes Oct 4, 2023

View reviewed changes

Merge branch 'upmaster' into SPARK-45357

fc38586

zhengruifeng closed this in 46c7ff8 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45357][CONNECT][TESTS] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #43155

[SPARK-45357][CONNECT][TESTS] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #43155

LuciferYang commented Sep 27, 2023 •

edited

LuciferYang Sep 27, 2023

LuciferYang Sep 27, 2023

amaliujia Sep 28, 2023

LuciferYang commented Sep 28, 2023

LuciferYang commented Sep 28, 2023

cloud-fan Sep 28, 2023

LuciferYang Sep 28, 2023 •

edited

cloud-fan Sep 28, 2023

amaliujia commented Sep 28, 2023

zhengruifeng commented Oct 4, 2023

LuciferYang commented Oct 4, 2023 •

edited

LuciferYang commented Oct 4, 2023

LuciferYang commented Oct 4, 2023 •

edited

LuciferYang commented Oct 4, 2023

zhengruifeng commented Oct 4, 2023

zhengruifeng commented Oct 6, 2023

LuciferYang commented Oct 6, 2023

[SPARK-45357][CONNECT][TESTS] Normalize dataframeId when comparing CollectMetrics in SparkConnectProtoSuite #43155

[SPARK-45357][CONNECT][TESTS] Normalize dataframeId when comparing CollectMetrics in SparkConnectProtoSuite #43155

Conversation

LuciferYang commented Sep 27, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang Sep 27, 2023

Choose a reason for hiding this comment

LuciferYang Sep 27, 2023

Choose a reason for hiding this comment

amaliujia Sep 28, 2023

Choose a reason for hiding this comment

LuciferYang commented Sep 28, 2023

LuciferYang commented Sep 28, 2023

cloud-fan Sep 28, 2023

Choose a reason for hiding this comment

LuciferYang Sep 28, 2023 • edited

Choose a reason for hiding this comment

cloud-fan Sep 28, 2023

Choose a reason for hiding this comment

amaliujia commented Sep 28, 2023

zhengruifeng commented Oct 4, 2023

LuciferYang commented Oct 4, 2023 • edited

LuciferYang commented Oct 4, 2023

LuciferYang commented Oct 4, 2023 • edited

LuciferYang commented Oct 4, 2023

zhengruifeng commented Oct 4, 2023

zhengruifeng commented Oct 6, 2023

LuciferYang commented Oct 6, 2023

[SPARK-45357][CONNECT][TESTS] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #43155

[SPARK-45357][CONNECT][TESTS] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #43155

LuciferYang commented Sep 27, 2023 •

edited

LuciferYang Sep 28, 2023 •

edited

LuciferYang commented Oct 4, 2023 •

edited

LuciferYang commented Oct 4, 2023 •

edited