[SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics #43010

amaliujia · 2023-09-20T07:56:31Z

What changes were proposed in this pull request?

In existing code, plan matching is used to validate if two CollectMetrics have the same name but different semantic. However, plan matching approach is fragile. A better way to tackle this is to just utilize the unique DataFrame Id. This is because observe API is only supported by DataFrame API. SQL does not have such syntax.

So two CollectMetric are semantic the same if and only if they have same name and same DataFrame id.

Why are the changes needed?

This is to use a more stable approach to replace a fragile approach.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

NO

amaliujia · 2023-09-20T21:35:23Z

@cloud-fan trying this idea

cloud-fan · 2023-09-21T05:32:58Z

python/pyspark/sql/connect/plan.py

@@ -1197,6 +1197,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
        plan.collect_metrics.input.CopyFrom(self._child.plan(session))
        plan.collect_metrics.name = self._name
        plan.collect_metrics.metrics.extend([self.col_to_expr(x, session) for x in self._exprs])
+        plan.collect_metrics.dataframe_id = self._child._plan_id


does Spark Connect DataFrame also have the unique id? cc @HyukjinKwon @zhengruifeng

the _plan_id is unique:

spark/python/pyspark/sql/connect/plan.py

Lines 57 to 72 in 5299e54

class LogicalPlan:

_lock: Lock = Lock()

_nextPlanId: int = 0

INDENT = 2

def __init__(self, child: Optional["LogicalPlan"]) -> None:

self._child = child

plan_id: Optional[int] = None

with LogicalPlan._lock:

plan_id = LogicalPlan._nextPlanId

LogicalPlan._nextPlanId += 1

assert plan_id is not None

self._plan_id = plan_id

yes. plan_id is enough functionally.

cloud-fan · 2023-09-21T05:33:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

    def check(plan: LogicalPlan): Unit = plan.foreach { node =>
      node match {
-        case metrics @ CollectMetrics(name, _, _) =>
-          val simplifiedMetrics = simplifyPlanForCollectedMetrics(metrics.canonicalized)


nit: we can remove simplifyPlanForCollectedMetrics now

cloud-fan · 2023-09-21T05:34:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

      expectedErrorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",
      expectedMessageParameters = Map("expr" -> "\"a\"")
    )

    // Unwrapped non-deterministic expression
    assertAnalysisErrorClass(
-      CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation),
+      CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation, 5),


I think we can always use 0 in this test suite? Unless we have multiple CollectMetrics in one plan tree

I keep different dataframe id when it makes sense, otherwise changes back to 0.

zhengruifeng · 2023-09-21T05:44:12Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

@@ -990,6 +990,9 @@ message CollectMetrics {

  // (Required) The metric sequence.
  repeated Expression metrics = 3;
+
+  // (Required) A unique DataFrame id.
+  int64 dataframe_id = 4;


since dataframe_id is set the plan_id

why not reusing the plan_id in RelationCommon?

Good point. done

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

amaliujia · 2023-09-21T19:12:01Z

@cloud-fan @zhengruifeng This PR is ready for another review.

cloud-fan · 2023-09-22T00:38:32Z

python/pyspark/sql/connect/plan.py

@@ -1192,6 +1192,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
        assert self._child is not None

        plan = proto.Relation()
+        plan.common.plan_id = self._child._plan_id


Maybe we can add a comment here: we treat the id of the to-be-observed plan as the dataframe id for CollectMetrics

I am doing this because CollectMetrics does not re-use LogicalPlan's _create_proto_relation which helps set the plan_id. Looks like majority of the plans set the plan_id by default. However I do not have context on why CollectMetrcis is implemented by the current way.

cc @zhengruifeng

you change here is fine, it is equivalent to _create_proto_relation

I notice there are a few places not using _create_proto_relation, I will check and modify them in separate work

cloud-fan · 2023-09-22T03:06:49Z

thanks, merging to master/3.5!

…etrics ### What changes were proposed in this pull request? In existing code, plan matching is used to validate if two CollectMetrics have the same name but different semantic. However, plan matching approach is fragile. A better way to tackle this is to just utilize the unique DataFrame Id. This is because observe API is only supported by DataFrame API. SQL does not have such syntax. So two CollectMetric are semantic the same if and only if they have same name and same DataFrame id. ### Why are the changes needed? This is to use a more stable approach to replace a fragile approach. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #43010 from amaliujia/another_approch_for_collect_metrics. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7c3c7c5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

LuciferYang · 2023-09-27T14:50:49Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -1969,7 +1969,8 @@ trait SupportsSubquery extends LogicalPlan
 case class CollectMetrics(
    name: String,
    metrics: Seq[NamedExpression],
-    child: LogicalPlan)
+    child: LogicalPlan,
+    dataframeId: Long)


@amaliujia when I execute the following command:

build/mvn clean install -pl connector/connect/server -am -DskipTests mvn test -pl connector/connect/server

The following test failures will occur:

- Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179)

It seems that the failure is due to the differences in dataframeId when comparing plans.

For this, I've created SPARK-45357

spark/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

Lines 958 to 1017 in ab92cae

test("Test observe") {

val connectPlan0 =

connectTestRelation.observe(

"my_metric",

proto_min("id".protoAttr).as("min_val"),

proto_max("id".protoAttr).as("max_val"),

proto_sum("id".protoAttr))

val sparkPlan0 =

sparkTestRelation.observe(

"my_metric",

min(Column("id")).as("min_val"),

max(Column("id")).as("max_val"),

sum(Column("id")))

comparePlans(connectPlan0, sparkPlan0)

val connectPlan1 =

connectTestRelation.observe("my_metric", proto_min("id".protoAttr).as("min_val"))

val sparkPlan1 =

sparkTestRelation.observe("my_metric", min(Column("id")).as("min_val"))

comparePlans(connectPlan1, sparkPlan1)

checkError(

exception = intercept[AnalysisException] {

analyzePlan(

transform(connectTestRelation.observe("my_metric", "id".protoAttr.cast("string"))))

},

errorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",

parameters = Map("expr" -> "\"id AS id\""))

val connectPlan2 =

connectTestRelation.observe(

Observation("my_metric"),

proto_min("id".protoAttr).as("min_val"),

proto_max("id".protoAttr).as("max_val"),

proto_sum("id".protoAttr))

val sparkPlan2 =

sparkTestRelation.observe(

Observation("my_metric"),

min(Column("id")).as("min_val"),

max(Column("id")).as("max_val"),

sum(Column("id")))

comparePlans(connectPlan2, sparkPlan2)

val connectPlan3 =

connectTestRelation.observe(

Observation("my_metric"),

proto_min("id".protoAttr).as("min_val"))

val sparkPlan3 =

sparkTestRelation.observe(Observation("my_metric"), min(Column("id")).as("min_val"))

comparePlans(connectPlan3, sparkPlan3)

checkError(

exception = intercept[AnalysisException] {

analyzePlan(

transform(

connectTestRelation.observe(Observation("my_metric"), "id".protoAttr.cast("string"))))

},

errorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",

parameters = Map("expr" -> "\"id AS id\""))

}

For this test case, should we ignore the comparison of dataframeId?

GA can pass the test now, which seems to be because the test order makes sparkTestRelation.id exactly 0. However, the test arrangement order of Maven is different from sbt, so sparkTestRelation.id is not 0 when using maven

omitting df id in the comparison of this test case makes sense to me.

### What changes were proposed in this pull request? In #43010, a new DataFrameId field is added to `CollectMetrics`. We should also canonicalize the new DataFrame id field to avoid downstream plan comparison failures. ### Why are the changes needed? avoid downstream plan comparison failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #43594 from amaliujia/do_not_canonicalize_dataframe_id. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

github-actions bot added SQL CONNECT labels Sep 20, 2023

amaliujia changed the title ~~[WIP]~~ [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics Sep 20, 2023

github-actions bot added the PYTHON label Sep 20, 2023

cloud-fan reviewed Sep 21, 2023

View reviewed changes

zhengruifeng reviewed Sep 21, 2023

View reviewed changes

amaliujia added 5 commits September 21, 2023 00:12

WIP

f21f3ee

update

035c3a6

re-gen proto

a5727ca

update

e1e9f8f

update

ed9b814

amaliujia force-pushed the another_approch_for_collect_metrics branch from b907596 to ed9b814 Compare September 21, 2023 07:15

amaliujia commented Sep 21, 2023

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala Show resolved Hide resolved

cloud-fan reviewed Sep 22, 2023

View reviewed changes

cloud-fan approved these changes Sep 22, 2023

View reviewed changes

zhengruifeng approved these changes Sep 22, 2023

View reviewed changes

cloud-fan closed this in 7c3c7c5 Sep 22, 2023

amaliujia changed the title ~~[SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics~~ [SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics Sep 22, 2023

LuciferYang reviewed Sep 27, 2023

View reviewed changes

amaliujia mentioned this pull request Oct 31, 2023

[SPARK-45242][SQL][FOLLOWUP] Canonicalize DataFrame ID in CollectMetrics #43594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics #43010

[SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics #43010

amaliujia commented Sep 20, 2023 •

edited

amaliujia commented Sep 20, 2023

cloud-fan Sep 21, 2023

zhengruifeng Sep 21, 2023

amaliujia Sep 21, 2023

cloud-fan Sep 21, 2023

amaliujia Sep 21, 2023

cloud-fan Sep 21, 2023 •

edited

amaliujia Sep 21, 2023

zhengruifeng Sep 21, 2023

amaliujia Sep 21, 2023

amaliujia commented Sep 21, 2023

cloud-fan Sep 22, 2023

amaliujia Sep 22, 2023 •

edited

zhengruifeng Sep 22, 2023

cloud-fan commented Sep 22, 2023 •

edited

LuciferYang Sep 27, 2023

LuciferYang Sep 27, 2023

LuciferYang Sep 27, 2023 •

edited

cloud-fan Sep 27, 2023

	class LogicalPlan:
	_lock: Lock = Lock()
	_nextPlanId: int = 0

	INDENT = 2

	def __init__(self, child: Optional["LogicalPlan"]) -> None:
	self._child = child

	plan_id: Optional[int] = None
	with LogicalPlan._lock:
	plan_id = LogicalPlan._nextPlanId
	LogicalPlan._nextPlanId += 1

	assert plan_id is not None
	self._plan_id = plan_id

	test("Test observe") {
	val connectPlan0 =
	connectTestRelation.observe(
	"my_metric",
	proto_min("id".protoAttr).as("min_val"),
	proto_max("id".protoAttr).as("max_val"),
	proto_sum("id".protoAttr))
	val sparkPlan0 =
	sparkTestRelation.observe(
	"my_metric",
	min(Column("id")).as("min_val"),
	max(Column("id")).as("max_val"),
	sum(Column("id")))
	comparePlans(connectPlan0, sparkPlan0)

	val connectPlan1 =
	connectTestRelation.observe("my_metric", proto_min("id".protoAttr).as("min_val"))
	val sparkPlan1 =
	sparkTestRelation.observe("my_metric", min(Column("id")).as("min_val"))
	comparePlans(connectPlan1, sparkPlan1)

	checkError(
	exception = intercept[AnalysisException] {
	analyzePlan(
	transform(connectTestRelation.observe("my_metric", "id".protoAttr.cast("string"))))
	},
	errorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",
	parameters = Map("expr" -> "\"id AS id\""))

	val connectPlan2 =
	connectTestRelation.observe(
	Observation("my_metric"),
	proto_min("id".protoAttr).as("min_val"),
	proto_max("id".protoAttr).as("max_val"),
	proto_sum("id".protoAttr))
	val sparkPlan2 =
	sparkTestRelation.observe(
	Observation("my_metric"),
	min(Column("id")).as("min_val"),
	max(Column("id")).as("max_val"),
	sum(Column("id")))
	comparePlans(connectPlan2, sparkPlan2)

	val connectPlan3 =
	connectTestRelation.observe(
	Observation("my_metric"),
	proto_min("id".protoAttr).as("min_val"))
	val sparkPlan3 =
	sparkTestRelation.observe(Observation("my_metric"), min(Column("id")).as("min_val"))
	comparePlans(connectPlan3, sparkPlan3)

	checkError(
	exception = intercept[AnalysisException] {
	analyzePlan(
	transform(
	connectTestRelation.observe(Observation("my_metric"), "id".protoAttr.cast("string"))))
	},
	errorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",
	parameters = Map("expr" -> "\"id AS id\""))
	}

[SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics #43010

[SPARK-45242][SQL] Use DataFrame ID to semantically validate CollectMetrics #43010

Conversation

amaliujia commented Sep 20, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

amaliujia commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia commented Sep 21, 2023

Choose a reason for hiding this comment

amaliujia Sep 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Sep 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia commented Sep 20, 2023 •

edited

cloud-fan Sep 21, 2023 •

edited

amaliujia Sep 22, 2023 •

edited

cloud-fan commented Sep 22, 2023 •

edited

LuciferYang Sep 27, 2023 •

edited