Skip to content

Commit 2f17844

Browse files
committed
Allow UPDATE to move rows between partitions.
When an UPDATE causes a row to no longer match the partition constraint, try to move it to a different partition where it does match the partition constraint. In essence, the UPDATE is split into a DELETE from the old partition and an INSERT into the new one. This can lead to surprising behavior in concurrency scenarios because EvalPlanQual rechecks won't work as they normally did; the known problems are documented. (There is a pending patch to improve the situation further, but it needs more review.) Amit Khandekar, reviewed and tested by Amit Langote, David Rowley, Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro Herrera, Amit Kapila, and me. A few final revisions by me. Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
1 parent 7f17fd6 commit 2f17844

File tree

27 files changed

+1957
-274
lines changed

27 files changed

+1957
-274
lines changed

contrib/file_fdw/input/file_fdw.source

+1
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,7 @@ SELECT tableoid::regclass, * FROM p1;
178178
SELECT tableoid::regclass, * FROM p2;
179179
INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
180180
INSERT INTO pt VALUES (2, 'xyzzy');
181+
UPDATE pt set a = 1 where a = 2; -- ERROR
181182
SELECT tableoid::regclass, * FROM pt;
182183
SELECT tableoid::regclass, * FROM p1;
183184
SELECT tableoid::regclass, * FROM p2;

contrib/file_fdw/output/file_fdw.source

+2
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,8 @@ SELECT tableoid::regclass, * FROM p2;
344344
INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
345345
ERROR: cannot route inserted tuples to a foreign table
346346
INSERT INTO pt VALUES (2, 'xyzzy');
347+
UPDATE pt set a = 1 where a = 2; -- ERROR
348+
ERROR: cannot route inserted tuples to a foreign table
347349
SELECT tableoid::regclass, * FROM pt;
348350
tableoid | a | b
349351
----------+---+-------

doc/src/sgml/ddl.sgml

+21-3
Original file line numberDiff line numberDiff line change
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
30053005
foreign table partitions.
30063006
</para>
30073007

3008+
<para>
3009+
Updating the partition key of a row might cause it to be moved into a
3010+
different partition where this row satisfies its partition constraint.
3011+
</para>
3012+
30083013
<sect3 id="ddl-partitioning-declarative-example">
30093014
<title>Example</title>
30103015

@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
33023307

33033308
<listitem>
33043309
<para>
3305-
An <command>UPDATE</command> that causes a row to move from one partition to
3306-
another fails, because the new value of the row fails to satisfy the
3307-
implicit partition constraint of the original partition.
3310+
When an <command>UPDATE</command> causes a row to move from one
3311+
partition to another, there is a chance that another concurrent
3312+
<command>UPDATE</command> or <command>DELETE</command> misses this row.
3313+
Suppose session 1 is performing an <command>UPDATE</command> on a
3314+
partition key, and meanwhile a concurrent session 2 for which this row
3315+
is visible performs an <command>UPDATE</command> or
3316+
<command>DELETE</command> operation on this row. Session 2 can silently
3317+
miss the row if the row is deleted from the partition due to session
3318+
1's activity. In such case, session 2's
3319+
<command>UPDATE</command> or <command>DELETE</command>, being unaware of
3320+
the row movement thinks that the row has just been deleted and concludes
3321+
that there is nothing to be done for this row. In the usual case where
3322+
the table is not partitioned, or where there is no row movement,
3323+
session 2 would have identified the newly updated row and carried out
3324+
the <command>UPDATE</command>/<command>DELETE</command> on this new row
3325+
version.
33083326
</para>
33093327
</listitem>
33103328

doc/src/sgml/ref/update.sgml

+9-4
Original file line numberDiff line numberDiff line change
@@ -282,10 +282,15 @@ UPDATE <replaceable class="parameter">count</replaceable>
282282

283283
<para>
284284
In the case of a partitioned table, updating a row might cause it to no
285-
longer satisfy the partition constraint. Since there is no provision to
286-
move the row to the partition appropriate to the new value of its
287-
partitioning key, an error will occur in this case. This can also happen
288-
when updating a partition directly.
285+
longer satisfy the partition constraint of the containing partition. In that
286+
case, if there is some other partition in the partition tree for which this
287+
row satisfies its partition constraint, then the row is moved to that
288+
partition. If there is no such partition, an error will occur. Behind the
289+
scenes, the row movement is actually a <command>DELETE</command> and
290+
<command>INSERT</command> operation. However, there is a possibility that a
291+
concurrent <command>UPDATE</command> or <command>DELETE</command> on the
292+
same row may miss this row. For details see the section
293+
<xref linkend="ddl-partitioning-declarative-limitations"/>.
289294
</para>
290295
</refsect1>
291296

doc/src/sgml/trigger.sgml

+23
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,29 @@
153153
triggers.
154154
</para>
155155

156+
<para>
157+
If an <command>UPDATE</command> on a partitioned table causes a row to move
158+
to another partition, it will be performed as a <command>DELETE</command>
159+
from the original partition followed by an <command>INSERT</command> into
160+
the new partition. In this case, all row-level <literal>BEFORE</literal>
161+
<command>UPDATE</command> triggers and all row-level
162+
<literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
163+
the original partition. Then all row-level <literal>BEFORE</literal>
164+
<command>INSERT</command> triggers are fired on the destination partition.
165+
The possibility of surprising outcomes should be considered when all these
166+
triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
167+
triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
168+
and <literal>AFTER</literal> <command>INSERT</command> triggers are
169+
applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
170+
are not applied because the <command>UPDATE</command> has been converted to
171+
a <command>DELETE</command> and an <command>INSERT</command>. As far as
172+
statement-level triggers are concerned, none of the
173+
<command>DELETE</command> or <command>INSERT</command> triggers are fired,
174+
even if row movement occurs; only the <command>UPDATE</command> triggers
175+
defined on the target table used in the <command>UPDATE</command> statement
176+
will be fired.
177+
</para>
178+
156179
<para>
157180
Trigger functions invoked by per-statement triggers should always
158181
return <symbol>NULL</symbol>. Trigger functions invoked by per-row

src/backend/commands/copy.c

+7-33
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,6 @@ typedef struct CopyStateData
170170
PartitionTupleRouting *partition_tuple_routing;
171171

172172
TransitionCaptureState *transition_capture;
173-
TupleConversionMap **transition_tupconv_maps;
174173

175174
/*
176175
* These variables are used to reduce overhead in textual COPY FROM.
@@ -2481,19 +2480,7 @@ CopyFrom(CopyState cstate)
24812480
* tuple).
24822481
*/
24832482
if (cstate->transition_capture != NULL)
2484-
{
2485-
int i;
2486-
2487-
cstate->transition_tupconv_maps = (TupleConversionMap **)
2488-
palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
2489-
for (i = 0; i < proute->num_partitions; ++i)
2490-
{
2491-
cstate->transition_tupconv_maps[i] =
2492-
convert_tuples_by_name(RelationGetDescr(proute->partitions[i]->ri_RelationDesc),
2493-
RelationGetDescr(cstate->rel),
2494-
gettext_noop("could not convert row type"));
2495-
}
2496-
}
2483+
ExecSetupChildParentMapForLeaf(proute);
24972484
}
24982485

24992486
/*
@@ -2587,7 +2574,6 @@ CopyFrom(CopyState cstate)
25872574
if (cstate->partition_tuple_routing)
25882575
{
25892576
int leaf_part_index;
2590-
TupleConversionMap *map;
25912577
PartitionTupleRouting *proute = cstate->partition_tuple_routing;
25922578

25932579
/*
@@ -2651,7 +2637,8 @@ CopyFrom(CopyState cstate)
26512637
*/
26522638
cstate->transition_capture->tcs_original_insert_tuple = NULL;
26532639
cstate->transition_capture->tcs_map =
2654-
cstate->transition_tupconv_maps[leaf_part_index];
2640+
TupConvMapForLeaf(proute, saved_resultRelInfo,
2641+
leaf_part_index);
26552642
}
26562643
else
26572644
{
@@ -2668,23 +2655,10 @@ CopyFrom(CopyState cstate)
26682655
* We might need to convert from the parent rowtype to the
26692656
* partition rowtype.
26702657
*/
2671-
map = proute->partition_tupconv_maps[leaf_part_index];
2672-
if (map)
2673-
{
2674-
Relation partrel = resultRelInfo->ri_RelationDesc;
2675-
2676-
tuple = do_convert_tuple(tuple, map);
2677-
2678-
/*
2679-
* We must use the partition's tuple descriptor from this
2680-
* point on. Use a dedicated slot from this point on until
2681-
* we're finished dealing with the partition.
2682-
*/
2683-
slot = proute->partition_tuple_slot;
2684-
Assert(slot != NULL);
2685-
ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
2686-
ExecStoreTuple(tuple, slot, InvalidBuffer, true);
2687-
}
2658+
tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
2659+
tuple,
2660+
proute->partition_tuple_slot,
2661+
&slot);
26882662

26892663
tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
26902664
}

src/backend/commands/trigger.c

+41-11
Original file line numberDiff line numberDiff line change
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
28542854
{
28552855
HeapTuple trigtuple;
28562856

2857-
Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
2858-
if (fdw_trigtuple == NULL)
2857+
/*
2858+
* Note: if the UPDATE is converted into a DELETE+INSERT as part of
2859+
* update-partition-key operation, then this function is also called
2860+
* separately for DELETE and INSERT to capture transition table rows.
2861+
* In such case, either old tuple or new tuple can be NULL.
2862+
*/
2863+
if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
28592864
trigtuple = GetTupleForTrigger(estate,
28602865
NULL,
28612866
relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
54145419
* triggers actually need to be queued. It is also called after each row,
54155420
* even if there are no triggers for that event, if there are any AFTER
54165421
* STATEMENT triggers for the statement which use transition tables, so that
5417-
* the transition tuplestores can be built.
5422+
* the transition tuplestores can be built. Furthermore, if the transition
5423+
* capture is happening for UPDATEd rows being moved to another partition due
5424+
* to the partition-key being changed, then this function is called once when
5425+
* the row is deleted (to capture OLD row), and once when the row is inserted
5426+
* into another partition (to capture NEW row). This is done separately because
5427+
* DELETE and INSERT happen on different tables.
54185428
*
54195429
* Transition tuplestores are built now, rather than when events are pulled
54205430
* off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
54635473
bool update_new_table = transition_capture->tcs_update_new_table;
54645474
bool insert_new_table = transition_capture->tcs_insert_new_table;;
54655475

5466-
if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
5467-
(event == TRIGGER_EVENT_UPDATE && update_old_table))
5476+
/*
5477+
* For INSERT events newtup should be non-NULL, for DELETE events
5478+
* oldtup should be non-NULL, whereas for UPDATE events normally both
5479+
* oldtup and newtup are non-NULL. But for UPDATE events fired for
5480+
* capturing transition tuples during UPDATE partition-key row
5481+
* movement, oldtup is NULL when the event is for a row being inserted,
5482+
* whereas newtup is NULL when the event is for a row being deleted.
5483+
*/
5484+
Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
5485+
oldtup == NULL));
5486+
Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
5487+
newtup == NULL));
5488+
5489+
if (oldtup != NULL &&
5490+
((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
5491+
(event == TRIGGER_EVENT_UPDATE && update_old_table)))
54685492
{
54695493
Tuplestorestate *old_tuplestore;
54705494

5471-
Assert(oldtup != NULL);
54725495
old_tuplestore = transition_capture->tcs_private->old_tuplestore;
54735496

54745497
if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
54815504
else
54825505
tuplestore_puttuple(old_tuplestore, oldtup);
54835506
}
5484-
if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
5485-
(event == TRIGGER_EVENT_UPDATE && update_new_table))
5507+
if (newtup != NULL &&
5508+
((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
5509+
(event == TRIGGER_EVENT_UPDATE && update_new_table)))
54865510
{
54875511
Tuplestorestate *new_tuplestore;
54885512

5489-
Assert(newtup != NULL);
54905513
new_tuplestore = transition_capture->tcs_private->new_tuplestore;
54915514

54925515
if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
55025525
tuplestore_puttuple(new_tuplestore, newtup);
55035526
}
55045527

5505-
/* If transition tables are the only reason we're here, return. */
5528+
/*
5529+
* If transition tables are the only reason we're here, return. As
5530+
* mentioned above, we can also be here during update tuple routing in
5531+
* presence of transition tables, in which case this function is called
5532+
* separately for oldtup and newtup, so we expect exactly one of them
5533+
* to be NULL.
5534+
*/
55065535
if (trigdesc == NULL ||
55075536
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
55085537
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
5509-
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
5538+
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
5539+
(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
55105540
return;
55115541
}
55125542

0 commit comments

Comments
 (0)