Copy on write#

Copy on Write is a mechanism to simplify the indexing API and improve performance through avoiding copies if possible. CoW means that any DataFrame or Series derived from another in any way always behaves as a copy. An explanation on how to use Copy on Write efficiently can be found here.

Reference tracking#

To be able to determine if we have to make a copy when writing into a DataFrame, we have to be aware if the values are shared with another DataFrame. pandas keeps track of all Blocks that share values with another block internally to be able to tell when a copy needs to be triggered. The reference tracking mechanism is implemented on the Block level.

We use a custom reference tracker object, BlockValuesRefs, that keeps track of every block, whose values share memory with each other. The reference is held through a weak-reference. Every pair of blocks that share some memory should point to the same BlockValuesRefs object. If one block goes out of scope, the reference to this block dies. As a consequence, the reference tracker object always knows how many blocks are alive and share memory.

Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is held by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with block.copy(deep=False) (which will create a new Block instance with the same underlying values and which will correctly set up the references).

We can ask the reference tracking object if there is another block alive that shares data with us before writing into the values. We can trigger a copy before writing if there is in fact another block alive.