溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務條款》

golang中三色GC的實現原理

發布時間:2020-06-23 10:33:49 來源:億速云 閱讀:301 作者:Leah 欄目:編程語言

今天就跟大家聊聊有關golang中三色GC的實現原理,可能很多人都不太了解,為了讓大家更加了解,小編給大家總結了以下內容,希望大家根據這篇文章可以有所收獲。

基礎概念

內存結構

go在程序啟動時會分配一塊虛擬內存地址是連續的內存, 結構如下:

golang中三色GC的實現原理

這一塊內存分為了3個區域, 在X64上大小分別是512M, 16G和512G, 它們的作用如下:

arena

arena區域就是我們通常說的heap, go從heap分配的內存都在這個區域中.

bitmap

bitmap區域用于表示arena區域中哪些地址保存了對象, 并且對象中哪些地址包含了指針。

bitmap區域中一個byte(8 bit)對應了arena區域中的四個指針大小的內存, 也就是2 bit對應一個指針大小的內存。

所以bitmap區域的大小是 512GB / 指針大小(8 byte) / 4 = 16GB。

bitmap區域中的一個byte對應arena區域的四個指針大小的內存的結構如下,每一個指針大小的內存都會有兩個bit分別表示是否應該繼續掃描和是否包含指針:

golang中三色GC的實現原理

bitmap中的byte和arena的對應關系從末尾開始, 也就是隨著內存分配會向兩邊擴展:

golang中三色GC的實現原理

spans

spans區域用于表示arena區中的某一頁(Page)屬于哪個span, 什么是span將在下面介紹。

spans區域中一個指針(8 byte)對應了arena區域中的一頁(在go中一頁=8KB)。

所以spans的大小是 512GB / 頁大小(8KB) * 指針大小(8 byte) = 512MB。

spans區域的一個指針對應arena區域的一頁的結構如下, 和bitmap不一樣的是對應關系會從開頭開始:

golang中三色GC的實現原理

什么時候從Heap分配對象

很多講解go的文章和書籍中都提到過, go會自動確定哪些對象應該放在棧上, 哪些對象應該放在堆上。

簡單的來說, 當一個對象的內容可能在生成該對象的函數結束后被訪問, 那么這個對象就會分配在堆上。

在堆上分配對象的情況包括:

1、返回對象的指針

2、傳遞了對象的指針到其他函數

3、在閉包中使用了對象并且需要修改對象

4、使用new

在C語言中函數返回在棧上的對象的指針是非常危險的事情, 但在go中卻是安全的, 因為這個對象會自動在堆上分配.
go決定是否使用堆分配對象的過程也叫"逃逸分析".

GC Bitmap

GC在標記時需要知道哪些地方包含了指針, 例如上面提到的bitmap區域涵蓋了arena區域中的指針信息。

除此之外, GC還需要知道??臻g上哪些地方包含了指針,因為??臻g不屬于arena區域, ??臻g的指針信息將會在函數信息里面。

另外, GC在分配對象時也需要根據對象的類型設置bitmap區域, 來源的指針信息將會在類型信息里面。

總結起來go中有以下的GC Bitmap:

1、bitmap區域: 涵蓋了arena區域, 使用2 bit表示一個指針大小的內存

2、函數信息: 涵蓋了函數的??臻g, 使用1 bit表示一個指針大小的內存 (位于stackmap.bytedata)

3、類型信息: 在分配對象時會復制到bitmap區域, 使用1 bit表示一個指針大小的內存 (位于_type.gcdata)

Span

span是用于分配對象的區塊, 下圖是簡單說明了Span的內部結構:

golang中三色GC的實現原理通常一個span包含了多個大小相同的元素, 一個元素會保存一個對象, 除非:

1、span用于保存大對象, 這種情況span只有一個元素

2、span用于保存極小對象且不包含指針的對象(tiny object), 這種情況span會用一個元素保存多個對象

span中有一個freeindex標記下一次分配對象時應該開始搜索的地址, 分配后freeindex會增加,在freeindex之前的元素都是已分配的, 在freeindex之后的元素有可能已分配, 也有可能未分配。

span每次GC以后都可能會回收掉一些元素, allocBits用于標記哪些元素是已分配的, 哪些元素是未分配的。

使用freeindex + allocBits可以在分配時跳過已分配的元素, 把對象設置在未分配的元素中,但因為每次都去訪問allocBits效率會比較慢, span中有一個整數型的allocCache用于緩存freeindex開始的bitmap, 緩存的bit值與原值相反。

gcmarkBits用于在gc時標記哪些對象存活, 每次gc以后gcmarkBits會變為allocBits.
需要注意的是span結構本身的內存是從系統分配的, 上面提到的spans區域和bitmap區域都只是一個索引.

Span的類型

span根據大小可以分為67個類型, 如下:

// class  bytes/obj  bytes/span  objects  tail waste  max waste
//     1          8        8192     1024           0     87.50%
//     2         16        8192      512           0     43.75%
//     3         32        8192      256           0     46.88%
//     4         48        8192      170          32     31.52%
//     5         64        8192      128           0     23.44%
//     6         80        8192      102          32     19.07%
//     7         96        8192       85          32     15.95%
//     8        112        8192       73          16     13.56%
//     9        128        8192       64           0     11.72%
//    10        144        8192       56         128     11.82%
//    11        160        8192       51          32      9.73%
//    12        176        8192       46          96      9.59%
//    13        192        8192       42         128      9.25%
//    14        208        8192       39          80      8.12%
//    15        224        8192       36         128      8.15%
//    16        240        8192       34          32      6.62%
//    17        256        8192       32           0      5.86%
//    18        288        8192       28         128     12.16%
//    19        320        8192       25         192     11.80%
//    20        352        8192       23          96      9.88%
//    21        384        8192       21         128      9.51%
//    22        416        8192       19         288     10.71%
//    23        448        8192       18         128      8.37%
//    24        480        8192       17          32      6.82%
//    25        512        8192       16           0      6.05%
//    26        576        8192       14         128     12.33%
//    27        640        8192       12         512     15.48%
//    28        704        8192       11         448     13.93%
//    29        768        8192       10         512     13.94%
//    30        896        8192        9         128     15.52%
//    31       1024        8192        8           0     12.40%
//    32       1152        8192        7         128     12.41%
//    33       1280        8192        6         512     15.55%
//    34       1408       16384       11         896     14.00%
//    35       1536        8192        5         512     14.00%
//    36       1792       16384        9         256     15.57%
//    37       2048        8192        4           0     12.45%
//    38       2304       16384        7         256     12.46%
//    39       2688        8192        3         128     15.59%
//    40       3072       24576        8           0     12.47%
//    41       3200       16384        5         384      6.22%
//    42       3456       24576        7         384      8.83%
//    43       4096        8192        2           0     15.60%
//    44       4864       24576        5         256     16.65%
//    45       5376       16384        3         256     10.92%
//    46       6144       24576        4           0     12.48%
//    47       6528       32768        5         128      6.23%
//    48       6784       40960        6         256      4.36%
//    49       6912       49152        7         768      3.37%
//    50       8192        8192        1           0     15.61%
//    51       9472       57344        6         512     14.28%
//    52       9728       49152        5         512      3.64%
//    53      10240       40960        4           0      4.99%
//    54      10880       32768        3         128      6.24%
//    55      12288       24576        2           0     11.45%
//    56      13568       40960        3         256      9.99%
//    57      14336       57344        4           0      5.35%
//    58      16384       16384        1           0     12.49%
//    59      18432       73728        4           0     11.11%
//    60      19072       57344        3         128      3.57%
//    61      20480       40960        2           0      6.87%
//    62      21760       65536        3         256      6.25%
//    63      24576       24576        1           0     11.45%
//    64      27264       81920        3         128     10.00%
//    65      28672       57344        2           0      4.91%
//    66      32768       32768        1           0     12.50%

以類型(class)為1的span為例,
span中的元素大小是8 byte, span本身占1頁也就是8K, 一共可以保存1024個對象.

在分配對象時, 會根據對象的大小決定使用什么類型的span,
例如16 byte的對象會使用span 2, 17 byte的對象會使用span 3, 32 byte的對象會使用span 3.
從這個例子也可以看到, 分配17和32 byte的對象都會使用span 3, 也就是說部分大小的對象在分配時會浪費一定的空間.

有人可能會注意到, 上面最大的span的元素大小是32K, 那么分配超過32K的對象會在哪里分配呢?
超過32K的對象稱為"大對象", 分配大對象時, 會直接從heap分配一個特殊的span,
這個特殊的span的類型(class)是0, 只包含了一個大對象, span的大小由對象的大小決定.

特殊的span加上的66個標準的span, 一共組成了67個span類型.

Span的位置

在前一篇中我提到了P是一個虛擬的資源, 同一時間只能有一個線程訪問同一個P, 所以P中的數據不需要鎖.
為了分配對象時有更好的性能, 各個P中都有span的緩存(也叫mcache), 緩存的結構如下:

golang中三色GC的實現原理

各個P中按span類型的不同, 有67*2=134個span的緩存,

其中scan和noscan的區別在于,
如果對象包含了指針, 分配對象時會使用scan的span,
如果對象不包含指針, 分配對象時會使用noscan的span.
把span分為scan和noscan的意義在于,
GC掃描對象的時候對于noscan的span可以不去查看bitmap區域來標記子對象, 這樣可以大幅提升標記的效率.

在分配對象時將會從以下的位置獲取適合的span用于分配:

  • 首先從P的緩存(mcache)獲取, 如果有緩存的span并且未滿則使用, 這個步驟不需要鎖

  • 然后從全局緩存(mcentral)獲取, 如果獲取成功則設置到P, 這個步驟需要鎖

  • 最后從mheap獲取, 獲取后設置到全局緩存, 這個步驟需要鎖

在P中緩存span的做法跟CoreCLR中線程緩存分配上下文(Allocation Context)的做法相似,
都可以讓分配對象時大部分時候不需要線程鎖, 改進分配的性能.

分配對象的處理

分配對象的流程

go從堆分配對象時會調用newobject函數, 這個函數的流程大致如下:

golang中三色GC的實現原理

首先會檢查GC是否在工作中, 如果GC在工作中并且當前的G分配了一定大小的內存則需要協助GC做一定的工作,
這個機制叫GC Assist, 用于防止分配內存太快導致GC回收跟不上的情況發生.

之后會判斷是小對象還是大對象, 如果是大對象則直接調用largeAlloc從堆中分配,
如果是小對象分3個階段獲取可用的span, 然后從span中分配對象:

  • 首先從P的緩存(mcache)獲取

  • 然后從全局緩存(mcentral)獲取, 全局緩存中有可用的span的列表

  • 最后從mheap獲取, mheap中也有span的自由列表, 如果都獲取失敗則從arena區域分配

這三個階段的詳細結構如下圖:

golang中三色GC的實現原理

數據類型的定義

分配對象涉及的數據類型包含:

p: 前一篇提到過, P是協程中的用于運行go代碼的虛擬資源
m: 前一篇提到過, M目前代表系統線程
g: 前一篇提到過, G就是goroutine
mspan: 用于分配對象的區塊
mcentral: 全局的mspan緩存, 一共有67*2=134個
mheap: 用于管理heap的對象, 全局只有一個

源代碼分析

go從堆分配對象時會調用newobject函數, 先從這個函數看起:

// implementation of new builtin
// compiler (both frontend and SSA backend) knows the signature
// of this function
func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}

newobject調用了mallocgc函數:

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    if gcphase == _GCmarktermination {
        throw("mallocgc called with gcphase == _GCmarktermination")
    }

    if size == 0 {
        return unsafe.Pointer(&zerobase)
    }

    if debug.sbrk != 0 {
        align := uintptr(16)
        if typ != nil {
            align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }

    // 判斷是否要輔助GC工作
    // gcBlackenEnabled在GC的標記階段會開啟
    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {
            assistG = assistG.m.curg
        }
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)

        // 會按分配的大小判斷需要協助GC完成多少工作
        // 具體的算法將在下面講解收集器時說明
        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }

    // 增加當前G對應的M的lock計數, 防止這個G被搶占
    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {
        throw("malloc deadlock")
    }
    if mp.gsignal == getg() {
        throw("malloc during signal")
    }
    mp.mallocing = 1

    shouldhelpgc := false
    dataSize := size
    // 獲取當前G對應的M對應的P的本地span緩存(mcache)
    // 因為M在擁有P后會把P的mcache設到M中, 這里返回的是getg().m.mcache
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    // 判斷是否小對象, maxSmallSize當前的值是32K
    if size <= maxSmallSize {
        // 如果對象不包含指針, 并且對象的大小小于16 bytes, 可以做特殊處理
        // 這里是針對非常小的對象的優化, 因為span的元素最小只能是8 byte, 如果對象更小那么很多空間都會被浪費掉
        // 非常小的對象可以整合在"class 2 noscan"的元素(大小為16 byte)中
        if noscan && size < maxTinySize {
            // Tiny allocator.
            //
            // Tiny allocator combines several tiny allocation requests
            // into a single memory block. The resulting memory block
            // is freed when all subobjects are unreachable. The subobjects
            // must be noscan (don't have pointers), this ensures that
            // the amount of potentially wasted memory is bounded.
            //
            // Size of the memory block used for combining (maxTinySize) is tunable.
            // Current setting is 16 bytes, which relates to 2x worst case memory
            // wastage (when all but one subobjects are unreachable).
            // 8 bytes would result in no wastage at all, but provides less
            // opportunities for combining.
            // 32 bytes provides more opportunities for combining,
            // but can lead to 4x worst case wastage.
            // The best case winning is 8x regardless of block size.
            //
            // Objects obtained from tiny allocator must not be freed explicitly.
            // So when an object will be freed explicitly, we ensure that
            // its size >= maxTinySize.
            //
            // SetFinalizer has a special case for objects potentially coming
            // from tiny allocator, it such case it allows to set finalizers
            // for an inner byte of a memory block.
            //
            // The main targets of tiny allocator are small strings and
            // standalone escaping variables. On a json benchmark
            // the allocator reduces number of allocations by ~12% and
            // reduces heap size by ~20%.
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySpanClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySpanClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            // 否則按普通的小對象分配
            // 首先獲取對象的大小應該使用哪個span類型
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            // 等于sizeclass * 2 + (noscan ? 1 : 0)
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            // 嘗試快速的從這個span中分配
            v := nextFreeFast(span)
            if v == 0 {
                // 分配失敗, 可能需要從mcentral或者mheap中獲取
                // 如果從mcentral或者mheap獲取了新的span, 則shouldhelpgc會等于true
                // shouldhelpgc會等于true時會在下面判斷是否要觸發GC
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        // 大對象直接從mheap分配, 這里的s是一個特殊的span, 它的class是0
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero, noscan)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }

    // 設置arena對應的bitmap, 記錄哪些位置包含了指針, GC會使用bitmap掃描所有可到達的對象
    var scanSize uintptr
    if !noscan {
        // If allocating a defer+arg block, now that we've picked a malloc size
        // large enough to hold everything, cut the "asked for" size down to
        // just the defer header, so that the GC bitmap will record the arg block
        // as containing nothing at all (as if it were unused space at the end of
        // a malloc block caused by size rounding).
        // The defer arg areas are scanned as part of scanstack.
        if typ == deferType {
            dataSize = unsafe.Sizeof(_defer{})
        }
        // 這個函數非常的長, 有興趣的可以看
        // https://github.com/golang/go/blob/go1.9.2/src/runtime/mbitmap.go#L855
        // 雖然代碼很長但是設置的內容跟上面說過的bitmap區域的結構一樣
        // 根據類型信息設置scan bit跟pointer bit, scan bit成立表示應該繼續掃描, pointer bit成立表示該位置是指針
        // 需要注意的地方有
        // - 如果一個類型只有開頭的地方包含指針, 例如[ptr, ptr, large non-pointer data]
        //   那么后面的部分的scan bit將會為0, 這樣可以大幅提升標記的效率
        // - 第二個slot的scan bit用途比較特殊, 它并不用于標記是否繼續scan, 而是標記checkmark
        // 什么是checkmark
        // - 因為go的并行GC比較復雜, 為了檢查實現是否正確, go需要在有一個檢查所有應該被標記的對象是否被標記的機制
        //   這個機制就是checkmark, 在開啟checkmark時go會在標記階段的最后停止整個世界然后重新執行一次標記
        //   上面的第二個slot的scan bit就是用于標記對象在checkmark標記中是否被標記的
        // - 有的人可能會發現第二個slot要求對象最少有兩個指針的大小, 那么只有一個指針的大小的對象呢?
        //   只有一個指針的大小的對象可以分為兩種情況
        //   對象就是指針, 因為大小剛好是1個指針所以并不需要看bitmap區域, 這時第一個slot就是checkmark
        //   對象不是指針, 因為有tiny alloc的機制, 不是指針且只有一個指針大小的對象會分配在兩個指針的span中
        //               這時候也不需要看bitmap區域, 所以和上面一樣第一個slot就是checkmark
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {
                scanSize = dataSize - typ.size + typ.ptrdata
            }
        } else {
            scanSize = typ.ptrdata
        }
        c.local_scan += scanSize
    }

    // 內存屏障, 因為x86和x64的store不會亂序所以這里只是個針對編譯器的屏障, 匯編中是ret
    // Ensure that the stores above that initialize x to
    // type-safe memory and set the heap bits occur before
    // the caller can make x observable to the garbage
    // collector. Otherwise, on weakly ordered machines,
    // the garbage collector could follow a pointer to x,
    // but see uninitialized memory or stale heap bits.
    publicationBarrier()

    // 如果當前在GC中, 需要立刻標記分配后的對象為"黑色", 防止它被回收
    // Allocate black during GC.
    // All slots hold nil so no scanning is needed.
    // This may be racing with GC so do it atomically if there can be
    // a race marking the bit.
    if gcphase != _GCoff {
        gcmarknewobject(uintptr(x), size, scanSize)
    }

    // Race Detector的處理(用于檢測線程沖突問題)
    if raceenabled {
        racemalloc(x, size)
    }

    // Memory Sanitizer的處理(用于檢測危險指針等內存問題)
    if msanenabled {
        msanmalloc(x, size)
    }

    // 重新允許當前的G被搶占
    mp.mallocing = 0
    releasem(mp)

    // 除錯記錄
    if debug.allocfreetrace != 0 {
        tracealloc(x, size, typ)
    }

    // Profiler記錄
    if rate := MemProfileRate; rate > 0 {
        if size < uintptr(rate) && int32(size) < c.next_sample {
            c.next_sample -= int32(size)
        } else {
            mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }

    // gcAssistBytes減去"實際分配大小 - 要求分配大小", 調整到準確值
    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }

    // 如果之前獲取了新的span, 則判斷是否需要后臺啟動GC
    // 這里的判斷邏輯(gcTrigger)會在下面詳細說明
    if shouldhelpgc {
        if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
            gcStart(gcBackgroundMode, t)
        }
    }

    return x
}

接下來看看如何從span里面分配對象, 首先會調用nextFreeFast嘗試快速分配:

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    // 獲取第一個非0的bit是第幾個bit, 也就是哪個元素是未分配的
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    // 找到未分配的元素
    if theBit < 64 {
        result := s.freeindex + uintptr(theBit)
        // 要求索引值小于元素數量
        if result < s.nelems {
            // 下一個freeindex
            freeidx := result + 1
            // 可以被64整除時需要特殊處理(參考nextFree)
            if freeidx%64 == 0 && freeidx != s.nelems {
                return 0
            }
            // 更新freeindex和allocCache(高位都是0, 用盡以后會更新)
            s.allocCache >>= uint(theBit + 1)
            s.freeindex = freeidx
            // 返回元素所在的地址
            v := gclinkptr(result*s.elemsize + s.base())
            // 添加已分配的元素計數
            s.allocCount++
            return v
        }
    }
    return 0
}

如果在freeindex后無法快速找到未分配的元素, 就需要調用nextFree做出更復雜的處理:

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    // 找到下一個freeindex和更新allocCache
    s = c.alloc[spc]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    // 如果span里面所有元素都已分配, 則需要獲取新的span
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {
            println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        // 申請新的span
        systemstack(func() {
            c.refill(spc)
        })
        // 獲取申請后的新的span, 并設置需要檢查是否執行GC
        shouldhelpgc = true
        s = c.alloc[spc]

        freeIndex = s.nextFreeIndex()
    }

    if freeIndex >= s.nelems {
        throw("freeIndex is not valid")
    }

    // 返回元素所在的地址
    v = gclinkptr(freeIndex*s.elemsize + s.base())
    // 添加已分配的元素計數
    s.allocCount++
    if uintptr(s.allocCount) > s.nelems {
        println("s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
        throw("s.allocCount > s.nelems")
    }
    return
}

如果mcache中指定類型的span已滿, 就需要調用refill函數申請新的span:

// Gets a span that has a free object in it and assigns it
// to be the cached span for the given sizeclass. Returns this span.
func (c *mcache) refill(spc spanClass) *mspan {
    _g_ := getg()

    // 防止G被搶占
    _g_.m.locks++
    // Return the current cached span to the central lists.
    s := c.alloc[spc]

    // 確保當前的span所有元素都已分配
    if uintptr(s.allocCount) != s.nelems {
        throw("refill of span with free space remaining")
    }

    // 設置span的incache屬性, 除非是全局使用的空span(也就是mcache里面span指針的默認值)
    if s != &emptymspan {
        s.incache = false
    }

    // 向mcentral申請一個新的span
    // Get a new cached span from the central lists.
    s = mheap_.central[spc].mcentral.cacheSpan()
    if s == nil {
        throw("out of memory")
    }

    if uintptr(s.allocCount) == s.nelems {
        throw("span has no free space")
    }

    // 設置新的span到mcache中
    c.alloc[spc] = s
    // 允許G被搶占
    _g_.m.locks--
    return s
}

向mcentral申請一個新的span會通過cacheSpan函數:
mcentral首先嘗試從內部的鏈表復用原有的span, 如果復用失敗則向mheap申請.

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    // 讓當前G協助一部分的sweep工作
    // Deduct credit for this span allocation and sweep if necessary.
    spanBytes := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) * _PageSize
    deductSweepCredit(spanBytes, 0)

    // 對mcentral上鎖, 因為可能會有多個M(P)同時訪問
    lock(&c.lock)
    traceDone := false
    if trace.enabled {
        traceGCSweepStart()
    }
    sg := mheap_.sweepgen
retry:
    // mcentral里面有兩個span的鏈表
    // - nonempty表示確定該span最少有一個未分配的元素
    // - empty表示不確定該span最少有一個未分配的元素
    // 這里優先查找nonempty的鏈表
    // sweepgen每次GC都會增加2
    // - sweepgen == 全局sweepgen, 表示span已經sweep過
    // - sweepgen == 全局sweepgen-1, 表示span正在sweep
    // - sweepgen == 全局sweepgen-2, 表示span等待sweep
    var s *mspan
    for s = c.nonempty.first; s != nil; s = s.next {
        // 如果span等待sweep, 嘗試原子修改sweepgen為全局sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 修改成功則把span移到empty鏈表, sweep它然后跳到havespan
            c.nonempty.remove(s)
            c.empty.insertBack(s)
            unlock(&c.lock)
            s.sweep(true)
            goto havespan
        }
        // 如果這個span正在被其他線程sweep, 就跳過
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // span已經sweep過
        // 因為nonempty鏈表中的span確定最少有一個未分配的元素, 這里可以直接使用它
        // we have a nonempty span that does not require sweeping, allocate from it
        c.nonempty.remove(s)
        c.empty.insertBack(s)
        unlock(&c.lock)
        goto havespan
    }

    // 查找empty的鏈表
    for s = c.empty.first; s != nil; s = s.next {
        // 如果span等待sweep, 嘗試原子修改sweepgen為全局sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 把span放到empty鏈表的最后
            // we have an empty span that requires sweeping,
            // sweep it and see if we can free some space in it
            c.empty.remove(s)
            // swept spans are at the end of the list
            c.empty.insertBack(s)
            unlock(&c.lock)
            // 嘗試sweep
            s.sweep(true)
            // sweep以后還需要檢測是否有未分配的對象, 如果有則可以使用它
            freeIndex := s.nextFreeIndex()
            if freeIndex != s.nelems {
                s.freeindex = freeIndex
                goto havespan
            }
            lock(&c.lock)
            // the span is still empty after sweep
            // it is already in the empty list, so just retry
            goto retry
        }
        // 如果這個span正在被其他線程sweep, 就跳過
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // 找不到有未分配對象的span
        // already swept empty span,
        // all subsequent ones must also be either swept or in process of sweeping
        break
    }
    if trace.enabled {
        traceGCSweepDone()
        traceDone = true
    }
    unlock(&c.lock)

    // 找不到有未分配對象的span, 需要從mheap分配
    // 分配完成后加到empty鏈表中
    // Replenish central list if empty.
    s = c.grow()
    if s == nil {
        return nil
    }
    lock(&c.lock)
    c.empty.insertBack(s)
    unlock(&c.lock)

    // At this point s is a non-empty span, queued at the end of the empty list,
    // c is unlocked.
havespan:
    if trace.enabled && !traceDone {
        traceGCSweepDone()
    }
    // 統計span中未分配的元素數量, 加到mcentral.nmalloc中
    // 統計span中未分配的元素總大小, 加到memstats.heap_live中
    cap := int32((s.npages << _PageShift) / s.elemsize)
    n := cap - int32(s.allocCount)
    if n == 0 || s.freeindex == s.nelems || uintptr(s.allocCount) == s.nelems {
        throw("span has no free objects")
    }
    // Assume all objects from this span will be allocated in the
    // mcache. If it gets uncached, we'll adjust this.
    atomic.Xadd64(&c.nmalloc, int64(n))
    usedBytes := uintptr(s.allocCount) * s.elemsize
    atomic.Xadd64(&memstats.heap_live, int64(spanBytes)-int64(usedBytes))
    // 跟蹤處理
    if trace.enabled {
        // heap_live changed.
        traceHeapAlloc()
    }
    // 如果當前在GC中, 因為heap_live改變了, 重新調整G輔助標記工作的值
    // 詳細請參考下面對revise函數的解析
    if gcBlackenEnabled != 0 {
        // heap_live changed.
        gcController.revise()
    }
    // 設置span的incache屬性, 表示span正在mcache中
    s.incache = true
    // 根據freeindex更新allocCache
    freeByteBase := s.freeindex &^ (64 - 1)
    whichByte := freeByteBase / 8
    // Init alloc bits cache.
    s.refillAllocCache(whichByte)

    // Adjust the allocCache so that s.freeindex corresponds to the low bit in
    // s.allocCache.
    s.allocCache >>= s.freeindex % 64

    return s
}

mcentral向mheap申請一個新的span會使用grow函數:

// grow allocates a new empty span from the heap and initializes it for c's size class.
func (c *mcentral) grow() *mspan {
    // 根據mcentral的類型計算需要申請的span的大小(除以8K = 有多少頁)和可以保存多少個元素
    npages := uintptr(class_to_allocnpages[c.spanclass.sizeclass()])
    size := uintptr(class_to_size[c.spanclass.sizeclass()])
    n := (npages << _PageShift) / size

    // 向mheap申請一個新的span, 以頁(8K)為單位
    s := mheap_.alloc(npages, c.spanclass, false, true)
    if s == nil {
        return nil
    }

    p := s.base()
    s.limit = p + size*n

    // 分配并初始化span的allocBits和gcmarkBits
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

mheap分配span的函數是alloc:

func (h *mheap) alloc(npage uintptr, spanclass spanClass, large bool, needzero bool) *mspan {
    // 在g0的??臻g中調用alloc_m函數
    // 關于systemstack的說明請看前一篇文章
    // Don't do any operations that lock the heap on the G stack.
    // It might trigger stack growth, and the stack growth code needs
    // to be able to allocate heap.
    var s *mspan
    systemstack(func() {
        s = h.alloc_m(npage, spanclass, large)
    })

    if s != nil {
        if needzero && s.needzero != 0 {
            memclrNoHeapPointers(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        }
        s.needzero = 0
    }
    return s
}

alloc函數會在g0的??臻g中調用alloc_m函數:

// Allocate a new span of npage pages from the heap for GC'd memory
// and record its size class in the HeapMap and HeapMapCache.
func (h *mheap) alloc_m(npage uintptr, spanclass spanClass, large bool) *mspan {
    _g_ := getg()
    if _g_ != _g_.m.g0 {
        throw("_mheap_alloc not on g0 stack")
    }
    // 對mheap上鎖, 這里的鎖是全局鎖
    lock(&h.lock)

    // 為了防止heap增速太快, 在分配n頁之前要先sweep和回收n頁
    // 會先枚舉busy列表然后再枚舉busyLarge列表進行sweep, 具體參考reclaim和reclaimList函數
    // To prevent excessive heap growth, before allocating n pages
    // we need to sweep and reclaim at least n pages.
    if h.sweepdone == 0 {
        // TODO(austin): This tends to sweep a large number of
        // spans in order to find a few completely free spans
        // (for example, in the garbage benchmark, this sweeps
        // ~30x the number of pages its trying to allocate).
        // If GC kept a bit for whether there were any marks
        // in a span, we could release these free spans
        // at the end of GC and eliminate this entirely.
        if trace.enabled {
            traceGCSweepStart()
        }
        h.reclaim(npage)
        if trace.enabled {
            traceGCSweepDone()
        }
    }

    // 把mcache中的本地統計數據加到全局
    // transfer stats from cache to global
    memstats.heap_scan += uint64(_g_.m.mcache.local_scan)
    _g_.m.mcache.local_scan = 0
    memstats.tinyallocs += uint64(_g_.m.mcache.local_tinyallocs)
    _g_.m.mcache.local_tinyallocs = 0

    // 調用allocSpanLocked分配span, allocSpanLocked函數要求當前已經對mheap上鎖
    s := h.allocSpanLocked(npage, &memstats.heap_inuse)
    if s != nil {
        // Record span info, because gc needs to be
        // able to map interior pointer to containing span.
        // 設置span的sweepgen = 全局sweepgen
        atomic.Store(&s.sweepgen, h.sweepgen)
        // 放到全局span列表中, 這里的sweepSpans的長度是2
        // sweepSpans[h.sweepgen/2%2]保存當前正在使用的span列表
        // sweepSpans[1-h.sweepgen/2%2]保存等待sweep的span列表
        // 因為每次gcsweepgen都會加2, 每次gc這兩個列表都會交換
        h.sweepSpans[h.sweepgen/2%2].push(s) // Add to swept in-use list.
        // 初始化span成員
        s.state = _MSpanInUse
        s.allocCount = 0
        s.spanclass = spanclass
        if sizeclass := spanclass.sizeclass(); sizeclass == 0 {
            s.elemsize = s.npages << _PageShift
            s.divShift = 0
            s.divMul = 0
            s.divShift2 = 0
            s.baseMask = 0
        } else {
            s.elemsize = uintptr(class_to_size[sizeclass])
            m := &class_to_divmagic[sizeclass]
            s.divShift = m.shift
            s.divMul = m.mul
            s.divShift2 = m.shift2
            s.baseMask = m.baseMask
        }

        // update stats, sweep lists
        h.pagesInUse += uint64(npage)
        // 上面grow函數會傳入true, 也就是通過grow調用到這里large會等于true
        // 添加已分配的span到busy列表, 如果頁數超過_MaxMHeapList(128頁=8K*128=1M)則放到busylarge列表
        if large {
            memstats.heap_objects++
            mheap_.largealloc += uint64(s.elemsize)
            mheap_.nlargealloc++
            atomic.Xadd64(&memstats.heap_live, int64(npage<<_PageShift))
            // Swept spans are at the end of lists.
            if s.npages < uintptr(len(h.busy)) {
                h.busy[s.npages].insertBack(s)
            } else {
                h.busylarge.insertBack(s)
            }
        }
    }
    // 如果當前在GC中, 因為heap_live改變了, 重新調整G輔助標記工作的值
    // 詳細請參考下面對revise函數的解析
    // heap_scan and heap_live were updated.
    if gcBlackenEnabled != 0 {
        gcController.revise()
    }

    // 跟蹤處理
    if trace.enabled {
        traceHeapAlloc()
    }

    // h.spans is accessed concurrently without synchronization
    // from other threads. Hence, there must be a store/store
    // barrier here to ensure the writes to h.spans above happen
    // before the caller can publish a pointer p to an object
    // allocated from s. As soon as this happens, the garbage
    // collector running on another processor could read p and
    // look up s in h.spans. The unlock acts as the barrier to
    // order these writes. On the read side, the data dependency
    // between p and the index in h.spans orders the reads.
    unlock(&h.lock)
    return s
}

繼續查看allocSpanLocked函數:

// Allocates a span of the given size.  h must be locked.
// The returned span has been removed from the
// free list, but its state is still MSpanFree.
func (h *mheap) allocSpanLocked(npage uintptr, stat *uint64) *mspan {
    var list *mSpanList
    var s *mspan

    // 嘗試在mheap中的自由列表分配
    // 頁數小于_MaxMHeapList(128頁=1M)的自由span都會在free列表中
    // 頁數大于_MaxMHeapList的自由span都會在freelarge列表中
    // Try in fixed-size lists up to max.
    for i := int(npage); i < len(h.free); i++ {
        list = &h.free[i]
        if !list.isEmpty() {
            s = list.first
            list.remove(s)
            goto HaveSpan
        }
    }
    // free列表找不到則查找freelarge列表
    // 查找不到就向arena區域申請一個新的span加到freelarge中, 然后再查找freelarge列表
    // Best fit in list of large spans.
    s = h.allocLarge(npage) // allocLarge removed s from h.freelarge for us
    if s == nil {
        if !h.grow(npage) {
            return nil
        }
        s = h.allocLarge(npage)
        if s == nil {
            return nil
        }
    }

HaveSpan:
    // Mark span in use.
    if s.state != _MSpanFree {
        throw("MHeap_AllocLocked - MSpan not free")
    }
    if s.npages < npage {
        throw("MHeap_AllocLocked - bad npages")
    }
    // 如果span有已釋放(解除虛擬內存和物理內存關系)的頁, 提醒這些頁會被使用然后更新統計數據
    if s.npreleased > 0 {
        sysUsed(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        memstats.heap_released -= uint64(s.npreleased << _PageShift)
        s.npreleased = 0
    }

    // 如果獲取到的span頁數比要求的頁數多
    // 分割剩余的頁數到另一個span并且放到自由列表中
    if s.npages > npage {
        // Trim extra and put it back in the heap.
        t := (*mspan)(h.spanalloc.alloc())
        t.init(s.base()+npage<<_PageShift, s.npages-npage)
        s.npages = npage
        p := (t.base() - h.arena_start) >> _PageShift
        if p > 0 {
            h.spans[p-1] = s
        }
        h.spans[p] = t
        h.spans[p+t.npages-1] = t
        t.needzero = s.needzero
        s.state = _MSpanManual // prevent coalescing with s
        t.state = _MSpanManual
        h.freeSpanLocked(t, false, false, s.unusedsince)
        s.state = _MSpanFree
    }
    s.unusedsince = 0

    // 設置spans區域, 哪些地址對應哪個mspan對象
    p := (s.base() - h.arena_start) >> _PageShift
    for n := uintptr(0); n < npage; n++ {
        h.spans[p+n] = s
    }

    // 更新統計數據
    *stat += uint64(npage << _PageShift)
    memstats.heap_idle -= uint64(npage << _PageShift)

    //println("spanalloc", hex(s.start<<_PageShift))
    if s.inList() {
        throw("still in list")
    }
    return s
}

繼續查看allocLarge函數:

// allocLarge allocates a span of at least npage pages from the treap of large spans.
// Returns nil if no such span currently exists.
func (h *mheap) allocLarge(npage uintptr) *mspan {
    // Search treap for smallest span with >= npage pages.
    return h.freelarge.remove(npage)
}

freelarge的類型是mTreap, 調用remove函數會在樹里面搜索一個至少npage且在樹中的最小的span返回:

// remove searches for, finds, removes from the treap, and returns the smallest
// span that can hold npages. If no span has at least npages return nil.
// This is slightly more complicated than a simple binary tree search
// since if an exact match is not found the next larger node is
// returned.
// If the last node inspected > npagesKey not holding
// a left node (a smaller npages) is the "best fit" node.
func (root *mTreap) remove(npages uintptr) *mspan {
    t := root.treap
    for t != nil {
        if t.spanKey == nil {
            throw("treap node with nil spanKey found")
        }
        if t.npagesKey < npages {
            t = t.right
        } else if t.left != nil && t.left.npagesKey >= npages {
            t = t.left
        } else {
            result := t.spanKey
            root.removeNode(t)
            return result
        }
    }
    return nil
}

向arena區域申請新span的函數是mheap類的grow函數:

// Try to add at least npage pages of memory to the heap,
// returning whether it worked.
//
// h must be locked.
func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {
        ask = _HeapAllocChunk
    }

    // 調用mheap.sysAlloc函數申請
    v := h.sysAlloc(ask)
    if v == nil {
        if ask > npage<<_PageShift {
            ask = npage << _PageShift
            v = h.sysAlloc(ask)
        }
        if v == nil {
            print("runtime: out of memory: cannot allocate ", ask, "-byte block (", memstats.heap_sys, " in use)\n")
            return false
        }
    }

    // 創建一個新的span并加到自由列表中
    // Create a fake "in use" span and free it, so that the
    // right coalescing happens.
    s := (*mspan)(h.spanalloc.alloc())
    s.init(uintptr(v), ask>>_PageShift)
    p := (s.base() - h.arena_start) >> _PageShift
    for i := p; i < p+s.npages; i++ {
        h.spans[i] = s
    }
    atomic.Store(&s.sweepgen, h.sweepgen)
    s.state = _MSpanInUse
    h.pagesInUse += uint64(s.npages)
    h.freeSpanLocked(s, false, true, 0)
    return true
}

繼續查看mheap的sysAlloc函數:

// sysAlloc allocates the next n bytes from the heap arena. The
// returned pointer is always _PageSize aligned and between
// h.arena_start and h.arena_end. sysAlloc returns nil on failure.
// There is no corresponding free function.
func (h *mheap) sysAlloc(n uintptr) unsafe.Pointer {
    // strandLimit is the maximum number of bytes to strand from
    // the current arena block. If we would need to strand more
    // than this, we fall back to sysAlloc'ing just enough for
    // this allocation.
    const strandLimit = 16 << 20

    // 如果arena區域當前已提交的區域不足, 則調用sysReserve預留更多的空間, 然后更新arena_end
    // sysReserve在linux上調用的是mmap函數
    // mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if n > h.arena_end-h.arena_alloc {
        // If we haven't grown the arena to _MaxMem yet, try
        // to reserve some more address space.
        p_size := round(n+_PageSize, 256<<20)
        new_end := h.arena_end + p_size // Careful: can overflow
        if h.arena_end <= new_end && new_end-h.arena_start-1 <= _MaxMem {
            // TODO: It would be bad if part of the arena
            // is reserved and part is not.
            var reserved bool
            p := uintptr(sysReserve(unsafe.Pointer(h.arena_end), p_size, &reserved))
            if p == 0 {
                // TODO: Try smaller reservation
                // growths in case we're in a crowded
                // 32-bit address space.
                goto reservationFailed
            }
            // p can be just about anywhere in the address
            // space, including before arena_end.
            if p == h.arena_end {
                // The new block is contiguous with
                // the current block. Extend the
                // current arena block.
                h.arena_end = new_end
                h.arena_reserved = reserved
            } else if h.arena_start <= p && p+p_size-h.arena_start-1 <= _MaxMem && h.arena_end-h.arena_alloc < strandLimit {
                // We were able to reserve more memory
                // within the arena space, but it's
                // not contiguous with our previous
                // reservation. It could be before or
                // after our current arena_used.
                //
                // Keep everything page-aligned.
                // Our pages are bigger than hardware pages.
                h.arena_end = p + p_size
                p = round(p, _PageSize)
                h.arena_alloc = p
                h.arena_reserved = reserved
            } else {
                // We got a mapping, but either
                //
                // 1) It's not in the arena, so we
                // can't use it. (This should never
                // happen on 32-bit.)
                //
                // 2) We would need to discard too
                // much of our current arena block to
                // use it.
                //
                // We haven't added this allocation to
                // the stats, so subtract it from a
                // fake stat (but avoid underflow).
                //
                // We'll fall back to a small sysAlloc.
                stat := uint64(p_size)
                sysFree(unsafe.Pointer(p), p_size, &stat)
            }
        }
    }

    // 預留的空間足夠時只需要增加arena_alloc
    if n <= h.arena_end-h.arena_alloc {
        // Keep taking from our reservation.
        p := h.arena_alloc
        sysMap(unsafe.Pointer(p), n, h.arena_reserved, &memstats.heap_sys)
        h.arena_alloc += n
        if h.arena_alloc > h.arena_used {
            h.setArenaUsed(h.arena_alloc, true)
        }

        if p&(_PageSize-1) != 0 {
            throw("misrounded allocation in MHeap_SysAlloc")
        }
        return unsafe.Pointer(p)
    }

    // 預留空間失敗后的處理
reservationFailed:
    // If using 64-bit, our reservation is all we have.
    if sys.PtrSize != 4 {
        return nil
    }

    // On 32-bit, once the reservation is gone we can
    // try to get memory at a location chosen by the OS.
    p_size := round(n, _PageSize) + _PageSize
    p := uintptr(sysAlloc(p_size, &memstats.heap_sys))
    if p == 0 {
        return nil
    }

    if p < h.arena_start || p+p_size-h.arena_start > _MaxMem {
        // This shouldn't be possible because _MaxMem is the
        // whole address space on 32-bit.
        top := uint64(h.arena_start) + _MaxMem
        print("runtime: memory allocated by OS (", hex(p), ") not in usable range [", hex(h.arena_start), ",", hex(top), ")\n")
        sysFree(unsafe.Pointer(p), p_size, &memstats.heap_sys)
        return nil
    }

    p += -p & (_PageSize - 1)
    if p+n > h.arena_used {
        h.setArenaUsed(p+n, true)
    }

    if p&(_PageSize-1) != 0 {
        throw("misrounded allocation in MHeap_SysAlloc")
    }
    return unsafe.Pointer(p)
}

以上就是分配對象的完整流程了, 接下來分析GC標記和回收對象的處理.

回收對象的處理

回收對象的流程

GO的GC是并行GC, 也就是GC的大部分處理和普通的go代碼是同時運行的, 這讓GO的GC流程比較復雜.

首先GC有四個階段, 它們分別是:

Sweep Termination: 對未清掃的span進行清掃, 只有上一輪的GC的清掃工作完成才可以開始新一輪的GC

Mark: 掃描所有根對象, 和根對象可以到達的所有對象, 標記它們不被回收

Mark Termination: 完成標記工作, 重新掃描部分根對象(要求STW)

Sweep: 按標記結果清掃span

下圖是比較完整的GC流程, 并按顏色對這四個階段進行了分類:

golang中三色GC的實現原理

在GC過程中會有兩種后臺任務(G), 一種是標記用的后臺任務, 一種是清掃用的后臺任務.

標記用的后臺任務會在需要時啟動, 可以同時工作的后臺任務數量大約是P的數量的25%, 也就是go所講的讓25%的cpu用在GC上的根據.

清掃用的后臺任務在程序啟動時會啟動一個, 進入清掃階段時喚醒.

目前整個GC流程會進行兩次STW(Stop The World), 第一次是Mark階段的開始, 第二次是Mark Termination階段.

第一次STW會準備根對象的掃描, 啟動寫屏障(Write Barrier)和輔助GC(mutator assist).

第二次STW會重新掃描部分根對象, 禁用寫屏障(Write Barrier)和輔助GC(mutator assist).

需要注意的是, 不是所有根對象的掃描都需要STW, 例如掃描棧上的對象只需要停止擁有該棧的G.

從go 1.9開始, 寫屏障的實現使用了Hybrid Write Barrier, 大幅減少了第二次STW的時間.

GC的觸發條件

GC在滿足一定條件后會被觸發, 觸發條件有以下幾種:

  • gcTriggerAlways: 強制觸發GC

  • gcTriggerHeap: 當前分配的內存達到一定值就觸發GC

  • gcTriggerTime: 當一定時間沒有執行過GC就觸發GC

  • gcTriggerCycle: 要求啟動新一輪的GC, 已啟動則跳過, 手動觸發GC的runtime.GC()會使用這個條件

觸發條件的判斷在gctrigger的test函數.

其中gcTriggerHeap和gcTriggerTime這兩個條件是自然觸發的, gcTriggerHeap的判斷代碼如下:

return memstats.heap_live >= memstats.gc_trigger

heap_live的增加在上面對分配器的代碼分析中可以看到, 當值達到gc_trigger就會觸發GC, 那么gc_trigger是如何決定的?
gc_trigger的計算在gcSetTriggerRatio函數中, 公式是:

trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))

當前標記存活的大小乘以1+系數triggerRatio, 就是下次出發GC需要的分配量.
triggerRatio在每次GC后都會調整, 計算triggerRatio的函數是encCycle, 公式是:

const triggerGain = 0.5
// 目標Heap增長率, 默認是1.0
goalGrowthRatio := float64(gcpercent) / 100
// 實際Heap增長率, 等于總大小/存活大小-1
actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1
// GC標記階段的使用時間(因為endCycle是在Mark Termination階段調用的)
assistDuration := nanotime() - c.markStartTime
// GC標記階段的CPU占用率, 目標值是0.25
utilization := gcGoalUtilization
if assistDuration > 0 {
    // assistTime是G輔助GC標記對象所使用的時間合計
    // (nanosecnds spent in mutator assists during this cycle)
    // 額外的CPU占用率 = 輔助GC標記對象的總時間 / (GC標記使用時間 * P的數量)
    utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs))
}
// 觸發系數偏移值 = 目標增長率 - 原觸發系數 - CPU占用率 / 目標CPU占用率 * (實際增長率 - 原觸發系數)
// 參數的分析:
// 實際增長率越大, 觸發系數偏移值越小, 小于0時下次觸發GC會提早
// CPU占用率越大, 觸發系數偏移值越小, 小于0時下次觸發GC會提早
// 原觸發系數越大, 觸發系數偏移值越小, 小于0時下次觸發GC會提早
triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio)
// 根據偏移值調整觸發系數, 每次只調整偏移值的一半(漸進式調整)
triggerRatio := memstats.triggerRatio + triggerGain*triggerError

公式中的"目標Heap增長率"可以通過設置環境變量"GOGC"調整, 默認值是100, 增加它的值可以減少GC的觸發.
設置"GOGC=off"可以徹底關掉GC.

gcTriggerTime的判斷代碼如下:

lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
return lastgc != 0 && t.now-lastgc > forcegcperiod

forcegcperiod的定義是2分鐘, 也就是2分鐘內沒有執行過GC就會強制觸發.

三色的定義(黑, 灰, 白)

我看過的對三色GC的"三色"這個概念解釋的最好的文章就是這一篇了, 強烈建議先看這一篇中的講解.

"三色"的概念可以簡單的理解為:

  • 黑色: 對象在這次GC中已標記, 且這個對象包含的子對象也已標記

  • 灰色: 對象在這次GC中已標記, 但這個對象包含的子對象未標記

  • 白色: 對象在這次GC中未標記

在go內部對象并沒有保存顏色的屬性, 三色只是對它們的狀態的描述,

白色的對象在它所在的span的gcmarkBits中對應的bit為0,

灰色的對象在它所在的span的gcmarkBits中對應的bit為1, 并且對象在標記隊列中,

黑色的對象在它所在的span的gcmarkBits中對應的bit為1, 并且對象已經從標記隊列中取出并處理.

gc完成后, gcmarkBits會移動到allocBits然后重新分配一個全部為0的bitmap, 這樣黑色的對象就變為了白色.

寫屏障(Write Barrier)

因為go支持并行GC, GC的掃描和go代碼可以同時運行, 這樣帶來的問題是GC掃描的過程中go代碼有可能改變了對象的依賴樹,

例如開始掃描時發現根對象A和B, B擁有C的指針, GC先掃描A, 然后B把C的指針交給A, GC再掃描B, 這時C就不會被掃描到.

為了避免這個問題, go在GC的標記階段會啟用寫屏障(Write Barrier).

啟用了寫屏障(Write Barrier)后, 當B把C的指針交給A時, GC會認為在這一輪的掃描中C的指針是存活的,

即使A可能會在稍后丟掉C, 那么C就在下一輪回收.

寫屏障只針對指針啟用, 而且只在GC的標記階段啟用, 平時會直接把值寫入到目標地址.

go在1.9開始啟用了混合寫屏障(Hybrid Write Barrier), 偽代碼如下:

writePointer(slot, ptr):
    shade(*slot)
    if any stack is grey:
        shade(ptr)
    *slot = ptr

混合寫屏障會同時標記指針寫入目標的"原指針"和“新指針".

標記原指針的原因是, 其他運行中的線程有可能會同時把這個指針的值復制到寄存器或者棧上的本地變量,
因為復制指針到寄存器或者棧上的本地變量不會經過寫屏障, 所以有可能會導致指針不被標記, 試想下面的情況:

[go] b = obj
[go] oldx = nil
[gc] scan oldx...
[go] oldx = b.x // 復制b.x到本地變量, 不進過寫屏障
[go] b.x = ptr // 寫屏障應該標記b.x的原值
[gc] scan b...
如果寫屏障不標記原值, 那么oldx就不會被掃描到.

標記新指針的原因是, 其他運行中的線程有可能會轉移指針的位置, 試想下面的情況:

[go] a = ptr
[go] b = obj
[gc] scan b...
[go] b.x = a // 寫屏障應該標記b.x的新值
[go] a = nil
[gc] scan a...
如果寫屏障不標記新值, 那么ptr就不會被掃描到.

混合寫屏障可以讓GC在并行標記結束后不需要重新掃描各個G的堆棧, 可以減少Mark Termination中的STW時間.
除了寫屏障外, 在GC的過程中所有新分配的對象都會立刻變為黑色, 在上面的mallocgc函數中可以看到.

輔助GC(mutator assist)

為了防止heap增速太快, 在GC執行的過程中如果同時運行的G分配了內存, 那么這個G會被要求輔助GC做一部分的工作.
在GC的過程中同時運行的G稱為"mutator", "mutator assist"機制就是G輔助GC做一部分工作的機制.

輔助GC做的工作有兩種類型, 一種是標記(Mark), 另一種是清掃(Sweep).
輔助標記的觸發可以查看上面的mallocgc函數, 觸發時G會幫助掃描"工作量"個對象, 工作量的計算公式是:

debtBytes * assistWorkPerByte

意思是分配的大小乘以系數assistWorkPerByte, assistWorkPerByte的計算在函數revise中, 公式是:

// 等待掃描的對象數量 = 未掃描的對象數量 - 已掃描的對象數量
scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
if scanWorkExpected < 1000 {
    scanWorkExpected = 1000
}
// 距離觸發GC的Heap大小 = 期待觸發GC的Heap大小 - 當前的Heap大小
// 注意next_gc的計算跟gc_trigger不一樣, next_gc等于heap_marked * (1 + gcpercent / 100)
heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
if heapDistance <= 0 {
    heapDistance = 1
}
// 每分配1 byte需要輔助掃描的對象數量 = 等待掃描的對象數量 / 距離觸發GC的Heap大小
c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)

和輔助標記不一樣的是, 輔助清掃申請新span時才會檢查, 而輔助標記是每次分配對象時都會檢查.
輔助清掃的觸發可以看上面的cacheSpan函數, 觸發時G會幫助回收"工作量"頁的對象, 工作量的計算公式是:

spanBytes * sweepPagesPerByte // 不完全相同, 具體看deductSweepCredit函數

意思是分配的大小乘以系數sweepPagesPerByte, sweepPagesPerByte的計算在函數gcSetTriggerRatio中, 公式是:

// 當前的Heap大小
heapLiveBasis := atomic.Load64(&memstats.heap_live)
// 距離觸發GC的Heap大小 = 下次觸發GC的Heap大小 - 當前的Heap大小
heapDistance := int64(trigger) - int64(heapLiveBasis)
heapDistance -= 1024 * 1024
if heapDistance < _PageSize {
    heapDistance = _PageSize
}
// 已清掃的頁數
pagesSwept := atomic.Load64(&mheap_.pagesSwept)
// 未清掃的頁數 = 使用中的頁數 - 已清掃的頁數
sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept)
if sweepDistancePages <= 0 {
    mheap_.sweepPagesPerByte = 0
} else {
    // 每分配1 byte(的span)需要輔助清掃的頁數 = 未清掃的頁數 / 距離觸發GC的Heap大小
    mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
}

根對象

在GC的標記階段首先需要標記的就是"根對象", 從根對象開始可到達的所有對象都會被認為是存活的.
根對象包含了全局變量, 各個G的棧上的變量等, GC會先掃描根對象然后再掃描根對象可到達的所有對象.
掃描根對象包含了一系列的工作, 它們定義在函數:

Fixed Roots: 特殊的掃描工作

fixedRootFinalizers: 掃描析構器隊列

fixedRootFreeGStacks: 釋放已中止的G的棧

Flush Cache Roots: 釋放mcache中的所有span, 要求STW

Data Roots: 掃描可讀寫的全局變量

BSS Roots: 掃描只讀的全局變量

Span Roots: 掃描各個span中特殊對象(析構器列表)

Stack Roots: 掃描各個G的棧

標記階段(Mark)會做其中的"Fixed Roots", "Data Roots", "BSS Roots", "Span Roots", "Stack Roots".

完成標記階段(Mark Termination)會做其中的"Fixed Roots", "Flush Cache Roots".

標記隊列

GC的標記階段會使用"標記隊列"來確定所有可從根對象到達的對象都已標記, 上面提到的"灰色"的對象就是在標記隊列中的對象.

舉例來說, 如果當前有[A, B, C]這三個根對象, 那么掃描根對象時就會把它們放到標記隊列:

work queue: [A, B, C]

后臺標記任務從標記隊列中取出A, 如果A引用了D, 則把D放入標記隊列:

work queue: [B, C, D]

后臺標記任務從標記隊列取出B, 如果B也引用了D, 這時因為D在gcmarkBits中對應的bit已經是1所以會跳過:

work queue: [C, D]

如果并行運行的go代碼分配了一個對象E, 對象E會被立刻標記, 但不會進入標記隊列(因為確定E沒有引用其他對象).

然后并行運行的go代碼把對象F設置給對象E的成員, 寫屏障會標記對象F然后把對象F加到運行隊列:

work queue: [C, D, F]

后臺標記任務從標記隊列取出C, 如果C沒有引用其他對象, 則不需要處理:

work queue: [D, F]

后臺標記任務從標記隊列取出D, 如果D引用了X, 則把X放入標記隊列:

work queue: [F, X]

后臺標記任務從標記隊列取出F, 如果F沒有引用其他對象, 則不需要處理.

后臺標記任務從標記隊列取出X, 如果X沒有引用其他對象, 則不需要處理.

最后標記隊列為空, 標記完成, 存活的對象有[A, B, C, D, E, F, X].

實際的狀況會比上面介紹的狀況稍微復雜一點.

標記隊列會分為全局標記隊列和各個P的本地標記隊列, 這點和協程中的運行隊列相似.

并且標記隊列為空以后, 還需要停止整個世界并禁止寫屏障, 然后再次檢查是否為空.

源代碼分析

go觸發gc會從gcStart函數開始:

// gcStart transitions the GC from _GCoff to _GCmark (if
// !mode.stwMark) or _GCmarktermination (if mode.stwMark) by
// performing sweep termination and GC initialization.
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(mode gcMode, trigger gcTrigger) {
    // 判斷當前G是否可搶占, 不可搶占時不觸發GC
    // Since this is called from malloc and malloc is called in
    // the guts of a number of libraries that might be holding
    // locks, don't attempt to start GC in non-preemptible or
    // potentially unstable situations.
    mp := acquirem()
    if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
        releasem(mp)
        return
    }
    releasem(mp)
    mp = nil

    // 并行清掃上一輪GC未清掃的span
    // Pick up the remaining unswept/not being swept spans concurrently
    //
    // This shouldn't happen if we're being invoked in background
    // mode since proportional sweep should have just finished
    // sweeping everything, but rounding errors, etc, may leave a
    // few spans unswept. In forced mode, this is necessary since
    // GC can be forced at any point in the sweeping cycle.
    //
    // We check the transition condition continuously here in case
    // this G gets delayed in to the next GC cycle.
    for trigger.test() && gosweepone() != ^uintptr(0) {
        sweep.nbgsweep++
    }

    // 上鎖, 然后重新檢查gcTrigger的條件是否成立, 不成立時不觸發GC
    // Perform GC initialization and the sweep termination
    // transition.
    semacquire(&work.startSema)
    // Re-check transition condition under transition lock.
    if !trigger.test() {
        semrelease(&work.startSema)
        return
    }

    // 記錄是否強制觸發, gcTriggerCycle是runtime.GC用的
    // For stats, check if this GC was forced by the user.
    work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle

    // 判斷是否指定了禁止并行GC的參數
    // In gcstoptheworld debug mode, upgrade the mode accordingly.
    // We do this after re-checking the transition condition so
    // that multiple goroutines that detect the heap trigger don't
    // start multiple STW GCs.
    if mode == gcBackgroundMode {
        if debug.gcstoptheworld == 1 {
            mode = gcForceMode
        } else if debug.gcstoptheworld == 2 {
            mode = gcForceBlockMode
        }
    }

    // Ok, we're doing it!  Stop everybody else
    semacquire(&worldsema)

    // 跟蹤處理
    if trace.enabled {
        traceGCStart()
    }

    // 啟動后臺掃描任務(G)
    if mode == gcBackgroundMode {
        gcBgMarkStartWorkers()
    }

    // 重置標記相關的狀態
    gcResetMarkState()

    // 重置參數
    work.stwprocs, work.maxprocs = gcprocs(), gomaxprocs
    work.heap0 = atomic.Load64(&memstats.heap_live)
    work.pauseNS = 0
    work.mode = mode

    // 記錄開始時間
    now := nanotime()
    work.tSweepTerm = now
    work.pauseStart = now
    
    // 停止所有運行中的G, 并禁止它們運行
    systemstack(stopTheWorldWithSema)
    
    // !!!!!!!!!!!!!!!!
    // 世界已停止(STW)...
    // !!!!!!!!!!!!!!!!
    
    // 清掃上一輪GC未清掃的span, 確保上一輪GC已完成
    // Finish sweep before we start concurrent scan.
    systemstack(func() {
        finishsweep_m()
    })
    // 清掃sched.sudogcache和sched.deferpool
    // clearpools before we start the GC. If we wait they memory will not be
    // reclaimed until the next GC cycle.
    clearpools()

    // 增加GC計數
    work.cycles++
    
    // 判斷是否并行GC模式
    if mode == gcBackgroundMode { // Do as much work concurrently as possible
        // 標記新一輪GC已開始
        gcController.startCycle()
        work.heapGoal = memstats.next_gc

        // 設置全局變量中的GC狀態為_GCmark
        // 然后啟用寫屏障
        // Enter concurrent mark phase and enable
        // write barriers.
        //
        // Because the world is stopped, all Ps will
        // observe that write barriers are enabled by
        // the time we start the world and begin
        // scanning.
        //
        // Write barriers must be enabled before assists are
        // enabled because they must be enabled before
        // any non-leaf heap objects are marked. Since
        // allocations are blocked until assists can
        // happen, we want enable assists as early as
        // possible.
        setGCPhase(_GCmark)

        // 重置后臺標記任務的計數
        gcBgMarkPrepare() // Must happen before assist enable.

        // 計算掃描根對象的任務數量
        gcMarkRootPrepare()

        // 標記所有tiny alloc等待合并的對象
        // Mark all active tinyalloc blocks. Since we're
        // allocating from these, they need to be black like
        // other allocations. The alternative is to blacken
        // the tiny block on every allocation from it, which
        // would slow down the tiny allocator.
        gcMarkTinyAllocs()

        // 啟用輔助GC
        // At this point all Ps have enabled the write
        // barrier, thus maintaining the no white to
        // black invariant. Enable mutator assists to
        // put back-pressure on fast allocating
        // mutators.
        atomic.Store(&gcBlackenEnabled, 1)

        // 記錄標記開始的時間
        // Assists and workers can start the moment we start
        // the world.
        gcController.markStartTime = now

        // 重新啟動世界
        // 前面創建的后臺標記任務會開始工作, 所有后臺標記任務都完成工作后, 進入完成標記階段
        // Concurrent mark.
        systemstack(startTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!
        // 世界已重新啟動...
        // !!!!!!!!!!!!!!!
        
        // 記錄停止了多久, 和標記階段開始的時間
        now = nanotime()
        work.pauseNS += now - work.pauseStart
        work.tMark = now
    } else {
        // 不是并行GC模式
        // 記錄完成標記階段開始的時間
        t := nanotime()
        work.tMark, work.tMarkTerm = t, t
        work.heapGoal = work.heap0

        // 跳過標記階段, 執行完成標記階段
        // 所有標記工作都會在世界已停止的狀態執行
        // (標記階段會設置work.markrootDone=true, 如果跳過則它的值是false, 完成標記階段會執行所有工作)
        // 完成標記階段會重新啟動世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(memstats.triggerRatio)
    }

    semrelease(&work.startSema)
}

接下來一個個分析gcStart調用的函數, 建議配合上面的"回收對象的流程"中的圖理解.

函數gcBgMarkStartWorkers用于啟動后臺標記任務, 先分別對每個P啟動一個:

// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
    // Background marking is performed by per-P G's. Ensure that
    // each P has a background GC G.
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {
            break
        }
        // 如果已啟動則不重復啟動
        if p.gcBgMarkWorker == 0 {
            go gcBgMarkWorker(p)
            // 啟動后等待該任務通知信號量bgMarkReady再繼續
            notetsleepg(&work.bgMarkReady, -1)
            noteclear(&work.bgMarkReady)
        }
    }
}

這里雖然為每個P啟動了一個后臺標記任務, 但是可以同時工作的只有25%, 這個邏輯在協程M獲取G時調用的findRunnableGCWorker中:

// findRunnableGCWorker returns the background mark worker for _p_ if it
// should be run. This must only be called when gcBlackenEnabled != 0.
func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
    if gcBlackenEnabled == 0 {
        throw("gcControllerState.findRunnable: blackening not enabled")
    }
    if _p_.gcBgMarkWorker == 0 {
        // The mark worker associated with this P is blocked
        // performing a mark transition. We can't run it
        // because it may be on some other run or wait queue.
        return nil
    }

    if !gcMarkWorkAvailable(_p_) {
        // No work to be done right now. This can happen at
        // the end of the mark phase when there are still
        // assists tapering off. Don't bother running a worker
        // now because it'll just return immediately.
        return nil
    }

    // 原子減少對應的值, 如果減少后大于等于0則返回true, 否則返回false
    decIfPositive := func(ptr *int64) bool {
        if *ptr > 0 {
            if atomic.Xaddint64(ptr, -1) >= 0 {
                return true
            }
            // We lost a race
            atomic.Xaddint64(ptr, +1)
        }
        return false
    }

    // 減少dedicatedMarkWorkersNeeded, 成功時后臺標記任務的模式是Dedicated
    // dedicatedMarkWorkersNeeded是當前P的數量的25%去除小數點
    // 詳見startCycle函數
    if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
        // This P is now dedicated to marking until the end of
        // the concurrent mark phase.
        _p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
    } else {
        // 減少fractionalMarkWorkersNeeded, 成功是后臺標記任務的模式是Fractional
        // 上面的計算如果小數點后有數值(不能夠整除)則fractionalMarkWorkersNeeded為1, 否則為0
        // 詳見startCycle函數
        // 舉例來說, 4個P時會執行1個Dedicated模式的任務, 5個P時會執行1個Dedicated模式和1個Fractional模式的任務
        if !decIfPositive(&c.fractionalMarkWorkersNeeded) {
            // No more workers are need right now.
            return nil
        }

        // 按Dedicated模式的任務的執行時間判斷cpu占用率是否超過預算值, 超過時不啟動
        // This P has picked the token for the fractional worker.
        // Is the GC currently under or at the utilization goal?
        // If so, do more work.
        //
        // We used to check whether doing one time slice of work
        // would remain under the utilization goal, but that has the
        // effect of delaying work until the mutator has run for
        // enough time slices to pay for the work. During those time
        // slices, write barriers are enabled, so the mutator is running slower.
        // Now instead we do the work whenever we're under or at the
        // utilization work and pay for it by letting the mutator run later.
        // This doesn't change the overall utilization averages, but it
        // front loads the GC work so that the GC finishes earlier and
        // write barriers can be turned off sooner, effectively giving
        // the mutator a faster machine.
        //
        // The old, slower behavior can be restored by setting
        //  gcForcePreemptNS = forcePreemptNS.
        const gcForcePreemptNS = 0

        // TODO(austin): We could fast path this and basically
        // eliminate contention on c.fractionalMarkWorkersNeeded by
        // precomputing the minimum time at which it's worth
        // next scheduling the fractional worker. Then Ps
        // don't have to fight in the window where we've
        // passed that deadline and no one has started the
        // worker yet.
        //
        // TODO(austin): Shorter preemption interval for mark
        // worker to improve fairness and give this
        // finer-grained control over schedule?
        now := nanotime() - gcController.markStartTime
        then := now + gcForcePreemptNS
        timeUsed := c.fractionalMarkTime + gcForcePreemptNS
        if then > 0 && float64(timeUsed)/float64(then) > c.fractionalUtilizationGoal {
            // Nope, we'd overshoot the utilization goal
            atomic.Xaddint64(&c.fractionalMarkWorkersNeeded, +1)
            return nil
        }
        _p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
    }

    // 安排后臺標記任務執行
    // Run the background mark worker
    gp := _p_.gcBgMarkWorker.ptr()
    casgstatus(gp, _Gwaiting, _Grunnable)
    if trace.enabled {
        traceGoUnpark(gp, 0)
    }
    return gp
}

gcResetMarkState函數會重置標記相關的狀態:

// gcResetMarkState resets global state prior to marking (concurrent
// or STW) and resets the stack scan state of all Gs.
//
// This is safe to do without the world stopped because any Gs created
// during or after this will start out in the reset state.
func gcResetMarkState() {
    // This may be called during a concurrent phase, so make sure
    // allgs doesn't change.
    lock(&allglock)
    for _, gp := range allgs {
        gp.gcscandone = false  // set to true in gcphasework
        gp.gcscanvalid = false // stack has not been scanned
        gp.gcAssistBytes = 0
    }
    unlock(&allglock)

    work.bytesMarked = 0
    work.initialHeapLive = atomic.Load64(&memstats.heap_live)
    work.markrootDone = false
}

stopTheWorldWithSema函數會停止整個世界, 這個函數必須在g0中運行:

// stopTheWorldWithSema is the core implementation of stopTheWorld.
// The caller is responsible for acquiring worldsema and disabling
// preemption first and then should stopTheWorldWithSema on the system
// stack:
//
//  semacquire(&worldsema, 0)
//  m.preemptoff = "reason"
//  systemstack(stopTheWorldWithSema)
//
// When finished, the caller must either call startTheWorld or undo
// these three operations separately:
//
//  m.preemptoff = ""
//  systemstack(startTheWorldWithSema)
//  semrelease(&worldsema)
//
// It is allowed to acquire worldsema once and then execute multiple
// startTheWorldWithSema/stopTheWorldWithSema pairs.
// Other P's are able to execute between successive calls to
// startTheWorldWithSema and stopTheWorldWithSema.
// Holding worldsema causes any other goroutines invoking
// stopTheWorld to block.
func stopTheWorldWithSema() {
    _g_ := getg()

    // If we hold a lock, then we won't be able to stop another M
    // that is blocked trying to acquire the lock.
    if _g_.m.locks > 0 {
        throw("stopTheWorld: holding locks")
    }

    lock(&sched.lock)
    
    // 需要停止的P數量
    sched.stopwait = gomaxprocs
    
    // 設置gc等待標記, 調度時看見此標記會進入等待
    atomic.Store(&sched.gcwaiting, 1)
    
    // 搶占所有運行中的G
    preemptall()
    
    // 停止當前的P
    // stop current P
    _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
    
    // 減少需要停止的P數量(當前的P算一個)
    sched.stopwait--
    
    // 搶占所有在Psyscall狀態的P, 防止它們重新參與調度
    // try to retake all P's in Psyscall status
    for i := 0; i < int(gomaxprocs); i++ {
        p := allp[i]
        s := p.status
        if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
            if trace.enabled {
                traceGoSysBlock(p)
                traceProcStop(p)
            }
            p.syscalltick++
            sched.stopwait--
        }
    }
    
    // 防止所有空閑的P重新參與調度
    // stop idle P's
    for {
        p := pidleget()
        if p == nil {
            break
        }
        p.status = _Pgcstop
        sched.stopwait--
    }
    wait := sched.stopwait > 0
    unlock(&sched.lock)

    // 如果仍有需要停止的P, 則等待它們停止
    // wait for remaining P's to stop voluntarily
    if wait {
        for {
            // 循環等待 + 搶占所有運行中的G
            // wait for 100us, then try to re-preempt in case of any races
            if notetsleep(&sched.stopnote, 100*1000) {
                noteclear(&sched.stopnote)
                break
            }
            preemptall()
        }
    }

    // 邏輯正確性檢查
    // sanity checks
    bad := ""
    if sched.stopwait != 0 {
        bad = "stopTheWorld: not stopped (stopwait != 0)"
    } else {
        for i := 0; i < int(gomaxprocs); i++ {
            p := allp[i]
            if p.status != _Pgcstop {
                bad = "stopTheWorld: not stopped (status != _Pgcstop)"
            }
        }
    }
    if atomic.Load(&freezing) != 0 {
        // Some other thread is panicking. This can cause the
        // sanity checks above to fail if the panic happens in
        // the signal handler on a stopped thread. Either way,
        // we should halt this thread.
        lock(&deadlock)
        lock(&deadlock)
    }
    if bad != "" {
        throw(bad)
    }
    
    // 到這里所有運行中的G都會變為待運行, 并且所有的P都不能被M獲取
    // 也就是說所有的go代碼(除了當前的)都會停止運行, 并且不能運行新的go代碼
}

finishsweep_m函數會清掃上一輪GC未清掃的span, 確保上一輪GC已完成:

// finishsweep_m ensures that all spans are swept.
//
// The world must be stopped. This ensures there are no sweeps in
// progress.
//
//go:nowritebarrier
func finishsweep_m() {
    // sweepone會取出一個未sweep的span然后執行sweep
    // 詳細將在下面sweep階段時分析
    // Sweeping must be complete before marking commences, so
    // sweep any unswept spans. If this is a concurrent GC, there
    // shouldn't be any spans left to sweep, so this should finish
    // instantly. If GC was forced before the concurrent sweep
    // finished, there may be spans to sweep.
    for sweepone() != ^uintptr(0) {
        sweep.npausesweep++
    }

    // 所有span都sweep完成后, 啟動一個新的markbit時代
    // 這個函數是實現span的gcmarkBits和allocBits的分配和復用的關鍵, 流程如下
    // - span分配gcmarkBits和allocBits
    // - span完成sweep
    //   - 原allocBits不再被使用
    //   - gcmarkBits變為allocBits
    //   - 分配新的gcmarkBits
    // - 開啟新的markbit時代
    // - span完成sweep, 同上
    // - 開啟新的markbit時代
    //   - 2個時代之前的bitmap將不再被使用, 可以復用這些bitmap
    nextMarkBitArenaEpoch()
}

clearpools函數會清理sched.sudogcache和sched.deferpool, 讓它們的內存可以被回收:

func clearpools() {
    // clear sync.Pools
    if poolcleanup != nil {
        poolcleanup()
    }

    // Clear central sudog cache.
    // Leave per-P caches alone, they have strictly bounded size.
    // Disconnect cached list before dropping it on the floor,
    // so that a dangling ref to one entry does not pin all of them.
    lock(&sched.sudoglock)
    var sg, sgnext *sudog
    for sg = sched.sudogcache; sg != nil; sg = sgnext {
        sgnext = sg.next
        sg.next = nil
    }
    sched.sudogcache = nil
    unlock(&sched.sudoglock)

    // Clear central defer pools.
    // Leave per-P pools alone, they have strictly bounded size.
    lock(&sched.deferlock)
    for i := range sched.deferpool {
        // disconnect cached list before dropping it on the floor,
        // so that a dangling ref to one entry does not pin all of them.
        var d, dlink *_defer
        for d = sched.deferpool[i]; d != nil; d = dlink {
            dlink = d.link
            d.link = nil
        }
        sched.deferpool[i] = nil
    }
    unlock(&sched.deferlock)
}

startCycle標記開始了新一輪的GC:

// startCycle resets the GC controller's state and computes estimates
// for a new GC cycle. The caller must hold worldsema.
func (c *gcControllerState) startCycle() {
    c.scanWork = 0
    c.bgScanCredit = 0
    c.assistTime = 0
    c.dedicatedMarkTime = 0
    c.fractionalMarkTime = 0
    c.idleMarkTime = 0

    // 偽裝heap_marked的值如果gc_trigger的值很小, 防止后面對triggerRatio做出錯誤的調整
    // If this is the first GC cycle or we're operating on a very
    // small heap, fake heap_marked so it looks like gc_trigger is
    // the appropriate growth from heap_marked, even though the
    // real heap_marked may not have a meaningful value (on the
    // first cycle) or may be much smaller (resulting in a large
    // error response).
    if memstats.gc_trigger <= heapminimum {
        memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio))
    }

    // 重新計算next_gc, 注意next_gc的計算跟gc_trigger不一樣
    // Re-compute the heap goal for this cycle in case something
    // changed. This is the same calculation we use elsewhere.
    memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
    if gcpercent < 0 {
        memstats.next_gc = ^uint64(0)
    }

    // 確保next_gc和heap_live之間最少有1MB
    // Ensure that the heap goal is at least a little larger than
    // the current live heap size. This may not be the case if GC
    // start is delayed or if the allocation that pushed heap_live
    // over gc_trigger is large or if the trigger is really close to
    // GOGC. Assist is proportional to this distance, so enforce a
    // minimum distance, even if it means going over the GOGC goal
    // by a tiny bit.
    if memstats.next_gc < memstats.heap_live+1024*1024 {
        memstats.next_gc = memstats.heap_live + 1024*1024
    }

    // 計算可以同時執行的后臺標記任務的數量
    // dedicatedMarkWorkersNeeded等于P的數量的25%去除小數點
    // 如果可以整除則fractionalMarkWorkersNeeded等于0否則等于1
    // totalUtilizationGoal是GC所占的P的目標值(例如P一共有5個時目標是1.25個P)
    // fractionalUtilizationGoal是Fractiona模式的任務所占的P的目標值(例如P一共有5個時目標是0.25個P)
    // Compute the total mark utilization goal and divide it among
    // dedicated and fractional workers.
    totalUtilizationGoal := float64(gomaxprocs) * gcGoalUtilization
    c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal)
    c.fractionalUtilizationGoal = totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)
    if c.fractionalUtilizationGoal > 0 {
        c.fractionalMarkWorkersNeeded = 1
    } else {
        c.fractionalMarkWorkersNeeded = 0
    }

    // 重置P中的輔助GC所用的時間統計
    // Clear per-P state
    for _, p := range &allp {
        if p == nil {
            break
        }
        p.gcAssistTime = 0
    }

    // 計算輔助GC的參數
    // 參考上面對計算assistWorkPerByte的公式的分析
    // Compute initial values for controls that are updated
    // throughout the cycle.
    c.revise()

    if debug.gcpacertrace > 0 {
        print("pacer: assist ratio=", c.assistWorkPerByte,
            " (scan ", memstats.heap_scan>>20, " MB in ",
            work.initialHeapLive>>20, "->",
            memstats.next_gc>>20, " MB)",
            " workers=", c.dedicatedMarkWorkersNeeded,
            "+", c.fractionalMarkWorkersNeeded, "\n")
    }
}

setGCPhase函數會修改表示當前GC階段的全局變量和是否開啟寫屏障的全局變量:

//go:nosplit
func setGCPhase(x uint32) {
    atomic.Store(&gcphase, x)
    writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
    writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
}

gcBgMarkPrepare函數會重置后臺標記任務的計數:

// gcBgMarkPrepare sets up state for background marking.
// Mutator assists must not yet be enabled.
func gcBgMarkPrepare() {
    // Background marking will stop when the work queues are empty
    // and there are no more workers (note that, since this is
    // concurrent, this may be a transient state, but mark
    // termination will clean it up). Between background workers
    // and assists, we don't really know how many workers there
    // will be, so we pretend to have an arbitrarily large number
    // of workers, almost all of which are "waiting". While a
    // worker is working it decrements nwait. If nproc == nwait,
    // there are no workers.
    work.nproc = ^uint32(0)
    work.nwait = ^uint32(0)
}

gcMarkRootPrepare函數會計算掃描根對象的任務數量:

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
// some miscellany) and initializes scanning-related state.
//
// The caller must have call gcCopySpans().
//
// The world must be stopped.
//
//go:nowritebarrier
func gcMarkRootPrepare() {
    // 釋放mcache中的所有span的任務, 只在完成標記階段(mark termination)中執行
    if gcphase == _GCmarktermination {
        work.nFlushCacheRoots = int(gomaxprocs)
    } else {
        work.nFlushCacheRoots = 0
    }

    // 計算block數量的函數, rootBlockBytes是256KB
    // Compute how many data and BSS root blocks there are.
    nBlocks := func(bytes uintptr) int {
        return int((bytes + rootBlockBytes - 1) / rootBlockBytes)
    }

    work.nDataRoots = 0
    work.nBSSRoots = 0

    // data和bss每一輪GC只掃描一次
    // 并行GC中會在后臺標記任務中掃描, 完成標記階段(mark termination)中不掃描
    // 非并行GC會在完成標記階段(mark termination)中掃描
    // Only scan globals once per cycle; preferably concurrently.
    if !work.markrootDone {
        // 計算掃描可讀寫的全局變量的任務數量
        for _, datap := range activeModules() {
            nDataRoots := nBlocks(datap.edata - datap.data)
            if nDataRoots > work.nDataRoots {
                work.nDataRoots = nDataRoots
            }
        }

        // 計算掃描只讀的全局變量的任務數量
        for _, datap := range activeModules() {
            nBSSRoots := nBlocks(datap.ebss - datap.bss)
            if nBSSRoots > work.nBSSRoots {
                work.nBSSRoots = nBSSRoots
            }
        }
    }

    // span中的finalizer和各個G的棧每一輪GC只掃描一次
    // 同上
    if !work.markrootDone {
        // 計算掃描span中的finalizer的任務數量
        // On the first markroot, we need to scan span roots.
        // In concurrent GC, this happens during concurrent
        // mark and we depend on addfinalizer to ensure the
        // above invariants for objects that get finalizers
        // after concurrent mark. In STW GC, this will happen
        // during mark termination.
        //
        // We're only interested in scanning the in-use spans,
        // which will all be swept at this point. More spans
        // may be added to this list during concurrent GC, but
        // we only care about spans that were allocated before
        // this mark phase.
        work.nSpanRoots = mheap_.sweepSpans[mheap_.sweepgen/2%2].numBlocks()

        // 計算掃描各個G的棧的任務數量
        // On the first markroot, we need to scan all Gs. Gs
        // may be created after this point, but it's okay that
        // we ignore them because they begin life without any
        // roots, so there's nothing to scan, and any roots
        // they create during the concurrent phase will be
        // scanned during mark termination. During mark
        // termination, allglen isn't changing, so we'll scan
        // all Gs.
        work.nStackRoots = int(atomic.Loaduintptr(&allglen))
    } else {
        // We've already scanned span roots and kept the scan
        // up-to-date during concurrent mark.
        work.nSpanRoots = 0

        // The hybrid barrier ensures that stacks can't
        // contain pointers to unmarked objects, so on the
        // second markroot, there's no need to scan stacks.
        work.nStackRoots = 0

        if debug.gcrescanstacks > 0 {
            // Scan stacks anyway for debugging.
            work.nStackRoots = int(atomic.Loaduintptr(&allglen))
        }
    }

    // 計算總任務數量
    // 后臺標記任務會對markrootNext進行原子遞增, 來決定做哪個任務
    // 這種用數值來實現鎖自由隊列的辦法挺聰明的, 盡管google工程師覺得不好(看后面markroot函數的分析)
    work.markrootNext = 0
    work.markrootJobs = uint32(fixedRootCount + work.nFlushCacheRoots + work.nDataRoots + work.nBSSRoots + work.nSpanRoots + work.nStackRoots)
}

gcMarkTinyAllocs函數會標記所有tiny alloc等待合并的對象:

// gcMarkTinyAllocs greys all active tiny alloc blocks.
//
// The world must be stopped.
func gcMarkTinyAllocs() {
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {
            break
        }
        c := p.mcache
        if c == nil || c.tiny == 0 {
            continue
        }
        // 標記各個P中的mcache中的tiny
        // 在上面的mallocgc函數中可以看到tiny是當前等待合并的對象
        _, hbits, span, objIndex := heapBitsForObject(c.tiny, 0, 0)
        gcw := &p.gcw
        // 標記一個對象存活, 并把它加到標記隊列(該對象變為灰色)
        greyobject(c.tiny, 0, 0, hbits, span, gcw, objIndex)
        // gcBlackenPromptly變量表示當前是否禁止本地隊列, 如果已禁止則把標記任務flush到全局隊列
        if gcBlackenPromptly {
            gcw.dispose()
        }
    }
}

startTheWorldWithSema函數會重新啟動世界:

func startTheWorldWithSema() {
    _g_ := getg()
    
    // 禁止G被搶占
    _g_.m.locks++        // disable preemption because it can be holding p in a local var
    
    // 判斷收到的網絡事件(fd可讀可寫或錯誤)并添加對應的G到待運行隊列
    gp := netpoll(false) // non-blocking
    injectglist(gp)
    
    // 判斷是否要啟動gc helper
    add := needaddgcproc()
    lock(&sched.lock)
    
    // 如果要求改變gomaxprocs則調整P的數量
    // procresize會返回有可運行任務的P的鏈表
    procs := gomaxprocs
    if newprocs != 0 {
        procs = newprocs
        newprocs = 0
    }
    p1 := procresize(procs)
    
    // 取消GC等待標記
    sched.gcwaiting = 0
    
    // 如果sysmon在等待則喚醒它
    if sched.sysmonwait != 0 {
        sched.sysmonwait = 0
        notewakeup(&sched.sysmonnote)
    }
    unlock(&sched.lock)
    
    // 喚醒有可運行任務的P
    for p1 != nil {
        p := p1
        p1 = p1.link.ptr()
        if p.m != 0 {
            mp := p.m.ptr()
            p.m = 0
            if mp.nextp != 0 {
                throw("startTheWorld: inconsistent mp->nextp")
            }
            mp.nextp.set(p)
            notewakeup(&mp.park)
        } else {
            // Start M to run P.  Do not start another M below.
            newm(nil, p)
            add = false
        }
    }
    
    // 如果有空閑的P,并且沒有自旋中的M則喚醒或者創建一個M
    // Wakeup an additional proc in case we have excessive runnable goroutines
    // in local queues or in the global queue. If we don't, the proc will park itself.
    // If we have lots of excessive work, resetspinning will unpark additional procs as necessary.
    if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
        wakep()
    }
    
    // 啟動gc helper
    if add {
        // If GC could have used another helper proc, start one now,
        // in the hope that it will be available next time.
        // It would have been even better to start it before the collection,
        // but doing so requires allocating memory, so it's tricky to
        // coordinate. This lazy approach works out in practice:
        // we don't mind if the first couple gc rounds don't have quite
        // the maximum number of procs.
        newm(mhelpgc, nil)
    }
    
    // 允許G被搶占
    _g_.m.locks--
    
    // 如果當前G要求被搶占則重新嘗試
    if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
        _g_.stackguard0 = stackPreempt
    }
}

重啟世界后各個M會重新開始調度, 調度時會優先使用上面提到的findRunnableGCWorker函數查找任務, 之后就有大約25%的P運行后臺標記任務.
后臺標記任務的函數是gcBgMarkWorker:

func gcBgMarkWorker(_p_ *p) {
    gp := getg()
    
    // 用于休眠后重新獲取P的構造體
    type parkInfo struct {
        m      muintptr // Release this m on park.
        attach puintptr // If non-nil, attach to this p on park.
    }
    // We pass park to a gopark unlock function, so it can't be on
    // the stack (see gopark). Prevent deadlock from recursively
    // starting GC by disabling preemption.
    gp.m.preemptoff = "GC worker init"
    park := new(parkInfo)
    gp.m.preemptoff = ""
    
    // 設置當前的M并禁止搶占
    park.m.set(acquirem())
    // 設置當前的P(需要關聯到的P)
    park.attach.set(_p_)
    
    // 通知gcBgMarkStartWorkers可以繼續處理
    // Inform gcBgMarkStartWorkers that this worker is ready.
    // After this point, the background mark worker is scheduled
    // cooperatively by gcController.findRunnable. Hence, it must
    // never be preempted, as this would put it into _Grunnable
    // and put it on a run queue. Instead, when the preempt flag
    // is set, this puts itself into _Gwaiting to be woken up by
    // gcController.findRunnable at the appropriate time.
    notewakeup(&work.bgMarkReady)
    
    for {
        // 讓當前G進入休眠
        // Go to sleep until woken by gcController.findRunnable.
        // We can't releasem yet since even the call to gopark
        // may be preempted.
        gopark(func(g *g, parkp unsafe.Pointer) bool {
            park := (*parkInfo)(parkp)
            
            // 重新允許搶占
            // The worker G is no longer running, so it's
            // now safe to allow preemption.
            releasem(park.m.ptr())
            
            // 設置關聯的P
            // 把當前的G設到P的gcBgMarkWorker成員, 下次findRunnableGCWorker會使用
            // 設置失敗時不休眠
            // If the worker isn't attached to its P,
            // attach now. During initialization and after
            // a phase change, the worker may have been
            // running on a different P. As soon as we
            // attach, the owner P may schedule the
            // worker, so this must be done after the G is
            // stopped.
            if park.attach != 0 {
                p := park.attach.ptr()
                park.attach.set(nil)
                // cas the worker because we may be
                // racing with a new worker starting
                // on this P.
                if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
                    // The P got a new worker.
                    // Exit this worker.
                    return false
                }
            }
            return true
        }, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0)
        
        // 檢查P的gcBgMarkWorker是否和當前的G一致, 不一致時結束當前的任務
        // Loop until the P dies and disassociates this
        // worker (the P may later be reused, in which case
        // it will get a new worker) or we failed to associate.
        if _p_.gcBgMarkWorker.ptr() != gp {
            break
        }
        
        // 禁止G被搶占
        // Disable preemption so we can use the gcw. If the
        // scheduler wants to preempt us, we'll stop draining,
        // dispose the gcw, and then preempt.
        park.m.set(acquirem())
        
        if gcBlackenEnabled == 0 {
            throw("gcBgMarkWorker: blackening not enabled")
        }
        
        // 記錄開始時間
        startTime := nanotime()
        
        decnwait := atomic.Xadd(&work.nwait, -1)
        if decnwait == work.nproc {
            println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
            throw("work.nwait was > work.nproc")
        }
        
        // 切換到g0運行
        systemstack(func() {
            // 設置G的狀態為等待中這樣它的??梢员粧呙?兩個后臺標記任務可以互相掃描對方的棧)
            // Mark our goroutine preemptible so its stack
            // can be scanned. This lets two mark workers
            // scan each other (otherwise, they would
            // deadlock). We must not modify anything on
            // the G stack. However, stack shrinking is
            // disabled for mark workers, so it is safe to
            // read from the G stack.
            casgstatus(gp, _Grunning, _Gwaiting)
            
            // 判斷后臺標記任務的模式
            switch _p_.gcMarkWorkerMode {
            default:
                throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
            case gcMarkWorkerDedicatedMode:
                // 這個模式下P應該專心執行標記
                // 執行標記, 直到被搶占, 并且需要計算后臺的掃描量來減少輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
                // 被搶占時把本地運行隊列中的所有G都踢到全局運行隊列
                if gp.preempt {
                    // We were preempted. This is
                    // a useful signal to kick
                    // everything out of the run
                    // queue so it can run
                    // somewhere else.
                    lock(&sched.lock)
                    for {
                        gp, _ := runqget(_p_)
                        if gp == nil {
                            break
                        }
                        globrunqput(gp)
                    }
                    unlock(&sched.lock)
                }
                // 繼續執行標記, 直到無更多任務, 并且需要計算后臺的掃描量來減少輔助GC和喚醒等待中的G
                // Go back to draining, this time
                // without preemption.
                gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
            case gcMarkWorkerFractionalMode:
                // 這個模式下P應該適當執行標記
                // 執行標記, 直到被搶占, 并且需要計算后臺的掃描量來減少輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
            case gcMarkWorkerIdleMode:
                // 這個模式下P只在空閑時執行標記
                // 執行標記, 直到被搶占或者達到一定的量, 并且需要計算后臺的掃描量來減少輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
            }
            
            // 恢復G的狀態到運行中
            casgstatus(gp, _Gwaiting, _Grunning)
        })
        
        // 如果標記了禁止本地標記隊列則flush到全局標記隊列
        // If we are nearing the end of mark, dispose
        // of the cache promptly. We must do this
        // before signaling that we're no longer
        // working so that other workers can't observe
        // no workers and no work while we have this
        // cached, and before we compute done.
        if gcBlackenPromptly {
            _p_.gcw.dispose()
        }
        
        // 累加所用時間
        // Account for time.
        duration := nanotime() - startTime
        switch _p_.gcMarkWorkerMode {
        case gcMarkWorkerDedicatedMode:
            atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
            atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
        case gcMarkWorkerFractionalMode:
            atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
            atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 1)
        case gcMarkWorkerIdleMode:
            atomic.Xaddint64(&gcController.idleMarkTime, duration)
        }
        
        // Was this the last worker and did we run out
        // of work?
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait > work.nproc {
            println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
                "work.nwait=", incnwait, "work.nproc=", work.nproc)
            throw("work.nwait > work.nproc")
        }
        
        // 判斷是否所有后臺標記任務都完成, 并且沒有更多的任務
        // If this worker reached a background mark completion
        // point, signal the main GC goroutine.
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // 取消和P的關聯
            // Make this G preemptible and disassociate it
            // as the worker for this P so
            // findRunnableGCWorker doesn't try to
            // schedule it.
            _p_.gcBgMarkWorker.set(nil)
            
            // 允許G被搶占
            releasem(park.m.ptr())
            
            // 準備進入完成標記階段
            gcMarkDone()
            
            // 休眠之前會重新關聯P
            // 因為上面允許被搶占, 到這里的時候可能就會變成其他P
            // 如果重新關聯P失敗則這個任務會結束
            // Disable preemption and prepare to reattach
            // to the P.
            //
            // We may be running on a different P at this
            // point, so we can't reattach until this G is
            // parked.
            park.m.set(acquirem())
            park.attach.set(_p_)
        }
    }
}

gcDrain函數用于執行標記:

// gcDrain scans roots and objects in work buffers, blackening grey
// objects until all roots and work buffers have been drained.
//
// If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt
// is set. This implies gcDrainNoBlock.
//
// If flags&gcDrainIdle != 0, gcDrain returns when there is other work
// to do. This implies gcDrainNoBlock.
//
// If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is
// unable to get more work. Otherwise, it will block until all
// blocking calls are blocked in gcDrain.
//
// If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work
// credit to gcController.bgScanCredit every gcCreditSlack units of
// scan work.
//
//go:nowritebarrier
func gcDrain(gcw *gcWork, flags gcDrainFlags) {
    if !writeBarrier.needed {
        throw("gcDrain phase incorrect")
    }
    
    gp := getg().m.curg
    
    // 看到搶占標志時是否要返回
    preemptible := flags&gcDrainUntilPreempt != 0
    
    // 沒有任務時是否要等待任務
    blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0
    
    // 是否計算后臺的掃描量來減少輔助GC和喚醒等待中的G
    flushBgCredit := flags&gcDrainFlushBgCredit != 0
    
    // 是否只執行一定量的工作
    idle := flags&gcDrainIdle != 0
    
    // 記錄初始的已掃描數量
    initScanWork := gcw.scanWork
    
    // 掃描idleCheckThreshold(100000)個對象以后檢查是否要返回
    // idleCheck is the scan work at which to perform the next
    // idle check with the scheduler.
    idleCheck := initScanWork + idleCheckThreshold
    
    // 如果根對象未掃描完, 則先掃描根對象
    // Drain root marking jobs.
    if work.markrootNext < work.markrootJobs {
        // 如果標記了preemptible, 循環直到被搶占
        for !(preemptible && gp.preempt) {
            // 從根對象掃描隊列取出一個值(原子遞增)
            job := atomic.Xadd(&work.markrootNext, +1) - 1
            if job >= work.markrootJobs {
                break
            }
            // 執行根對象掃描工作
            markroot(gcw, job)
            // 如果是idle模式并且有其他工作, 則返回
            if idle && pollWork() {
                goto done
            }
        }
    }
    
    // 根對象已經在標記隊列中, 消費標記隊列
    // 如果標記了preemptible, 循環直到被搶占
    // Drain heap marking jobs.
    for !(preemptible && gp.preempt) {
        // 如果全局標記隊列為空, 把本地標記隊列的一部分工作分過去
        // (如果wbuf2不為空則移動wbuf2過去, 否則移動wbuf1的一半過去)
        // Try to keep work available on the global queue. We used to
        // check if there were waiting workers, but it's better to
        // just keep work available than to make workers wait. In the
        // worst case, we'll do O(log(_WorkbufSize)) unnecessary
        // balances.
        if work.full == 0 {
            gcw.balance()
        }
        
        // 從本地標記隊列中獲取對象, 獲取不到則從全局標記隊列獲取
        var b uintptr
        if blocking {
            // 阻塞獲取
            b = gcw.get()
        } else {
            // 非阻塞獲取
            b = gcw.tryGetFast()
            if b == 0 {
                b = gcw.tryGet()
            }
        }
        
        // 獲取不到對象, 標記隊列已為空, 跳出循環
        if b == 0 {
            // work barrier reached or tryGet failed.
            break
        }
        
        // 掃描獲取到的對象
        scanobject(b, gcw)
        
        // 如果已經掃描了一定數量的對象(gcCreditSlack的值是2000)
        // Flush background scan work credit to the global
        // account if we've accumulated enough locally so
        // mutator assists can draw on it.
        if gcw.scanWork >= gcCreditSlack {
            // 把掃描的對象數量添加到全局
            atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
            // 減少輔助GC的工作量和喚醒等待中的G
            if flushBgCredit {
                gcFlushBgCredit(gcw.scanWork - initScanWork)
                initScanWork = 0
            }
            idleCheck -= gcw.scanWork
            gcw.scanWork = 0
            
            // 如果是idle模式且達到了檢查的掃描量, 則檢查是否有其他任務(G), 如果有則跳出循環
            if idle && idleCheck <= 0 {
                idleCheck += idleCheckThreshold
                if pollWork() {
                    break
                }
            }
        }
    }
    
    // In blocking mode, write barriers are not allowed after this
    // point because we must preserve the condition that the work
    // buffers are empty.
    
done:
    // 把掃描的對象數量添加到全局
    // Flush remaining scan work credit.
    if gcw.scanWork > 0 {
        atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
        // 減少輔助GC的工作量和喚醒等待中的G
        if flushBgCredit {
            gcFlushBgCredit(gcw.scanWork - initScanWork)
        }
        gcw.scanWork = 0
    }
}

markroot函數用于執行根對象掃描工作:

// markroot scans the i'th root.
//
// Preemption must be disabled (because this uses a gcWork).
//
// nowritebarrier is only advisory here.
//
//go:nowritebarrier
func markroot(gcw *gcWork, i uint32) {
    // 判斷取出的數值對應哪種任務
    // (google的工程師覺得這種辦法可笑)
    // TODO(austin): This is a bit ridiculous. Compute and store
    // the bases in gcMarkRootPrepare instead of the counts.
    baseFlushCache := uint32(fixedRootCount)
    baseData := baseFlushCache + uint32(work.nFlushCacheRoots)
    baseBSS := baseData + uint32(work.nDataRoots)
    baseSpans := baseBSS + uint32(work.nBSSRoots)
    baseStacks := baseSpans + uint32(work.nSpanRoots)
    end := baseStacks + uint32(work.nStackRoots)

    // Note: if you add a case here, please also update heapdump.go:dumproots.
    switch {
    // 釋放mcache中的所有span, 要求STW
    case baseFlushCache <= i && i < baseData:
        flushmcache(int(i - baseFlushCache))

    // 掃描可讀寫的全局變量
    // 這里只會掃描i對應的block, 掃描時傳入包含哪里有指針的bitmap數據
    case baseData <= i && i < baseBSS:
        for _, datap := range activeModules() {
            markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData))
        }

    // 掃描只讀的全局變量
    // 這里只會掃描i對應的block, 掃描時傳入包含哪里有指針的bitmap數據
    case baseBSS <= i && i < baseSpans:
        for _, datap := range activeModules() {
            markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS))
        }

    // 掃描析構器隊列
    case i == fixedRootFinalizers:
        // Only do this once per GC cycle since we don't call
        // queuefinalizer during marking.
        if work.markrootDone {
            break
        }
        for fb := allfin; fb != nil; fb = fb.alllink {
            cnt := uintptr(atomic.Load(&fb.cnt))
            scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw)
        }

    // 釋放已中止的G的棧
    case i == fixedRootFreeGStacks:
        // Only do this once per GC cycle; preferably
        // concurrently.
        if !work.markrootDone {
            // Switch to the system stack so we can call
            // stackfree.
            systemstack(markrootFreeGStacks)
        }

    // 掃描各個span中特殊對象(析構器列表)
    case baseSpans <= i && i < baseStacks:
        // mark MSpan.specials
        markrootSpans(gcw, int(i-baseSpans))

    // 掃描各個G的棧
    default:
        // 獲取需要掃描的G
        // the rest is scanning goroutine stacks
        var gp *g
        if baseStacks <= i && i < end {
            gp = allgs[i-baseStacks]
        } else {
            throw("markroot: bad index")
        }

        // 記錄等待開始的時間
        // remember when we've first observed the G blocked
        // needed only to output in traceback
        status := readgstatus(gp) // We are not in a scan state
        if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {
            gp.waitsince = work.tstart
        }

        // 切換到g0運行(有可能會掃到自己的棧)
        // scang must be done on the system stack in case
        // we're trying to scan our own stack.
        systemstack(func() {
            // 判斷掃描的棧是否自己的
            // If this is a self-scan, put the user G in
            // _Gwaiting to prevent self-deadlock. It may
            // already be in _Gwaiting if this is a mark
            // worker or we're in mark termination.
            userG := getg().m.curg
            selfScan := gp == userG && readgstatus(userG) == _Grunning
            
            // 如果正在掃描自己的棧則切換狀態到等待中防止死鎖
            if selfScan {
                casgstatus(userG, _Grunning, _Gwaiting)
                userG.waitreason = "garbage collection scan"
            }
            
            // 掃描G的棧
            // TODO: scang blocks until gp's stack has
            // been scanned, which may take a while for
            // running goroutines. Consider doing this in
            // two phases where the first is non-blocking:
            // we scan the stacks we can and ask running
            // goroutines to scan themselves; and the
            // second blocks.
            scang(gp, gcw)
            
            // 如果正在掃描自己的棧則把狀態切換回運行中
            if selfScan {
                casgstatus(userG, _Gwaiting, _Grunning)
            }
        })
    }
}

scang函數負責掃描G的棧:

// scang blocks until gp's stack has been scanned.
// It might be scanned by scang or it might be scanned by the goroutine itself.
// Either way, the stack scan has completed when scang returns.
func scang(gp *g, gcw *gcWork) {
    // Invariant; we (the caller, markroot for a specific goroutine) own gp.gcscandone.
    // Nothing is racing with us now, but gcscandone might be set to true left over
    // from an earlier round of stack scanning (we scan twice per GC).
    // We use gcscandone to record whether the scan has been done during this round.

    // 標記掃描未完成
    gp.gcscandone = false

    // See http://golang.org/cl/21503 for justification of the yield delay.
    const yieldDelay = 10 * 1000
    var nextYield int64

    // 循環直到掃描完成
    // Endeavor to get gcscandone set to true,
    // either by doing the stack scan ourselves or by coercing gp to scan itself.
    // gp.gcscandone can transition from false to true when we're not looking
    // (if we asked for preemption), so any time we lock the status using
    // castogscanstatus we have to double-check that the scan is still not done.
loop:
    for i := 0; !gp.gcscandone; i++ {
        // 判斷G的當前狀態
        switch s := readgstatus(gp); s {
        default:
            dumpgstatus(gp)
            throw("stopg: invalid status")

        // G已中止, 不需要掃描它
        case _Gdead:
            // No stack.
            gp.gcscandone = true
            break loop

        // G的棧正在擴展, 下一輪重試
        case _Gcopystack:
        // Stack being switched. Go around again.

        // G不是運行中, 首先需要防止它運行
        case _Grunnable, _Gsyscall, _Gwaiting:
            // Claim goroutine by setting scan bit.
            // Racing with execution or readying of gp.
            // The scan bit keeps them from running
            // the goroutine until we're done.
            if castogscanstatus(gp, s, s|_Gscan) {
                // 原子切換狀態成功時掃描它的棧
                if !gp.gcscandone {
                    scanstack(gp, gcw)
                    gp.gcscandone = true
                }
                // 恢復G的狀態, 并跳出循環
                restartg(gp)
                break loop
            }

        // G正在掃描它自己, 等待掃描完畢
        case _Gscanwaiting:
        // newstack is doing a scan for us right now. Wait.

        // G正在運行
        case _Grunning:
            // Goroutine running. Try to preempt execution so it can scan itself.
            // The preemption handler (in newstack) does the actual scan.

            // 如果已經有搶占請求, 則搶占成功時會幫我們處理
            // Optimization: if there is already a pending preemption request
            // (from the previous loop iteration), don't bother with the atomics.
            if gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt {
                break
            }

            // 搶占G, 搶占成功時G會掃描它自己
            // Ask for preemption and self scan.
            if castogscanstatus(gp, _Grunning, _Gscanrunning) {
                if !gp.gcscandone {
                    gp.preemptscan = true
                    gp.preempt = true
                    gp.stackguard0 = stackPreempt
                }
                casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning)
            }
        }

        // 第一輪休眠10毫秒, 第二輪休眠5毫秒
        if i == 0 {
            nextYield = nanotime() + yieldDelay
        }
        if nanotime() < nextYield {
            procyield(10)
        } else {
            osyield()
            nextYield = nanotime() + yieldDelay/2
        }
    }

    // 掃描完成, 取消搶占掃描的請求
    gp.preemptscan = false // cancel scan request if no longer needed
}

設置preemptscan后, 在搶占G成功時會調用scanstack掃描它自己的棧, 具體代碼在這里.
掃描棧用的函數是scanstack:

// scanstack scans gp's stack, greying all pointers found on the stack.
//
// scanstack is marked go:systemstack because it must not be preempted
// while using a workbuf.
//
//go:nowritebarrier
//go:systemstack
func scanstack(gp *g, gcw *gcWork) {
    if gp.gcscanvalid {
        return
    }

    if readgstatus(gp)&_Gscan == 0 {
        print("runtime:scanstack: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", hex(readgstatus(gp)), "\n")
        throw("scanstack - bad status")
    }

    switch readgstatus(gp) &^ _Gscan {
    default:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
        throw("mark - bad status")
    case _Gdead:
        return
    case _Grunning:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
        throw("scanstack: goroutine not stopped")
    case _Grunnable, _Gsyscall, _Gwaiting:
        // ok
    }

    if gp == getg() {
        throw("can't scan our own stack")
    }
    mp := gp.m
    if mp != nil && mp.helpgc != 0 {
        throw("can't scan gchelper stack")
    }

    // Shrink the stack if not much of it is being used. During
    // concurrent GC, we can do this during concurrent mark.
    if !work.markrootDone {
        shrinkstack(gp)
    }

    // Scan the stack.
    var cache pcvalueCache
    scanframe := func(frame *stkframe, unused unsafe.Pointer) bool {
        // scanframeworker會根據代碼地址(pc)獲取函數信息
        // 然后找到函數信息中的stackmap.bytedata, 它保存了函數的棧上哪些地方有指針
        // 再調用scanblock來掃描函數的??臻g, 同時函數的參數也會這樣掃描
        scanframeworker(frame, &cache, gcw)
        return true
    }
    // 枚舉所有調用幀, 分別調用scanframe函數
    gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0)
    // 枚舉所有defer的調用幀, 分別調用scanframe函數
    tracebackdefers(gp, scanframe, nil)
    gp.gcscanvalid = true
}

scanblock函數是一個通用的掃描函數, 掃描全局變量和??臻g都會用它, 和scanobject不同的是bitmap需要手動傳入:

// scanblock scans b as scanobject would, but using an explicit
// pointer bitmap instead of the heap bitmap.
//
// This is used to scan non-heap roots, so it does not update
// gcw.bytesMarked or gcw.scanWork.
//
//go:nowritebarrier
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
    // Use local copies of original parameters, so that a stack trace
    // due to one of the throws below shows the original block
    // base and extent.
    b := b0
    n := n0

    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used

    // 枚舉掃描的地址
    for i := uintptr(0); i < n; {
        // 找到bitmap中對應的byte
        // Find bits for the next word.
        bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
        if bits == 0 {
            i += sys.PtrSize * 8
            continue
        }
        // 枚舉byte
        for j := 0; j < 8 && i < n; j++ {
            // 如果該地址包含指針
            if bits&1 != 0 {
                // 標記在該地址的對象存活, 并把它加到標記隊列(該對象變為灰色)
                // Same work as in scanobject; see comments there.
                obj := *(*uintptr)(unsafe.Pointer(b + i))
                if obj != 0 && arena_start <= obj && obj < arena_used {
                    // 找到該對象對應的span和bitmap
                    if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
                        // 標記一個對象存活, 并把它加到標記隊列(該對象變為灰色)
                        greyobject(obj, b, i, hbits, span, gcw, objIndex)
                    }
                }
            }
            // 處理下一個指針下一個bit
            bits >>= 1
            i += sys.PtrSize
        }
    }
}

greyobject用于標記一個對象存活, 并把它加到標記隊列(該對象變為灰色):

// obj is the start of an object with mark mbits.
// If it isn't already marked, mark it and enqueue into gcw.
// base and off are for debugging only and could be removed.
//go:nowritebarrierrec
func greyobject(obj, base, off uintptr, hbits heapBits, span *mspan, gcw *gcWork, objIndex uintptr) {
    // obj should be start of allocation, and so must be at least pointer-aligned.
    if obj&(sys.PtrSize-1) != 0 {
        throw("greyobject: obj not pointer-aligned")
    }
    mbits := span.markBitsForIndex(objIndex)

    if useCheckmark {
        // checkmark是用于檢查是否所有可到達的對象都被正確標記的機制, 僅除錯使用
        if !mbits.isMarked() {
            printlock()
            print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "\n")
            print("runtime: found obj at *(", hex(base), "+", hex(off), ")\n")

            // Dump the source (base) object
            gcDumpObject("base", base, off)

            // Dump the object
            gcDumpObject("obj", obj, ^uintptr(0))

            getg().m.traceback = 2
            throw("checkmark found unmarked object")
        }
        if hbits.isCheckmarked(span.elemsize) {
            return
        }
        hbits.setCheckmarked(span.elemsize)
        if !hbits.isCheckmarked(span.elemsize) {
            throw("setCheckmarked and isCheckmarked disagree")
        }
    } else {
        if debug.gccheckmark > 0 && span.isFree(objIndex) {
            print("runtime: marking free object ", hex(obj), " found at *(", hex(base), "+", hex(off), ")\n")
            gcDumpObject("base", base, off)
            gcDumpObject("obj", obj, ^uintptr(0))
            getg().m.traceback = 2
            throw("marking free object")
        }

        // 如果對象所在的span中的gcmarkBits對應的bit已經設置為1則可以跳過處理
        // If marked we have nothing to do.
        if mbits.isMarked() {
            return
        }
        
        // 設置對象所在的span中的gcmarkBits對應的bit為1
        // mbits.setMarked() // Avoid extra call overhead with manual inlining.
        atomic.Or8(mbits.bytep, mbits.mask)
        
        // 如果確定對象不包含指針(所在span的類型是noscan), 則不需要把對象放入標記隊列
        // If this is a noscan object, fast-track it to black
        // instead of greying it.
        if span.spanclass.noscan() {
            gcw.bytesMarked += uint64(span.elemsize)
            return
        }
    }

    // 把對象放入標記隊列
    // 先放入本地標記隊列, 失敗時把本地標記隊列中的部分工作轉移到全局標記隊列, 再放入本地標記隊列
    // Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
    // seems like a nice optimization that can be added back in.
    // There needs to be time between the PREFETCH and the use.
    // Previously we put the obj in an 8 element buffer that is drained at a rate
    // to give the PREFETCH time to do its work.
    // Use of PREFETCHNTA might be more appropriate than PREFETCH
    if !gcw.putFast(obj) {
        gcw.put(obj)
    }
}

gcDrain函數掃描完根對象, 就會開始消費標記隊列, 對從標記隊列中取出的對象調用scanobject函數:

// scanobject scans the object starting at b, adding pointers to gcw.
// b must point to the beginning of a heap object or an oblet.
// scanobject consults the GC bitmap for the pointer mask and the
// spans for the size of the object.
//
//go:nowritebarrier
func scanobject(b uintptr, gcw *gcWork) {
    // Note that arena_used may change concurrently during
    // scanobject and hence scanobject may encounter a pointer to
    // a newly allocated heap object that is *not* in
    // [start,used). It will not mark this object; however, we
    // know that it was just installed by a mutator, which means
    // that mutator will execute a write barrier and take care of
    // marking it. This is even more pronounced on relaxed memory
    // architectures since we access arena_used without barriers
    // or synchronization, but the same logic applies.
    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used

    // Find the bits for b and the size of the object at b.
    //
    // b is either the beginning of an object, in which case this
    // is the size of the object to scan, or it points to an
    // oblet, in which case we compute the size to scan below.
    // 獲取對象對應的bitmap
    hbits := heapBitsForAddr(b)
    
    // 獲取對象所在的span
    s := spanOfUnchecked(b)
    
    // 獲取對象的大小
    n := s.elemsize
    if n == 0 {
        throw("scanobject n == 0")
    }

    // 對象大小過大時(maxObletBytes是128KB)需要分割掃描
    // 每次最多只掃描128KB
    if n > maxObletBytes {
        // Large object. Break into oblets for better
        // parallelism and lower latency.
        if b == s.base() {
            // It's possible this is a noscan object (not
            // from greyobject, but from other code
            // paths), in which case we must *not* enqueue
            // oblets since their bitmaps will be
            // uninitialized.
            if s.spanclass.noscan() {
                // Bypass the whole scan.
                gcw.bytesMarked += uint64(n)
                return
            }

            // Enqueue the other oblets to scan later.
            // Some oblets may be in b's scalar tail, but
            // these will be marked as "no more pointers",
            // so we'll drop out immediately when we go to
            // scan those.
            for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {
                if !gcw.putFast(oblet) {
                    gcw.put(oblet)
                }
            }
        }

        // Compute the size of the oblet. Since this object
        // must be a large object, s.base() is the beginning
        // of the object.
        n = s.base() + s.elemsize - b
        if n > maxObletBytes {
            n = maxObletBytes
        }
    }

    // 掃描對象中的指針
    var i uintptr
    for i = 0; i < n; i += sys.PtrSize {
        // 獲取對應的bit
        // Find bits for this word.
        if i != 0 {
            // Avoid needless hbits.next() on last iteration.
            hbits = hbits.next()
        }
        // Load bits once. See CL 22712 and issue 16973 for discussion.
        bits := hbits.bits()
        
        // 檢查scan bit判斷是否繼續掃描, 注意第二個scan bit是checkmark
        // During checkmarking, 1-word objects store the checkmark
        // in the type bit for the one word. The only one-word objects
        // are pointers, or else they'd be merged with other non-pointer
        // data into larger allocations.
        if i != 1*sys.PtrSize && bits&bitScan == 0 {
            break // no more pointers in this object
        }
        
        // 檢查pointer bit, 不是指針則繼續
        if bits&bitPointer == 0 {
            continue // not a pointer
        }

        // 取出指針的值
        // Work here is duplicated in scanblock and above.
        // If you make changes here, make changes there too.
        obj := *(*uintptr)(unsafe.Pointer(b + i))

        // 如果指針在arena區域中, 則調用greyobject標記對象并把對象放到標記隊列中
        // At this point we have extracted the next potential pointer.
        // Check if it points into heap and not back at the current object.
        if obj != 0 && arena_start <= obj && obj < arena_used && obj-b >= n {
            // Mark the object.
            if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
                greyobject(obj, b, i, hbits, span, gcw, objIndex)
            }
        }
    }
    
    // 統計掃描過的大小和對象數量
    gcw.bytesMarked += uint64(n)
    gcw.scanWork += int64(i)
}

在所有后臺標記任務都把標記隊列消費完畢時, 會執行gcMarkDone函數準備進入完成標記階段(mark termination):
在并行GC中gcMarkDone會被執行兩次, 第一次會禁止本地標記隊列然后重新開始后臺標記任務, 第二次會進入完成標記階段(mark termination)。

// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
// to mark termination.
//
// This should be called when all mark work has been drained. In mark
// 1, this includes all root marking jobs, global work buffers, and
// active work buffers in assists and background workers; however,
// work may still be cached in per-P work buffers. In mark 2, per-P
// caches are disabled.
//
// The calling context must be preemptible.
//
// Note that it is explicitly okay to have write barriers in this
// function because completion of concurrent mark is best-effort
// anyway. Any work created by write barriers here will be cleaned up
// by mark termination.
func gcMarkDone() {
top:
    semacquire(&work.markDoneSema)

    // Re-check transition condition under transition lock.
    if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
        semrelease(&work.markDoneSema)
        return
    }

    // 暫時禁止啟動新的后臺標記任務
    // Disallow starting new workers so that any remaining workers
    // in the current mark phase will drain out.
    //
    // TODO(austin): Should dedicated workers keep an eye on this
    // and exit gcDrain promptly?
    atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
    atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, -0xffffffff)

    // 判斷本地標記隊列是否已禁用
    if !gcBlackenPromptly {
        // 本地標記隊列是否未禁用, 禁用然后重新開始后臺標記任務
        // Transition from mark 1 to mark 2.
        //
        // The global work list is empty, but there can still be work
        // sitting in the per-P work caches.
        // Flush and disable work caches.

        // 禁用本地標記隊列
        // Disallow caching workbufs and indicate that we're in mark 2.
        gcBlackenPromptly = true

        // Prevent completion of mark 2 until we've flushed
        // cached workbufs.
        atomic.Xadd(&work.nwait, -1)

        // GC is set up for mark 2. Let Gs blocked on the
        // transition lock go while we flush caches.
        semrelease(&work.markDoneSema)

        // 把所有本地標記隊列中的對象都推到全局標記隊列
        systemstack(func() {
            // Flush all currently cached workbufs and
            // ensure all Ps see gcBlackenPromptly. This
            // also blocks until any remaining mark 1
            // workers have exited their loop so we can
            // start new mark 2 workers.
            forEachP(func(_p_ *p) {
                _p_.gcw.dispose()
            })
        })

        // 除錯用
        // Check that roots are marked. We should be able to
        // do this before the forEachP, but based on issue
        // #16083 there may be a (harmless) race where we can
        // enter mark 2 while some workers are still scanning
        // stacks. The forEachP ensures these scans are done.
        //
        // TODO(austin): Figure out the race and fix this
        // properly.
        gcMarkRootCheck()

        // 允許啟動新的后臺標記任務
        // Now we can start up mark 2 workers.
        atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
        atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 0xffffffff)

        // 如果確定沒有更多的任務則可以直接跳到函數頂部
        // 這樣就當作是第二次調用了
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // This loop will make progress because
            // gcBlackenPromptly is now true, so it won't
            // take this same "if" branch.
            goto top
        }
    } else {
        // 記錄完成標記階段開始的時間和STW開始的時間
        // Transition to mark termination.
        now := nanotime()
        work.tMarkTerm = now
        work.pauseStart = now
        
        // 禁止G被搶占
        getg().m.preemptoff = "gcing"
        
        // 停止所有運行中的G, 并禁止它們運行
        systemstack(stopTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!!
        // 世界已停止(STW)...
        // !!!!!!!!!!!!!!!!
        
        // The gcphase is _GCmark, it will transition to _GCmarktermination
        // below. The important thing is that the wb remains active until
        // all marking is complete. This includes writes made by the GC.
        
        // 標記對根對象的掃描已完成, 會影響gcMarkRootPrepare中的處理
        // Record that one root marking pass has completed.
        work.markrootDone = true
        
        // 禁止輔助GC和后臺標記任務的運行
        // Disable assists and background workers. We must do
        // this before waking blocked assists.
        atomic.Store(&gcBlackenEnabled, 0)
        
        // 喚醒所有因為輔助GC而休眠的G
        // Wake all blocked assists. These will run when we
        // start the world again.
        gcWakeAllAssists()
        
        // Likewise, release the transition lock. Blocked
        // workers and assists will run when we start the
        // world again.
        semrelease(&work.markDoneSema)
        
        // 計算下一次觸發gc需要的heap大小
        // endCycle depends on all gcWork cache stats being
        // flushed. This is ensured by mark 2.
        nextTriggerRatio := gcController.endCycle()
        
        // 進入完成標記階段, 會重新啟動世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(nextTriggerRatio)
    }
}

gcMarkTermination函數會進入完成標記階段:

func gcMarkTermination(nextTriggerRatio float64) {
    // World is stopped.
    // Start marktermination which includes enabling the write barrier.
    // 禁止輔助GC和后臺標記任務的運行
    atomic.Store(&gcBlackenEnabled, 0)
    
    // 重新允許本地標記隊列(下次GC使用)
    gcBlackenPromptly = false
    
    // 設置當前GC階段到完成標記階段, 并啟用寫屏障
    setGCPhase(_GCmarktermination)

    // 記錄開始時間
    work.heap1 = memstats.heap_live
    startTime := nanotime()

    // 禁止G被搶占
    mp := acquirem()
    mp.preemptoff = "gcing"
    _g_ := getg()
    _g_.m.traceback = 2
    
    // 設置G的狀態為等待中這樣它的??梢员粧呙?
    gp := _g_.m.curg
    casgstatus(gp, _Grunning, _Gwaiting)
    gp.waitreason = "garbage collection"

    // 切換到g0運行
    // Run gc on the g0 stack. We do this so that the g stack
    // we're currently running on will no longer change. Cuts
    // the root set down a bit (g0 stacks are not scanned, and
    // we don't need to scan gc's internal state).  We also
    // need to switch to g0 so we can shrink the stack.
    systemstack(func() {
        // 開始STW中的標記
        gcMark(startTime)
        
        // 必須立刻返回, 因為外面的G的棧有可能被移動, 不能在這之后訪問外面的變量
        // Must return immediately.
        // The outer function's stack may have moved
        // during gcMark (it shrinks stacks, including the
        // outer function's stack), so we must not refer
        // to any of its variables. Return back to the
        // non-system stack to pick up the new addresses
        // before continuing.
    })

    // 重新切換到g0運行
    systemstack(func() {
        work.heap2 = work.bytesMarked
        
        // 如果啟用了checkmark則執行檢查, 檢查是否所有可到達的對象都有標記
        if debug.gccheckmark > 0 {
            // Run a full stop-the-world mark using checkmark bits,
            // to check that we didn't forget to mark anything during
            // the concurrent mark process.
            gcResetMarkState()
            initCheckmarks()
            gcMark(startTime)
            clearCheckmarks()
        }

        // 設置當前GC階段到關閉, 并禁用寫屏障
        // marking is complete so we can turn the write barrier off
        setGCPhase(_GCoff)
        
        // 喚醒后臺清掃任務, 將在STW結束后開始運行
        gcSweep(work.mode)

        // 除錯用
        if debug.gctrace > 1 {
            startTime = nanotime()
            // The g stacks have been scanned so
            // they have gcscanvalid==true and gcworkdone==true.
            // Reset these so that all stacks will be rescanned.
            gcResetMarkState()
            finishsweep_m()

            // Still in STW but gcphase is _GCoff, reset to _GCmarktermination
            // At this point all objects will be found during the gcMark which
            // does a complete STW mark and object scan.
            setGCPhase(_GCmarktermination)
            gcMark(startTime)
            setGCPhase(_GCoff) // marking is done, turn off wb.
            gcSweep(work.mode)
        }
    })

    // 設置G的狀態為運行中
    _g_.m.traceback = 0
    casgstatus(gp, _Gwaiting, _Grunning)

    // 跟蹤處理
    if trace.enabled {
        traceGCDone()
    }

    // all done
    mp.preemptoff = ""

    if gcphase != _GCoff {
        throw("gc done but gcphase != _GCoff")
    }

    // 更新下一次觸發gc需要的heap大小(gc_trigger)
    // Update GC trigger and pacing for the next cycle.
    gcSetTriggerRatio(nextTriggerRatio)

    // 更新用時記錄
    // Update timing memstats
    now := nanotime()
    sec, nsec, _ := time_now()
    unixNow := sec*1e9 + int64(nsec)
    work.pauseNS += now - work.pauseStart
    work.tEnd = now
    atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
    atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
    memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
    memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
    memstats.pause_total_ns += uint64(work.pauseNS)

    // 更新所用cpu記錄
    // Update work.totaltime.
    sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
    // We report idle marking time below, but omit it from the
    // overall utilization here since it's "free".
    markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
    markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
    cycleCpu := sweepTermCpu + markCpu + markTermCpu
    work.totaltime += cycleCpu

    // Compute overall GC CPU utilization.
    totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
    memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)

    // 重置清掃狀態
    // Reset sweep state.
    sweep.nbgsweep = 0
    sweep.npausesweep = 0

    // 統計強制開始GC的次數
    if work.userForced {
        memstats.numforcedgc++
    }

    // 統計執行GC的次數然后喚醒等待清掃的G
    // Bump GC cycle count and wake goroutines waiting on sweep.
    lock(&work.sweepWaiters.lock)
    memstats.numgc++
    injectglist(work.sweepWaiters.head.ptr())
    work.sweepWaiters.head = 0
    unlock(&work.sweepWaiters.lock)

    // 性能統計用
    // Finish the current heap profiling cycle and start a new
    // heap profiling cycle. We do this before starting the world
    // so events don't leak into the wrong cycle.
    mProf_NextCycle()

    // 重新啟動世界
    systemstack(startTheWorldWithSema)

    // !!!!!!!!!!!!!!!
    // 世界已重新啟動...
    // !!!!!!!!!!!!!!!

    // 性能統計用
    // Flush the heap profile so we can start a new cycle next GC.
    // This is relatively expensive, so we don't do it with the
    // world stopped.
    mProf_Flush()

    // 移動標記隊列使用的緩沖區到自由列表, 使得它們可以被回收
    // Prepare workbufs for freeing by the sweeper. We do this
    // asynchronously because it can take non-trivial time.
    prepareFreeWorkbufs()

    // 釋放未使用的棧
    // Free stack spans. This must be done between GC cycles.
    systemstack(freeStackSpans)

    // 除錯用
    // Print gctrace before dropping worldsema. As soon as we drop
    // worldsema another cycle could start and smash the stats
    // we're trying to print.
    if debug.gctrace > 0 {
        util := int(memstats.gc_cpu_fraction * 100)

        var sbuf [24]byte
        printlock()
        print("gc ", memstats.numgc,
            " @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
            util, "%: ")
        prev := work.tSweepTerm
        for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
            if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
            prev = ns
        }
        print(" ms clock, ")
        for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
            if i == 2 || i == 3 {
                // Separate mark time components with /.
                print("/")
            } else if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
        }
        print(" ms cpu, ",
            work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
            work.heapGoal>>20, " MB goal, ",
            work.maxprocs, " P")
        if work.userForced {
            print(" (forced)")
        }
        print("\n")
        printunlock()
    }

    semrelease(&worldsema)
    // Careful: another GC cycle may start now.

    // 重新允許當前的G被搶占
    releasem(mp)
    mp = nil

    // 如果是并行GC, 讓當前M繼續運行(會回到gcBgMarkWorker然后休眠)
    // 如果不是并行GC, 則讓當前M開始調度
    // now that gc is done, kick off finalizer thread if needed
    if !concurrentSweep {
        // give the queued finalizers, if any, a chance to run
        Gosched()
    }
}

gcSweep函數會喚醒后臺清掃任務:
后臺清掃任務會在程序啟動時調用的gcenable函數中啟動.

func gcSweep(mode gcMode) {
    if gcphase != _GCoff {
        throw("gcSweep being done but phase is not GCoff")
    }

    // 增加sweepgen, 這樣sweepSpans中兩個隊列角色會交換, 所有span都會變為"待清掃"的span
    lock(&mheap_.lock)
    mheap_.sweepgen += 2
    mheap_.sweepdone = 0
    if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
        // We should have drained this list during the last
        // sweep phase. We certainly need to start this phase
        // with an empty swept list.
        throw("non-empty swept list")
    }
    mheap_.pagesSwept = 0
    unlock(&mheap_.lock)

    // 如果非并行GC則在這里完成所有工作(STW中)
    if !_ConcurrentSweep || mode == gcForceBlockMode {
        // Special case synchronous sweep.
        // Record that no proportional sweeping has to happen.
        lock(&mheap_.lock)
        mheap_.sweepPagesPerByte = 0
        unlock(&mheap_.lock)
        // Sweep all spans eagerly.
        for sweepone() != ^uintptr(0) {
            sweep.npausesweep++
        }
        // Free workbufs eagerly.
        prepareFreeWorkbufs()
        for freeSomeWbufs(false) {
        }
        // All "free" events for this mark/sweep cycle have
        // now happened, so we can make this profile cycle
        // available immediately.
        mProf_NextCycle()
        mProf_Flush()
        return
    }

    // 喚醒后臺清掃任務
    // Background sweep.
    lock(&sweep.lock)
    if sweep.parked {
        sweep.parked = false
        ready(sweep.g, 0, true)
    }
    unlock(&sweep.lock)
}

后臺清掃任務的函數是bgsweep:

func bgsweep(c chan int) {
    sweep.g = getg()

    // 等待喚醒
    lock(&sweep.lock)
    sweep.parked = true
    c <- 1
    goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)

    // 循環清掃
    for {
        // 清掃一個span, 然后進入調度(一次只做少量工作)
        for gosweepone() != ^uintptr(0) {
            sweep.nbgsweep++
            Gosched()
        }
        // 釋放一些未使用的標記隊列緩沖區到heap
        for freeSomeWbufs(true) {
            Gosched()
        }
        // 如果清掃未完成則繼續循環
        lock(&sweep.lock)
        if !gosweepdone() {
            // This can happen if a GC runs between
            // gosweepone returning ^0 above
            // and the lock being acquired.
            unlock(&sweep.lock)
            continue
        }
        // 否則讓后臺清掃任務進入休眠, 當前M繼續調度
        sweep.parked = true
        goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)
    }
}

gosweepone函數會從sweepSpans中取出單個span清掃:

//go:nowritebarrier
func gosweepone() uintptr {
    var ret uintptr
    // 切換到g0運行
    systemstack(func() {
        ret = sweepone()
    })
    return ret
}

sweepone函數如下:

// sweeps one span
// returns number of pages returned to heap, or ^uintptr(0) if there is nothing to sweep
//go:nowritebarrier
func sweepone() uintptr {
    _g_ := getg()
    sweepRatio := mheap_.sweepPagesPerByte // For debugging

    // 禁止G被搶占
    // increment locks to ensure that the goroutine is not preempted
    // in the middle of sweep thus leaving the span in an inconsistent state for next GC
    _g_.m.locks++
    
    // 檢查是否已完成清掃
    if atomic.Load(&mheap_.sweepdone) != 0 {
        _g_.m.locks--
        return ^uintptr(0)
    }
    
    // 更新同時執行sweep的任務數量
    atomic.Xadd(&mheap_.sweepers, +1)

    npages := ^uintptr(0)
    sg := mheap_.sweepgen
    for {
        // 從sweepSpans中取出一個span
        s := mheap_.sweepSpans[1-sg/2%2].pop()
        // 全部清掃完畢時跳出循環
        if s == nil {
            atomic.Store(&mheap_.sweepdone, 1)
            break
        }
        // 其他M已經在清掃這個span時跳過
        if s.state != mSpanInUse {
            // This can happen if direct sweeping already
            // swept this span, but in that case the sweep
            // generation should always be up-to-date.
            if s.sweepgen != sg {
                print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n")
                throw("non in-use span in unswept list")
            }
            continue
        }
        // 原子增加span的sweepgen, 失敗表示其他M已經開始清掃這個span, 跳過
        if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            continue
        }
        // 清掃這個span, 然后跳出循環
        npages = s.npages
        if !s.sweep(false) {
            // Span is still in-use, so this returned no
            // pages to the heap and the span needs to
            // move to the swept in-use list.
            npages = 0
        }
        break
    }

    // 更新同時執行sweep的任務數量
    // Decrement the number of active sweepers and if this is the
    // last one print trace information.
    if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
        if debug.gcpacertrace > 0 {
            print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n")
        }
    }
    // 允許G被搶占
    _g_.m.locks--
    // 返回清掃的頁數
    return npages
}

span的sweep函數用于清掃單個span:

// Sweep frees or collects finalizers for blocks not marked in the mark phase.
// It clears the mark bits in preparation for the next GC round.
// Returns true if the span was returned to heap.
// If preserve=true, don't return it to heap nor relink in MCentral lists;
// caller takes care of it.
//TODO go:nowritebarrier
func (s *mspan) sweep(preserve bool) bool {
    // It's critical that we enter this function with preemption disabled,
    // GC must not start while we are in the middle of this function.
    _g_ := getg()
    if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 {
        throw("MSpan_Sweep: m is not locked")
    }
    sweepgen := mheap_.sweepgen
    if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
        print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
        throw("MSpan_Sweep: bad span state")
    }

    if trace.enabled {
        traceGCSweepSpan(s.npages * _PageSize)
    }

    // 統計已清理的頁數
    atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))

    spc := s.spanclass
    size := s.elemsize
    res := false

    c := _g_.m.mcache
    freeToHeap := false

    // The allocBits indicate which unmarked objects don't need to be
    // processed since they were free at the end of the last GC cycle
    // and were not allocated since then.
    // If the allocBits index is >= s.freeindex and the bit
    // is not marked then the object remains unallocated
    // since the last GC.
    // This situation is analogous to being on a freelist.

    // 判斷在special中的析構器, 如果對應的對象已經不再存活則標記對象存活防止回收, 然后把析構器移到運行隊列
    // Unlink & free special records for any objects we're about to free.
    // Two complications here:
    // 1. An object can have both finalizer and profile special records.
    //    In such case we need to queue finalizer for execution,
    //    mark the object as live and preserve the profile special.
    // 2. A tiny object can have several finalizers setup for different offsets.
    //    If such object is not marked, we need to queue all finalizers at once.
    // Both 1 and 2 are possible at the same time.
    specialp := &s.specials
    special := *specialp
    for special != nil {
        // A finalizer can be set for an inner byte of an object, find object beginning.
        objIndex := uintptr(special.offset) / size
        p := s.base() + objIndex*size
        mbits := s.markBitsForIndex(objIndex)
        if !mbits.isMarked() {
            // This object is not marked and has at least one special record.
            // Pass 1: see if it has at least one finalizer.
            hasFin := false
            endOffset := p - s.base() + size
            for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next {
                if tmp.kind == _KindSpecialFinalizer {
                    // Stop freeing of object if it has a finalizer.
                    mbits.setMarkedNonAtomic()
                    hasFin = true
                    break
                }
            }
            // Pass 2: queue all finalizers _or_ handle profile record.
            for special != nil && uintptr(special.offset) < endOffset {
                // Find the exact byte for which the special was setup
                // (as opposed to object beginning).
                p := s.base() + uintptr(special.offset)
                if special.kind == _KindSpecialFinalizer || !hasFin {
                    // Splice out special record.
                    y := special
                    special = special.next
                    *specialp = special
                    freespecial(y, unsafe.Pointer(p), size)
                } else {
                    // This is profile record, but the object has finalizers (so kept alive).
                    // Keep special record.
                    specialp = &special.next
                    special = *specialp
                }
            }
        } else {
            // object is still live: keep special record
            specialp = &special.next
            special = *specialp
        }
    }

    // 除錯用
    if debug.allocfreetrace != 0 || raceenabled || msanenabled {
        // Find all newly freed objects. This doesn't have to
        // efficient; allocfreetrace has massive overhead.
        mbits := s.markBitsForBase()
        abits := s.allocBitsForIndex(0)
        for i := uintptr(0); i < s.nelems; i++ {
            if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) {
                x := s.base() + i*s.elemsize
                if debug.allocfreetrace != 0 {
                    tracefree(unsafe.Pointer(x), size)
                }
                if raceenabled {
                    racefree(unsafe.Pointer(x), size)
                }
                if msanenabled {
                    msanfree(unsafe.Pointer(x), size)
                }
            }
            mbits.advance()
            abits.advance()
        }
    }

    // 計算釋放的對象數量
    // Count the number of free objects in this span.
    nalloc := uint16(s.countAlloc())
    if spc.sizeclass() == 0 && nalloc == 0 {
        // 如果span的類型是0(大對象)并且其中的對象已經不存活則釋放到heap
        s.needzero = 1
        freeToHeap = true
    }
    nfreed := s.allocCount - nalloc
    if nalloc > s.allocCount {
        print("runtime: nelems=", s.nelems, " nalloc=", nalloc, " previous allocCount=", s.allocCount, " nfreed=", nfreed, "\n")
        throw("sweep increased allocation count")
    }

    // 設置新的allocCount
    s.allocCount = nalloc

    // 判斷span是否無未分配的對象
    wasempty := s.nextFreeIndex() == s.nelems

    // 重置freeindex, 下次分配從0開始搜索
    s.freeindex = 0 // reset allocation index to start of span.
    if trace.enabled {
        getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize
    }

    // gcmarkBits變為新的allocBits
    // 然后重新分配一塊全部為0的gcmarkBits
    // 下次分配對象時可以根據allocBits得知哪些元素是未分配的
    // gcmarkBits becomes the allocBits.
    // get a fresh cleared gcmarkBits in preparation for next GC
    s.allocBits = s.gcmarkBits
    s.gcmarkBits = newMarkBits(s.nelems)

    // 更新freeindex開始的allocCache
    // Initialize alloc bits cache.
    s.refillAllocCache(0)

    // 如果span中已經無存活的對象則更新sweepgen到最新
    // 下面會把span加到mcentral或者mheap
    // We need to set s.sweepgen = h.sweepgen only when all blocks are swept,
    // because of the potential for a concurrent free/SetFinalizer.
    // But we need to set it before we make the span available for allocation
    // (return it to heap or mcentral), because allocation code assumes that a
    // span is already swept if available for allocation.
    if freeToHeap || nfreed == 0 {
        // The span must be in our exclusive ownership until we update sweepgen,
        // check for potential races.
        if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
            print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
            throw("MSpan_Sweep: bad span state after sweep")
        }
        // Serialization point.
        // At this point the mark bits are cleared and allocation ready
        // to go so release the span.
        atomic.Store(&s.sweepgen, sweepgen)
    }

    if nfreed > 0 && spc.sizeclass() != 0 {
        // 把span加到mcentral, res等于是否添加成功
        c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed)
        res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty)
        // freeSpan會更新sweepgen
        // MCentral_FreeSpan updates sweepgen
    } else if freeToHeap {
        // 把span釋放到mheap
        // Free large span to heap

        // NOTE(rsc,dvyukov): The original implementation of efence
        // in CL 22060046 used SysFree instead of SysFault, so that
        // the operating system would eventually give the memory
        // back to us again, so that an efence program could run
        // longer without running out of memory. Unfortunately,
        // calling SysFree here without any kind of adjustment of the
        // heap data structures means that when the memory does
        // come back to us, we have the wrong metadata for it, either in
        // the MSpan structures or in the garbage collection bitmap.
        // Using SysFault here means that the program will run out of
        // memory fairly quickly in efence mode, but at least it won't
        // have mysterious crashes due to confused memory reuse.
        // It should be possible to switch back to SysFree if we also
        // implement and then call some kind of MHeap_DeleteSpan.
        if debug.efence > 0 {
            s.limit = 0 // prevent mlookup from finding this span
            sysFault(unsafe.Pointer(s.base()), size)
        } else {
            mheap_.freeSpan(s, 1)
        }
        c.local_nlargefree++
        c.local_largefree += size
        res = true
    }
    
    // 如果span未加到mcentral或者未釋放到mheap, 則表示span仍在使用
    if !res {
        // 把仍在使用的span加到sweepSpans的"已清掃"隊列中
        // The span has been swept and is still in-use, so put
        // it on the swept in-use list.
        mheap_.sweepSpans[sweepgen/2%2].push(s)
    }
    return res
}

從bgsweep和前面的分配器可以看出掃描階段的工作是十分懶惰(lazy)的,
實際可能會出現前一階段的掃描還未完成, 就需要開始新一輪的GC的情況,
所以每一輪GC開始之前都需要完成前一輪GC的掃描工作(Sweep Termination階段).

GC的整個流程都分析完畢了, 最后貼上寫屏障函數writebarrierptr的實現:

// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
    if writeBarrier.cgo {
        cgoCheckWriteBarrier(dst, src)
    }
    if !writeBarrier.needed {
        *dst = src
        return
    }
    if src != 0 && src < minPhysPageSize {
        systemstack(func() {
            print("runtime: writebarrierptr *", dst, " = ", hex(src), "\n")
            throw("bad pointer in write barrier")
        })
    }
    // 標記指針
    writebarrierptr_prewrite1(dst, src)
    // 設置指針到目標
    *dst = src
}

writebarrierptr_prewrite1函數如下:

// writebarrierptr_prewrite1 invokes a write barrier for *dst = src
// prior to the write happening.
//
// Write barrier calls must not happen during critical GC and scheduler
// related operations. In particular there are times when the GC assumes
// that the world is stopped but scheduler related code is still being
// executed, dealing with syscalls, dealing with putting gs on runnable
// queues and so forth. This code cannot execute write barriers because
// the GC might drop them on the floor. Stopping the world involves removing
// the p associated with an m. We use the fact that m.p == nil to indicate
// that we are in one these critical section and throw if the write is of
// a pointer to a heap object.
//go:nosplit
func writebarrierptr_prewrite1(dst *uintptr, src uintptr) {
    mp := acquirem()
    if mp.inwb || mp.dying > 0 {
        releasem(mp)
        return
    }
    systemstack(func() {
        if mp.p == 0 && memstats.enablegc && !mp.inwb && inheap(src) {
            throw("writebarrierptr_prewrite1 called with mp.p == nil")
        }
        mp.inwb = true
        gcmarkwb_m(dst, src)
    })
    mp.inwb = false
    releasem(mp)
}

gcmarkwb_m函數如下:

func gcmarkwb_m(slot *uintptr, ptr uintptr) {    if writeBarrier.needed {        // Note: This turns bad pointer writes into bad
        // pointer reads, which could be confusing. We avoid
        // reading from obviously bad pointers, which should
        // take care of the vast majority of these. We could
        // patch this up in the signal handler, or use XCHG to
        // combine the read and the write. Checking inheap is
        // insufficient since we need to track changes to
        // roots outside the heap.
        //
        // Note: profbuf.go omits a barrier during signal handler
        // profile logging; that's safe only because this deletion barrier exists.
        // If we remove the deletion barrier, we'll have to work out
        // a new way to handle the profile logging.
        if slot1 := uintptr(unsafe.Pointer(slot)); slot1 >= minPhysPageSize {            if optr := *slot; optr != 0 {                // 標記舊指針
                shade(optr)
            }
        }        // TODO: Make this conditional on the caller's stack color.
        if ptr != 0 && inheap(ptr) {            // 標記新指針
            shade(ptr)
        }
    }
}

shade函數如下:

// Shade the object if it isn't already.
// The object is not nil and known to be in the heap.
// Preemption must be disabled.
//go:nowritebarrier
func shade(b uintptr) {
    if obj, hbits, span, objIndex := heapBitsForObject(b, 0, 0); obj != 0 {
        gcw := &getg().m.p.ptr().gcw
        // 標記一個對象存活, 并把它加到標記隊列(該對象變為灰色)
        greyobject(obj, 0, 0, hbits, span, gcw, objIndex)
        // 如果標記了禁止本地標記隊列則flush到全局標記隊列
        if gcphase == _GCmarktermination || gcBlackenPromptly {
            // Ps aren't allowed to cache work during mark
            // termination.
            gcw.dispose()
        }
    }
}

看完上述內容,你們對golang中三色GC的實現原理有進一步的了解嗎?如果還想了解更多相關內容,歡迎關注億速云行業資訊頻道,感謝各位的閱讀。

向AI問一下細節

免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。

AI

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女