Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文まとめ「Segmentation Anything」

論文まとめ「Segmentation Anything」

個人的な論文のまとめです

Hidenori Itaya

July 17, 2023
Tweet

More Decks by Hidenori Itaya

Other Decks in Research

Transcript

  1. ൘୩ӳయ (த෦େֶ)
    ࿦จ·ͱΊ
    Segmentation Anything
    "MFYBOEFS,JSJMMPW &SJD.JOUVO /JLIJMB 3BW )BO[J.BP $IMPF3PMMBOE -BVSB(VTUBGTPO 5FUF9JBP4QFODFS8IJUFIFBE "MFYBOEFS$#FSH 8BO:FO-P
    1JPUS%PMMBS 3PTT(JSTIJDL <.FUB"*3FTFBSDI>

    View Slide

  2. 2
    • MetaʹΑΓ։ൃ͞Εͨθϩγϣοτηάϝϯςʔγϣϯٕज़
    r ಛఆͷλεΫΛ܇࿅ͳ͠ʹ೚ҙͷΦϒδΣΫτʹηάϝϯτʢྖҬ෼ׂʣ͢Δ͜ͱ͕Մೳ (zero-shot)
    r λεΫʹಛԽͨ͠ڭࢣ͋Γֶशͱൺֱͯ͠ಉ౳΋͘͠͸༏Εͨਫ਼౓Λ֫ಘ
    Segmentation Anything Model (SAM) [Kirillov+, arXiv2023]
    oken pre-
    ning and
    engineer-
    entation,
    .
    om NLP
    reground
    orm text,
    segment
    hen, is to
    . The re-
    en when
    e objects
    Fig. 3),
    st one of
    ng a lan-
    mbiguous
    a natural
    zero-shot
    mpting.
    uggests a
    uence of
    ing sam-
    Figure 3: Each column shows 3 valid masks generated by
    SAM from a single ambiguous point prompt (green circle).
    a broadly capable model that can adapt to many (though
    ҰͭͷϙΠϯτ͔ΒSAMʹΑΓ
    ੜ੒ͨ͠ϚεΫ
    COCO [66] LVIS v1 [44]
    method AP APS APM APL AP APS APM APL
    ViTDet-H [62] 51.0 32.0 54.3 68.9 46.6 35.0 58.0 66.3
    zero-shot transfer methods (segmentation module only):
    SAM 46.5 30.8 51.0 61.7 44.7 32.5 57.6 65.5
    Table 5: Instance segmentation results. SAM is prompted
    with ViTDet boxes to do zero-shot segmentation. The fully-
    supervised ViTDet outperforms SAM, but the gap shrinks
    on the higher-quality LVIS masks. Interestingly, SAM out-
    performs ViTDet according to human ratings (see Fig. 11).


    ("&*#!)+')!$



    '$)%')!$ (
    .
    .
    . !)
    .
    Figure 11: Mask quality rating distribution from our human
    study for ViTDet and SAM, both applied to LVIS ground
    truth boxes. We also report LVIS and COCO ground truth
    quality. The legend shows rating means and 95% confi-

    3


    3

    7

    3

    7

    3
    Figure 12: Zero-shot text-to-mask. SAM can work with
    simple and nuanced text prompts. When SAM fails to make
    a correct prediction, an additional point prompt can help.
    Results. We show qualitative results in Fig. 12. SAM
    can segment objects based on simple text prompts like “a
    wheel” as well as phrases like “beaver tooth grille”. When
    Zero-shot text-to-mask
    Zero-shot edge prediction
    image ground truth SAM
    Figure 10: Zero-shot edge prediction on BSDS500. SAM
    was not trained to predict edge maps nor did it have access
    to BSDS images or annotations during training.
    method year ODS OIS AP R50
    HED [108] 2015 .788 .808 .840 .923
    EDETR [79] 2022 .840 .858 .896 .930
    mask [email protected]
    method all small med. large freq. com. rar
    ViTDet-H [62] 63.0 51.7 80.8 87.0 63.1 63.3 58
    zero-shot transfer methods:
    SAM – single out. 54.9 42.8 76.7 74.4 54.7 59.8 62
    SAM 59.3 45.5 81.6 86.9 59.1 63.9 65
    Table 4: Object proposal generation on LVIS v1. SAM
    applied zero-shot, i.e. it was not trained for object propo
    generation nor did it access LVIS images or annotations.
    intermediate step in pioneering systems (e.g., [102, 41, 84
    To generate object proposals, we run a slightly modifi
    version of our automatic mask generation pipeline and o
    put the masks as proposals (see §D.3 for details).
    We compute the standard average recall (AR) metric
    LVIS v1 [44]. We focus on LVIS because its large numb

    View Slide

  3. 3
    SAͷ3ཁૉ Segment Anything
    Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3
    Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´
    ar4 Ross Girshick4
    1project lead 2joint first author 3equal contribution 4directional lead
    Meta AI Research, FAIR
    (b) Model: Segment Anything Model (SAM)
    prompt image
    valid mask
    image
    encoder
    prompt
    encoder
    lightweight mask decoder
    (a) Task: promptable segmentation
    segmentation prompt image
    model
    cat with
    black ears
    valid mask
    (c) Data: data engine (top) & dataset (bottom)
    • 1+ billion masks
    • 11 million images
    • privacy respecting
    • licensed images
    annotate
    train
    data
    model
    Segment Anything 1B (SA-1B):
    Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
    able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range
    of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
    Abstract matching in some cases) fine-tuned models [10, 21]. Empir-
    [cs.CV] 5 Apr 2023
    Segment Anything
    Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3
    Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´
    ar4 Ross Girshick4
    1project lead 2joint first author 3equal contribution 4directional lead
    Meta AI Research, FAIR
    (b) Model: Segment Anything Model (SAM)
    prompt image
    valid mask
    image
    encoder
    prompt
    encoder
    lightweight mask decoder
    (a) Task: promptable segmentation
    segmentation prompt image
    model
    cat with
    black ears
    valid mask
    (c) Data: data engine (top) & dataset (bottom)
    • 1+ billion masks
    • 11 million images
    • privacy respecting
    • licensed images
    annotate
    train
    data
    model
    Segment Anything 1B (SA-1B):
    Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
    able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range
    of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
    Abstract matching in some cases) fine-tuned models [10, 21]. Empir-
    [cs.CV] 5 Apr 2023
    Segment Anything
    Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3
    Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´
    ar4 Ross Girshick4
    1project lead 2joint first author 3equal contribution 4directional lead
    Meta AI Research, FAIR
    (b) Model: Segment Anything Model (SAM)
    prompt image
    valid mask
    image
    encoder
    prompt
    encoder
    lightweight mask decoder
    (a) Task: promptable segmentation
    segmentation prompt image
    model
    cat with
    black ears
    valid mask
    (c) Data: data engine (top) & dataset (bottom)
    • 1+ billion masks
    • 11 million images
    • privacy respecting
    • licensed images
    annotate
    train
    data
    model
    Segment Anything 1B (SA-1B):
    Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
    able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range
    of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
    Abstract
    We introduce the Segment Anything (SA) project: a new
    matching in some cases) fine-tuned models [10, 21]. Empir-
    ical trends show this behavior improving with model scale,
    [cs.CV] 5 Apr 2023
    Data engine & Dataset
    Task Model
    promptable segmentation Segment Anything Model (SAM)
    Data

    View Slide

  4. 4
    SAͷ3ཁૉ - Task -
    Data
    Model
    Task
    Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland
    Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´
    a
    1project lead 2joint first author 3equal contribution 4directional l
    Meta AI Research, FAIR
    (b) Model: Segment Anything Model (SAM)
    prompt image
    valid mask
    image
    encoder
    prompt
    encoder
    lightweight mask decoder
    (a) Task: promptable segmentation
    segmentation prompt image
    model
    cat with
    black ears
    valid mask
    (c) Data: data e
    • 1+ bill
    • 11 mil
    • privacy
    • license
    model
    Segment
    Figure 1: We aim to build a foundation model for segmentation by introducing three interconnec
    ϓϩϯϓτՄೳͳηάϝϯςʔγϣϯλεΫ
    (Promptable segmentation)
    /-1΍$7ʹ͓͚Δج൫Ϟσϧ͸ɼzϓϩϯϓτzٕ๏Λ༻͍Δ͜ͱͰɼ
    ৽͍͠σʔληοτ΍λεΫʹରͯ͠[FSPTIPU΍GFXTIPUͷֶश͕
    Մೳ
    Background
    ೚ҙͷηάϝϯςʔγϣϯϓϩϯϓτʹର͠ɼ༗ޮͳηάϝϯςʔγϣ
    ϯϚεΫΛฦ͢͜ͱ
    ϓϩϯϓτ͸ᐆດੑ͕͋ΔͨΊɼਖ਼͍͠ϚεΫ͕Ұͭͱ͸ݶΒͳ͍
    ※ϓϩϯϓτɿը૾಺ͷԿΛηάϝϯςʔγϣϯ͢Δ͔Λࢦఆ͢Δ৘ใ
    Target of Tasks
    ϓϩϯϓτΛ༻͍ͨηάϝϯςʔγϣϯλεΫΛࣄલֶशͷ໨తͱ͠ɼ
    ԼྲྀͷηάϝϯςʔγϣϯλεΫʹzero-shotͰసૹ

    View Slide

  5. Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3
    ead Alexander C. Berg Wan-Yen Lo Piotr Doll´
    ar4 Ross Girshick4
    2joint first author 3equal contribution 4directional lead
    Meta AI Research, FAIR
    (b) Model: Segment Anything Model (SAM)
    prompt image
    valid mask
    image
    encoder
    prompt
    encoder
    lightweight mask decoder
    (c) Data: data engine (top) & dataset (bottom)
    • 1+ billion masks
    • 11 million images
    • privacy respecting
    • licensed images
    annotate
    train
    data
    model
    Segment Anything 1B (SA-1B):
    on model for segmentation by introducing three interconnected components: a prompt-
    5
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • Encoder
    - Prompt encoder
    - Image encoder
    Structure
    ը૾ͱϓϩϯϓτΛͦΕͧΕຒΊࠐΉ
    5SBOTGPSNFSCBTFE %FDPEFSͰຒΊࠐΈ͔ΒϚεΫΛੜ੒
    • Decoder
    - Mask decoder
    Process Flow

    View Slide

  6. 6
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • ը૾ͷಛ௃ΛຒΊࠐΉ
    • ωοτϫʔΫߏ଄͸ViT
    • Masked AutoEncoder (MAE) [He+, CVPR2022] ʹΑΔࣄલֶशࡁΈViTΛ࢖༻
    • 1൪ܭࢉίετ͕ߴ͍෦෼͕ͩɼਪ࿦࣌ʹ͸image embeddingΛอ࣋͢Ε͹ϓϩϯϓτΛϦΞϧλΠϜ
    ʹมߋՄ
    Image encoder
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on

    View Slide

  7. 7
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • ϓϩϯϓτΛຒΊࠐΉ
    • Position encodingͰදݱ͠pronptຖͷֶशՄೳͳύϥϝʔλ(ຒΊࠐΈ)ͱՃࢉ
    Prompt encoder (point, box)
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on

    View Slide

  8. 8
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • ϓϩϯϓτΛຒΊࠐΉ
    • CLIP [Radford+, ICML2021] ͷtext encoderΛ࢖༻
    Prompt encoder (text)
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on

    View Slide

  9. 9
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • ϓϩϯϓτΛຒΊࠐΉ
    • Convolution LayerʹΑΓ৞ΈࠐΜͩಛ௃ϚοϓΛImage embeddingͱ଍͠߹ΘͤΔ
    Prompt encoder (mask)
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on

    View Slide

  10. 10
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • ϚεΫީิΛग़ྗ
    • ϓϩϯϓτͷᐆດੑΛߟྀ͠3ͭͷϚεΫީิΛग़ྗ
    • ωοτϫʔΫߏ଄ΛTransformer Decoder
    • Promptͷself-attentionɼ2ํ޲ͷcross-attentionΛ࢖༻
    Mask decoder
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on
    • Decoder Layer (4 step)
    1. Tokenʹର͢ΔSelf-attention
    2. Token͔ΒImage embeddingʹର͢ΔCross-attention
    3. PointຖͷMLPʹΑΔ֤Tokenͷߋ৽
    4. Image embedding͔ΒToken΁ͷCross-attention
    etails
    k Details
    r can be any
    dding. Mo-
    training, we
    er (ViT) [33]
    image
    embedding
    (256x64x64)
    x2
    token
    to image
    attn.
    2x
    conv.
    trans.
    IoU
    scores
    mlp
    masks
    dot product
    per mask
    prompt tokens
    (N
    tokens
    x256)
    output tokens
    +
    output
    token
    per mask
    IoU
    output
    token mlp
    mask decoder
    self attn.
    token to image attn.
    mlp
    image to token attn.
    Figure 14: Details of the lightweight mask decoder. A
    two-layer decoder updates both the image embedding and
    prompt tokens via cross-attention. Then the image embed-
    ding is upscaled, from which the updated output tokens are
    used to dynamically predict masks. (Not illustrated for fig-
    ure clarity: At every attention layer, positional encodings
    are added to the image embedding, and the entire original
    prompt token (including position encoding) is re-added to
    the token queries and keys.)
    and image embedding are then added element-wise. If there

    View Slide

  11. 11
    SAͷ3ཁૉ - Model -
    Data
    Model
    Task
    Segment Anything Model (SAM)
    • Focal loss [Lin+, ICCV2017] ͱDice loss [Milletari+, 3DV2016] ͷ૊Έ߹Θͤ
    • ϓϩϯϓτ͸ϥϯμϜʹαϯϓϦϯά
    Training
    ,
    score
    score
    score
    ,
    ,
    valid masks
    image
    image
    encoder
    image
    embedding mask points box text
    prompt encoder
    mask decoder
    conv
    Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
    then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
    prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
    3. Segment Anything Model
    We next describe the Segment Anything Model (SAM)
    for promptable segmentation. SAM has three components,
    illustrated in Fig. 4: an image encoder, a flexible prompt
    loss [15, 45, 64] over masks. To rank masks, the model pre-
    dicts a confidence score (i.e., estimated IoU) for each mask.
    Efficiency. The overall model design is largely motivated
    by efficiency. Given a precomputed image embedding, the
    prompt encoder and mask decoder run in a web browser, on

    View Slide

  12. 12
    SAͷ3ཁૉ - Data -
    Data
    Model
    Task
    Chloe Rolland3 Laura Gustafson3
    o Piotr Doll´
    ar4 Ross Girshick4
    4directional lead
    (c) Data: data engine (top) & dataset (bottom)
    • 1+ billion masks
    • 11 million images
    • privacy respecting
    • licensed images
    annotate
    train
    data
    model
    Segment Anything 1B (SA-1B):
    three interconnected components: a prompt-
    ion and enables zero-shot transfer to a range
    Data engine
    Model-in-the-loopɿΞϊςʔγϣϯʹSAMΛ༻͍ͯσʔληοτΛ࡞੒
    1. Assisted-manual stage : ϞσϧΞγετʹΑΔϚχϡΞϧΞϊςʔγϣϯ
    SAM͕༧ଌͨ͠ϚεΫΛਓखͰमਖ਼ (ผσʔληοτͰֶशࡁΈͷSAMΛ࢖༻)
    मਖ਼σʔλ͕Ұఆ਺ू·Ε͹ͦͷσʔλͰSAMΛֶश
    मਖ਼ϧʔϧɿ1ը૾͋ͨΓ30ඵҎ಺Ͱ෇༩ՄೳͳൣғΛΞϊςʔγϣϯ
    2. Semi-automatic stage :ϚχϡΞϧͱϞσϧʹΑΔࣗಈΞϊςʔγϣϯ
    ༧ଌͨ͠ΦϒδΣΫτҎ֎Λ௥ՃͰਓखʹΑΓΞϊςʔγϣϯ
    ͜ͷ࣌ɼ৽͍͠௥ՃσʔλͰSAMΛֶश
    ※͜ͷ࣌఺Ͱ1020ສݸͷϚεΫΛऩू
    3. Fully-automatic stage : Ξϊςʔλͳ͠ʹϞσϧͷ׬શࣗಈΞϊςʔγϣϯ
    SAMͷ༧ଌΛ༻͍ͨΞϊςʔγϣϯͰσʔληοτ࡞੒
    ༧ଌ಺ͷ֬৴౓͕ߴ͍݁ՌΛબ୒͠NMSʹΑΓॏෳΛআڈ
    ※2ͷ࣌఺ͰSAM͸ߴਫ਼౓ͳͨΊɼ༧ଌ݁ՌΛͦͷ··࢖༻Մೳ
    σʔληοτͷ࡞Γํ

    View Slide

  13. 13
    SAͷ3ཁૉ - Data -
    Data
    Model
    Task
    <50 masks
    50-100 masks
    100-200 masks
    200-300 masks
    300-400 masks
    400-500 masks
    >500 masks
    Figure 2: Example images with overlaid masks from our newly introduced dataset, SA-1B. SA-1B contains 11M diverse,
    high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were
    annotated fully automatically by SAM, and as we verify by human ratings and numerous experiments, are of high quality and
    diversity. We group images by number of masks per image for visualization (there are ⇠100 masks per image on average).
    New Dataset : Segment Anything 1B (SA-1B)
    • 1100ສຕͷը૾ͱ11ԯݸͷϚεΫΛؚΉσʔληοτ
    • ैདྷͷσʔληοτͱൺֱ͠1ը૾͋ͨΓͷϚεΫ਺͕๲େ

    "
    ""
    #
    "

    ""

    "
    ""

    "
    ""
    ! "
    "
    ""
























    Figure 6: Dataset mask properties. The legend references the number of images and masks in each dataset. Note, that SA-1B
    has 11⇥ more images and 400⇥ more masks than the largest existing segmentation dataset Open Images [60].
    Per country
    image count
    ≥ 100k
    < 100k
    < 10k
    < 1k











    "











    #
















    #
    !


    #





    .034&0..0/&05/42,(3&0'(3

    -
    -
    -

    -
    5.%(20),.$*(31(2&05/426
    3,$&($/,$
    )2,&$
    5201(
    024+.(2,&$
    $4,/.(2,&$ $2,%%($/
    Figure 7: Estimated geographic distribution of SA-1B images. Most of the world’s countries have more than 1000 images in
    SA-1B, and the three countries with the most images are from different parts of the world.
    Mask properties. In Fig. 5 we plot the spatial distribution
    of object centers in SA-1B compared to the largest existing
    segmentation datasets. Common photographer biases are
    present in all datasets. We observe that SA-1B has greater
    coverage of image corners compared to LVIS v1 [44] and
    ADE20K [117], the two most similarly distributed datasets,
    while COCO [66] and Open Images V5 [60] have a more
    SA-1B % images
    # countries #imgs #masks SA-1B COCO O.I.
    Africa 54 300k 28M 2.8% 3.0% 1.7%
    Asia & Oceania 70 3.9M 423M 36.2% 11.4% 14.3%
    Europe 47 5.4M 540M 49.8% 34.2% 36.2%
    Latin America & Carib. 42 380k 36M 3.5% 3.1% 5.0%
    North America 4 830k 80M 7.7% 48.3% 42.8%
    At the start of this stage, SAM was trained using com-
    mon public segmentation datasets. After sufficient data an-
    notation, SAM was retrained using only newly annotated
    masks. As more masks were collected, the image encoder
    was scaled from ViT-B to ViT-H and other architectural de-
    ails evolved; in total we retrained our model 6 times. Av-
    erage annotation time per mask decreased from 34 to 14
    Figure 5: Image-size normalized mask center distributions.
    • ϚεΫҐஔͷภΓ΋গͳ͍
    ैདྷͷσʔληοτ͸தԝʹภΓ͕͋Δ

    View Slide

  14. 14
    • Point͔ΒͷMask༧ଌ
    ධՁ࣮ݧ
    STREETS [91] TimberSeg [38] TrashCan [52] VISOR [28, 27] WoodScape [112] PIDRay [104] ZeroWaste-f [6]
    Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities.
    -20 0 +20 +40
    IoU delta at 1 center point
    GTEA [34, 63]
    TrashCan [52]
    DRAM [24]
    PIDRay [104]
    Cityscapes [25]
    WoodScape [112]
    IBD [17]
    EgoHOS [113]
    Plittersdorf [46]
    VISOR [28, 27]
    NDISPark [22, 23]
    Hypersim [86]
    OVIS [81]
    ADE20K [117]
    iShape [111]
    ZeroWaste-f [6]
    STREETS [91]
    LVIS [44]
    NDD20 [100]
    TimberSeg [38]
    DOORS [80]
    BBBC038v1 [12]
    PPDLS [74]
    -21.4
    -15.0
    -6.5
    -5.8
    -2.0
    -0.6
    -0.3
    +0.8
    +1.5
    +1.8
    +2.7
    +6.1
    +7.0
    +7.8
    +8.8
    +9.1
    +17.3
    +18.5
    +21.1
    +28.9
    +41.1
    +44.7
    +46.9
    (a) SAM vs. RITM [92] on 23 datasets

    !
    $#$#



    &#"$
    " %"%$

    # %$!%$

    (b) Mask quality ratings by human annotators


    %" ! $#


    $#$#
    "


    !








    (c) Center points (default) (d) Random points
    Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92].
    Due to ambiguity, a single mask may not match ground truth; circles show “oracle” results of the most relevant of SAM’s 3
    predictions. (b) Per-dataset comparison of mask quality ratings by annotators from 1 (worst) to 10 (best). All methods use
    the ground truth mask center as the prompt. (c, d) mIoU with varying number of points. SAM significantly outperforms prior
    interactive segmenters with 1 point and is on par with more points. Low absolute mIoU at 1 point is the result of ambiguity.
    Results. First, we look at automatic evaluation on the full
    suite of 23 datasets using mIoU. We compare per-dataset
    fall between 7 and 9, which corresponds to the qualitative
    rating guideline: “A high score (7-9): The object is identi-
    SAMͱRITM [Sofiiuk+, ICIP2022]ͷ༷ʑͳσʔληοτͰͷൺֱ (meanIoU)
    ؙҹ͸ɼSAMͷ3ͭͷ༧ଌͷ͏ͪ࠷΋ؔ࿈ੑͷߴ͍΋ͷͰ͋Δɽ
    → 23σʔληοτͷ͏ͪ16ݸͰSAM͕ߴ͍IoUΛୡ੒
    IBD [17] iShape [111] LVIS [44] NDD20 [100] NDISPark [22, 23] OVIS [81] PPDLS [74] Plittersdorf [46]
    STREETS [91] TimberSeg [38] TrashCan [52] VISOR [28, 27] WoodScape [112] PIDRay [104] ZeroWaste-f [6]
    Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities.
    -20 0 +20 +40
    IoU delta at 1 center point
    GTEA [34, 63]
    TrashCan [52]
    DRAM [24]
    PIDRay [104]
    Cityscapes [25]
    WoodScape [112]
    IBD [17]
    EgoHOS [113]
    Plittersdorf [46]
    VISOR [28, 27]
    NDISPark [22, 23]
    Hypersim [86]
    OVIS [81]
    ADE20K [117]
    iShape [111]
    ZeroWaste-f [6]
    STREETS [91]
    LVIS [44]
    NDD20 [100]
    TimberSeg [38]
    DOORS [80]
    BBBC038v1 [12]
    PPDLS [74]
    -21.4
    -15.0
    -6.5
    -5.8
    -2.0
    -0.6
    -0.3
    +0.8
    +1.5
    +1.8
    +2.7
    +6.1
    +7.0
    +7.8
    +8.8
    +9.1
    +17.3
    +18.5
    +21.1
    +28.9
    +41.1
    +44.7
    +46.9
    (a) SAM vs. RITM [92] on 23 datasets

    !
    $#$#



    &#"$
    " %"%$

    # %$!%$

    (b) Mask quality ratings by human annotators


    %" ! $#


    $#$#
    "


    !








    (c) Center points (default) (d) Random points
    Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92].
    Mask quality ratings by human annotators
    → Ξϊςʔλ͸SAMͷϚεΫ඼࣭ΛRITMΑΓ΋େ෯ʹߴ͘ධՁ
    → ଟ͘ͷϕϯνϚʔΫͰZero-shotʹΑΓैདྷϞσϧΛ্ճΔੑೳΛ֫ಘ

    View Slide

  15. 15
    SAͷ༷ʑͳ೿ੜख๏
    CLIP_Surgery
    [Li+, Apr.2023]
    Segment Anything Is Not Always Perfect
    [Ji+, Apr.2023]
    PerSAM
    [Zhang+, May2023]
    Matcher
    [Liu+, May2023]
    Segment Anything in High Quality
    [Ke+, Jun2023]
    SAͷ෼ੳͱػೳ֦ு
    Fast Segment Anything
    [Zhao+, Jun.2023]
    Detect Any Shadow
    [Wang+, May2023]
    MobileSAM
    [Zhang+, July2023]
    ը૾ͷम෮
    ௒ղ૾౓
    ϦϞʔτηϯγϯά
    ҩ༻ը૾ͷSegmetation
    SAM for Digital Pathology Surgery
    [Deng+, Apr.2023]
    Segment Anything in Medical Images
    [Ma+, Apr.2023]
    SAM Fails to Segment Anything?
    [Chen+, Apr.2023]
    SAM for Medical Image Analysis
    [Mazurowski+, May2023]
    ը૾ͷ੾Γൈ͖
    Matte Anything
    [Yao+, Jun2023]
    Matting Anything
    [Li+, Jun2023]
    3Dσʔλ
    Seal
    [Liu+, Jun2023]
    TomoSAM
    [Semerato+, Jun2023]
    SAͷ೿ੜख๏
    Inpaint Anything
    [Yu+, Apr2023]
    Robotics
    Instruct2Act
    [Huang+, May2023]
    Segment Anything in Video Super-resolution
    [Lu+, May2023]
    SAM-IQA
    [Li+, Jul2023]
    όΠΦΠϯϑΥϚςΟΫε
    IAMSAM
    [Lee+, Mayr2023]
    RSPrompter
    [Chen+, Jun2023]
    ِ৭෺ମͷݕग़
    SAMCOD
    [Tang+, Apr2023]
    ※ Reference : https://github.com/Hedlen/awesome-segment-anything

    View Slide

  16. 16
    • Official site
    r https://segment-anything.com/
    • Paper
    r https://scontent.fhnd2-3.fna.fbcdn.net/v/t39.2365-
    6/10000000_900554171201033_1602411987825904100_n.pdf?_nc_cat=100&ccb=1-
    7&_nc_sid=3c67a6&_nc_ohc=iMsE1fjDr4EAX__pmB2&_nc_ht=scontent.fhnd2-
    3.fna&oh=00_AfDUvhlaLmAdep94YXoayUUE9T_A1lAcXDNM8Si7T5M-jA&oe=648220A7
    • Code
    r https://github.com/facebookresearch/segment-anything
    • Dataset
    r https://segment-anything.com/dataset/index.html
    • Demo
    r https://segment-anything.com/demo
    ࢀߟࢿྉ

    View Slide