Aggregation Framework

이번 포스팅에서는 몽고디비에서 데이터를 분석을 어떻게 하는지 알아볼 것이다.

Pipeline, Stages and Tunables

Aggregation framework란 몽고디비에서 collection에 있는 document를 분석할 수 있게 도와주는 도구이다. Aggregation framework는 파이프라인 개념에 기초를 두고

있다. 처음 input은 collection을 받는다. 그리고 각각의 stage는 document를 input,output으로 사용한다. ( stream of documents )

Aggregation 파이프라인의 각각의 stage는 데이터 처리 단위이다.

Tunables(?)는 필드를 변경하거나, 정렬하거나, 연산 등 다양하게 필드를 변경할 수 있는 연산자이다.

같은 stage를 반복해서 사용할 때도 있다고 하는데, 어떠한 이유인지 잘 모르겠다. ( TODO - 왜 이렇게 쓰는지 알아오기 )

Aggregation Framework 사용예제

이제 stage가 어떻게 사용되는지 준비된 Collection을 이용하여 살펴보자. 아래 Collection은 Facebook에서 투자한 회사에 관한 정보가 담긴 것이다.

{
  "_id" : "52cdef7c4bab8bd675297d8e",
  "name" : "Facebook",
  "category_code" : "social",
  "founded_year" : 2004,
  "description" : "Social network",
  "funding_rounds" : [{
      "id" : 4,
      "round_code" : "b",
      "raised_amount" : 27500000,
      "raised_currency_code" : "USD",
      "funded_year" : 2006,
      "investments" : [
        {
          "company" : null,
          "financial_org" : {
            "name" : "Greylock Partners",
            "permalink" : "greylock"
          },
          "person" : null
        },
        {
          "company" : null,
          "financial_org" : {
            "name" : "Meritech Capital Partners",
            "permalink" : "meritech-capital-partners"
          },
          "person" : null
        },
        {
          "company" : null,
          "financial_org" : {
            "name" : "Founders Fund",
            "permalink" : "founders-fund"
          },
          "person" : null
        },
        {
          "company" : null,
          "financial_org" : {
            "name" : "SV Angel",
            "permalink" : "sv-angel"
          },
          "person" : null
        }
      ]
    },
    {
      "id" : 2197,
      "round_code" : "c",
      "raised_amount" : 15000000,
      "raised_currency_code" : "USD",
      "funded_year" : 2008,
      "investments" : [
        {
          "company" : null,
          "financial_org" : {
            "name" : "European Founders Fund",
            "permalink" : "european-founders-fund"
          },
          "person" : null
        }
      ]
    }],
  "ipo" : {
    "valuation_amount" : NumberLong("104000000000"),
    "valuation_currency_code" : "USD",
    "pub_year" : 2012,
    "pub_month" : 5,
    "pub_day" : 18,
    "stock_symbol" : "NASDAQ:FB"
  }

}

match

2004년에 설립된 회사들을 필터해주는 간단한 예제를 살펴보자.

db.companies.aggregate([
    {$match: {founded_year: 2004}},

])

해당 쿼리는 아래와 같은 쿼리이다.

db.companies.find({founded_year: 2004})

project

우리가 관심있는 필드만을 보기 위해 추가적인 조건을 사용해보자.

db.companies.aggregate([
  {$match: {founded_year: 2004}},
  {$project: {
    _id: 0,
    name: 1,
    founded_year: 1
  }}

])

해당 쿼리를 실행하면 다음과 같은 결과물을 확인할 수 있다.

{"name": "Redfin", "founded_year": 2004 }
{"name": "Wink", "founded_year": 2004 }
{"name": "Techmeme", "founded_year": 2004 }
{"name": "Eventful", "founded_year": 2004 }
{"name": "Oodle", "founded_year": 2004 }

...

위의 쿼리를 정리해보면, 2004년 설립된 조건문과 출력에 관한 조건문 두가지 stage를 사용하고 있다.

limit

db.companies.aggregate([
  {$match: {founded_year: 2004}},
  {$limit: 5},
  {$project: {
    _id: 0,
    name: 1}}

])

limit와 project 순서를 바꿔도 결과물은 같지만, 순서를 바꾼 경우 5개가 아닌 여러개의 document를 처리해야하기 때문에 성능차이가 난다.

효율성을 위해 필터가 많이 되는 stage를 앞에 두는 것이 더 좋은 방법이다.

skip, sort

db.companies.aggregate([
  {$match: {founded_year: 2004}},
  {$sort: {name: 1}},
  {$skip: 10},
  {$limit: 5},
  {$project: {
    _id: 0,
    name: 1}},

])

name을 오름차순으로 정렬한 뒤에, 앞 document 10개를 제외하고 5개를 출력해주는 쿼리이다.

조금 더 복잡한 예제

db.companies.aggregate([
  {$match: {"funding_rounds.investments.financial_org.permalink": "greylock" }},
  {$project: {
    _id: 0, 
    name: 1,
    ipo: "$ipo.pub_year",
    valuation: "$ipo.valuation_amount",
    funders: "$funding_rounds.investments.financial_org.permalink"
  }}

]).pretty()

$ 는 필드 패스를 지정할 때 사용한다. ipo의 pub_year 값을 입력한다.

Aggregation framework - Expressions

저작자표시

IT 메모장

7. Aggregation Framework에 관하여 - mongodb