当前位置：网站首页>Elasticsearch 第六篇：聚合統計查詢

Elasticsearch 第六篇：聚合統計查詢

2020-11-06 20:10:00 【itread01】

前面一直沒有記錄 Elasticsearch 的聚合查詢或者其它複雜的查詢。本篇做一下筆記，為了方便測試，索引資料依然是第五篇生成的測試索引庫 db_student_test ，別名是 student_test

第一部分基本聚合

1、最大值 max、最小值 min、平均值 avg 、總和 sum

場景：查詢語文、數學、英語這三科的最大值、最小值、平均值

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "max_chinese" : { "max" : { "field" : "chinese" } },
        "min_chinese" : { "min" : { "field" : "chinese" } },
        "avg_chinese" : { "avg" : { "field" : "chinese" } },
        "max_math": { "max" : { "field" : "math" } },
        "min_math": { "min" : { "field" : "math" } },
        "avg_math": { "avg" : { "field" : "math" } },
        "max_english": { "max" : { "field" : "english" } },
        "min_english": { "min" : { "field" : "english" } },
        "avg_english": { "avg" : { "field" : "english" } }
    }
}

查詢結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "avg_english": {
            "value": 57.78366490546798
        },
        "max_chinese": {
            "value": 98
        },
        "min_chinese": {
            "value": 25
        },
        "min_math": {
            "value": 15
        },
        "max_english": {
            "value": 98
        },
        "avg_chinese": {
            "value": 59.353859695794505
        },
        "avg_math": {
            "value": 56.92907568735187
        },
        "min_english": {
            "value": 21
        },
        "max_math": {
            "value": 99
        }
    }
}

也可以來查詢語文科目分數總和，相當於 sql 的 sum 邏輯，雖然在這裡並沒有什麼意義：

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "sum_chinese" : { "sum" : { "field" : "chinese" } }
    }
}

2、求個數，相當於 sql 的 count 邏輯

場景：查詢所有學生總數，這裡隨便 count 一個欄位就可以，例如數學這個欄位

POST  http://localhost:9200/student_test1/_search?size=0
{
  "aggs": {
    "age_count": {
      "value_count": {
        "field": "math"
      }
    }
  }
}

返回結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "age_count": {
            "value": 50084828
        }
    }
}

課間總數是：50084828 跟第五篇我們生成的資料總量一致

3、distinct 聚合，相當於 sql 的 count ( distinct )

場景：統計語文成績有多少種值

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "chinese"
            }
        }
    }
}

返回結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "type_count": {
            "value": 74
        }
    }
}

從結果上看，只有74個不同的分數，與第五篇隨機生成資料的規則匹配

4、統計聚合

場景：查詢語文成績總個數、最大值、最小值、平均值、總和等

POST  http://localhost:9200/student_test1/_search?size=0
{
  "aggs": {
    "chinese_stats": {
      "stats": {
        "field": "chinese"
      }
    }
  }
}

返回結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "chinese_stats": {
            "count": 50084828,
            "min": 25,
            "max": 98,
            "avg": 59.353859695794505,
            "sum": 2972727854
        }
    }
}

5、加強版統計聚合，查詢結果在上面的基礎上，加上方差等統計學上的資料

POST  http://localhost:9200/student_test1/_search?size=0 
{
  "aggs": {
    "chinese_stats": {
      "extended_stats": {
        "field": "chinese"
      }
    }
  }
}

6、分位聚合統計

預設的分位是 1% 5% 25% 50% 75% 95% 99% 《= 的概念

分位數的概念：25% 的分位數是 54，意思是小於等於 54 的樣本佔據了總樣本的 25% ，即是 54 這個數將最底層的1/4 的資料分割出來。

POST  http://localhost:9200/student_test1/_search?size=0 
{
  "aggs": {
    "chinese_percents": {
      "percentiles": {
        "field": "chinese"
      }
    }
  }
}

也可以自定義分位：

POST  http://localhost:9200/student_test1/_search?size=0 
{
  "aggs": {
    "chinese_percents": {
      "percentiles": {
        "field": "chinese",
        "percents" : [10,20,30,40,50,60,70,80,90] 
      }
    }
  }
}

7、範圍聚合統計

場景：分別查詢語文成績小於40分、小於50分、小於60分的比例

POST  http://localhost:9200/student_test1/_search?size=0 
{
  "aggs": {
    "gge_perc_rank": {
      "percentile_ranks": {
        "field": "chinese",
        "values": [40,50,60]
      }
    }
  }
}

以上是查詢成績小於40，小於50，小於60的佔比，得到的資料是： 21.29% 36.09% 51.12% 可以看到這是一個接近等差的數列，可見測試資料的隨機性還是很好的。

第二部分其它聚合方式

1、Term 聚合

場景：想知道學生的語文成績，在所有分數值上的個數

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "genres" : {
            "terms" : { 
                "field" : "chinese"
            }
        }
    }
}

這個查詢會將欄位Chinese進行聚合，例如87分聚合成一個組，88分聚合成一個組，等等；

但是這裡預設是按組的大小排序，而且不會將所有的組都顯示出來，數量太小的組可能被忽略，查詢結果如下：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "genres": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 42560269,
            "buckets": [
                {
                    "key": 61,
                    "doc_count": 752863
                },
                {
                    "key": 68,
                    "doc_count": 752835
                },
                {
                    "key": 55,
                    "doc_count": 752749
                },
                {
                    "key": 59,
                    "doc_count": 752444
                },
                {
                    "key": 76,
                    "doc_count": 752405
                },
                {
                    "key": 74,
                    "doc_count": 752309
                },
                {
                    "key": 56,
                    "doc_count": 752283
                },
                {
                    "key": 49,
                    "doc_count": 752273
                },
                {
                    "key": 52,
                    "doc_count": 752201
                },
                {
                    "key": 50,
                    "doc_count": 752197
                }
            ]
        }
    }
}

如果想要自定義篩選條件，Term聚合還可以按照以下設定來查詢：

post  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "genres" : {
            "terms" : { 
                "field" : "chinese",
                 "size" : 100,                     // 可能有100個不用的分數，我們將全部都展示出來
                 "order" : { "_count" : "asc" },   // 按照組數由小到大排序
                  "min_doc_count": 752200          //過濾條件：組數最小值是752200
            }
        }
    }
}

查詢結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "genres": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": 52,
                    "doc_count": 752201
                },
                {
                    "key": 49,
                    "doc_count": 752273
                },
                {
                    "key": 56,
                    "doc_count": 752283
                },
                {
                    "key": 74,
                    "doc_count": 752309
                },
                {
                    "key": 76,
                    "doc_count": 752405
                },
                {
                    "key": 59,
                    "doc_count": 752444
                },
                {
                    "key": 55,
                    "doc_count": 752749
                },
                {
                    "key": 68,
                    "doc_count": 752835
                },
                {
                    "key": 61,
                    "doc_count": 752863
                }
            ]
        }
    }
}

2、Filter 聚合

Filter 聚合會先進行條件過濾，在進行聚合

場景：查詢華南理工大學的學生的數學科目平均分（先篩選學校，再進行分數統計聚合）

{
    "aggs" : {
        "scut_math_avg" : {
            "filter" : { "term": { "school": "華南理工大學" } },
            "aggs" : {
                "avg_price" : { "avg" : { "field" : "math" } }
            }
        }
    }
}

查詢結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "scut_math_avg": {
            "doc_count": 1854993,
            "avg_price": {
                "value": 56.93080027795253
            }
        }
    }
}

3、Filters 多重聚合

場景：查詢各個學校，語文、數學、英語的平均分都是多少，可以採用多重聚合，速度可能有點慢，如下

POST  http://localhost:9200/student_test1/_search?size=0
{
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "school_1" :   { "term" : { "school" : "華南理工大學" }},
          "school_2" : { "term" : { "school" : "中山大學" }},
          "school_3" : { "match" : { "school" : "暨南大學" }}
        }
      },
      "aggs" : {
           "avg_chinese" : { "avg" : { "field" : "chinese" } },
           "avg_math" : { "avg" : { "field" : "math" } }
      }
    }
  }
}

於是得到結果：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "messages": {
            "buckets": {
                "school_1": {
                    "doc_count": 1854993,
                    "avg_chinese": {
                        "value": 59.353236912484306
                    },
                    "avg_math": {
                        "value": 56.93080027795253
                    }
                },
                "school_2": {
                    "doc_count": 1855016,
                    "avg_chinese": {
                        "value": 59.349129064115886
                    },
                    "avg_math": {
                        "value": 56.93540918245449
                    }
                },
                "school_3": {
                    "doc_count": 44519876,
                    "avg_chinese": {
                        "value": 59.35397212247402
                    },
                    "avg_math": {
                        "value": 56.92948502372289
                    }
                }
            }
        }
    }
}

4、Range 範圍聚合

場景：想要查詢語文成績各個分數段的人數，可以這樣查詢

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "chinese_ranges" : {
            "range" : {
                "field" : "chinese",
                "ranges" : [
                    { "to" : 60 },
                    { "from" : 60, "to" : 75 },
                    { "from" : 75, "to" : 85 },
                    { "from" : 85 }
                ]
            }
        }
    }
}

查詢結果是：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "chinese_ranges": {
            "buckets": [
                {
                    "key": "*-60.0",
                    "to": 60,
                    "doc_count": 25096839
                },
                {
                    "key": "60.0-75.0",
                    "from": 60,
                    "to": 75,
                    "doc_count": 11278543
                },
                {
                    "key": "75.0-85.0",
                    "from": 75,
                    "to": 85,
                    "doc_count": 7424634
                },
                {
                    "key": "85.0-*",
                    "from": 85,
                    "doc_count": 6284812
                }
            ]
        }
    }
}

這個返回結果的組名分別是 *-60.0 60.0-75.0 75.0-85.0 85.0-*
如果我們不想要這樣的組名，可以自定義組名，例如：

POST  http://localhost:9200/student_test1/_search?size=0
{
    "aggs" : {
        "chinese_ranges" : {
            "range" : {
                "field" : "chinese",
                "keyed" : true,
                "ranges" : [
                    { "key" : "不及格", "to" : 60 },
                    { "key" : "及格", "from" : 60, "to" : 75 },
                    { "key" : "良好", "from" : 75, "to" : 85 },
                    { "key" : "優秀", "from" : 85 }
                ]
            }
        }
    }
}

查詢結果將會是：

{
    "took": 1675,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "chinese_ranges": {
            "buckets": {
                "不及格": {
                    "to": 60,
                    "doc_count": 25096839
                },
                "及格": {
                    "from": 60,
                    "to": 75,
                    "doc_count": 11278543
                },
                "良好": {
                    "from": 75,
                    "to": 85,
                    "doc_count": 7424634
                },
                "優秀": {
                    "from": 85,
                    "doc_count": 6284812
                }
            }
        }
    }
}

還有其它各種各樣的、複雜的聚合查詢，都是可以網上查資料，甚至還支援推薦系統的一些計算方法，例如矩陣的概念等等。

還可以參考 https://blog.csdn.net/alex_xfboy/article/details/8610