[Trick] PyTorch对标SciPy实现不同的rankdata方法

发表于 2025-09-26

在数据处理过程当中，我们经常会遇到需要对数据进行排序并赋予排名的需求。SciPy库中的rankdata函数提供了多种处理重复值（ties）的方法。

average: The average of the ranks that would have been assigned to all the tied values is assigned to each value.

min: The minimum of the ranks that would have been assigned to all the tied values is assigned to each value. (This is also referred to as “competition” ranking.)

max: The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.

dense: Like min, but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.

ordinal: All values are given a distinct rank, corresponding to the order that the values occur in a.

但是SciPy并不支持在GPU上运行，无法利用GPU的计算能力来加速处理大规模数据。本文将介绍如何在PyTorch中实现类似于SciPy的rankdata功能，并支持多种处理重复值的方法。

通常来说，一个简单的Trick是使用两次argsort来实现排名功能，如下所示：

1 2	def rankdata(input: torch.Tensor, dim: int = -1) -> torch.Tensor: return torch.argsort(torch.argsort(input, dim=dim), dim=dim) + 1

但是这种方法无法处理重复值的情况。为了解决这个问题，大体上的思路是首先对于输入数据进行排序，然后使用searchsorted函数来确定每个元素在排序后的数组中的位置。通过调整searchsorted的side参数，我们可以获取重复值的区间，从而实现不同的排名策略。假设left和right分别表示每个元素在排序后数组中的左边界和右边界，我们可以比较简单的获得average、min和max三种排名方式：

average: (left + right + 1) / 2
min: left + 1
max: right

@torch.jit.script
def rankdata_avg(input: torch.Tensor, dim: int = -1) -> torch.Tensor:
    """Assign ranks to data, ranks begin at 1.

    The average of the ranks that would have been assigned to all the tied values is assigned to each value.

    Examples:
        >>> input = torch.tensor([0, 2, 3, 2])
        >>> rankdata_avg(input)
        tensor([1.0000, 2.5000, 4.0000, 2.5000])
    """
    input = input.swapdims(dim, -1).contiguous()
    sorted_input, _ = torch.sort(input, dim=-1)
    left = torch.searchsorted(sorted_input, input, right=False).swapdims(dim, -1)
    right = torch.searchsorted(sorted_input, input, right=True).swapdims(dim, -1)
    ranks = (left + right + 1) * 0.5
    return ranks

@torch.jit.script
def rankdata_min(input: torch.Tensor, dim: int = -1) -> torch.Tensor:
    """Assign ranks to data, ranks begin at 1.

    The minimum of the ranks that would have been assigned to all the tied values is assigned to each value.

    Examples:
        >>> input = torch.tensor([0, 2, 3, 2])
        >>> rankdata_min(input)
        tensor([1, 2, 4, 2])
    """
    input = input.swapdims(dim, -1).contiguous()
    sorted_input, _ = torch.sort(input, dim=-1)
    ranks = torch.searchsorted(sorted_input, input, right=False).swapdims(dim, -1) + 1
    return ranks

@torch.jit.script
def rankdata_max(input: torch.Tensor, dim: int = -1) -> torch.Tensor:
    """Assign ranks to data, ranks begin at 1.

    The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.

    Examples:
        >>> input = torch.tensor([0, 2, 3, 2])
        >>> rankdata_max(input)
        tensor([1, 3, 4, 3])
    """
    input = input.swapdims(dim, -1).contiguous()
    sorted_input, _ = torch.sort(input, dim=-1)
    ranks = torch.searchsorted(sorted_input, input, right=True).swapdims(dim, -1)
    return ranks

对应于dense方法，我们可以先将排序后的数组中的重复值替换为一个较大的数（如最大值），然后再进行一次排序，最后使用searchsorted来获取排名：

@torch.jit.script
def rankdata_dense(input: torch.Tensor, dim: int = -1) -> torch.Tensor:
    """Assign ranks to data, ranks begin at 1.

    Like `min` mode, but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.

    Examples:
        >>> input = torch.tensor([0, 2, 3, 2])
        >>> rankdata_dense(input)
        tensor([1, 2, 3, 2])
    """
    input = input.swapdims(dim, -1).contiguous()
    sorted_input, _ = torch.sort(input, dim=-1)
    sorted_input[..., 1:].masked_fill_(sorted_input[..., 1:] == sorted_input[..., :-1], sorted_input.max())
    sorted_input, _ = torch.sort(sorted_input, dim=-1)
    ranks = torch.searchsorted(sorted_input, input, right=False).swapdims(dim, -1) + 1
    return ranks

对应于ordinal方法，重复值的排名并不是一致的，使用最简单的两次argsort方法即可，这里提供一种基于scatter_的实现方式：

@torch.jit.script
def rankdata_ordinal(input: torch.Tensor, dim: int = -1) -> torch.Tensor:
    """Assign ranks to data, ranks begin at 1.

    All values are given a distinct rank, corresponding to the order that the values occur in `input`.

    Examples:
        >>> input = torch.tensor([0, 2, 3, 2])
        >>> rankdata_ordinal(input)
        tensor([1, 2, 4, 3])
    """
    dim = (dim + input.ndim) % input.ndim
    indices = torch.argsort(input, dim=dim)
    shape = [1 if i != dim else -1 for i in range(input.ndim)]
    ranks = torch.arange(1, input.size(dim) + 1, device=input.device).view(shape).expand_as(input)
    output = torch.empty_like(input, dtype=torch.long)
    output.scatter_(dim, indices, ranks)
    return output

全量加载Tensorboard路径下存储的数据点

发表于 2025-04-22

当tfevents文件中存储的数据点非常多的时候（超过10K），Tensorboard会自动对数据点进行降采样，使得加载最多10K个数据点。这使得在进行结果比对的时候，会出现一些不对齐的情况，对应的加载逻辑在event_accumulator.py当中，但是并没有直接提供强制全量加载的接口，并且该行为并没有在文档中进行说明。在EventAccumulator的参数列表中可以发现对于size_guidance的描述如下，说明可以通过设置size_guidance来避免默认的降采样行为。

size_guidance: Information on how much data the EventAccumulator should store in memory. The DEFAULT_SIZE_GUIDANCE tries not to store too much so as to avoid OOMing the client. The size_guidance should be a map from a tagType string to an integer representing the number of items to keep per tag for items of that tagType. If the size is 0, all events are stored.

其中提到的DEFAULT_SIZE_GUIDANCE定义如下：

DEFAULT_SIZE_GUIDANCE = {
    COMPRESSED_HISTOGRAMS: 500,
    IMAGES: 4,
    AUDIO: 4,
    SCALARS: 10000,
    HISTOGRAMS: 1,
    TENSORS: 10,
}

在这里定义一个新的size_guidance，对应加载所有的数据：

class NoneSizeGuidance:
    def __getitem__(self, _, /):
        return 0
    
    def __contains__(self, _, /):
        return True

对应的使用示例如下所示：

import os 

import pandas as pd 
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator

def load_tensorboard_scalar(logdir: os.PathLike, tag: str, duplicate: str = "mean") -> pd.Series:
    accumulator = EventAccumulator(
        logdir,
        size_guidance=NoneSizeGuidance(),
    ).Reload()
    output = pd.DataFrame(accumulator.Scalars(tag), columns=["wall_time", "step", tag])
    output: pd.Series = output.drop(columns=["wall_time"]).set_index("step")[tag]

    if duplicate == "mean":
        return output.groupby(level=0).mean()
    elif duplicate == "first":
        return output.groupby(level=0).first()
    elif duplicate == "last":
        return output.groupby(level=0).last()
    elif duplicate == "none":
        return output
    else:
        raise ValueError(f"Unknown duplicate method: {duplicate}")


load_tensorboard_scalar(
    logdir="path/to/logdir",
    tag="Train/Loss",
    duplicate="mean",
)

Linux系统中多版本GCC管理与切换

发表于 2025-04-22

Linux系统中，通常会拥有预装的GCC版本，但是预装的版本通常会更看重稳定性，在实际项目中，可能会需要一些新特性，这时候需要手动安装更新的编译器版本。可以看到，在当前的Ubuntu 22.04系统中，已经安装了gcc-11和gcc-13两个版本的编译器，默认的gcc版本是仍然是预装的gcc-11。

~ ❯❯❯ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~ ❯❯❯ ll /usr/bin/gcc*
lrwxrwxrwx 1 root root  6 Aug  5  2021 /usr/bin/gcc -> gcc-11
lrwxrwxrwx 1 root root 23 May 13  2023 /usr/bin/gcc-11 -> x86_64-linux-gnu-gcc-11
lrwxrwxrwx 1 root root 23 Jul 11  2023 /usr/bin/gcc-13 -> x86_64-linux-gnu-gcc-13
lrwxrwxrwx 1 root root  9 Aug  5  2021 /usr/bin/gcc-ar -> gcc-ar-11
lrwxrwxrwx 1 root root 26 May 13  2023 /usr/bin/gcc-ar-11 -> x86_64-linux-gnu-gcc-ar-11
lrwxrwxrwx 1 root root 26 Jul 11  2023 /usr/bin/gcc-ar-13 -> x86_64-linux-gnu-gcc-ar-13
lrwxrwxrwx 1 root root  9 Aug  5  2021 /usr/bin/gcc-nm -> gcc-nm-11
lrwxrwxrwx 1 root root 26 May 13  2023 /usr/bin/gcc-nm-11 -> x86_64-linux-gnu-gcc-nm-11
lrwxrwxrwx 1 root root 26 Jul 11  2023 /usr/bin/gcc-nm-13 -> x86_64-linux-gnu-gcc-nm-13
lrwxrwxrwx 1 root root 13 Aug  5  2021 /usr/bin/gcc-ranlib -> gcc-ranlib-11
lrwxrwxrwx 1 root root 30 May 13  2023 /usr/bin/gcc-ranlib-11 -> x86_64-linux-gnu-gcc-ranlib-11
lrwxrwxrwx 1 root root 30 Jul 11  2023 /usr/bin/gcc-ranlib-13 -> x86_64-linux-gnu-gcc-ranlib-13

在这里可以通过update-alternatives命令来设置默认的gcc版本。可以通过--install参数来添加新的版本，或者通过--config参数来选择当前的默认版本，并且这里通过--slave参数来设置g++和gcov的版本。

1
2

~ ❯❯❯ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11 --slave /usr/bin/g++ g++ /usr/bin/g++-11 --slave /usr/bin/gcov gcov /usr/bin/gcov-11
~ ❯❯❯ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 13 --slave /usr/bin/g++ g++ /usr/bin/g++-13 --slave /usr/bin/gcov gcov /usr/bin/gcov-13

可以发现，命令执行完成后，/usr/bin/gcc的软链接指向了/etc/alternatives/gcc，检查版本之后可以发现，当前的gcc版本已经变成了gcc-13，而g++的版本也进行了相应的更新。

~ ❯❯❯ gcc --version
gcc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~ ❯❯❯ g++ --version
g++ (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~ ❯❯❯ ll /usr/bin/gcc*
lrwxrwxrwx 1 root root 21 Apr 22 16:09 /usr/bin/gcc -> /etc/alternatives/gcc
lrwxrwxrwx 1 root root 23 May 13  2023 /usr/bin/gcc-11 -> x86_64-linux-gnu-gcc-11
lrwxrwxrwx 1 root root 23 Jul 11  2023 /usr/bin/gcc-13 -> x86_64-linux-gnu-gcc-13
lrwxrwxrwx 1 root root  9 Aug  5  2021 /usr/bin/gcc-ar -> gcc-ar-11
lrwxrwxrwx 1 root root 26 May 13  2023 /usr/bin/gcc-ar-11 -> x86_64-linux-gnu-gcc-ar-11
lrwxrwxrwx 1 root root 26 Jul 11  2023 /usr/bin/gcc-ar-13 -> x86_64-linux-gnu-gcc-ar-13
lrwxrwxrwx 1 root root  9 Aug  5  2021 /usr/bin/gcc-nm -> gcc-nm-11
lrwxrwxrwx 1 root root 26 May 13  2023 /usr/bin/gcc-nm-11 -> x86_64-linux-gnu-gcc-nm-11
lrwxrwxrwx 1 root root 26 Jul 11  2023 /usr/bin/gcc-nm-13 -> x86_64-linux-gnu-gcc-nm-13
lrwxrwxrwx 1 root root 13 Aug  5  2021 /usr/bin/gcc-ranlib -> gcc-ranlib-11
lrwxrwxrwx 1 root root 30 May 13  2023 /usr/bin/gcc-ranlib-11 -> x86_64-linux-gnu-gcc-ranlib-11
lrwxrwxrwx 1 root root 30 Jul 11  2023 /usr/bin/gcc-ranlib-13 -> x86_64-linux-gnu-gcc-ranlib-13

~ ❯❯❯ ll /usr/bin/g++*
lrwxrwxrwx 1 root root 21 Apr 22 16:09 /usr/bin/g++ -> /etc/alternatives/g++
lrwxrwxrwx 1 root root 23 May 13  2023 /usr/bin/g++-11 -> x86_64-linux-gnu-g++-11
lrwxrwxrwx 1 root root 23 Jul 11  2023 /usr/bin/g++-13 -> x86_64-linux-gnu-g++-13

通过--config参数可以查看当前拥有的所有gcc版本，并且可以简单地通过数字来选择需要的版本。

~ ❯❯❯ update-alternatives --config gcc
There are 2 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path             Priority   Status
------------------------------------------------------------
* 0            /usr/bin/gcc-13   13        auto mode
  1            /usr/bin/gcc-11   11        manual mode
  2            /usr/bin/gcc-13   13        manual mode

Press <enter> to keep the current choice[*], or type selection number:

在Spirit Qi中使用std::shared_ptr对象

发表于 2024-12-09

boost::spirit::qi是一个用于解析字符串的库，可以用于解析字符串并构造对象。在解析的过程中，有时候需要构造语法树，这个时候通常可以采用phx::new_或者phx::construct来构造裸指针或对象。但是在实际的应用当中，有时候会希望通过如std::shared_ptr这样的智能指针来管理对象的生命周期，但是在boost::phoenix当中并没有提供make_shared这样的函数对象。

基于stackoverflow上的回答可以构造一个make_shared_的函数对象，用于在boost::spirit::qi的语法规则当中延迟构造std::shared_ptr对象。

namespace boost::phoenix {
template <typename T>
struct make_shared_f {
  template <typename... A>
  struct result {
    typedef std::shared_ptr<T> type;
  };

  template <typename... A>
  typename result<A...>::type operator()(A&&... a) const {
    return std::make_shared<T>(std::forward<A>(a)...);
  }
};
template <typename T>
using make_shared_ = function<make_shared_f<T>>;
}  // namespace boost::phoenix

在使用的时候可以如下构造创建std::shared_ptr对象的action：

namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

using phx::make_shared_;
using qi::_1;
using qi::_val;

...

expr = term[_val = _1] >>
       *(('+' >> term[_val = make_shared_<AddNode>()(_val, _1)]) |
         ('-' >> term[_val = make_shared_<SubNode>()(_val, _1)]));

在Ubuntu系统中安装最新的CMake&GCC工具

发表于 2024-12-09

期望能够在WSL当中使用C++23的特性，所以需要安装最新的CMake和GCC工具，在Ubuntu22.04版本中，默认支持的CMake版本为3.22，GCC版本为11，所以需要安装最新的版本，这里记录一下安装的过程。

CMake

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates gnupg software-properties-common wget

wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
sudo apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main"
sudo apt-get update

sudo apt-get install cmake

GCC

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-?? g++-??

gcc-?? --version

其中的??代表版本号，例如gcc-13，g++-13

VSCode Ruff插件配置

发表于 2024-06-11

Ruff是一个基于Rust开发的代码格式化工具，其功能类似于black，但是效率上由于基于Rust的原因，要比black快很多，同时也可以支持类似isort的功能。

在VSCode中使用Ruff的时候，可以通过应用商店搜索Ruff插件进行安装，安装完成之后，可以通过以下配置来设置默认的格式化工具：

"[python]": {
    "editor.codeActionsOnSave": {
        "source.organizeImports": "explicit"
    },
    "editor.formatOnType": true,
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.codeActionsOnSave": {
        "source.fixAll": "explicit",
        "source.organizeImports": "explicit"
    }
},
"ruff.format.args": [
    "--line-length=120"
]

以上配置会在保存文件的时候自动进行代码格式化，并且对于所有引入的包进行自动排序，同时通过添加命令行参数，将ruff的格式化行宽设置为120。

NAT模式下的WSL2代理设置

发表于 2024-06-11

配置Windows 11的WSL的时候，发现WSL2的网络配置有一些问题，这里记录一下解决方法。

在安装完Ubuntu系统之后，有一些需要通过代理访问的需求，但是会提示：

1	wsl: 检测到 localhost 代理设置，但未镜像到 WSL。NAT模式下的WSL不支持 localhost 代理设置。

尝试使用网络上提供的解决方案，通过使用/etc/resolv.conf文件获取到Windows的DNS服务器地址，然后进行代理的设置，但是发现并没有什么用。

最终查找发现可以通过ip route show命令来获取到Windows的DNS服务器地址，进而动态设置代理。

set_proxy() {
  host_ip=$(ip route show | grep -i default | awk '{print $3}')
  export http_proxy="http://${host_ip}:7890"
  export https_proxy="http://${host_ip}:7890"
}

unset_proxy() {
  unset http_proxy
  unset https_proxy
}

通过将以上内容加入到.zshrc文件中，就可以通过set_proxy和unset_proxy来动态设置和取消代理了。

Hexo博客备份与迁移

发表于 2023-02-20

这两天计划将之前部署在MacBook上面的博客迁移到实验室的Linux电脑上，并且在上面维护，同时由于hexo推送到github上面的文件并不是原始文件，所以希望做一个文件备份，防止丢失。在这里记录一下相关的操作，避免之后再次迁移的时候需要做重复性的工作。

Hexo博客备份

原本博客对应的repo应当是username.github.io，其中master分支用来管理对应的博客，新建一个backup分支用来存放博客的原始文件。

1
2
3

git add -A
git commit -m "source file backup"
git push -u origin main:backup --force

新机器环境配置

首先安装nvm，然后利用nvm安装node

1	nvm install --lts

之后可以检测是否已经安装成功

1 2	node -v npm -v

确认node环境没有问题之后，我们可以进行hexo的安装

1 2	npm install -g hexo-cli npm install -g hexo

环境安装完成之后就可以尝试在本地重新部署博客，拉取github上面的备份

1	git clone https://github.com/path/to/your/repo

转换到对应的分支

1	git checkout origin/backup

进行对应的环境配置

1	npm install

在本地测试是否博客可以正常部署

1	hexo g && hexo s

更新相关Package

我的博客在部署之后就没有进行过相关包的更新，所以很多包都和最新版本相差较多，可以输入以下命令来查看过时的包。

1	npm outdated

可以看到有许多的包和最新版本已经差别较大，我们这里会尝试进行更新，但是直接更新可能会有依赖相关的问题

Package                         Current  Wanted  Latest  Location                                     Depended by
hexo                              4.2.1   4.2.1   6.3.0  node_modules/hexo                            hexo
hexo-deployer-git                 2.1.0   2.1.0   4.0.0  node_modules/hexo-deployer-git               hexo
hexo-deployer-rsync               1.0.0   1.0.0   2.0.0  node_modules/hexo-deployer-rsync             hexo
hexo-generator-archive            1.0.0   1.0.0   2.0.0  node_modules/hexo-generator-archive          hexo
hexo-generator-category           1.0.0   1.0.0   2.0.0  node_modules/hexo-generator-category         hexo
hexo-generator-feed               2.2.0   2.2.0   3.0.0  node_modules/hexo-generator-feed             hexo
hexo-generator-index              1.0.0   1.0.0   3.0.0  node_modules/hexo-generator-index            hexo
hexo-generator-sitemap            2.1.0   2.2.0   3.0.1  node_modules/hexo-generator-sitemap          hexo
hexo-generator-tag                1.0.0   1.0.0   2.0.0  node_modules/hexo-generator-tag              hexo
hexo-renderer-ejs                 1.0.0   1.0.0   2.0.0  node_modules/hexo-renderer-ejs               hexo
hexo-renderer-markdown-it-plus    1.0.4   1.0.6   1.0.6  node_modules/hexo-renderer-markdown-it-plus  hexo
hexo-server                       1.0.0   1.0.0   3.0.0  node_modules/hexo-server                     hexo

这里首先安装 npm-check-updates，然后用这个工具来确认相关的依赖是否有问题

1 2	npm install -g npm-check-updates ncu

Checking /home/yunchong/Documents/hexo/package.json
[====================] 17/17 100%

 hexo                            ^4.0.0  →  ^6.3.0
 hexo-deployer-git               ^2.1.0  →  ^4.0.0
 hexo-deployer-rsync             ^1.0.0  →  ^2.0.0
 hexo-generator-archive          ^1.0.0  →  ^2.0.0
 hexo-generator-category         ^1.0.0  →  ^2.0.0
 hexo-generator-feed             ^2.2.0  →  ^3.0.0
 hexo-generator-index            ^1.0.0  →  ^3.0.0
 hexo-generator-sitemap          ^2.0.0  →  ^3.0.1
 hexo-generator-tag              ^1.0.0  →  ^2.0.0
 hexo-renderer-ejs               ^1.0.0  →  ^2.0.0
 hexo-renderer-markdown-it-plus  ^1.0.4  →  ^1.0.6
 hexo-server                     ^1.0.0  →  ^3.0.0

利用ncu来更新对应的package.json文件

ncu -u

Upgrading /home/yunchong/Documents/hexo/package.json
[====================] 17/17 100%

 hexo                            ^4.0.0  →  ^6.3.0
 hexo-deployer-git               ^2.1.0  →  ^4.0.0
 hexo-deployer-rsync             ^1.0.0  →  ^2.0.0
 hexo-generator-archive          ^1.0.0  →  ^2.0.0
 hexo-generator-category         ^1.0.0  →  ^2.0.0
 hexo-generator-feed             ^2.2.0  →  ^3.0.0
 hexo-generator-index            ^1.0.0  →  ^3.0.0
 hexo-generator-sitemap          ^2.0.0  →  ^3.0.1
 hexo-generator-tag              ^1.0.0  →  ^2.0.0
 hexo-renderer-ejs               ^1.0.0  →  ^2.0.0
 hexo-renderer-markdown-it-plus  ^1.0.4  →  ^1.0.6
 hexo-server                     ^1.0.0  →  ^3.0.0

之后直接用npm就可以对照更新之后的package.json文件进行新版本的安装

1	npm install

非等间隔时间序列的指数加权移动平均

发表于 2022-12-03

指数加权移动平均（EWMA或EWA）是量化交易中一种简单而强大的工具，特别适用于日内交易。它允许交易者快速轻松地跟踪指定时间段内证券的平均价格，并可用于识别趋势并进行交易决策。

通常来说EWMA的数学公式可以表示为如下：

$\text{EWMA}_{t} = \alpha x_t +(1 - \alpha) \text{EWMA}_{t-1}$

所以其关键在于 $\alpha$ 的计算，在pandas所提供的api中，提供了alpha，halflife，span，com这四个表示不同但是相互等价的参数，通常使用的多为alpha和span。

其中com即为质心（Center of Mass），他的计算，可以认为是针对于每个时间点的权重的加权平均，所找到的位置，即：

$\begin{aligned} \text{CoM} &= \sum_{t=0}^{\infty} (1 - \alpha)^t \alpha t \\ &= \alpha(1 - \alpha)\sum_{t=0}^{\infty} t(1-\alpha)^{t-1} \\ &= \alpha(1 - \alpha) \sum_{t=0}^{\infty} \left[-(1 - \alpha)^t\right]'\\ &= \alpha(1 - \alpha) \left[-\sum_{t=0}^{\infty} (1 - \alpha)^t\right]'\\ &= \alpha(1 - \alpha) \left[ - \frac{1}{\alpha}\right]' \\ &= \frac{1 - \alpha}{\alpha} \end{aligned}$

化简上式，我们可以得到：

$\alpha = 1 / (1 + \text{CoM})$

半衰期（Half-life）即为权重衰减到一半所需要的时间，所以我们可以得到：

$(1 - \alpha)^H = 0.5 \Rightarrow \alpha = 1 - \exp \left(-\frac{\log2}{H}\right)$

以上均为时间间隔等长的情况，当面对不同间隔的时间序列的时候，我们可以使用index参数来指定时间序列的时间间隔，这样可以使得计算的结果更加准确。假设两个时间戳的间隔为dt，那么我们可以使用如下的公式来计算alpha：

$\alpha' = 1 - \exp(-\alpha \text{d}t) \approx 1 - (1 - \alpha \text{d}t) = \alpha \text{d}t$

当时间间隔总是为1的时候，实际上和最开始的公式基本等价。

考虑一个情形，依次有三个时间戳，分别为t0，t1，t2，那么dt1和dt2分别为t1 - t0和t2 - t1，那么我们可以使用如下的公式来计算alpha：

$\begin{aligned} \text{EWMA}_2 &= \alpha_2 x_2 + (1 - \alpha_2) \text{EWMA}_1 \\ &= \alpha_2 x_2 + (1 - \alpha_2) \left(\alpha_1 x_1 + (1 - \alpha_1) \text{EWMA}_0\right) \\ &= \alpha_2 x_2 + \alpha_1(1 - \alpha_2) x_1 + (1 - \alpha_1)(1 - \alpha_2) \text{EWMA}_0 \end{aligned}$

其中t0时刻的权重为：

$(1 - \alpha_1)(1 - \alpha_2) = \exp(-\alpha \text{d}t_1 - \alpha \text{d}t_2) = \exp(-\alpha (t_2 - t_0))$

这样即使当中有多个时间戳到达，对于同样间隔的数据点，其权重仍然一致。对应的python代码如下所示：

from typing import Optional

import numpy as np


class EWMA(object):
    def __init__(
        self,
        com: Optional[float] = None,
        span: Optional[float] = None,
        halflife: Optional[float] = None,
        alpha: Optional[float] = None,
    ) -> None:
        assert (
            (com is None) + (span is None) + (halflife is None) + (alpha is None)
        ) == 3, "only one of com, span, halflife, alpha should be not None"
        if com is not None:
            self.alpha = 1 / (1 + com)
        elif span is not None:
            self.alpha = 2 / (span + 1)
        elif halflife is not None:
            self.alpha = 1 - np.exp(np.log(0.5) / halflife)
        elif alpha is not None:
            self.alpha = alpha

    def __call__(self, x: np.ndarray, index: Optional[np.ndarray] = None) -> np.ndarray:
        if index is not None:
            alpha = 1 - np.exp(-np.diff(index, prepend=0) * self.alpha)
        else:
            alpha = np.ones_like(x) * self.alpha

        ewma = np.zeros_like(x)
        ewma[0] = x[0]
        for i in range(1, len(x)):
            ewma[i] = alpha[i] * x[i] + (1 - alpha[i]) * ewma[i - 1]
        return ewma

[论文笔记] XGBoost: A Scalable Tree Boosting System

发表于 2022-03-16

梯度提升树

首先考虑梯度提升树，考虑一个有 $n$ 个样本，每个样本有 $m$ 个特征的数据集 $\mathcal{D} = \{(\mathrm{x}_i, y_i)\}$ ，一个集成树模型实际上得到的使用K个具有可加性质的函数，得到的输出对应如下所示：

$\hat{y}_i = \phi(\mathrm{x}_i) = \sum_{k=1}^K f_k(\mathrm{x}_i),\quad f_k \in \mathcal{F}$

对于每一棵树而言，一个输入会被映射到一个对应的叶节点，这个节点上的权重就对应这个输入的结果。在这里目标函数使用被正则化的形式：

$\mathcal{L}(\phi) = \sum_{i}l(\hat y_i, y_i) + \sum_K \Omega(f_k)$

$\text{where} \quad \Omega(f) = \gamma T + \frac12 \lambda \|w\|^2$

其中前半部分 $l$ 代表的是损失函数，用来量化预测值与真实值之间的差距，后者是正则化项，用来控制模型的复杂度，防止过拟合。对于树模型而言，正则化项的第一项控制叶节点的数量，后一项控制每个叶节点的权重。如果去掉正则化项，实际上就是普通的梯度提升树。

在对于第 $t$ 颗树的时候，我们需要优化的目标函数实际上以下式子：

$\mathcal{L}^{(t)} = \sum_{i=1}^n l(y_i, \hat y_i^{(t-1)} + f_t(\mathrm{x}_i)) + \Omega(f_t)$

将损失函数展开到二阶近似：

$\mathcal{L}^{(t)} \approx \sum_{i=1}^n \left[ l(y_i, \hat y_i^{(t-1)}) + g_i f_t(\mathrm{x}_i) + \frac12 h_i f_t^2(\mathrm{x}_i)\right] + \Omega(f_t)$

其中 $\hat y_i^{(t-1)}$ 是前 $t-1$ 颗树的结果，在当前的优化实际上是一个常数，将其移除之后得到 $t$ 步的优化函数为：

$\tilde{\mathcal{L}}^{(t)} = \sum_{i=1}^n \left[ g_i f_t(\mathrm{x}_i) + \frac12 h_i f_t^2(\mathrm{x}_i) \right] + \Omega(f_t)$

定义 $I_j = \{j| q(\mathrm{x}_i) = j\}$ 为叶节点 $j$ 上面对应的样本集，于是可以修改求和形式如下：

$\begin{aligned} \tilde{\mathcal{L}}^{(t)} &= \sum_{i=1}^n \left[ g_i f_t(\mathrm{x}_i) + \frac12 h_i f_t^2(\mathrm{x}_i) \right] + \Omega(f_t) \\ &= \sum_{j=1}^T \sum_{i \in I_j} \left[ g_i f_t(\mathrm{x}_i) + \frac12 h_i f_t^2(\mathrm{x}_i) \right] + \gamma T + \frac12 \lambda \sum_{j=1}^T w_j^2 \\ &= \sum_{j=1}^T \sum_{i \in I_j} \left[ g_i w_j + \frac12 h_i w_j^2 \right] + \gamma T + \frac12 \lambda \sum_{j=1}^T w_j^2 \\ &= \sum_{j=1}^T \left[ \left(\sum_{i \in I_j} g_i\right) w_j + \frac12 \left(\sum_{i \in I_j}h_i + \lambda\right) w_j^2 \right] + \gamma T \end{aligned}$

当对于一个确定的树结构， $\gamma T$ 为常量，前面这一项对应于 $w_j$ 的一个二次表达式，可以得到最优解为：

$w_j^\star = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j}h_i + \lambda}$

带入可以知道最优的值为：

$\tilde{\mathcal{L}}^{(t)}(q) = - \frac12 \sum_{j=1}^T \frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j}h_i + \lambda} + \gamma T$

上式仅仅与树结构 $q$ 有关，所以可以作为一个树结构的度量，越小说明这个树结构越好。由于没有办法穷举所有可能的树结构，所以只能贪心地对于树结构去改进，增添新的分支。假设我们希望把一个节点分离成两个子集 $I_L$ 和 $I_R$ 那么这个分裂会带来的 $\tilde{\mathcal{L}}$ 的减少就是：

$\begin{aligned} \mathcal{L}_{\text{split}} &= \mathcal{L}_{\text{before}} - \mathcal{L}_{\text{after}} \\ &= \frac12 \left[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L}h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R}h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I}h_i + \lambda} \right] - \gamma \end{aligned}$

这个值即为分裂带来的增益，应当越大越好，其中前面一项是因为分裂所带来的提升，后面一项是对于分裂使得模型复杂度增加的惩罚。所以 $\gamma$ 相当于给节点分裂设定了阈值，只有当分裂带来的增益超过这个阈值，才会进行树分裂，起到了剪枝的效果。

节点分裂算法

精确贪心算法

关键问题就是如何找到最优的分裂方案来获得最大的分裂增益，最直观的方法就是进行遍历，只要对于数据所有可能的分裂方式进行一次遍历，就可以从中找到增益最大的分裂方式。为了算法能够执行的更加高效，我们需要在最开始对于数据进行一次排序，这样就只要在有序数据上进行一次遍历就可以了。

近似算法

精确贪心算法由于需要遍历所有的可能，非常消耗时间。并且当数据没有办法全部放进内存的时候，进行精准的贪心算法明显是不可行的，所以需要近似算法。精确贪心算法相当于，对于连续变量当中的所有分隔，都作为分割点的候选。一个很自然的近似算法就是，只从当中选择一个子集作为分割点的候选，就是将连续变量给映射到一个个的桶当中，然后基于这些通的统计信息，来选择分割点。具体算法如下图所示，只需要将每一个桶，作为其中的一个样本来思考就可以了。

那么如何分桶实际上就是近似算法的关键所在，XGBoost的论文当中提出了两种方案：

全局方法（global）：即在最开始的构造时间就进行分桶，并且在整个节点分裂的过程当中，都采用最开始的分桶结果。
- 需要进行更少的分桶操作
局部方法（local）：在每次分裂的时候，都重新进行分桶。
- 每一次都在改进分桶方案，对于更深的树会更好
- 需要更少的候选数量

当然从结果上来看，当全局方法的候选数量提升之后，也同样可以获取和局部方法差不多的表现。

对于具体如何选择分割点，论文提出了一个叫做Weighted Quantile Sketch的方法。使用 $\mathcal{D}_k = \{(x_{1k}, h_1), (x_{2k}, h_2) ,\ldots, (x_{nk}, h_n)\}$ 来表示第 $k$ 个特征以及样本的二阶梯度，定义一个rank函数为 $r_k: \mathbb{R} \rightarrow [0, +\infty)$ 如下所示：

$r_k(z) = \frac{1}{\sum_{(x, h) \in \mathcal{D}_k} h} \sum_{(x, h) \in \mathcal{D}_k, x<z}h$

对于之前算法当中所提到的 $\epsilon$ ，实际上就是要找到一系列的分割点 $\{s_{k1}, s_{k_2}, \ldots , s_{kl}\}$ ，使得：

$|r_k({s_{k, j}}) - r_k (s_{k, j+1})| < \epsilon$

所以 $\epsilon$ 相当于一个度量采样点数量的值， $\epsilon$ 越小，对应的分割点就越多，数量近似为 $1 / \epsilon$ 。而采用二阶梯度作为分割依据的根据，来源于之前的目标函数：

$\begin{aligned} \tilde{\mathcal{L}}^{(t)} &= \sum_{i=1}^n \left[ g_i f_t(\mathrm{x}_i) + \frac12 h_i f_t^2(\mathrm{x}_i) \right] + \Omega(f_t) \\ &= \sum_{i=1}^n \frac 12 h_i \left(f_t(\mathrm{x}_i) - \frac{g_i}{h_i}\right)^2 + \Omega(f_t) + \text{constant} \end{aligned}$

之前的目标函数，实际上可以看作是一个以 $h_i$ 为权重的加权平方损失，所以采用 $h_i$ 来计算分割点。对于Weighted Quantile Sketch的具体实现，论文提供了一种新的数据结构，具体在论文附录当中，有兴趣的读者可以自己查看。

XGBoost对于稀疏特征有特殊的优化。稀疏矩阵的产生原因可能是缺失值，又或者One-Hot方法的使用。对于稀疏数据可以做一些特殊处理，我们可以将随意地分到左子树或者右子树，或者可以从数据当中来确定子树的分配方式。

如上述算法所示，只对于对应特征没有缺失值的样本来考虑分割点，而对于所有缺失值，考虑统一分到左边或者统一分到右边。对于两种情况以及所有分割点，取当中能够获得最大增益的作为最终选择的节点分裂方式。