【问题标题】:HBase shell scan bytes to string conversionHBase shell 扫描字节到字符串的转换
【发布时间】:2026-02-03 19:10:01
【问题描述】:

我想扫描 hbase 表并将整数视为字符串(不是它们的二进制表示)。我可以进行转换,但不知道如何使用 hbase shell 中的 Java API 编写扫描语句:

org.apache.hadoop.hbase.util.Bytes.toString(
  "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)

 org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)

我会很高兴有扫描示例,获取搜索二进制数据(long's)并输出普通字符串。我使用的是 hbase shell,而不是 JAVA。

【问题讨论】:

    标签: jruby hbase


    【解决方案1】:

    HBase 将数据存储为字节数组(无类型)。因此,如果您执行表扫描,数据将显示在 common format(转义的十六进制字符串)中,例如:
    "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65 " -> 你好 HBase

    如果您想从序列化的字节数组中取回输入的值,您必须手动执行此操作。 您有以下选择:

    • Java 代码 (Bytes.toString(...))
    • 破解$HBASE/HOME/lib/ruby/hbase/table.rb中的to_string函数: 对于非元表,将 toStringBinary 替换为 toInt
    • 编写一个 get/scan JRuby 函数,它将字节数组转换为适当的类型

    既然你想要HBase shell,那就考虑最后一个选项:
    创建文件 get_result.rb :

    import org.apache.hadoop.hbase.HBaseConfiguration
    import org.apache.hadoop.hbase.client.HTable
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.util.Bytes;
    import org.apache.hadoop.hbase.client.ResultScanner;
    import org.apache.hadoop.hbase.client.Result;
    import java.util.ArrayList;
    
    # Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
    def get_result()
      htable = HTable.new(HBaseConfiguration.new, "test")
      rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
      output = ArrayList.new
      output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
      rs.each { |r| 
        r.raw.each { |kv|
          row = Bytes.toString(kv.getRow)
          fam = Bytes.toString(kv.getFamily)
          ql = Bytes.toString(kv.getQualifier)
          ts = kv.getTimestamp
          val = Bytes.toInt(kv.getValue)
          output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
        }
      }
      output.each {|line| puts "#{line}\n"}
    end
    

    在 HBase shell 中加载并使用它:

    require '/path/to/get_result'
    get_result
    

    注意:根据您的需要修改/增强/修复代码

    【讨论】:

      【解决方案2】:

      为了完整起见,事实证明调用 Bytes::toStringBinary 给出了您在 HBase shell 中获得的十六进制转义序列:

      \x0B\x2_SOME_ASCII_TEXT_\x10\x00...

      Bytes::toString 将尝试反序列化为一个假设为 UTF8 的字符串,这看起来更像:

      \u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...

      【讨论】:

        【解决方案3】:

        您可以将 scan_counter 命令添加到 hbase shell。

        第一:

        添加到/usr/lib/hbase/lib/ruby/hbase/table.rb(扫描功能后):

        #----------------------------------------------------------------------------------------------
          # Scans whole table or a range of keys and returns rows matching specific criterias with values as number
          def scan_counter(args = {})
            unless args.kind_of?(Hash)
              raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
            end
        
            limit = args.delete("LIMIT") || -1
            maxlength = args.delete("MAXLENGTH") || -1
        
            if args.any?
              filter = args["FILTER"]
              startrow = args["STARTROW"] || ''
              stoprow = args["STOPROW"]
              timestamp = args["TIMESTAMP"]
              columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
              cache = args["CACHE_BLOCKS"] || true
              versions = args["VERSIONS"] || 1
              timerange = args[TIMERANGE]
        
              # Normalize column names
              columns = [columns] if columns.class == String
              unless columns.kind_of?(Array)
                raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
              end
        
              scan = if stoprow
                org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
              else
                org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
              end
        
              columns.each { |c| scan.addColumns(c) }
              scan.setFilter(filter) if filter
              scan.setTimeStamp(timestamp) if timestamp
              scan.setCacheBlocks(cache)
              scan.setMaxVersions(versions) if versions > 1
              scan.setTimeRange(timerange[0], timerange[1]) if timerange
            else
              scan = org.apache.hadoop.hbase.client.Scan.new
            end
        
            # Start the scanner
            scanner = @table.getScanner(scan)
            count = 0
            res = {}
            iter = scanner.iterator
        
            # Iterate results
            while iter.hasNext
              if limit > 0 && count >= limit
                break
              end
        
              row = iter.next
              key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)
        
              row.list.each do |kv|
                family = String.from_java_bytes(kv.getFamily)
                qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)
        
                column = "#{family}:#{qualifier}"
                cell = to_string_scan_counter(column, kv, maxlength)
        
                if block_given?
                  yield(key, "column=#{column}, #{cell}")
                else
                  res[key] ||= {}
                  res[key][column] = cell
                end
              end
        
              # One more row processed
              count += 1
            end
        
            return ((block_given?) ? count : res)
          end
        
          #----------------------------------------------------------------------------------------
          # Helper methods
        
          # Returns a list of column names in the table
          def get_all_columns
            @table.table_descriptor.getFamilies.map do |family|
              "#{family.getNameAsString}:"
            end
          end
        
          # Checks if current table is one of the 'meta' tables
          def is_meta_table?
            tn = @table.table_name
            org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
          end
        
          # Returns family and (when has it) qualifier for a column name
          def parse_column_name(column)
            split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
            return split[0], (split.length > 1) ? split[1] : nil
          end
        
          # Make a String of the passed kv
          # Intercept cells whose format we know such as the info:regioninfo in .META.
          def to_string(column, kv, maxlength = -1)
            if is_meta_table?
              if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
                hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
                return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
              end
              if column == 'info:serverstartcode'
                if kv.getValue.length > 0
                  str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
                else
                  str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
                end
                return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
              end
            end
        
            val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
            (maxlength != -1) ? val[0, maxlength] : val
          end
        
        
          def to_string_scan_counter(column, kv, maxlength = -1)
            if is_meta_table?
              if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
                hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
                return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
              end
              if column == 'info:serverstartcode'
                if kv.getValue.length > 0
                  str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
                else
                  str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
                end
                return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
              end
            end
        
            val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
            (maxlength != -1) ? val[0, maxlength] : val
          end
        

        秒:

        添加到 /usr/lib/hbase/lib/ruby/shell/commands/ 以下文件名为:scan_counter.rb

          #
        # Copyright 2010 The Apache Software Foundation
        #
        # Licensed to the Apache Software Foundation (ASF) under one
        # or more contributor license agreements.  See the NOTICE file
        # distributed with this work for additional information
        # regarding copyright ownership.  The ASF licenses this file
        # to you under the Apache License, Version 2.0 (the
        # "License"); you may not use this file except in compliance
        # with the License.  You may obtain a copy of the License at
        #
        #     http://www.apache.org/licenses/LICENSE-2.0
        #
        # Unless required by applicable law or agreed to in writing, software
        # distributed under the License is distributed on an "AS IS" BASIS,
        # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        # See the License for the specific language governing permissions and
        # limitations under the License.
        #
        
        module Shell
          module Commands
            class ScanCounter < Command
              def help
                return <<-EOF
        Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
        specifications.  Scanner specifications may include one or more of:
        TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
        or COLUMNS. If no columns are specified, all columns will be scanned.
        To scan all members of a column family, leave the qualifier empty as in
        'col_family:'.
        
        Some examples:
        
          hbase> scan_counter '.META.'
          hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
          hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
          hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
          hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
        
        For experts, there is an additional option -- CACHE_BLOCKS -- which
        switches block caching for the scanner on (true) or off (false).  By
        default it is enabled.  Examples:
        
          hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
        EOF
              end
        
              def command(table, args = {})
                now = Time.now
                formatter.header(["ROW", "COLUMN+CELL"])
        
                count = table(table).scan_counter(args) do |row, cells|
                  formatter.row([ row, cells ])
                end
        
                formatter.footer(now, count)
              end
            end
          end
        end
        

        终于

        将函数scan_counter添加到/usr/lib/hbase/lib/ruby/shell.rb。

        用这个替换当前函数:(你可以通过:'DATA MANIPULATION COMMANDS'来识别它,)

        Shell.load_command_group(
          'dml',
          :full_name => 'DATA MANIPULATION COMMANDS',
          :commands => %w[
            count
            delete
            deleteall
            get
            get_counter
            incr
            put
            scan
            scan_counter
            truncate
          ]
        )
        

        【讨论】:

        • 整个代码和原来的scan函数是一样的,唯一的区别是在to_string()中取值是value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"而不是toStringBinary,对吧?跨度>
        • 是的。这是函数scan_counter中table.rb的唯一区别。