按多个数字对 Java String 数组进行排序答案

【问题标题】：Sort Java String array by multiple numbers按多个数字对 Java String 数组进行排序
【发布时间】：2014-10-23 17:17:19
【问题描述】：

我在这样的 txt 文件中有一个数据列表

Date,Lat,Lon,Depth,Mag

20000101,34.6920,-116.3550,12.30,1.21
20000101,34.4420,-116.2280,7.32,1.01
20000101,37.4172,-121.7667,5.88,1.14
20000101,-41.1300,174.7600,27.00,1.90
20000101,37.6392,-119.0482,2.40,1.03
20000101,32.1790,-115.0730,6.00,2.44
20000101,59.7753,-152.2192,86.34,1.48
20000101,34.5230,-116.2410,11.63,1.61
20000101,59.5369,-153.1360,100.15,1.62
20000101,44.7357,-110.7932,4.96,2.20
20000101,34.6320,-116.2950,9.00,1.73
20000101,44.7370,-110.7938,5.32,1.75
20000101,35.7040,-117.6320,4.15,1.45
20000101,41.9270,20.5430,10.00,4.80

我的任务是按每个标准对这些数据进行排序，例如）按日期、纬度和经度排序

我试过这样的冒泡排序

if ( Double.parseDouble(a[0].split(",")[1]) <  Double.parseDouble(a[1].split(",")[1]))

这可行，但需要太多时间

txt 文件中有40000 数据

有没有其他方法可以对这些数据进行排序？

【问题讨论】：

将每一行存储在一个类对象中，然后为该类对象的列表定义不同的比较器。以此为参考stackoverflow.com/questions/5245093/…
合并排序怎么样？ O(nlogn)

标签： java arrays string

【解决方案1】：

试试merge sort。合并排序的最坏情况性能为 O(n log n)。冒泡排序的最坏情况时间是 O(n^2)。

【讨论】：

如果你不知道这个符号是什么意思，那就是 Big Oh Notation。 O(nlogn) 比 O(n^2) 好，这意味着，在最坏的情况下，合并排序将比冒泡排序执行得更好。有关更多信息，请参见此处：en.wikipedia.org/wiki/Big_O_notation#Orders_of_common_functions（这对您今后很重要）。

【解决方案2】：

我可能会毁了一些学生的家庭作业，但是这里……

正如问题中所建议的，Java 中的自然方式是创建一个类来表示您的数据。然后实现一个Comparator 传递给实用方法Collections.sort。

在我的 MacBook Pro 2.3 GHz Intel Core i7 上运行 Parallels 虚拟机和 Java 8，42,000 个元素的数据集需要 45-90 毫秒来排序。

我将您的示例数据更改为更有趣，引入了一些不同的日期和重复的纬度。

20000101,34.6920,-116.3550,12.30,1.21
20000101,34.4420,-116.2280,7.32,1.01
20000101,34.6920,-121.7667,5.88,1.14
20000101,-41.1300,174.7600,27.00,1.90
20000101,37.6392,-119.0482,2.40,1.03
20000101,32.1790,-115.0730,6.00,2.44
20000101,34.6920,-152.2192,86.34,1.48
20000102,34.6320,-116.2410,11.63,1.61
20000102,59.5369,-153.1360,100.15,1.62
20000102,44.7357,-110.7932,4.96,2.20
20000102,34.6320,-116.2950,9.00,1.73
20000102,34.6320,-110.7938,5.32,1.75
20000102,34.6320,-117.6320,4.15,1.45
20000102,41.9270,20.5430,10.00,4.80

我的GeoReading 类来表示数据。

class GeoReading
{

    LocalDate localDate = null;
    BigDecimal latitude = null;
    BigDecimal longitude = null;
    BigDecimal depth = null;
    BigDecimal magnitude = null;

    public GeoReading( String arg )
    {
        // String is comma-separated values of: Date,Lat,Lon,Depth,Mag
        List<String> items = Arrays.asList( arg.split( "\\s*,\\s*" ) ); // Regex explained here: http://stackoverflow.com/a/7488676/642706
        this.localDate = ISODateTimeFormat.basicDate().parseLocalDate( items.get( 0 ) );
        this.latitude = new BigDecimal( items.get( 1 ) );
        this.longitude = new BigDecimal( items.get( 2 ) );
        this.depth = new BigDecimal( items.get( 3 ) );
        this.magnitude = new BigDecimal( items.get( 4 ) );
    }

    @Override
    public String toString()
    {
        return "GeoReading{" + "localDate=" + localDate + ", latitude=" + latitude + ", longitude=" + longitude + ", depth=" + depth + ", magnitude=" + magnitude + '}';
    }

}

这里是比较器的实现。

class GeoReadingAscendingComparator implements Comparator<GeoReading>
{

    @Override
    public int compare( GeoReading o1 , GeoReading o2 )
    {
        int localDateCompare = o1.localDate.compareTo( o2.localDate );
        if ( localDateCompare != 0 ) { // If not equal on this component, so compare on this.
            return localDateCompare;
        }

        int latitudeCompare = o1.latitude.compareTo( o2.latitude );
        if ( latitudeCompare != 0 ) { // If not equal on this component, so compare on this.
            return latitudeCompare;
        }

        return o1.longitude.compareTo( o2.longitude );

    }
}

主要代码。

Path path = Paths.get( "/Users/basil/lat-lon.txt" );  // Path for Mac OS X.
try {
    List<GeoReading> list = new ArrayList<>();
    Stream<String> lines = Files.lines( path );
    lines.forEach( line -> list.add( new GeoReading( line ) ) );
    // Take those 14 lines and multiply to simulate large text file. 14 * 3,000 = 42,000.
    int count = 3000;
    List<GeoReading> bigList = new ArrayList<>( list.size() * count ); // Initialze capacite to expected number of elements.
    for ( int i = 0 ; i < count ; i++ ) {
        bigList.addAll( list );
    }
    long start = System.nanoTime();
    Collections.sort( bigList , new GeoReadingAscendingComparator() );
    long elapsed = ( System.nanoTime() - start );
    System.out.println( "Done sorting the GeoReading list. Sorting " + bigList.size() + " took: " + TimeUnit.MILLISECONDS.convert( elapsed , TimeUnit.NANOSECONDS ) + " ms ( " + elapsed + " nanos )." );

    System.out.println( "Dump…" );
    for ( GeoReading g : bigList ) {
        System.out.println( g );
    }
} catch ( IOException ex ) {
    System.out.println( "ERROR - ex: " + ex );
}

在现实世界中，我会添加一些防御性编程代码来验证传入的数据。来自外部来源的数据总是有缺陷和/或变化。

【讨论】：

为什么是 BigXXX 字段？为什么不使用双精度和整数？
@user949300 准确度 > 性能。 我习惯于在我的日常项目中使用 BigDecimal 来处理需要准确度的项目。如果这个科学数据可以容忍浮点漂移的不准确性，那就去吧。可能会更快，但总执行时间不到一秒，我认为premature optimization。
OP 有 40K 记录，每个 5 个字段，所以使用 BigXXX 是 200K 对象。一个对象，非常粗略地说，比原语多出 10 个字节，所以是 2MB。我想，现在已经不是那么多了，但它确实是。
@user949300 BigDecimal 是 at least 2-3 times bigger 而不仅仅是您提到的最小对象开销。但在多台机器上仍然没有大量内存。如果内存是个问题，我会将数据移动到像Postgres 这样的数据库中。因此，再次强调，这不是偏好问题，而是需要问题。如果您确定浮点不准确是可以容忍的，那么使用浮点数（32 位）或双精度数（64 位），对内存和速度都有好处。如果需要准确性，始终使用 BigDecimal。
对于那些不知道的人，请阅读how floating-point number technology trades away accuracy 以了解性能（执行速度）。 Java 中的浮点原始数据类型是float (32-bit) and double (64-bit)。