Java.lang.String 的演进

JAVA 9的release版本

1.String是java最常用的一个对象,从创建之初,String的底层一直都是char[] ,char类型是一个两字节的类型,可以存储中文、日文、韩文等等多字节的字符,但是,如果只是存储一些英文数据,那么,存储空间将被浪费一半,使用jmap查看对象时,会发现占用空间最多的对象就是char[]。在这样的情况下,字符串使用intern方法,将数据存储在常量池中,减少数据的占用,虽然,此方法减少了很多数据重复使用,但数据在底层占用上,不能减少内存的占用。 java文件编码,默认的时utf8字符编码,而UTF8字符编码,UFT-8是针对Unicode的一种可变长度的字符编码,可以用来表示Unicode表中的任何字符。而且兼容ASCII字符编码。可以占用一个字节,也可能占用两个字节。

疑问: 如果utf-8的编码本身就是可变长度的字符编码,那么,在大量的latin编码的情况下,数据基本上可以预见的是都占用一个字节,其浪费空间的问题,可能就无从说起。

于是,翻开openjdk的网站,我找到了以下的描述,对String字符串的改进,并非网上所讲,为了优化内存的占用。而是为了的G1 GC的空间回收,并从oop在64位上的对象压缩来讲起。

openjdk中关于String字符变动的描述 At dump time, a designated string space is allocated within the Java heap during heap initialization. Pointers to the interned String objects and their underlying char-array objects are modified, as if those objects are from the designated space, when writing out the interned string table and the String objects.

The string table is compressed and then stored in the archive at dump time. The compression technique for the string table is the same as for the shared symbol table (see JDK-8059510). The regular narrow oop encoding and decoding is used to access the shared String objects from the compressed-string table.

On 64-bit platforms with compressed oop pointers, the narrow oops are encoded using offsets (with or without scaling) from the narrow oop base. Currently there are four different encoding modes: 32-bit unscaled, zero based, disjoint heap based, and heap based. Depending on the heap size and the heap minimum base, an appropriate [əˈprəʊpriət , əˈprəʊprieɪt] (适当的,合适的) encoding mode is selected. The narrow-oop encoding mode (including the encoding shift) must be the same at both dump time and run time, so that the oop pointers within the shared string space remain valid at run time. The shared-string space can be considered relocatable, with restrictions, at runtime. It is not required to be mapped at the same address as at dump time, but it should be at the same offset from the narrow oop base at dump time and run time. The heap size is not required to be the same at dump time and run time, as long as the same encoding mode is used. The offset of the string space and the oop-encoding mode (and shift) should be stored in the archive for run-time validation. If the encoding mode changes, it will invalidate the encoding of the oop pointer to the char array from each shared String. In such cases the shared-string data is ignored while the rest of the shared data can still be used by the VM. A warning indicating that shared strings are not used due to incompatible GC configuration will be reported by the VM.

At run time, the string space is mapped as part of the Java heap at the same offset from the oop encoding base as at dump time. The mapping starts at the lowest page-aligned address of the string space saved in the archive. The mapped string space contains the shared String and char-array objects. All G1 regions which overlap this mapped space will be marked as pinned; these G1 regions are unavailable for run-time allocation. There may be unused space wasted in a region that partially overlaps, but there should be at most one such region, at the end of the mapping. No patching is required for the oop pointers within the string space since the same narrow oop encoding is used. The shared-string space is writable, but the GC should not write to the oops in the space in order to preserve shareability across different processes. An application that attempts to lock one of these shared strings, and thus writes to the shared space, will get a private copy of the page, and therefore lose the benefit of sharing that particular page. Such cases are rare.

The shared-string table is distinct from the regular string table at runtime. Both tables are searched when looking up interned strings. The shared-string table is a read-only table at run time; no entries can be added or removed from it.

The G1 string-deduplication table is a separate hash table containing the char arrays for deduplication at runtime. When a string is interned and added to the StringTable, the string is deduplicated and the underlying char array is added to the deduplication table if it is not there already. The deduplication table is not stored into the archive. The deduplication table is populated during VM startup using the shared-string data. As an optimization, the work is done in the G1StringDedupThread (in G1StringDedupThread::run(), after initialize_in_thread()) to reduce startup time. The shared strings’ hash values are precomputed and stored in the strings at dump time to avoid the deduplication code writing the hash values at runtime.

在网页上的Goals中写到:

  • Reduce memory consumption by sharing the String objects and underlying char array objects amongst different JVM processes. (通过在不同进程中通过共享String对象和char数组来减少内存的占用。)
  • Only support shared strings for the G1 GC. Shared strings require a pinned region, and G1 is the only HotSpot GC that supports pinning. (只支持G1 GC)
  • Only support 64-bit platforms with compressed object and class pointers.(只支持x64平台,且启用了oop压缩指针)
  • No significant degradation (< 2-3%) on startup time, string-lookup time, GC pause time, or runtime performance using the usual benchmarks. (不太显著的降低2-3%的启动时间,字符串查找时间,暂停时间)

故:编者认为,String对象此次修改,更多的是在不同的进程中进行共享,且优化的根本还是在G1 GC的机制,由于在JDK9中,默认的是G1GC机制,此功能也暂时只支持了G1GC,且需要启用oop压缩 -XX:+UseCompressedOops。