Bug 565707 - JGit Arithmetic Overflow when file system does not support Nanos resolution
Summary: JGit Arithmetic Overflow when file system does not support Nanos resolution
Status: RESOLVED FIXED
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows 10
: P3 normal (vote)
Target Milestone: 5.9   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-30 08:32 EDT by Randall Becker CLA
Modified: 2020-08-23 16:49 EDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Randall Becker CLA 2020-07-30 08:32:47 EDT
We repeatedly get the following exception in Jenkins from JGit when many commits are compared. This results from lack of support for nanosecond resolution on the NonStop platform, and inability to perform operations with the UNDEFINED value in a Java long. Specifically, when a valid file stamp is converted to nanoseconds, then compared with another timestamp and one has an UNDEFINED value, the Java long range is exceeded. This seems like a pervasive issue in JGit. The multiplyExact, for example in the event/trace method appears to not deal with this condition. The situation is file system implementation specific and is not related to the JVM version.

Jenkins version: all
Jenkins Git Client version: 4.3.0
JGit Version: unknown but packaged with the Git Client.
JVM: 1.8.0_45 - 1.8.0.251
Platform HPE NonStop TNS/X (L19.20), and HPE NonStop TNS/E (J06.22)
Likelihood: Always

2020-07-30 12:06:27.004+0000 [id=99]    SEVERE  o.e.j.u.FS$FileStoreAttributes#lambda$3: java.lang.ArithmeticException: long overflow
java.lang.ArithmeticException: long overflow
        at java.lang.Math.multiplyExact(Math.java:1052)
        at java.time.Duration.toNanos(Duration.java:1186)
        at org.eclipse.jgit.util.FS$FileStoreAttributes.lambda$1(FS.java:363)
        at org.eclipse.jgit.util.FS$FileStoreAttributes$$Lambda$245/120256761.get(Unknown Source)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1582)
Caused: java.util.concurrent.CompletionException
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1584)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Comment 1 Thomas Wolf CLA 2020-08-05 07:40:46 EDT
I don't quite understand the description. JGit runs well on FAT (2 seconds file timestamp resolution), and also on Mac with Java 8 (1 second). OK, you _did_ write it was file system specific.

From code inspection I don't see where we would ever compare FileStoreAttributes.UNDEFINED_DURATION < or > something.

The only other reason for getting that exception shown in your stack trace would be measureFsTimestampResolution returning a crazy value.

Duration.toNanos() should not overflow unless the duration is > 9223372036 seconds.

So what UNDEFINED value do you mean? Is this some bogus value that this nonstop platform's filesystem returns if one has called Files.setLastModifiedTime() with some nanosecond value?
Comment 2 Randall Becker CLA 2020-08-05 10:44:44 EDT
The problem comes down to:

private static final Duration UNDEFINED_DURATION = Duration.ofNanos(Long.MAX_VALUE);

When this value is set to any Duration, and even 1 nano is added to the duration, this will cause an arithmetic overflow. What appears is that nano-second resolution on the platform is not supported, so Duration ends up with UNDEFINED_DURATION, so when multiplyExact is called, UNDEFINED_DIRECTION * anything non-zero will overflow. That seems to be the case here.
Comment 3 Thomas Wolf CLA 2020-08-06 05:13:03 EDT
(In reply to Randall Becker from comment #2)
> The problem comes down to:
> 
> private static final Duration UNDEFINED_DURATION =
> Duration.ofNanos(Long.MAX_VALUE);
> 
> When this value is set to any Duration, and even 1 nano is added to the
> duration, this will cause an arithmetic overflow. What appears is that
> nano-second resolution on the platform is not supported, so Duration ends up
> with UNDEFINED_DURATION, so when multiplyExact is called,
> UNDEFINED_DIRECTION * anything non-zero will overflow. That seems to be the
> case here.

UNDEFINED_DURATION is used only as a default marker when reading the .jgitconfig entry. I don't see anything added to it anywhere or toNanos() called on it; only two equals comparisons.

The stack trace shows an exception after measureFsTimestampResolution(), so the measurement must have produced a ridiculously high (or negative?) value, but I don't see from code inspection how this could happen unless the file system returned bogus file modification times.

Can you try the main loop of that measurement on your system?

Files.createFile(probe);
FileTime t1 = Files.getLastModifiedTime(probe); 
FileTime t2 = t1;
Instant t1i = t1.toInstant();
for (long i = 1; t2.compareTo(t1) <= 0; i += 1 + i / 20) {
	Files.setLastModifiedTime(probe,
			FileTime.from(t1i.plusNanos(i * 1000)));
	t2 = Files.getLastModifiedTime(probe); // print this
}
Duration fsResolution = Duration.between(t1.toInstant(), t2.toInstant());
Duration clockResolution = measureClockResolution();
fsResolution = fsResolution.plus(clockResolution);

Somewhere in this code one of t1, t2, fsResolution or clockResolution must produce a very large value.
Comment 4 Randall Becker CLA 2020-08-07 11:31:27 EDT
I am working on getting testing time for this code snippet and will report back shortly.
Comment 5 Randall Becker CLA 2020-08-11 17:10:51 EDT
This might explain it:

t1=2020-08-11T21:09:12Z
t1i=2020-08-11T21:09:12Z
t2.compareTo(t1)=0 (at loop start)
t1i.plusNanos(i * 1000)=2020-08-11T21:09:12.000001Z(1000 nanos)
t2=1952-08-24T02:04:48Z
t2i=1952-08-24T02:04:48Z
fsResolution=PT-595795H-4M-24S(-2144862264000000000 nanos)

So the Files.setLastModifiedTime(Instant) appears to be malfunctioning.
Comment 6 Randall Becker CLA 2020-08-11 17:29:15 EDT
I have verified that setLastModified(long) works. The underlying file system resolution is in microseconds, although setLastModified(long) is in milliseconds.
Comment 7 Randall Becker CLA 2020-08-11 17:42:26 EDT
I have run the same test on z/OS USS R2.4. The resulting fsResolution is 0 although the setLastModifiedTime call works but truncates the nanosecond resolution off. The issue is in all JVMs I can find on both platforms.
Comment 8 Matthias Sohn CLA 2020-08-12 03:42:00 EDT
(In reply to Randall Becker from comment #5)
> This might explain it:
> 
> t1=2020-08-11T21:09:12Z
> t1i=2020-08-11T21:09:12Z
> t2.compareTo(t1)=0 (at loop start)
> t1i.plusNanos(i * 1000)=2020-08-11T21:09:12.000001Z(1000 nanos)
> t2=1952-08-24T02:04:48Z
> t2i=1952-08-24T02:04:48Z
> fsResolution=PT-595795H-4M-24S(-2144862264000000000 nanos)
> 
> So the Files.setLastModifiedTime(Instant) appears to be malfunctioning.

report this bug for the corresponding JVM port ?
Did you also try Java 11 or higher ?
Comment 9 Thomas Wolf CLA 2020-08-12 07:09:08 EDT
(In reply to Randall Becker from comment #5)
> This might explain it:
> 
> t1=2020-08-11T21:09:12Z
> t1i=2020-08-11T21:09:12Z
> t2.compareTo(t1)=0 (at loop start)
> t1i.plusNanos(i * 1000)=2020-08-11T21:09:12.000001Z(1000 nanos)
> t2=1952-08-24T02:04:48Z
> t2i=1952-08-24T02:04:48Z
> fsResolution=PT-595795H-4M-24S(-2144862264000000000 nanos)
> 
> So the Files.setLastModifiedTime(Instant) appears to be malfunctioning.

That is... interesting. A FileTime before Instant.EPOCH, and at the same time t2.compareTo(t1) <= 0 is false. (Otherwise that loop would not terminate!)

AWhy would fsResolution come out as zero on z/OS? That should only be possible if we lose resolution when converting from FileTime to Instant. Now the _range_ of FileTime is bigger than that of Instant, but the conversion should not lose precision.
Comment 10 Randall Becker CLA 2020-08-12 07:56:37 EDT
It is correct that the loop does not terminate. I had to put a break in it.

On z/OS, the timestamp is truncated to milliseconds, so setting the time does not actually change the time and so fsResolution calculates as zero. I do not know whether the actual usable resolution is better than 1s.

On NonStop, the timestamp is truncated to microseconds, but the JVM defect is that setLastModifiedTime(Instant) is setting the value incorrectly. I have opened a case with HPE. This will probably not get fixed on their TNS/E platform, which is bad, and it could take months to get a fix for TNS/X. Using setLastModifiedTime(long) would work as the Instant value is correct - although it would truncate. I do not think there is a working exposed interface in Java to set timestamps to microseconds, so the usable resolution really is 1milli.

Note: I am not an employee/contractor for either company, so very little of a fix/change for this is within my control.
Comment 11 Randall Becker CLA 2020-08-12 07:59:15 EDT
Also, there is no JVM higher than 1.8 on NonStop TNS/E or on my z/OS system. There is a candidate JVM 11 on NonStop TNS/X, but I do not have access to that as of yet. Still, Jenkins has a minimum support of Java 8, so the code needs to work there.
Comment 12 Thomas Wolf CLA 2020-08-12 11:59:51 EDT
(In reply to Randall Becker from comment #10)
> It is correct that the loop does not terminate. I had to put a break in it.

Only on NonStop, or also on z/OS?

> On z/OS, the timestamp is truncated to milliseconds, so setting the time
> does not actually change the time and so fsResolution calculates as zero. I
> do not know whether the actual usable resolution is better than 1s.

Even if the resolution is only milliseconds: we keep adding more and more, so after enough iterations, we should be adding milliseconds, and we should see a difference, not zero.

We exit the loop when t2.compareTo(t1) > 0, and next we do a Duration.between(t1.toInstant(), t2.toInstant()). So how can we exit that loop and still get zero? Millis or not. Does the Duration.between() call drop them? It does have some exception handling where nanos might get set to zero. FileTime.toInstant() shouldn't drop precision.
Comment 13 Randall Becker CLA 2020-08-12 12:41:43 EDT
(In reply to Thomas Wolf from comment #12)
> (In reply to Randall Becker from comment #10)
> > It is correct that the loop does not terminate. I had to put a break in it.
> 
> Only on NonStop, or also on z/OS?

Both platforms have this. t2 maintains less than or equal to t1 and the loop does not terminate for a very long time - never on NonStop, and 1 second (10^9 nanos) on z/OS. Obviously, I did not let it run that long.

> > On z/OS, the timestamp is truncated to milliseconds, so setting the time
> > does not actually change the time and so fsResolution calculates as zero. I
> > do not know whether the actual usable resolution is better than 1s.
> 
> Even if the resolution is only milliseconds: we keep adding more and more,
> so after enough iterations, we should be adding milliseconds, and we should
> see a difference, not zero.

We are still seeing 0 on z/OS so I think the actual resolution set through setLastModifiedTime is 1 second, not 1 millisecond.

> We exit the loop when t2.compareTo(t1) > 0, and next we do a
> Duration.between(t1.toInstant(), t2.toInstant()). So how can we exit that
> loop and still get zero? Millis or not. Does the Duration.between() call
> drop them? It does have some exception handling where nanos might get set to
> zero. FileTime.toInstant() shouldn't drop precision.

We never exit on NonStop because t2 ends up before the EPOCH. We will eventually exit on z/OS once we count all nanos in a second. FileTime.toInstant() is not dropping resolution, but setLastModifiedTime(Instant) is. So when t2 is picked up from the file system it is either really bad (NonStop) or the same as t1 for a long time (z/OS).

Note: I have officially opened a case with HPE Support on this issue. I do not think IBM will consider this a defect.
Comment 14 Thomas Wolf CLA 2020-08-12 13:06:10 EDT
(In reply to Randall Becker from comment #13)
> (In reply to Thomas Wolf from comment #12)
> > (In reply to Randall Becker from comment #10)
> > > It is correct that the loop does not terminate. I had to put a break in it.
> > 
> > Only on NonStop, or also on z/OS?
> 
> Both platforms have this. t2 maintains less than or equal to t1 and the loop
> does not terminate for a very long time - never on NonStop, and 1 second
> (10^9 nanos) on z/OS. Obviously, I did not let it run that long.

All right, so on z/OS it does eventually exit the loop, and does give a resolution >= 1sec after the loop?

When I saw that code I also thought that this would loop quite long if the resolution was large. :-) We can do something about that.

I don't know whether whatever we might do for that would also side-step the NonStop issue. Since we don't have a NonStop platform to test test on (at least I don't) we can't make any promises about that.

(I'm thinking along the line of doing a kind of binary search between 2s and 0 to find the resolution, using setLastModified(long) for the range > millis and setLastModifiedTime(Instant) for smaller attempts.)
Comment 15 Randall Becker CLA 2020-08-12 13:20:33 EDT
(In reply to Thomas Wolf from comment #14)
> (In reply to Randall Becker from comment #13)
> > (In reply to Thomas Wolf from comment #12)
> > > (In reply to Randall Becker from comment #10)
> > > > It is correct that the loop does not terminate. I had to put a break in it.
> > > 
> > > Only on NonStop, or also on z/OS?
> > 
> > Both platforms have this. t2 maintains less than or equal to t1 and the loop
> > does not terminate for a very long time - never on NonStop, and 1 second
> > (10^9 nanos) on z/OS. Obviously, I did not let it run that long.
> 
> All right, so on z/OS it does eventually exit the loop, and does give a
> resolution >= 1sec after the loop?
> 
> When I saw that code I also thought that this would loop quite long if the
> resolution was large. :-) We can do something about that.
> 
> I don't know whether whatever we might do for that would also side-step the
> NonStop issue. Since we don't have a NonStop platform to test test on (at
> least I don't) we can't make any promises about that.
> 
> (I'm thinking along the line of doing a kind of binary search between 2s and
> 0 to find the resolution, using setLastModified(long) for the range > millis
> and setLastModifiedTime(Instant) for smaller attempts.)

I'm going to assume z/OS will stop eventually. Counting that long is actually expensive - in real dollars (we pay for usage).

I can certainly test the proposed code.
Comment 16 Eclipse Genie CLA 2020-08-14 09:50:05 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/167706
Comment 17 Thomas Wolf CLA 2020-08-14 10:30:01 EDT
(In reply to Randall Becker from comment #15)
> I'm going to assume z/OS will stop eventually. Counting that long is
> actually expensive - in real dollars (we pay for usage).

With Java 8 on Mac APFS (1 second resolution) current code takes 233 iterations and about 10ms.

> I can certainly test the proposed code.

(In reply to Eclipse Genie from comment #16)
> New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/167706

Please test the code from that Gerrit change on your z/OS and NonStop platforms.
Comment 18 Randall Becker CLA 2020-08-14 11:02:08 EDT
(In reply to Thomas Wolf from comment #17)
> (In reply to Randall Becker from comment #15)
> > I'm going to assume z/OS will stop eventually. Counting that long is
> > actually expensive - in real dollars (we pay for usage).
> 
> With Java 8 on Mac APFS (1 second resolution) current code takes 233
> iterations and about 10ms.
> 
> > I can certainly test the proposed code.
> 
> (In reply to Eclipse Genie from comment #16)
> > New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/167706
> 
> Please test the code from that Gerrit change on your z/OS and NonStop
> platforms.

Tested the method in isolation. The result is printed as:

fsResolution=PT1S(1000000000 nanos)

There were no messages logged, which is a bit suspicious. I'm not sure the code is working as intended. On HP NonStop, the result should be at millisecond resolution at worst. I added some debugging messages to my copy with the following trace:

Initial Loop
 Trying 1000
 t2=1960-02-29T02:54:48Z
 Trying 1000000
 t2=2020-08-14T15:02:03Z
Before binary search. high=2,low=0,minScale=1000000
 Loop iteration with high=2,low=0,mid=1
  delta=1000000000, current scale=1000000000
  t2=2020-08-14T15:02:04Z
  cmp=1
   diff=PT1S
 Loop iteration with high=1,low=0,mid=0
  delta=500000000, current scale=100000000
  t2=2020-08-14T15:02:03Z
  cmp=0
  no discernible difference
 Loop iteration with high=10,low=6,mid=8
  delta=800000000, current scale=100000000
  t2=2020-08-14T15:02:03Z
  cmp=0
  no discernible difference
 Loop iteration with high=10,low=9,mid=9
  delta=900000000, current scale=100000000
  t2=2020-08-14T15:02:03Z
  cmp=0
  no discernible difference

fsResolution=PT1S(1000000000 nanos)
Comment 19 Thomas Wolf CLA 2020-08-14 11:12:25 EDT
(In reply to Randall Becker from comment #18)
> Tested the method in isolation. The result is printed as:
> 
> fsResolution=PT1S(1000000000 nanos)
> 
> There were no messages logged, which is a bit suspicious.

No, that's correct. The initial test for 1000 (1 microsecond) doesn't log, and later the loop goes from larger to smaller intervals, but never even gets near the millisecond range.

> I'm not sure the
> code is working as intended. On HP NonStop, the result should be at
> millisecond resolution at worst.

May also depend on the Java version used. On my Mac, I get 1s with Java 8, and 1 microsecond with Java 11. On the same file system. (Which BTW, is normal and correct.)

> I added some debugging messages to my copy
> with the following trace:
> 
> Initial Loop
>  Trying 1000
>  t2=1960-02-29T02:54:48Z
>  Trying 1000000
>  t2=2020-08-14T15:02:03Z
> Before binary search. high=2,low=0,minScale=1000000
>  Loop iteration with high=2,low=0,mid=1
>   delta=1000000000, current scale=1000000000
>   t2=2020-08-14T15:02:04Z
>   cmp=1
>    diff=PT1S
>  Loop iteration with high=1,low=0,mid=0
>   delta=500000000, current scale=100000000
>   t2=2020-08-14T15:02:03Z
>   cmp=0
>   no discernible difference
>  Loop iteration with high=10,low=6,mid=8
>   delta=800000000, current scale=100000000
>   t2=2020-08-14T15:02:03Z
>   cmp=0
>   no discernible difference
>  Loop iteration with high=10,low=9,mid=9
>   delta=900000000, current scale=100000000
>   t2=2020-08-14T15:02:03Z
>   cmp=0
>   no discernible difference
> 
> fsResolution=PT1S(1000000000 nanos)

Thanks for testing, but this shows quite clearly that adding 1 second gave a difference, but adding 0.5s did _not_ give any difference. And neither did 0.8 or 0.9. So 1 second is correct.
Comment 20 Thomas Wolf CLA 2020-08-14 11:30:26 EDT
Note that the code is supposed to report file timestamp resolution *as seen by Java*. This may be worse that what a corresponding C program would get.
Comment 21 Randall Becker CLA 2020-08-14 11:36:19 EDT
(In reply to Thomas Wolf from comment #20)
> Note that the code is supposed to report file timestamp resolution *as seen
> by Java*. This may be worse that what a corresponding C program would get.

Too right. I suspect that "my way" of getting the timestamp in C is not how the JVM does it, but then "my way" involves OS primitives. Even stat() only reports 1 second resolution.
Comment 22 Thomas Wolf CLA 2020-08-14 11:46:37 EDT
There are a few related JDK bugs for this.

The upshot is: it depends on the file system mounted (for instance Mac HFS/HFS+ only stores 1sec resolution), on the OS, and on top of that it depends on the Java version. Java 8 on Mac APFS with OS X 10.14.6 (1ns resolution natively) still only gives you 1s resolution. Java 11 and 13 would give microsecond resolution, and starting from Java 14 one should be able to get nanoseconds on Linux/OS X. (But currently the code stops at 1 microsecond.) On NTFS, the native resolution is 100ns, but you can get that also only from Java 14 on.