Graphic card upgrade

I know there’s frequent chat on the site about the advantages to be gained by using xx or yy graphic card.

I’ve got a new graphic card on order that I plan to install in my PC on Thursday (Nvidia RTX 4060 8Gb). I thought I would post the time for an export of (say 10) RAW files with the old (Nvidia 1650 GTX) and new cards using all settings the same.

I will do as I have described above, (just to add to the general knowledge base) but thought I would post here in advance in case there was some specific test that someone would find helpful (obviously I would need to undertake stage one of that test before I remove the old graphic card).

Best wishes
Malcolm

Expect a HUGE improvement.

With the 4060, you should get something like 6-7 seconds per export with DeepPrimeXD2 and 25 Mpx images.

1 Like

That will depend on the size of the raw files and the number and types of edits to it. Generally, I achieve six to seven seconds per file with the 20.9 mp raws from my Nikon Z fc and the same modest number of edits to each image using the RtX 4060. If I instead was shooting a 45 megapixel Nikon D850, with significant and varied edits to each image I would expect processing to take much longer.

Mark

@MalcolmC I am afraid that the Google spreadsheet is really only effective at showing the performance of the CPU and GPU combined.

When you run your test please set up the following tests, 10 (or more) RAW images with appropriate edits but with NO NR set, arguably that test need be run only once but for clarity I would create groups a bit like I have here


only for DP XD tests you can run DP XD2s tests, or if you have both PL7 and PL8 the DP XD tests and DP XD2s tests.

Essentially the NO NR tests give some idea of how much of the export time is spent rendering the edits rather than rendering the edits & applying Noise Reduction.

It isn’t quite as simple as that but it is a lot more useful than just having one figure for the export which smashes the rendering the edits and applying NR into one figure.

Some PL8.1/Win denoising timings in seconds per photo with RTX 4070 + i7-14700KF and 45mpx Nikon Z8 (average for 10 raws batches, 14-bit, lossless compression):

  • 2.0 – no NR, CPU 40%, GPU 30-40%
  • 2.1 – HQ, CPU 50%, GPU 30-40%
  • 10.2 – PRIME, CPU 70%, GPU 1%
  • 2.2 – DP, CPU 40%, GPU 30%
  • 5.1 – DP XD2s, CPU 20-30%, GPU 70%

The CPU and GPU data is a crude approximation. I run a second round of tests to confirm the results. Surprises:

  • NoNR and HQ use GPU
  • NoNR, HQ, and DP performed almost equally

IIRC, in PL7 DP was much slower than HQ.
Currently DP XD2s is about 2.5 times slower than pure DP on my desktop. In PL7 I used pure DP most of the time and XD in 20-30% cases. In PL8 I switched to XD2s almost completely. With XD2s Luminance=15 some luma noise is retained (which I use sometimes), at 30 it starts to be clean. This might depend on camera/lens/photo.

Tested on a 10 file batch (default 2 threads in PL settings used, jpeg 90% quality export), 14-bit raws with lossless compression, not so high ISO (800-6400), uncropped, and all optical corrections applied, including LSC, plus SmartLighting and Selective Tones. Most of my job is ISO 8k-12k, and then DPXD2s is 20-40% slower, usually 6-7 seconds per photo in a batch. Timings depend on cropping, edits, ISO, amount of details, sensor type, lens, …

Then I switched to ‘Use CPU only’ in PL Preferences->Performance tab, and restarted PL:

  • 2.1 – no NR, CPU 45% GPU 30% (sic !)
  • 2.1 – HQ, CPU 50%, GPU 50% (sic!)
  • 10.1 – PRIME, CPU 75%, GPU 1%
  • 45.2 – DP, CPU 70%, GPU 1%
  • 109.7 – DP XD2s, CPU 90%, GPU 1%

The surprising thing here is again GPU usage for NoNR and HQ.

@Wlodek I need to dig out some of my timings but the i7-14700KF has no onboard graphics I believe? So some off the GPU will be used to drive the monitor but the NO NR and HQ figures look completely “barmy”!?

When I am timing I use the DxPL export figures to provide the data, e.g.


what it fails to capture is the setup times but typically that is small compared with the actual export times.

So I wind up with

I have similar figures for PL8 tests made during testing but haven’t repeated the above tests with PL8.1 production release, which I must do.

Early on in GPU testing I used a spreadsheets to do the calculations but that didn’t seem to attract much attention at all in the forum. So I started presenting results using the export table and added Process Explorer and GPU-Z outputs, i.e.

I doubt that attracted much attention either!

If you look at your figures with NO NR it takes 2 seconds with 40% CPU, why DxPL is using 30-40% GPU I do not know, that seems way too excessive.

The following shows my export tests with my 5900X and a 2060 GPU fitted, running tests on the Egypt image and shows GPU usage with batches of 60 images using PL7 (in this case) of “NO NR”, “DP” and “DP XD” images (the 5900X has no integrated GPU so all monitor traffic is handled by the GPU).

To improve the export timings for “NO NR”, increasing the number of export copies to max out the processor and having an even more powerful processor would reduce that figure but increasing the number of copies when exporting DP XD2s in particular will rapidly max out the GPU.

DP is using 0.2 seconds of GPU time and DP XD2s 3.1 seconds (the CPU is not idle during the denoising operation but the GPU is the main engine ) and on my system with the 2060, or more normally the 3060 fitted, the GPU is actually bottlenecking the CPU as demonstrated by this Process Explorer graph

increasing the GPU power even further than the 4070 would bring that 3.2 seconds down but at what price?

On my system I use 3 threads and have run tests where 4, 5 & 6 threads performed about the same as 3 threads and then got worse, the exports are 100% JPGs.

1 Like

Unfortunately, you are right, but I had no reasonable alternative at that time and I was in a hurry.
As long as I don’t use any video, GPU stays at 0% and circa 10W. Only standard circuitry is used then and it doesn’t interfere with PL exports. Before and after NoNR and HQ exports GPU stays at 0-1%, as expected but during the export it goes up, unlike for PRIME. I was surprised by GPU usage too and I double checked it. I have no performance problem, so I didn’t dig further. And during the exports, my PC gets hot enough (4070 goes up to 200W), not to tune it.

That’s what I did too. You can find more exact timing in PhotoLab log. Just the libraw unpack() function takes about 0.6 sec for Z8 14-bit losslessly compressed NEFs (that’s single-threaded CPU only task), so PL/drivers performance looks really good.

1 Like

Actually I’ve used it about a year ago, while planning my PC configuration. Thank you very much for the effort, it proved to be really useful.

1 Like

OK, I swapped my graphic card over yesterday.

The constants are an Intel 11700k, 32Gb RAM and all the source and target files on a Samsung 990 pro M.2 2Tb.

There were 10 RAW CR3 files (EOS R) and I am using PL 8.1 build 434 and the Nvidia Studio driver 566.14. I carried out no edits to the photos.

With the 1650 GTX, export times were (apx, I used as stopwatch):
All in seconds
No NR 25
HQ 30
Prime 172
Deep Prime 155
Deep Prime XD/XD2 399

With the RTX 4060
No NR 22
HQ 27
Prime 171
Deep Prime 27
Deep Prime XD/XD2 60

Therefore there is no appreciable difference for no NR, HQ and Prime, but significant improvements (more than I had anticipated) on Deep Prime and Deep Prime XD/XD2.

I think (but the thought is subjective) that the picture using the loupe tool renders more quickly too. with the new GPU.

I realise the above isn’t going to change the world in any way, but I post in the hope that it will help anyone thinking about upgrading their own graphic card have some data regarding the improvement they might achieve.

Best wishes
Malcolm

4 Likes

@MalcolmC It will because it is using the graphics card to do your chosen NR on the image (section) in the Loupe.

I will look at your post in greater detail later, I am backing up between machines at the moment.

Regards

Bryan

@MalcolmC The GPU won’t affect NO NR, HQ and Prime because they are CPU driven, except in the case of @Wlodek where there are some unexpected GPU usage (!?).

My figures for PL7 are in the large chart in my post above for two machines a Ryzen 5600G which was then replaced with a 5900X (and then built into a new case with a new motherboard replacing an old i7-4790K, which is still intact and usable for testing).

The chart contains NO NR, DP and DP XD and there are rows for a Canon R6 DP using a batch of 20 images.

For the 5600X I included tests for 4, 3 and 2 export copies, the 5600G is only for 2 export copies. The 5600X is a bit slower than your CPU and the 5900X somewhat faster but the tests were conducted with an RTX3060 which on the site that I tend to use is 22% faster with their GPU benchmark.

I also have two Ebay 2060s, about 80% of a 3060, one for the rebuilt 5600G and the other for my remaining i7 4790K, plus a 1050Ti and a 2GB 1050 which now fails the PL8 1GB test (please note that a 2GB card fails the 1GB test!?).

I downloaded some EOS R images and ran two sets of tests, the first with some basic edits and the second with only NR as appropriate and got the following timings with the 5900X and a 2060 GPU

Data kindly provided by @MalcomC confirms two findings for PL8.1/Win and sensors with standard Bayer matrix and “normal” raw encodings:

  1. There’s almost no penalty of using “pure” DP over not using NR at all, provided a sufficiently capable GPU is used, like desktop RTX 4060.
  2. DP XD2s is 2-3 times slower than pure DP.

The first point is quite striking.
Also Malcolm’s results on RTX4060/GTX1650 are unexpected to me – I wouldn’t imagine that DP on 4060 is about 6 times faster than on 1650.

It looks like normal floats (fp32) are used for GPU calculations – GTX 1650 is only 2.5 slower than RTX 4060 for fp16 (half float) operations, but 5.1 times slower for fp32 operations. There’s much stronger correlation between DP(XD) and GPU fp32 performance, than for PassMark G2D and G3D benchmarks (2.1 and 2.5 times difference there). Of course there are too many variables here to make precise judgements, as it’s unknown how exactly DP(XD) splits the work on CPU and GPU, various cache hits/misses, GPUs with equal labels are not always equal, the denoising algorithm timing greatly depends on actual image data, and so on.

Thank you for your analysis of the speed increase I achieved as a result of the graphic card update.

I looked at the GPU passmarks and found the 4060 to be around 2.5 x that of the 1650 GTX and (being a pessimist) thought I would be happy with an improvement of 2 x. Accordingly, I am delighted by the actual improvement achieved.

As others have explained above, there are a lot of factors involved here, but I cannot afford to buy a 5090 (when they arrive) to investigate further (unless someone offers to sponsor me!)

@Wlodek Although it might be a “red herring”, when I tested DP XD2s on the GTX 1050Ti card it was 50% slower than DP XD.

I theorised in one post or another that it might be a fact for all GTX cards because my RTX 2060 and RTX3060 showed what I had seen during testing, that typically DP XD2s was typically (very) slightly slower than DP XD on my systems (during the testing stage of the product I was testing first with the 5600G and then later with the 5900X but using the RTX 3060 throughout).

I only started testing PL8 with the RTX 2060 after the product was released. Plus the following are the tests with the GTX 1050Ti which I also only tested after PL8 was released.

The reason for asking for NO NR test figures is because this represents the rendering of the edits minus Noise Reduction (as the naming implies).

So I take the NO NR figure from the DP and DP XD(2s) figures to give “more accurate” timings to the portion of the export that is largely determined by the GPU (with some CPU involvement, which isn’t easily measured!)

So with the @MalcolmC figures we have CPU = 25 (for NO NR), (155 - 25) = 120 (DP) and (399 - 25) = 374 (DP XD2s) for the GTX 1650Ti

and then CPU = 22 (NO NR!), (27 - 22) = 5 (DP) and (60 - 22) = 38 (DP XD2s) for the RTX 4060

That yields an improvement of (129/5) = 25.8 (DP) and (374/38) = 9.84(DP XD2s) which is really “scary” and completely “out of whack” with the benchmark sites assessment of the 1650Ti with the 4060!?

Comparing the overall timings would give 25/22 = 1.136 but its the same processor(!?), 155/27 = 5.74 (DP) and 399/60 = 6.65 (DP XD2s).

My favourite GPU site puts the 1650 versus the 3060, but doesn’t have a figure for the GTX 1650Ti as performing like this

and against the 4060 like this

Instead of trying to compare the 20 image run I had done with a 10n image run I reran my tests with 10 images and then the NO NR test again because I start all my tests one after the other and that means the PL8 setup for the DP and DP XD2s tests will overlap with the actual running of the first NO NR and potentially steal processor away from the test (I sometimes use a DUMMY test for that purpose), so we have

My figures for 10 potentially similar images with the 5900X and an RTX2060 are 20 (NO NR), (28 - 20) = 8 (DP) and (77-20) = 57 DP XD2s)

versus for 10 images with a Intel 11700kand an RTX 4060 we have 22 (NO NR!), (27 - 22) = 5 (DP) and (60 - 22) = 38 (DP XD2s) for the RTX 4060

The slower CPU (the Intel) wasn’t much slower in the NO NR test 22/20 = 1.1 to the 5900X, the DP tests 8/5 = 1.6 to the 4600 and the DP XD2s tests 57/38 = 1.5 to the 4060 over the 2060.

If you compared the overall times you would have 22/20 = 1.1 time faster to the 5900X, 28/27 = 1.03 times faster to the 4600 and 77/60 =1.28 to the 4600.

Hence, my trying to be slightly more “accurate” using the NO NR figure but they tend to show GPUs performing much worse doing DxPL exports than GPU Gaming benchmarks seem to show when comparing one GPU with another.