Author Topic: Team render issues  (Read 1992 times)

2019-02-07, 21:54:13
Reply #15

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Tom I would like to get an answer on the color difference issues also if possible? here is the issue again for easier following but please check the attachments in the 1st post in this thread.

Thanks


"PICTURE VIEWER OUTPUT DIFFERENT FROM VFB

In all the test made so far using TR trough PV (the only possible way) but single machine too,  there is a difference where the PV saved image is always darker.
The c4d project settings are set to linear workflow and sRGB

I also had a situation where the lights didnt match in the render saved as .jpg from the PV to the .jpg saved from the VFB. It seems VFB is showing and saving the lightmix while the PV the beauty pass.That would explain why the color and intensity of light was different.Please check the attached images.
 If the above is true how would one save a non layered format out of PV and have it match the VFB ?
"

Hi kizo,

after reading your description, it seems to me that you somehow managed to save only the non-lightmix version of the image out of the PV. The functionality of the "Save as..." option of the C4D Picture Viewer depends on the current layer mode (Image vs. Single-Pass vs. Multi-Pass) and possibly on what layer you have selected.

As for the different lightness, we are aware of a very slight (almost imperceptible) difference between PV and VFB and it's probably a result of different sRGB handling. But it seems to me (based on your description) that you have a much bigger difference between those two. Might I ask, whether you have any PostProcessing filters enabled? And if so, are the results from PV and VFB the same after disabling PostProcessing?

Hi Houska,

thanks on the help. Im aware of the saving process from PV. I was just saving a file without previously going into single layer mode and choosing a layer.

I made further tests regarding the color and light difference and can reproduce the issue.

Rendering on a local machine only. The scene uses LightMix. Tried both .jpg and .tif.
NO multi pass file set to save so both formats save a single layer file.

1st time rendered in PV and I saved the files automatically from the render settings.
2nd time I rendered from VFB but leaving the save path to save automatically.
3rd I turned off auto saving the render, fired it from VFB and saved manually.

and this is where I get the difference.
the auto saved render from VFB and the manually saved one look different.check the attached .jpgs. tifs look the same

So this is not TR related but I encountered it while testing TR as I usually never render to PV on a single machine.
I have sent the scene I used for testing to Corona support 3 days ago so you can try reproducing it. Please let me know if you manage to test it out.

Thanks

kizo


The render saved from the PV or VFB(set to save in render settings) saves the beauty pass while if saved manually from the VFB it saves the lightmix. At least thats what we concluded after comparing the saved renders and separate passes.It would need checking.
« Last Edit: 2019-02-08, 11:36:42 by kizo »

2019-02-08, 09:26:54
Reply #16

Nelaton

  • Active Users
  • **
  • Posts: 52
    • View Profile
Hello,

@TomG: So we have a 10 gb switch  and network cards per computer of 1Gb. So, but i'm not sure, it's the switch who do the repartition and it power that preveal, is it?

NB:I was mistaken when i said the size packet was at  100 Mb in manual mode, for our  animations.
This is set to automatic.

Cheers,
Nelaton
« Last Edit: 2019-02-08, 09:32:08 by Nelaton »

2019-02-08, 14:25:22
Reply #17

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
The 1 GB cards would be the bottleneck (can think of it as a highway at one speed, but on and off ramps at one tenth of that speed, so that things can get delayed during those on and off ramps, which has the same effect of things not reaching their destination in time when machines go "Hi! Are you still there?")

If you have time to test, would be interested to know what happens if you do use automatic - mostly at the moment, testing longer intervals would be what was interesting. As another thought, If you use the Web Server version of Team Render, you can still render a single image and I *think* (but not sure off the top of my head) that you can get all machines contributing to that single image still... if that is possible, would be interesting to know whether it has the same issues, since there may be some different management in Team Render there rather than from a live-and-running version of Cinema 4D. I'll have a look and see if there is such a possibility with TR and the standalone server (though I can't test the network overload here, as I only have 1 WS and 2 nodes here :) So I will just be checking to see if "all machines rendering image" is possible, and won't know if it has any impact on the issues you are experiencing)

2019-02-08, 14:40:51
Reply #18

Nelaton

  • Active Users
  • **
  • Posts: 52
    • View Profile
Ok, so we tested in the past a scenario where we have 12 cameras. 1 per frame. And then, as you say, the machines works on one image a time. And render times are no comparison with
rendering in TR, stills (no animation).

But a still image calculated not in animation, take approximately de same time to render with TR, than on our local machine(s). So this is disppointing. On the other end, when rendering still image with TR (no animation mode) we get lines that are appearing, but this is very long, to make the render finish (and we have 8 clients runnings in this extent)

Also,
 we know that having 1giga network cards per machines is properly a non-sense with a 10 gig switch, and  we are aiming to change them in a near future.
Cheers,
Nelaton
 
« Last Edit: 2019-02-08, 14:45:51 by Nelaton »

2019-02-08, 15:24:32
Reply #19

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
Hi Nelaton,

I am not sure what you mean by "a still image calculated not in animation" - do you mean when using Team Render to Picture Viewer? When you mention lines appearing and render times are slow, this is where setting to Manual and raising the packet size helps (if you see the render build in the Picture Viewer in only thin lines, it means not enough data is being sent by the machines in a single packet, and they are spending all their time sending these small packets rather than focusing on rendering). Raising the packet size helps here - but then after a certain size, you may end up with a congested network and start losing connection to the nodes (all depends on how many nodes you have, and speed of your network). In this second case, a test with increased Interval would be good, as "larger packets sent less often" may resolve both the slow renders from the small default packet size, and the network congestion from sending too much data over the network (but, the impact of the Interval has yet to be tested by anyone experiencing this, so we don't know yet how much it will help).

So the process right now is
a) My Team Render to Picture Viewer is slower than it should be, almost as slow or slower than rendering locally (and I see the image build in very small strips) - raise packet size
b) I raised packet size, now nodes keep disconnecting - try increasing the Interval

As a note, upgrading your ethernet cards to 10 gig rather than 1 gig should mean you can raise the packet size with less risk of ending up with network problems from a large amount of data, and hopefully everything will then just work :)

2019-02-08, 16:22:50
Reply #20

Nelaton

  • Active Users
  • **
  • Posts: 52
    • View Profile
Thanks for your kind explanation. It doesn't appear to be a solution as pointed Kizo earlier, as other render engines do not suffer from traffic congestions
at 1Gigabit. I'm sure you got this essential point.
So then one question: Why not contact the Vray team to see how they handle this, and implement a Corona DR interface.
Because, now, i see TR as very unstable with corona.

Cheers,
« Last Edit: 2019-02-13, 13:28:23 by Nelaton »

2019-02-08, 16:23:26
Reply #21

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
Here is another option that may help, which is submitting a job to the Team Render Server rather than using Team Render to Picture Viewer. The downside is that you can't save to CXR, but other than that it should be fine.

The Server may handle sending data back and forth differently, since it is not showing the ongoing progress in the Picture Viewer / VFB, and so may not experience the dropouts - it seems to be independent of packet size, so could be it has its own inbuilt and separate method for handling sending data back and forth between Server and Client.

For me the easiest way to run a job through the Server is
a) Run the TR Server, the Corona License Server, and the TR Clients (including on the master machine, if you want it to contribute to rendering)
b) Use the folder icon in the top left of the TR Server UI to open the local file location where jobs are stored, and then make sure you are in the "admin" folder
c) Copy that address from Windows explorer
d) In C4D, use "Save Project with Assets" and paste the folder location in from above
e) This automatically creates a job for the TR Server
f) Open the browser UI for TR Server using the Globe icon almost top left in the TR Server
g) Start the job. So long as this is only a single frame job, all machines will contribute to it (if it is an animation, each machine will be given a different frame - and tests from Nelaton suggest that this for sure doesn't have issues with either slow rendering or nodes dropping out)

Since I don't have that many machines to test on, I can't say for sure if the possible difference in how the Server handles server-client communication will prevent network traffic problems, but I am kind of optimistic :) If anyone who has a large number of Clients and who experiences these issues can test, I'd be interested to hear the results.

2019-02-08, 16:24:39
Reply #22

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
On the V-Ray issues, two things - one, the implementation of V-Ray in the past was not done by Chaos Group but an external team. And two, they likely aren't updating the VFB in the same way we are (which is what makes our use of TR different from other engines). So, unfortunately, there isn't anything we can learn or do there in that regard, sorry.

2019-02-08, 16:25:07
Reply #23

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Hi,

so we have been testing various combinations and suggestions from this thread.

Raising the interval does help with disconnecting nodes. But the overall speed is extremely poor.
These are the specs of the machines used:

      x45   192.168.178.45:5401   PC, 40x2.3GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      x44   192.168.178.44:5401   PC, 40x2.3GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      x43   192.168.178.43:5401   PC, 40x2.3GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      x42   192.168.178.42:5401   PC, 40x2.3GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      x41   192.168.178.41:5401   PC, 40x2.3GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      ws35   192.168.178.35:5401   PC, 16x3.6GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      ws34   192.168.178.34:5401   PC, 32x3.8GHz, 64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle   
      ws32   192.168.178.32:5401   PC, 32x4GHz,    64.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle
         ws31   192.168.178.31:5401   PC, 64x3.8GHz, 128.00 GB RAM, (Studio Client)   Windows 10, 64 Bit, Professional Edition (build 17763)   Idle


We rendered the same 4000x4000px image with different variations

Rendering on the single 2990wx machine takes 14:34

using 8 machines from the list we got these results.The below tests were rendered from the PV with TR

30s/100mb   most nodes disconnected
40/100         1 disconnected        total render time stamped 14:26   (it took 6:20 to gather all the chunks and do the post process)
50/100         0,1 disconnected                                            14:07  (6:01)
60/100         no disconnected                                             15:54  (5:52)
90/100         no disconnected                                             19:57  (7:28)


using 5 machines

30/100         no disconnected                                             14:23

using 3 machines

30/100         no disconnected                                             15:41 


With a packet size higher than 100 times got a lot slower at any interval. When disconnected this is the error we got:

"
Sending chunk 12/13 to the server
Frame synchronization failed: Communication Error"

also this one:


"019/02/08 14:45:42  [Corona4D]
Sending chunk 2/5 to the server
2019/02/08 14:45:42  [Corona4D]
MEMORY_ERROR while sending data
2019/02/08 14:45:42  [Corona4D]
Frame synchronization failed: Memory Error"

with the above error the machine list would still show nodes rendering but the image wasnt updating and at over 20 mins nowhere near done

We also tried starting from the server app using the web interface.This time without the 2990wx. The remaining machines rendered the image in 14:55 The save path was set but the render wasnt save in that location at all.I was saved in the "results" folder.
Th render is in liner color space and extremely different than any of the tests made so far


After this set of tests Im even more worried. It seems that for our network and number of nodes on this particular render the best combination was 50s/100mb as it rendered with 8 machines in 14:07 but compared to the 14:34 of the single 2990wx its very disappointing.

   




2019-02-08, 16:26:03
Reply #24

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
The other note as regards stability - it is fine for most people, so far these reports only relate to yourself and Kizo, who are in the rare situation of having 8 and 10 Clients (most people have much less clients, and so much less network traffic).

2019-02-08, 16:27:54
Reply #25

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
TY for the results Kizo! What happens when you don't use all the Clients? (that is, rendering with just 2 clients, just 4 clients, etc., rather than all 10) You may get better results with less clients, as then there won't be network congestion and you won't need to raise the Interval (so you'll keep the speed benefit).

And what happens when using the TR Server approach to have all the machines rendering on that one job?

2019-02-08, 16:32:09
Reply #26

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
Here is another option that may help, which is submitting a job to the Team Render Server rather than using Team Render to Picture Viewer. The downside is that you can't save to CXR, but other than that it should be fine.

The Server may handle sending data back and forth differently, since it is not showing the ongoing progress in the Picture Viewer / VFB, and so may not experience the dropouts - it seems to be independent of packet size, so could be it has its own inbuilt and separate method for handling sending data back and forth between Server and Client.

For me the easiest way to run a job through the Server is
a) Run the TR Server, the Corona License Server, and the TR Clients (including on the master machine, if you want it to contribute to rendering)
b) Use the folder icon in the top left of the TR Server UI to open the local file location where jobs are stored, and then make sure you are in the "admin" folder
c) Copy that address from Windows explorer
d) In C4D, use "Save Project with Assets" and paste the folder location in from above
e) This automatically creates a job for the TR Server
f) Open the browser UI for TR Server using the Globe icon almost top left in the TR Server
g) Start the job. So long as this is only a single frame job, all machines will contribute to it (if it is an animation, each machine will be given a different frame - and tests from Nelaton suggest that this for sure doesn't have issues with either slow rendering or nodes dropping out)

Since I don't have that many machines to test on, I can't say for sure if the possible difference in how the Server handles server-client communication will prevent network traffic problems, but I am kind of optimistic :) If anyone who has a large number of Clients and who experiences these issues can test, I'd be interested to hear the results.

Thanks on the effort but this is a workflow killer even if it worked but it doesnt.
Saving projects locally to be able to render them out each time? thats just crazy
going each time trough the web interface part also.

I mean even if there were no issues its a overly long and frustrating procedure for 1 render let alone doing 20 or more
also there is no benefit in speed using the server.

2019-02-08, 16:33:23
Reply #27

kizo

  • Active Users
  • **
  • Posts: 28
    • View Profile
TY for the results Kizo! What happens when you don't use all the Clients? (that is, rendering with just 2 clients, just 4 clients, etc., rather than all 10) You may get better results with less clients, as then there won't be network congestion and you won't need to raise the Interval (so you'll keep the speed benefit).

And what happens when using the TR Server approach to have all the machines rendering on that one job?

Hi Tom

rendering wit 2 and 4 clients + main machine is on the above test list too. no benefit at all

2019-02-08, 16:35:28
Reply #28

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
TY! The fewer machines was with the raised packet size, right?

And the no speed benefit using the server was a comment based on an actual test? (Workflow killer aside) Cheers!

2019-02-08, 16:49:20
Reply #29

TomG

  • Corona Team
  • Active Users
  • ****
  • Posts: 2684
    • View Profile
BTW all this information has been incredibly helpful! I just wanted to add that as well as the tests and potential workarounds that I've been writing here, the developers have been continuing to think about and research the situation based on the tests that everyone has done for us (which have helped point to causes and so that helps in thinking about how it might be fixed). Didn't want you to think that what I write here is the only thing happening :) So, we are still looking into this and what might be possible for improvements.