Expand my Community achievements bar.

Room suddenly crashes, stopping streams & messages

Avatar

Level 3

Our room crashed last night, where all the video streams suddenly stopped and chat messages couldn't be sent.

This seems similar to what we reported in http://forums.adobe.com/post!reply.jspa?message=3674477 - did you ever reproduce or resolve the room crashing issue documented there?

This crash seemed different, however, as in the previous room crashes the current streams would continue, you just couldn't add or retract items, and the room would stay in a corrupt state where items could be added but not modified or retracted.  This time, the room recovered when we checked it again this morning.

Switching to a new room, everything worked fine, but we lost a lot of users in the process.  

Testing circumstances as similarly as possible, we weren't able to reproduce the crash today.  There were 70 users in the room at the time of the crash, perhaps 15 of them streaming.

The crash occurred at approximately 9:02pm ET last night in room yrtv17.  Could you please investigate any logs from your side and see what might have caused the crash?  Anything we can do to avoid similar room crashes in the future?

Thanks,

-Trace

30 Replies

Avatar

Level 3

Our room crashed again last night, in room 'yrtv22' at about 8:30pm.  Identical symptoms as the time below - video streams stopped and no new chats could be entered.  This morning, the room was fine again. 

Any idea what might be causing this, or plans for a fix?

Thanks,

-Trace

Avatar

Former Community Member

Hi Trace,

Thanks for the specific details. We'll pull up some logs and do some

forensics here.

I know you guys have been customers for quite a while - can you

characterize when you started encountering these full "room crashes"? I feel

as though this issue started sometime in the Feb-March timeframe.

thanks

nigel

Avatar

Level 3

Thanks again Nigel for looking into this.

We've had repeated room crashing issues at 3 different points.  First in October when we first started our latest project, then again the first week in May, culminating in repeated tests and failures when we did a solid day of testing on just this issue on May 7th, and documented in the forum posts related to this issue:  http://forums.adobe.com/message/3661931.  We included steps that consistently reproduced the issue quickly on the build we had at the time.  Raff mentioned he'd investigate, but there were no more posts to the thread we never followed up as the problem didn't recur for a couple of weeks.

The most recent issues have very similar symptoms in that the video streams stop and additional chats are not allowed, but there a couple of things that are different from the problems we saw a month ago:

1) A month ago, the rooms would stay in a broken state and would not reset themselves except the one time Jamie reset the room manually.  Now, the rooms appear to be broken until the session ends, but then the rooms reset themselves and work normally when we've checked back after a few hours.

2) A month ago, if we tried to make changes within a broken room via the LCCS navigator, we found we could not modify or retract users or items, but we could add new ones.  I can't recall exactly what happened when entering the most recent broken rooms via the LCCS navigator, but I seem to recall being able being able to make some changes, but not others.  I also recall video streams working correctly, but audio streams not correctly (or it might have been vic versa).  Will provide more exact detail if we an reproduce the problem again.

When the issue next recurs, we'll contact you immediately and try to keep people in the room so the session doesn't expire and reset the state of the room.  It seems to recur about once every week, but we lose all our users whenever it happens on production.

Meanwhile, here are the steps to reproduce I mentioned to Raff in the thread mentioned above:

"We were able to reproduce the issue in a few minutes several times today by doing the following:

*User 1 enters room and starts publishing webcam.  No one subscribed to the webcam duing this test.

*Users 2 and 3 enter room.

*User 1 starts and stops pubishing their audio repeatedly, once or twice per second or so.

An example room configuration we're using can be found in room 'yrtv8', if needed.

Our code that publishes and unpublishes audio is similar to the following:

private function checkAudio():void

{

  if (view.audioPublisher.isPublishing){

    view.audioPublisher.stop();

  } else {

    view.audioPublisher.publish();

  }

}

Please let me know if you are able to reproduce the problem."

Please let us know what you find from the timing of the issues we encountered and if you're able to reproduce via the steps mentioned above.

Kind regards,

-Trace

Avatar

Former Community Member

Hi Trace,

Ok, I'm on to more thorough research of this second issue. It's possible

this issue is related to ratcheting, and that eliminating the one will clear

the other.

I've looked through the logs at the time you mention, and the good news (I

think) is that you aren't straight up crashing the server, which was my

previous guess. Other rooms are happily continuing on the same box with no

issue, but your room literally just goes MIA at 20:28:20. No traces, no

nothing, until 20:50:26, fully 22 minutes later, when there are a few

register-hook calls (go heroku!), and at 20:53:16, there's at least one user

arriving and publishing items successfully.

I'll try to replicate the steps you've got here - could you guys produce

the smallest working app you can that reproduces the issue? It would really

help speed things along.

thanks again for your help in narrowing down these issues - I'm confident

that given time and focus, we'll have you in good shape.

nigel

Avatar

Level 3

Hi Nigel,

We just caused room yrtv22 to have issues again, today at 5:00pm ET. 

The room wasn't completely crashed as badly as before (video was still streaming in the room and chats were possible), but we were unable to retract items through the LCCS console, which was one of the symptoms of the more seriously broken rooms.

To break the room, we had a camera streaming in the room and someone watching.  Then, ran the following code, and clicked to start and stop the audio repeatedly.  This isn't something that happens this rapidly when using our app, but seems to reproduce similar issues.

Thanks,

-Trace

<?xml version="1.0" encoding="utf-8"?>

<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute" xmlns:rtc="http://ns.adobe.com/rtc">

<mx:Script>

<![CDATA[

import com.adobe.rtc.collaboration.AudioSubscriber;

/**

*  Handler for the stop and start buttons.

*/

protected function startBtn_clickHandler(event:MouseEvent):void

{

if ( startBtn.label == "Start" ) {

try {

audioPub.publish();

}

catch(error:*) {}

startBtn.label = "Stop" ;

}else if (startBtn.label == "Stop" ){

audioPub.stop();

startBtn.label = "Start" ;

}

}

]]>

</mx:Script>

<!--

You would likely use external authentication here for a deployed application;

you would certainly not hard code Adobe IDs here.

-->

<rtc:AdobeHSAuthenticator

id="auth"

userName="[enter your user name]"

password="[enter your password]" />

<rtc:ConnectSessionContainer id="cSession" authenticator="{auth}" width="100%" height="100%" roomURL="https://collaboration.adobelivecycle.com/[enter path to your room]">

<mx:VBox id="rootContainer" width="100%" height="800" horizontalAlign="center">

<rtc:AudioPublisher width="1" height="1" id="audioPub"/>

<mx:Button  id="startBtn" label="Start"  click="startBtn_clickHandler(event)" height="20"/>

</mx:VBox>

</rtc:ConnectSessionContainer>

</mx:Application>

Avatar

Level 3

To clarify, after running this test, we were able to go into the navigator and make changes to other rooms, but not yrtv22.

The problem happens the moment we turn a second cam and audio on in the room ,and others subsribe to it.

Thanks,

-Trace

Avatar

Former Community Member

Sorry, I'm a little lost - too many different repro steps.

Did you repro this problem with the code you sent? If not, could you try?

The best way to get to speedy resolution here is to build the simplest

possible app which shows the issue, with the minimum set of steps. I spent

about 20 minutes turning my audio on and off in a couple of simple apps, but

didn't encounter any issues.

I'm going to sign off for today (time to go to a BBQ), but I'll be looking

at more tomorrow. I'm sure we'll get to the bottom of this soon enough.

thanks

nigel

Avatar

Level 3

I think this problem might be related to mine as well. After a few hours, the "automatic kill switch" where the video and shat doesn't work. I sent an email further describing the problem... I am not sure if you guys had any time to check the logs as of yet?

Avatar

Level 3

Room yrtv24 showed symptoms very similar to the ones we saw previously, where video streams froze and the room stopped accepting replacements of existing chat message items, but audio streams were fine and the room accepted new chat message items.  This happened from about 7:15-7:18 this evening, but then the room recovered within a few minutes and everything worked normally. 

Do your logs show any cliues as to what might have happened at this time?

Right before the room crashed, we'd posted a slew of chat messages where a single chat message item was continually repeatedly replaced in very rapid succession.

Rafter after the room crashed, users got the following errors, which we don't usually get:

Error: UserManager.anonymousPresence : Wrong time to check anonymousPresence. Wait for the UserManager to synchronize. The value of anonymousPresence is returned after its actually set in the server

   at com.adobe.rtc.sharedManagers::UserManager/get anonymousPresence()

   at com.adobe.rtc.sharedManagers::UserManager/setCustomUserField()

   at .../onUserChange()

   at flash.events::EventDispatcher/dispatchEventFunction()

   at flash.events::EventDispatcher/dispatchEvent()

   at com.adobe.rtc.sharedManagers::UserManager/userReceivedOrEdited()

   at com.adobe.rtc.sharedManagers::UserManager/onItemReceive()

   at flash.events::EventDispatcher/dispatchEventFunction()

   at flash.events::EventDispatcher/dispatchEvent()

   at com.adobe.rtc.sharedModel::CollectionNode/http://www.adobe.com/2006/connect/cocomo/messaging/internal::receiveItem()

   at com.adobe.rtc.messaging.manager::MessageManager/http://www.adobe.com/2006/connect/cocomo/messaging/internal::receiveItem()

   at com.adobe.rtc.messaging.manager::MessageManager/http://www.adobe.com/2006/connect/cocomo/messaging/internal::receiveItems()

   at com.adobe.rtc.session.managers::SessionManagerBase/receiveItems()

Error: Error #2154: The NetStream Object is invalid.  This may be due to a failed NetConnection.

  at flash.net::NetStream/get soundTransform()

  at com.adobe.rtc.collaboration::AudioSubscriber/setLocalVolume()

  at com.../setSoundLevel()

  at com.../onAudioStreamReceived()

  at flash.events::EventDispatcher/dispatchEventFunction()

  at flash.events::EventDispatcher/dispatchEvent()

  at mx.core::UIComponent/dispatchEvent()

  at com.adobe.rtc.collaboration::AudioSubscriber/onStreamReceive()

  at flash.events::EventDispatcher/dispatchEventFunction()

  at flash.events::EventDispatcher/dispatchEvent()

  at com.adobe.rtc.sharedManagers::StreamManager/onItemReceive()

  at flash.events::EventDispatcher/dispatchEventFunction()

  at flash.events::EventDispatcher/dispatchEvent()

  at com.adobe.rtc.sharedModel::CollectionNode/http://www.adobe.com/2006/connect/cocomo/messaging/internal::receiveItem()

  at com.adobe.rtc.messaging.manager::MessageManager/http://www.adobe.com/2006/connect/cocomo/messaging/internal::receiveItem()

  at com.adobe.rtc.session.managers::SessionManagerBase/receiveItem()

We haven't yet been able to create a test case that reliably reproduces these symptoms, but the do seem to happen to us fairly often.

Thanks,

-Trace

Avatar

Level 3

Room yrtv25 is currently broken as well, with the usual characteristics -- it's not currently possible in that room to stream video or enter new chat messages (by replacing our singleton chat item); new chat items work fine.

The room's currently in a broken state, listing many users who have actually left the room, and I've left a computer on in the room as well to keep the session active.  If you're available to inspect what might be going on, this might be a great time to check the logs from your side - especially since this occurred twice in rapid succession.

Thanks,

-Trace

Avatar

Former Community Member

Hi Trace,

If you can stay in the room and PM me your account username/password, I can

take a look in the room console. I don't see anything strange in the logs so

far.

nigel

Avatar

Former Community Member

So here's something weird,

I'm seeing the same pattern as last time - your room's activity stops for a

while, and then I see this :

2011-06-16 20: 48:22 2981 (s)2641173

INFO SessionManager.IS_REGISTERHOOK <http://YOUR_HEROKU_ACCOUNT,> -

7 times in a row, within the span of 5-10 seconds.

I'm wondering,

A) Do you have code calling this repeatedly?

B) Could you try your app without the webhooks (no

AccountManager.subscribeCollection) and see if this bug is still

reproducible?

thanks

nigel

Avatar

Level 3

Just PMd you the info for yrtv25, which is still currently broken.

When you log into the room via the LCCS console, you can see the session's still active even though everyone left, and you can't delete or update any of the users or streams.

The Heroku activity was all 8 dynos from Heroku registering the hook when they restart - we'll make it so there's only 1 call to register the hook.   That happened several minutes after the room broke, however.  Also, 2 rooms broke tonight (yrtv24 broke also), but there would have only been 1 call to multiple simultaneous register hooks.  The Heroku issue could likely be a red herring, but is worth investigating. 

Please let me know what you find -- I'll leave a computer signed on to the room tonight.

Thanks,

-Trace

Avatar

Former Community Member

Actually, I'm wondering if you're receiving notifications (on the userList

and MESSAGE_NODE) in your heroku gateway. A good experiment would be to

unregisterHook from the account and try to break a room in the same way. I'm

just wondering if there's something about the way you're using hooks which

is causing an issue.

nigel

Avatar

Level 3

Sounds like a useful test - unfortunately, we're still not able to reliably able to reproduce the problem, so it will be difficult to tell if removing the hooks made a difference.  We're dependent on the hooks for a few key aspects of our service, so we'd like to continue using them if possible. 

We'll try recreating the environment when the room broke; if we can reproduce the problem reliably, we'll run the experiment where we remove the hooks.

We can also  tell you exactly when the hooks calls happen and when the room breaks next time this happens, so we can see if there's a correlation.  I think this time the hooks calls happened after the room broke, and there are events that can still happen the room after it's broken -- it's just specifically been the server creating new items which are then picked up by each of the clients.

In the meantime, did you use our credentials to log in to room yrtv25, and did you find anything interesting while there?  The room was still broken when I signed off late last night, but seemed to be fine when I logged in this morning.

Thanks again,

-Trace

Avatar

Former Community Member

Hi Trace,

Yup, I spent some time last night investigating a couple of things :

A) Do hooks cause the crash by themselves? Answer: not that I can tell. I

set up a test where a room was firing 2 hooks a second, and did 1000s of

messages (much more than one of your average rooms). No issues there.

B) What state does your room seem to be in? I hung around and wrote a couple

test apps against your room, and watched the logs and the dev console. The

behavior is characterized by a short burst (as you connect) where you're

able to send and receive messages. Then, the ability to send just drops out

entirely, although receiving is still functional (notably, server to server

publish still works). This looks suspiciously like a Player bug we've seen

in the past (it was on Linux/Mobile, and we got a fix in before release).

One thing I'm wondering is whether there's some sort of "poison pill"

message you're sending to trigger this behavior, such that once a client

receives it, it shuts down in this way. In both logs I've got, the last set

of successful message published I see (before all clients shut down

publishing) are about your UserManager customFields - specifically status

and state. Could you run a test that never publishes anything to those UM

nodes and see if it makes any difference?

onwards!

nigel

Avatar

Level 3

Thanks again Nigel!

Again, the problem is still sufficiently intermittent(weekly) and without clear things that seem to trigger it, and those custom fields are key to our functionality, so it would be hard for us to run a test without those fields for a week.  However, we can likely reduce their use and ensure they aren't set simultaneously as they are now, and see if that has any effect.

Will also inspect our code around the places where those fields are set, in case that provides any clues.

Will let you know the moment the problem recurs, to see if your logs corroborate the problems we've been seeing.

Kind regards,

-Trace

Avatar

Former Community Member

Sure, I understand these features are important to the functionality of the

app. For debugging, it's valuable to first work to get a relatively reliable

repro (even if it's stochastic, like "40% of the time when we do this

sequence, it blows up"), then strip down features in the lab and measure the

impact on reproducibility. We'll keep working with you guys to help narrow

down what we can.

nigel

Avatar

Level 3

Very much appreciate your working with us to narrowing down what's causing the rooms to crash. 

Ran several tests today, still including the updates the app is making to the 'status' and 'state' CustomFields.  Couldn't reproduce any crashes.  The frequency with which the crashes occur currently might be closer to 1-2% than 40%, and the tests involve multiple people and computers, so it's been hard to reproduce and track this down despite much effort.

This seems to recur right before our live sessions where we have a large number of users, so can let you know immediately if the issue recurs then; we'll be having another session this evening; will see if the room breaks again right beforehand, at about 8:30pm-8:45pm Eastern Time tonight.  If it breaks again, we can also see if the errors you see in the logs match what you were seeing before.

The particular custom field change you were observing usually happens at the same time as when the publisherIDs switch for everyone in the room, and the same time as a new user's audio goes live, so it's possible the error has more to do with the streaming changes than with the custom field changes.

Kind regards,

-Trace

Avatar

Former Community Member

Cool, thanks for all the good info. If you encounter any problems tonight,

please be sure to track the time of incidence as precisely as you can - the

more precision you can provide, the better we can map it to log activity on

the service.

thanks

nigel