Want us to send you a calendar invitation so you don’t forget? Register now to mark your calendar and receive Reminders!
Jon Tehero is a Group Product Manager for Adobe Target. He’s overseen hundreds of new features within the Target platform and has played a key role in migrating functionality from Target's classic platforms into the new Adobe Target UI. Jon is currently focused on expanding the Target feature set to address an even broader set of use-cases. Prior to working on the Product Management team, Jon consulted for over sixty mid- to enterprise-sized customers, and was a subject matter expert within the Adobe Consulting group.
Curious about what an Adobe Target Community Q&A Coffee Break looks like? Check out the threads from our first Series of Adobe Target Community Q&A Coffee Breaks
Topics help categorize Community content and increase your ability to discover relevant content.
Views
Replies
Total Likes
Hi @Jon_Tehero thank you for your time today - these coffee break sessions are great
With the current process, monitoring tests does not seem really practical b/c of the manual processes involved, but it is truly a requirement for high-risk tests and there is no way around it, we can't wait for the test to reach the required sample size/# of conversions to end the test and complete the significance analysis.
Currently we have to take these steps to perform offline significance calculations for A4T activities due to analytics continuous variables: 1) we have to create multiple segments that are compatible with data warehouse, 2) pull various reports from data warehouse to break down the data in a digestible format, 3) after the report is available sometimes hours or a day or 2 later, we then have to enter formulas in excel to calculate visitors and compute sum of success metric squared 4) followed by inputting the data into the excel spreadsheet confidence calculator.
In general, the process makes monitoring tests difficult, very time consuming, and I would go as far and say it may even discourage the monitoring cycle of the testing process because it requires a lot of effort. The level of effort required isn't ideal either after a test ends, but re-doing these steps in week intervals for example when a test is running for a test that could be identified as high-risk for the business isn't practical.
Is there an improvement to this process in the road-map or any recommendation on how to create efficiencies with existing functionality? There isn't much information I could find in the Adobe Cloud documentation that provided alternative solutions, but was hoping you could provide more insight to future improvements or potentially other ways that we can achieve the same result with less effort?
Thank you!
@Shani2 wrote:Hi @Jon_Tehero thank you for your time today - these coffee break sessions are great
I wanted to share our experience regarding A4T offline significance calculations. As a whole, the process is quite time consuming and most of our experiments require that we do offline calculations – it would be more practical if the Target UI/Analytics Reporting/A4T workspace panel could compute calculated metrics, but even improving the performance of Data Warehouse UI would be a great improvement to the process, since it's a requirement for those of us that select A4T as the reporting source in the activity.
With the current process, monitoring tests does not seem really practical b/c of the manual processes involved, but it is truly a requirement for high-risk tests and there is no way around it, we can't wait for the test to reach the required sample size/# of conversions to end the test and complete the significance analysis.
Currently we have to take these steps to perform offline significance calculations for A4T activities due to analytics continuous variables: 1) we have to create multiple segments that are compatible with data warehouse, 2) pull various reports from data warehouse to break down the data in a digestible format, 3) after the report is available sometimes hours or a day or 2 later, we then have to enter formulas in excel to calculate visitors and compute sum of success metric squared 4) followed by inputting the data into the excel spreadsheet confidence calculator.
In general, the process makes monitoring tests difficult, very time consuming, and I would go as far and say it may even discourage the monitoring cycle of the testing process because it requires a lot of effort. The level of effort required isn't ideal either after a test ends, but re-doing these steps in week intervals for example when a test is running for a test that could be identified as high-risk for the business isn't practical.
Is there an improvement to this process in the road-map or any recommendation on how to create efficiencies with existing functionality? There isn't much information I could find in the Adobe Cloud documentation that provided alternative solutions, but was hoping you could provide more insight to future improvements or potentially other ways that we can achieve the same result with less effort?
Thank you!
Hi Shani2,
Thank you for your question! We've received a lot of request for supporting calculated metrics. We know that this would improve the overall process/workflow for our customers. My peers on the Analytics product management team have this feature in their backlog but we do not have any specific dates at this time.
If you have access to the Experience Platform and have your analytics data landing on platform, the query service is probably the best option for achieving this today.
@Shani2 Check out this Spark page on additional best practices for leveraging A4T: https://spark.adobe.com/page/Lo3Spm4oBOvwF/
Thank you, I will look into!
@Jon_Tehero in the cloud documentation - it states that the logic for only 2 experiences work differently than 3+ experiences. Please see image below, I also included the link for reference.
@Shani2 wrote:@Jon_Tehero in the cloud documentation - it states that the logic for only 2 experiences work differently than 3+ experiences. Please see image below, I also included the link for reference.
@Shani2 ,
Thank you for pointing this out. I will review with our engineers and doc writers to make sure our documentation is accurate for the A/A behavior.
@Jon_Tehero I would really appreciate it if I somehow can be informed of the result of the discussion. Or should I create a customer service ticket? Thank you!
@Shani2 wrote:@Jon_Tehero I would really appreciate it if I somehow can be informed of the result of the discussion. Or should I create a customer service ticket? Thank you!
Yes, that would be the best way to get automatic updates. Thank you for your great questions today and for your patience on this documentation discrepancy.
Hi again, @Jon_Tehero, I have a statistical question. Is there a feature in the roadmap for the user to be able to select a one tailed test as the statistical method in the Target UI during the experience setup? Currently, by default, the Target statistical engine is configured to support a two tailed test as well as the Sample Size calculator. Technically we can manually adjust the significance level in the sample size calculator if we wished to convert to a single tailed test and manually choose to end a test based on a set of given parameters, however, it would be a lot less effort if that could be automated by being able to select one-sided vs two-sided tests in the UI. One-tailed tests require less time to run and has a lower error probability than two-tailed tests (i.e. alpha is not halved) if testing for a specific direction (i.e. positive/negative). If for example. we are truly testing for superiority, then testing for a negative impact has no value and the cost is having a test run for longer than needed for the business question of interest. As we look for more ways to be agile and become more efficient with our testing program, I am eager to learn if expanding alternative ways of computing statistical significance is in Target’s product roadmap? Thank you!
@Shani2 wrote:Hi again, @Jon_Tehero, I have a statistical question. Is there a feature in the roadmap for the user to be able to select a one tailed test as the statistical method in the Target UI during the experience setup? Currently, by default, the Target statistical engine is configured to support a two tailed test as well as the Sample Size calculator. Technically we can manually adjust the significance level in the sample size calculator if we wished to convert to a single tailed test and manually choose to end a test based on a set of given parameters, however, it would be a lot less effort if that could be automated by being able to select one-sided vs two-sided tests in the UI. One-tailed tests require less time to run and has a lower error probability than two-tailed tests (i.e. alpha is not halved) if testing for a specific direction (i.e. positive/negative). If for example. we are truly testing for superiority, then testing for a negative impact has no value and the cost is having a test run for longer than needed for the business question of interest. As we look for more ways to be agile and become more efficient with our testing program, I am eager to learn if expanding alternative ways of computing statistical significance is in Target’s product roadmap? Thank you!
Thank you for sharing your feedback on one-tailed experiments. We do not have anything on our roadmap at this time for supporting one-tailed tests.
@Jon_Tehero Sorry, I am full of questions, you can tell I couldn’t wait for this event
Moreover, in the event there are only 2 experiences, I think there is a true risk with false positives (higher than 5%) with the current algorithm logic, i.e. after the better performing experience reaches 95% confidence, 100% of traffic is allocated to the experience identified as the winner. Unlike the logic for 3 or more experiences in which 80% of traffic is allocated to the winner and 20% of traffic continues to be served randomly to all experiences – this is key in the event there are user behavior shifts and confidence intervals begin to overlap with other experiences while the test is running.
I’ve encountered a few experiences using Target’s manual A/B test in which the stats engine has called a winner early and a badge was displayed in the activity, however, after hours/days/weeks of collecting more data, the engine removes the badge as it recognizes that confidence levels are still overlapping/fluctuating. This is a prime example of how important it is to determine sample size/tests parameters before running a test to prevent ending a test prematurely and to ensure statistically valid results, but also why I raise my concern with the AA logic specifically for the 2 experiences scenario. Currently, there is no room for the algorithm to correct itself in the event it identified an experience as a winner that truly was not because there isn’t a reserve of traffic allocated for learning if user behavior changes – this is not truly a multi-armed bandit approach in this use case because after 95% confidence is reached optimization no longer occurs in parallel with learning.
Furthermore, another concern on the logic of the algorithm for two experiences is that hypothetically we cannot detect a novelty effect because the algorithm may declare an experience a winner too early. We have observed novelty effects after adding a new feature that is attention grabbing in manual A/B tests, for the first two weeks a challenger may be performing better than the default experience and display a badge, but with time the positive effect wears out as more data is collected – confirming that the lift was only an illusion.
In sum, I hesitate using AA for 2 experiences due to the current AI logic. But the dilemma is that we don’t tend to test in our organization more than 2 experiences. Are there any suggestions on how we can mitigate false positives for 2 experiences for AA? Is enhancing the algorithm for two experiences in the roadmap so that it serves as a true multi-armed bandit approach to optimization? Lastly, in the product roadmap, will users have the ability to set the significance level for AI driven activities? Not all tests are created equal, therefore, they will not have the same risks/costs, thus, some tests may require a false positive-rate less or more than 5%.
Please note I am aware of the time-correlated caveat for AA and the experiences I discussed above re Manual A/B tests were not contextually varying.
Thank you!
@Shani2 wrote:@Jon_Tehero Sorry, I am full of questions, you can tell I couldn’t wait for this event
this should be my last one and it’s regarding the AB AA algorithm. The concept/mechanism of AA is a great one, however, I think there are some limitations in a particular use case. I won’t go into the benefits of AA as those are plentiful, but do want to highlight one specifically, i.e. optimization occurs in parallel with learning.
Moreover, in the event there are only 2 experiences, I think there is a true risk with false positives (higher than 5%) with the current algorithm logic, i.e. after the better performing experience reaches 95% confidence, 100% of traffic is allocated to the experience identified as the winner. Unlike the logic for 3 or more experiences in which 80% of traffic is allocated to the winner and 20% of traffic continues to be served randomly to all experiences – this is key in the event there are user behavior shifts and confidence intervals begin to overlap with other experiences while the test is running.
I’ve encountered a few experiences using Target’s manual A/B test in which the stats engine has called a winner early and a badge was displayed in the activity, however, after hours/days/weeks of collecting more data, the engine removes the badge as it recognizes that confidence levels are still overlapping/fluctuating. This is a prime example of how important it is to determine sample size/tests parameters before running a test to prevent ending a test prematurely and to ensure statistically valid results, but also why I raise my concern with the AA logic specifically for the 2 experiences scenario. Currently, there is no room for the algorithm to correct itself in the event it identified an experience as a winner that truly was not because there isn’t a reserve of traffic allocated for learning if user behavior changes – this is not truly a multi-armed bandit approach in this use case because after 95% confidence is reached optimization no longer occurs in parallel with learning.
Furthermore, another concern on the logic of the algorithm for two experiences is that hypothetically we cannot detect a novelty effect because the algorithm may declare an experience a winner too early. We have observed novelty effects after adding a new feature that is attention grabbing in manual A/B tests, for the first two weeks a challenger may be performing better than the default experience and display a badge, but with time the positive effect wears out as more data is collected – confirming that the lift was only an illusion.
In sum, I hesitate using AA for 2 experiences due to the current AI logic. But the dilemma is that we don’t tend to test in our organization more than 2 experiences. Are there any suggestions on how we can mitigate false positives for 2 experiences for AA? Is enhancing the algorithm for two experiences in the roadmap so that it serves as a true multi-armed bandit approach to optimization? Lastly, in the product roadmap, will users have the ability to set the significance level for AI driven activities? Not all tests are created equal, therefore, they will not have the same risks/costs, thus, some tests may require a false positive-rate less or more than 5%.
Please note I am aware of the time-correlated caveat for AA and the experiences I discussed above re Manual A/B tests were not contextually varying.Thank you!
Our logic for 2 experiences and for more than 2 experiences is actually same (in both scenarios, once a winner is declared, we will allocate 80% of traffic to the winner and the remaining 20% traffic is split among all experiences). So in a case where 2 experiences are present, at the time we declare a winner, we'll send 90% of traffic to the winning experience, and 10% of traffic to the other experience.
If for any reason you are seeing behavior different than what I've described above, please submit a ticket to customer care so that we can look take a look.
Hello everyone! I am looking forward to chatting with you in a few minutes and answering your questions.
@Jon_Tehero
What is the best way to combine testing methods (a/b testing while personalising - XT & AB). To make sure that you are always improving?
@frihed30 wrote:
@Jon_Tehero
What is the best way to combine testing methods (a/b testing while personalising - XT & AB). To make sure that you are always improving?
Hello @frihed30 ,
Great question! Within our A/B activity, you can do both personalization and experimentation. This allows you to test out your personalization techniques and learn what is working the most. You can get pretty sophisticated in testing your personalization by utilizing a feature that allows you to serve different variations of the same experience to different audiences. This could come in handy if you were personalizing content across different geos with different languages, etc.
Hi Everyone! I'm looking forward to hearing from Jon today. He always impresses with his Target and Recommendations depth and breadth of knowledge.
@Jon_Tehero Hi -> with the new A4T view in workspace, will we ever use calculated metrics within the view to get confidence levels? Any tips for using this now? Thanks.
@mravlich Analytics team is working on enabling support for calculated metrics, but the complexity arises with how Analytics collects data based on visitors. Good suggestions on best practices for using A4T and success metrics on this Spark page: https://spark.adobe.com/page/Lo3Spm4oBOvwF/
@drewb6915421 -> the link doesn't seem to work. It just redirects me back to this thread.
Views
Replies
Total Likes
Views
Replies
Total Likes
Views
Like
Replies
Views
Like
Replies
Views
Likes
Replies
Views
Likes
Replies